This chapter is dedicated to getting a better understanding of Internet technologies. This is optional reading, but strongly recommended for users with little or medium experience, as well as for the curious-at-heart interested in fully mastering their use of Free and Batch Data Collector.
If at any time you realize that your extraction requirements can be met by processing the underlying code exclusively as a long-string of text, know that you can skip this section in its entirety. If, however, you need to treat the code as a hierarchical map of information to be identified and parsed by means of simple or complex selectors, please, read on.
What technologies are involved?
First up is HTML, the famous HyperText Markup Language. Historically this has been the language used for the representation of documents to be published on the Internet. It’s been called a markup language, as it borrows the typographical term marker, which is the delimitation (by highlighting) of the parts of text to be reviewed before printing. HTML does more or less the same, delimiting sections of a text to be published between codes called tags. In the last chapter we explained what they do and roughly how they work, and we pleasantly and persistently insist on repeating that when a tag is opened, the application of certain rules begins (generally visual or hypertext link) that will then cease when the tag itself is closed.
Next up is XHTML (eXtensible HyperText Markup Language), roughly the successor of HTML which represents the standard declination desired in 2000 by the W3C consortium. The main rules of the language are as follows:
- tags must always be written in lowercase letters;
- each opening tag must correspond to a closing tag;
- for tags that cannot be closed, implicit termination must be considered (for example, the carriage return tag, which in HTML is expressed with <br>, in XHTML is written as <br />);
- attributes are delimited by quotation marks (” “), and attribute names must be written using only lowercase letters.
Other rules exist, however they’ve been intentionally left out as they do not pertain to this chapter’s discussion.
Finally, there is XML, which looks quite similar to HTML and XHTML in terms of logic, structure, and tags that follow the same rules. Its peculiarity, however, is that it also allows non-standard tags to be defined. So if HTML is thought of as a set of a few hundred predefined tags, XML can contain a seemingly infinite set of the same.
What need was there for XML?
XML represents an excellent system of communication between applications, devices, or databases, even if they’re very different. In practice, if your refrigerator is actively communicating with your microwave, it probably does so in XML. Logically your refrigerator *wouldn’t* use HTML because it likely has no need to use text formatting elements to communicate whatever top-secret messages it needs to transmit.
We see lots of services that respond in XML format on the web. Many software interfaces (called APIs) use XML, and so do specialized newspapers and magazines, telephone archives, localization services, and so on… Heck, even Google even uses it on request. Free and Batch Data Collectors can analyze XML format and collect values enclosed between tags or the attributes that define them, even if they’re considered non-standard. Downloading an XML file, by the way, is much faster than rendering an entire HTML page, as there are obviously no images and formatting styles to slow things down. In short, XML is a high-performance communication solution suitable for multiple devices and work environments that we strongly suggest you consider when undertaking massive scrapes.
HTML, Instructions for Use
An HTML page is nothing more than a text file containing opening and closing tags, with portions of additional text between them, generally drawn on the browser window.
It is important to familiarize yourself with the most common tags used in the source code, and it is equally important not to get scared by anything weird or unknown. Remembering the general structure of an HTML path will help you recognize the main sections of the code, and will make it intriguing and fun — with a little patience — to trace the elements to be captured from a web page. The choice of the best element, like a precise Google query, is a bit of an art. You should think of yourself as the artist. The web-scraping, time-saving artist.
Here are the tags, in order, that will always appear on a well-formed webpage:
<meta charset=”UTF-8″ />
<meta name=”description” content=”A short webpage description” />
<meta name=”author” content=”Author Name” />
<link rel=”stylesheet” type=”text/css” href=”css/style.css” />
Visible page content
As demonstrated above, each open tag corresponds to a closed one, except for tags with implicit termination (meta is an example). In its entirety, the code is hierarchically structured: one tag can contain another, provided that the closing of the tags takes place in reverse order of their opening. Last in, first out, or LiFo. The following combination is therefore permitted:
While the following is a no-go:
Let’s define the tags used in the previous example:
- <!DOCTYPE html>
This is a system element at the beginning of the document that specifies the reason for the document itself: creating an html page.
Identifies the beginning and end of the entire html document. All other tags will be this tag’s “children.“
This contains information for proper page interpretation and/or useful details for search engines to index the website correctly. In our example, in order of appearance, we communicate the use of characters belonging to the UTF-8 standard (i.e. we say that the page is written in a Latin language), and we provide Google search engines with a description of the page and a reference to the author.
Webpage source code is not always contained within the same document. For example, CSS rules are usually defined in external files similar to support libraries. The link tag does not create hyperlinks that can be navigated to by the user, but rather loads other parts of the code necessary to display the page correctly. In our example we call the style.css file in the css subfolder.
This is the heart of the webpage, and is what is shown on your browser, including directives, commands, and the true readable textual content.
Selectors and Other Tags
I tag rappresentano già di per sé dei selettori generici. “html“, “head” e “body” sono senz’altro selettori univoci che compaiono una sola volta e possono essere utilizzati per catturare elementi. Il compito più importante durante la definizione di una ricetta è individuare il selettore giusto. La conoscenza di base della struttura gerarchica descritta sopra ed un’infarinatura dei tag esistenti sono la chiave d’accesso alla scelta si selettore. Tags are already generic selectors in and of themselves. “html“, “head” and “body” are undoubtedly unique selectors that appear only once, and can be used to capture elements. The most important task when defining a Recipe is to find the right selector. Your basic understanding of the hierarchical structure described above, along with a smattering of existing tags, will be directly proportional to your success when choosing your Recipe selector.
Here’s a quick list of the most common tags:
- <h1> <h2> <h3> <h4> <h5> <h6>
Defines a title or subtitle.
The “h” in this case stands for header. There are 6 predefined styles.
Defines a paragraph.
A paragraph is usually separated from previous and subsequent blocks by vertical spacing. It also generates a carriage return at the end of the paragraph itself.
Example: <p>In this chapter we will talk about ecology</p>
Defines a text in bold.
In the past this tag was <bold>.
Defines a text in italics.
In the past this tag was <i>.
Defines a part of the text to which you usually want to assign an id or a class. On its own, however, it basically does nothing.
Defines a part of the text to which you usually want to assign an id or a class. On its own, however, it basically does nothing different than the tag p.
- <br />
Creates a line break (carriage return).
- <img />
Allows you to upload an image where the tag is positioned relative to the rest of the text.
The file path is specified in the attribute src, while the description of the photo (optional) in the attribute alt.
Example: <img src=”filename.jpg” alt=”Description” />
Defines elements linked to a URL. One or more strings and/or one or more images may be present between the opening and closing tags.
The URL is specified in the attribute href.
Example: <a href=”https://www.google.com”>Go to Google</a>
Common structure for a bulleted list. The entire list is enclosed between ul tags while the child elements, each bullet point, are delimited by li tags.
Example: <ul><li>First element</li><li>Second element</li></ul>
Common structure for a table that starts and ends with the tag table. Each row of the table is enclosed in tr, while the columns in td.
The following example defines a table with a single row and two columns, each containing a number.
That’s the list. Your biggest takeaway is to remember that each tag can be given an id and one or more classes. Here are some practical examples of relative selectors to be used in Free or Batch Data Collector:
<img src=”filename.jpg” id=”mypicture” />Selector: img#mypicture
Captures a single image, as the ids are unique throughout the webpage.
<span id=”mytext”>Relevant note for the reader</span>Selector: span#mytext
Only captures the text contained in the tag.
<div class=”daysoftheweek”>Monday</div>Selector: div.daysoftheweek
Captures all three elements (remember, a single class can be attributed to multiple tags).
<div class=”daysoftheweek monday”>Monday</div>Selector : div.daysoftheweek.monday
Only captures the first of the 3 elements.
<span id=”mytext” class=”redText”>Relevant note for the reader</span>Selector: span#mytext.redText
Only captures the text contained in the tag. Even though the class is not required, you can still specify it.
Our html excursus ends here. For those who are interested, we recommend the official W3C guide, which can be found here: https://www.w3schools.com/html/.
Let’s now turn our focus towards creating your first Recipe, as well as how to build increasingly powerful selectors.