05. Recipes and Programming Languages | Batch Data Collector

In the previous chapters we introduced the term Recipe. Almost all scraping software uses this word to define the collection of settings used for capturing elements to be extracted from a webpage. Think of it as the record-type (or recordset) desired for the output file, or even more simply as the columns of the final file.

Free and Batch Data Collector extend this concept by integrating 1) a series of rules, and 2) actions to be performed into Recipes not only for raw data extraction, but also for “pre-processing” of the same data prior to data capture and export.

Each column has the following Recipe capabilities:

Exclusion rules should any collected data not comply with specific rules;
Full- or partial-string replacement functions to manipulate collected data;
String divider in the presence of one or more character(s);
String sub-part selector to isolate desirable or undesirable pieces;
Addition of prefix and/or suffix to collected String values;
Text capture between two text sequences;
Automatic data or format detection of things like telephone numbers, email addresses, webpage language, webpage URL, requested URL…

As we work to align our active vocabulary, you should know that we’ll define any Internet resource accessible via an address as a URL (Uniform Resource Locator, basically any sequence of characters that defines a specific position in a computer network) or often more generally as a URI (Uniform Resource Identifier).

Remember that for each recordset, the list of Events (actions to be performed before, during and/or after the extraction) is also part of the Recipe. Common examples could be things like clicking links, form-fills, partial code substitutions, or even setting predefined loops, where loops are defined as autonomous navigation of preset pages, from the first to the last, with data extraction completed upon each page load. More complex workflows could include actions dependent on IF conditions, partial source code deletion, forced pauses, page scrolls, or even manipulation of environment variables such as execution of Javascript (a language widely used in webpages) or image upload.

Back on the subject of columns, each will refer to elements on the webpage called nodes. Nodes can be expressed through CSS selectors.

CSS (Cascading Style Sheets) is a language used to define the visual design of HTML, XHTML and XML elements. More simply, it’s the language that controls the style of any webpage. Before its introduction, the Internet was a messy collection of webpages whose code jumbled content with formatting instructions. Some of you will remember the Browser Wars, where each provider introduced their own set of instructions independently of others. The absence of concrete rules gave rise to a sort of programming anarchy which often left users feeling frustrated. Then, in 1996, W3C decided to standardize the playing field by creating the first ever set of real guidelines. Acceptance and adherence to these guidelines has taken nearly twenty years, finally putting an end to the competition for browser dominance, and today allows us to enjoy a web experience that is far more adherent to clear rules, and is increasingly usable by different means.

When a code writer decides to adhere to XHTML and CSS formats, they’re ultimately agreeing to respect hierarchical code logic, and to separate visual style elements from content. These new rules revolve around the use of a tag, or keyword, which is enclosed between the symbols “<” and “>”.

A few examples of predefined XHTML tags you may have seen are “<br />”, which allows the browser interpreter to perform a carriage return, or the tags “<p>” and “</p>” which represent the opening and closing of a paragraph. We’re not here to list them all for you, because there are a ton and there’s Google. That said, we do want to underline one of the fundamental rules established by the W3C, which is that all open tags must be closed. No exceptions. You will never see a “<div>” tag without the corresponding “</div>”.

Many tags have attributes. Attribute are a set of details that describe an element. Let’s take a fictitious example, using the fancy “dish” element.

<dish weight=”20″ temperature=”60″ calories=”360″ ingredients=”bellpepper, zucchini, eggplant, fish” id=”dish_1″ class=”maincourse”></dish>

In the example above, weight, temperature, calories, ingredients, id and class are the attributes of the tag dish. The dish is correctly defined because its opening tag corresponds to the closing one (</dish>). This code is considered well-formed, suitable for any device and software, and its parts can be incorporated into a Recipe.

Among the attributes listed above, there are two not-so-fancy ones: id and class.

Ids are unique identifiers that lead back to one (and only one) tag within an HTML path. There cannot be two tags with the same id, unless you’re unfortunate enough be be evaluating some low-quality code.

Classes are simply element classes, or groups, united by similar graphic properties. For example, all links that share the same color and appearance within a webpage may belong to the same class. Or perhaps all similar buttons, or all titles of the body of text.

When reviewing or writing a CSS selector, you’ll notice that an id is always prefixed by the “#” symbol (hash mark or hashtag), while a class is proceeded with the “.” symbol (period or dot).

Therefore, going back to the example above, we can attach our imaginary element <dish> to a Recipe through the following selectors:

dish
#dish_1
.maincourse
#dish_1.maincourse
dish#dish_1
dish.maincourse

There are tons of different ways, some unique, others not so much. The first one, for example, is hardly unique, as there could be lots of other <dish> tags. The second, because of our now-shared definition of id, is undoubtedly unique.

Why choose a unique or repetitive element? It depends. If the goal of your Recipe is to capture text present only once on the page, the unique selector is obviously recommended. But sometimes things aren’t quite that simple, such as needing to retrieve some or all rows within a table, at which point you’ll need to use a repetitive selector.