Part 3 - Data munging

Posted on Jan 23, 2024
(Last updated: May 26, 2024)

Introduction

In this part we’re going to discuss different data munging techniques.

Data munging is the process of transforming “raw” data into a readable format.

One of the most common processes is when we want to scrape data from a website.

HTML

The web is built on Hypertext Markup Language.

HTML form a tree of nested elements, marked with tags.

<p class="foo">
    bar
</p>

This creates a paragraph element.

<img src="url" alt="text">

This creates an image element.

Elements can have attributes that affect their behavior or appearance. Above class, src, alt are all attributes. The class attribute is the most important one.

This is what we can use to identify these classes for, e.g. in a CSS file. The id attribute can also be used to uniquely identify an element.

In the paragraph element, we have some content in between the tags. This can be pure text as in the above case, but we can also have new tags in between, or nothing, as in the img tag.

Web scraping

Due to this convenient tree structure that HTML is built upon, information and content can easily be extracted from web pages.

This is called web scraping. In the most simple cases this can easily be done with writing some manual code. In more modern and complex websites, where the HTML is automatically generated and has non-human structure, libraries can do the job for us.

Golden rule of web scraping:

If the user can read it, it can be scraped

BeautifulSoup

BeautifulSoup is Python library that parses HTML (and XML) documents, and creates an abstract tree from the elements.

This enables us to easily navigate the tree, access some tag element along with all its siblings and children elements.

Selenium

However, in modern web pages are built on JavaScript and are most often rendered in the end user’s web browser.

The HTML for these websites aren’t necessarily a complete description of the data, but the data is dynamically loaded.

Selenium is a browser automation framework that is often used for testing web pages. Happens that selenium is also very convenient for web scraping, since we can perform user actions.

So we can click “I accept cookies”, “load next page” etc.

Regular expression

A regular expression or a regex is a sequence of characters that can match text.

We use regular expressions for:

  • Determine if a string matches a pattern completely
  • Find the first or all matches of a pattern
  • Extract groups that have been matched within the pattern
  • Replace the matched text with some other text or a new pattern composed of matched groups.

Matching characters

When matching characters, most characters are matched regularly, but some characters have special meaning:

  • . matches any character.
  • ^ matches start of line.
  • $ matches end of line.
  • $[acf]$ matches any of the characters a, c, f.
  • $[a-z]$ matches any lowercase characters.
  • $[A-Z]$ matches any uppercase characters.
  • $[0-9]$ matches any digits.
  • \w matches alphanumeric characters.
  • \W matches non-alphanumeric characters.
  • \d matches digits.
  • \D matches non-digits.
  • \s matches whitespace.
  • \S matches non-whitespace.

Any special character that wants to be matched literally need to be escaped with a \. E.g. \. matches a period.

Repetitions

By default, exactly one character is matched, but this behavior can be changed:

  • * matches 0 or more occurrences of the preceding character
  • + matches 1 or more
  • ? matches 0 or 1
  • {m, n} matches at least m but no more than n occurrences of the preceding character.

Groups

Regexes can include group, subregexes within parentheses.

Groups can include alternatives, denoted with the pipe |. One of these alternatives are matched.

If we’re dealing with replacement, groups can be referenced with backreferences, e.g. \1 refers to the first match grouped.

Examples

Swedish social security numbers are in the format yyyymmddxxxx

We could match this simply with:

[12]\d{3}[01]\d[0-3]\d{5}

Regex in Python

The most important features of the Python regex module are:

  • re.match(regex, string) returns a Match object that evaluates to True if the beginning of the string matched the regex.
  • re.fullmatch(regex, string) returns a Match object that evaluates to True if the whole string matched to the regex.
  • re.search(regex, string) returns a Match object to the first match of the regex in the string.
  • re.findall(regex, string) returns all non-overlapping matches of pattern in string (left-to-right).
  • re.sub(regex, replacement, string) replaces all occurrences of regex with replacement, the replacement can contain backreferences to groups in the match.