String parsing is a common task for data analysts and data engineers working with raw data. With the growth of unstructured qualitative data, parsing strings efficiently has become increasingly important for fast analysis.
One common use of string parsing is parsing URLs for marketing channel attribution. With the numerous sources and all the effort and money that goes into marketing, it is essential to be able to accurately attribute activity.
Common String Searching Methods
Equality is denoted with an equal sign and tests if one string exactly matches a second string. Equality is a great string searching algorithm if you are trying to match the whole string with a specified value. For even better performance, you can index the column you are searching for lightening fast search speeds.
Equality Equality is denoted with an equal sign and tests if one string exactly matches a second string. Equality is a great string searching algorithm if you are trying to match the whole string with a specified value. For even better performance, you can index the column you are searching for lightening fast search speeds.
The position function returns the position of the first occurrence of a substring in a given string. If the substring is not found, the result is 0. Since the position function is searching for a defined string, without considering any leading or trailing characters, it is pretty efficient.
select count(1) from pings where position('https://www.sisense.com' in url) > 0
This is not the same query as equality. Where equality matches the entire column, position matches anywhere in the url string. With the position function, indexing does not help. Each row of the column must be evaluated in order to determine if the substring is contained within.
The like operator matches the rows in a column based on a string expression. Like has a couple of wildcard characters that are very useful:
- The ‘%’ character matches any sequence of 0 or more characters.
- The ‘_’ character matches any single character.
Some common examples:
- like ‘awesome’ matches a string that is exactly awesome.
- like ‘S%’ matches any string starting with S.
- like ‘s_s’ matches any three character string starting and ending with s.
- like ‘s%s’ matches any string that both starts and ends with an s.
Here, we use like to find pings starting with our domain:
select count(1) from pings where url like 'https://www.sisense.com%'
and URLs that contain our domain:
select count(1) from pings where url like '%sisense.com%'
Regular expressions (regex) are the most expensive of these string searching methods, but also the most flexible. It is possible to write regular expressions that are extremely complex and take a lot of resources to compute.
For this post, we’ll keep it simple with our regular expressions. Let’s take a look at a common regex task that pulls ad set data from marketing data.
Our marketing ping urls look something like: ‘https://www.sisense.com/ab?source=facebook&ad_set=db-size-1-3&ad=dashboard-2’.
For the purposes of our analysis, we only want to see the ad set details. To do this, we’ll use the ~ regex operator to match only the data we want.
select count(1) from pings where url ~ 'ad_set=[_a-z0-9-]+''
In this example, we filter selected rows on the url column. The expression matches urls with the regular expression ad_set=[_a-z0-9-]+.
Since the start or end of the string is not indicated in the expression, it will match anywhere in the provided string, in this case the url. The bit of syntax that is of particular interest is [_a-z0-9-]+.
Square parentheses denote a set of characters to match. In this case, we match anything that is an underscore, between a and z inclusive, characters 1 to 9 inclusive, and a dash. These are denoted by:
- _ matches an underscore.
- a-z matches any character between a and z inclusive.
- 1-9 matches any character between 1 and 9 inclusive.
- – matches the dash character. Usually a backslash is required to match a dash because the dash is also used to communicate a range in matched values. It is not required in our example above since the dash is at the end of the expression.
Putting it all together, the square parentheses specify a set of characters that will be matched. The + at the end of the expression indicates that it will match one or more characters that are indicated within the square parentheses.
With these techniques under your belt, you can be parsing strings for marketing and other analysis in no time!