Before I start, let me say that I am looking forward to seeing lots of my readers next week at Inspire. You will most often find me in the Solutions Center. Please don’t hesitate to come ask questions, give suggestions or just chat. I love the opportunity to teach & learn.
Recently I got a question on our internal support board: How do I parse a file that is Ctrl-A delimited? Normally for reading delimited files you just read the file as a CSV and set the delimiter in the input settings and you are done. The problem with Ctrl-A is that it is a special (unprintable) character and it is not possible to set in the GUI. The normal backup for parsing issues like this – the Text To Columns tool – has the same limitation.
The answer of course is to use the Regular Expression tool which has all kinds of flexibility, but has some increased complexity as well. Specifically there is a tokenize mode that is designed for exactly this. The tokenize mode was designed for ultimate flexibility, but it has confused many people. People are used to the Text To Columns tool (or the input) where they specify the delimiter, or another way of thinking of it is that they specify the characters that they don’t want. The tokenize method is exactly the opposite, you specify what you do want. At first this seems very confusing, but once you understand it you start seeing parsing problems in a whole new light.
Before I get started with specific examples, any time you are using regular expressions it is nice to test them 1st. There is a super handy web site for just that: http://regexr.com/ This allows you to put in some example text and a regular expression and it will show you what matches and what doesn’t. Pay particular attention to the mouse over on the matches – it gives you all kinds of useful information.
Example 1: Parse a Ctrl-A delimited field
Hence, the regex for tokenizing a Ctrl-A delimited string is simply:
Breaking down the regex:
- [^…] The brackets specify a match to a single character in a set of characters. Starting the set with ^ changes the it to match any character not in the set.
- \cA This simply matches the Ctrl-A character.
- + This causes the match to match 1 or more of the previous.
In short we have said match as many characters that are not Ctrl-A as we can. The tokenize will then skip over any characters that don’t match the pattern, i.e. it skips the delimiter. See Example 1 in the in this module.
Example 2: Allowing blank tokens
For many situations the above example works perfectly, but sometimes blanks matter. For instance, with the following input text:
I want 3 fields out, not 2. How do I match the second field, since I just said you have to match what you want, not what you don’t? The answer is you have to include the delimiter in the match, making a slightly more complicated regular expression:
Breaking down this regex:
- (…) Parenthesis create a marked group of the expression. One feature of the tokenize mode is that it will allow you to match a bigger part of the input field, but only return the subset that you marked. This way we avoid returning the delimiter. You may only have 1 marked expression.
- [^,] Like above, match any character other than a ,
- * Match 0 or more of the previous. This differs from Example 1 in that we allow an empty set. You couldn’t end here though, because the regex engine doesn’t like a match of 0 characters, because there would be an infinite # of matches, so we have to terminate on something.
- (?:…) This is an unmarked group. We need this for the or (|) next
- | This is just saying match either the thing before or after, but not both. This almost always needs to be used with a marked or unmarked group.
- $ Matches the end of the string. Hence (?:,|$) matches up to a , or the end of the string.
To see this in action, see Example 2 in the in this module.
Example 3: HTML Links
A very common task that people want to do is parsing HTML documents. I have seem all kinds of crazy difficult ways of parsing them, but the RegEx tokenize makes it super easy. For this example I am going to parse links out of the home page of my blog. The expression is actually very simple:
Breaking down this regex:
- <a This is just match the literal text <a
- .*? . is any character, * is 0 or more, the ? is where it gets interesting. The ? modifies the * to make in non-greedy. What this means is to match to the shortest possible match. Without this, the expression might find a single token from the beginning of the 1st link to the end of the last.
- > Again, just matches the literal > in the HTML data.
- .*? Again, the shortest possible match of any characters that still satisfies the entire regex.
- </a> Again, just the literal characters </a>. This ends the match.
Almost like magic, this pulls every link out of a large HTML document into a series of records (or fields). In Example 3, I follow it up with another RegEx tool in Parse mode to extract the link target & description. This example is the most interesting to me, because by specifying just what we want we were able to extract interesting information and leave behind a sea of junk text. To see this in action, see Example 3 in the in this module.