One of the really great strengths of Alteryx is that is can handle any amount of data that you throw at it. If your data is small enough, it might all be in memory, but when Alteryx gets more data than fits, it silently swaps out to disk. This way people are routinely processing data sets that are 2, 10 or even 100 times bigger than they have enough memory for!
Mostly the user never notices this aspect of the Alteryx engine and it just works. There are times though when we get feature requests that would be much easier to implement if all the data was in memory. One example of that is aggregate functions in the formula tool. Since other desktop products that are similarly easy to use, like Tableau and Excel, have simple SUM and AVG type functions in their formulas, it is assumed that Alteryx would too.
When all the data is in memory, implementing this type of function is very straightforward. It gets quite a bit more complicated when the data can be arbitrarily large. While I would like to get this into Alteryx as a built in feature at some point, for now I’ve used it as another excuse to show off just how powerful the macro system in Alteryx is.
If you are just interested in using the Aggregate Formula and Filter tools, go ahead and download them here. Otherwise read on.
How it Works
You will note that for this macro I’ve made a few compromises in usability to make the parsing easier. All the aggregate functions start with AGG_ and I use { and } instead of ( and ). Because of the limitations of regular expression parsing, recursively parsing parentheses is difficult, so the little cheat of curly braces seemed worth it.
The aggregate functions only need to be calculated once. Presuming there are 125 widgets in all rows, we need to change:
[NumWidgets]/Agg_Sum{[NumWidgets]}
to
[NumWidgets]/125
The first (and more difficult) side of the aggregate formula is to parse the user’s expression and pull out all the aggregate expressions. The aggregates can contain any formula you would like – like Agg_Sum{([a]+[b])/[c]} – so we need to pull out the inner expressions and pre-calculate them. Like many parsing problems, regular expressions come to the rescue. In particular, I am using a RegEx tool in tokenize mode with a cool little trick:
(?:Agg_.*?\{.*?\})|(?:.*?(?=Agg_))
RegEx Tokenize is doing a positive look for matching tokens. I want to match everything in the string, but break them into rows when I find the aggregate functions. In case you have never seen the (?=…) syntax, that is a negative lookahead. This is really cool with the RegEx tokenize because it allows me to parse all the non-aggregate parts without bleeding over into the aggregate side. A very advanced RegEx trick, but one that is worth learning.
Once the expression is all parsed out, the next trick is to actually do the processing. Here I used a helper macro to do the work. The macro itself is very simple, basically a pre-formula for the insides of the aggregate functions, a summarize to actually perform the aggregates and then a formula to put it all together. The difficulty is that I don’t know ahead of time how many aggregate functions are used in the expression.
When writing Alteryx macros, there are a bunch of actions you can use to update a module. Most of them are nice and easy to use, like update a cell in a text input. When the simple (and easy to use) actions don’t work, there is 1 more catch all for when the easy ones don’t work: Update Raw XML. The advantage of it is that you can absolutely change anything about how a tool is configured. This allows me to update the formula and summarize tools with a variable number of rows. A little bit of a pain to set up, but it allows you to do almost anything within a macro. Using regular Alteryx formula and summarize I produce the full inner XML for the formula and summarize tools in the helper macro.
So in the end, it is a really small helper macro that actually does all the work. The outer macro is just there to get everything set up and ready to run.
You will also find an Aggregate Filter tool in the package. It is very simple – it just uses the aggregate formula to produce a new field and then filter on it – removing the new tool in the macro.
Like many things in the future, these macros will hopefully serve as prototypes for future features that get integrated into the product.
The way I set up the macros, it should work with any numeric function in the Summarize tool. As an exercise for the reader, what would it take to extend this to work with string and spatial functions?
Again, you can download the package here.
Thanks for reading,
Ned.