One of the most common questions I get about Alteryx is: “How can I make my module run faster?” Although Alteryx can be very fast, since it is such a general tool, it is only as good as the module that you have authored. There is a very simple guideline that you can follow to make a module faster: do less work. The most common example of doing less work is to use the select tool as early as possible to remove fields that you are no longer using. In order to walk you through the process I use to make an Alteryx module run faster, I am going to walk through the process of optimizing my Percentile Macro to run as fast as possible.
Benchmarking
It is important to know if you are improving your process or not. The 1st step to making something faster is to know how fast it is now so that you can assess the improvement. This is complicated by the fact that the operating system likes to cache things so the next time you use a file it will go faster. What this means is that to get an accurate benchmark, you need to either make sure your data is in the cache or out of the cache. The only easy was to make sure your data is not in the cache is to reboot between runs of you modules. Since this is such a pain, I almost always go the other direction. To make sure that data is in the cache, you simply can run your module twice and then only record the speed on the second run. In some cases there can be some variation in the speed based on other processes running or the state of the cache, in which case I will run 3 to 5 times and then take the fastest time. Why the fastest? Because that tends to be the most repeatable.
The next step is selecting what you are going to benchmark. If your process takes hours, that can be a pain to work with. I like to reduce the data so my benchmark test takes about 10 seconds to run. Any shorter than 10 seconds tends to have too much variation. If your module has lots of data and disk I/O it make be worthwhile to set up a benchmark that runs for a minute or two, but any longer than that is not necessary.
In this case I chose a D&B extract from Pennsylvania and am computing the Percentile Rank on “Employees Here.” I figured this would be a fairly typical use case of this macro. This test ran in 9.8 seconds.
Optimizing the Percentile Macro
So now we need to make this module do less work. Each of these steps in annotated in the module so you can open the module and follow along.
- Step 1: This module has 2 sorts. Sorting data can be expensive, so the less data we have to sort, the better. A cool technique to avoid this is to remove data not needed in the middle of the process and then add it back in later.
- Add a select tool in the beginning deselecting all “unknown” fields and then add a join at the end using “join by record position” to put them back. 7.8 seconds.
- Step 2: The last sort tool has 5 more fields than it needs. They are selected away immediately after the sort, so why not do it before.
- Added 1 select tool – 7.7 seconds. A modest improvement, but why not?
- Step 3: The original macro copied the input field in order to avoid messing with the field order. Now we don’t need to since we can easily patch it all up in the end.
- Delete formula tool. Connect update action to select tool. Update Select to rename __InputValue to __Value. 6.8 seconds.
- Step 4: Make data as small as possible. Since we are rounding the output percentile to 2 decimal places with no more than 3 on the left, it will fit in a float.
- Change the field type to float. 6.7 seconds.
That is all I see for this module… There were not that many tools in this module, but with a few small changes I was able to make it run in 32% less time – or 46% more records per hour. That is significant.
The new macro can be downloaded here – but I did not include the data I used for benchmarking.
Thanks for Reading,
Ned.
Pingback: Alteryx: Percentile Macro | Inspiring Ingenuity