or how do I get free spatial data for Alteryx…
I keep having this big idea for a blog post, but getting sidetracked by things that I need before I start. Its always nice to share, and I figure many other Alteryx users would have the same needs as me. Last weeks post, Weighted Medians, was one example of this. This week, for the next step in what I have been working on, I needed a US Block polygon file. Blocks are the lowest level for which Census data is tabulated. The census helpfully makes the polygons available for download on their ftp site, but the data is in a separate ZIP file per state containing SHP files. What I really want is a single YXDB file for the entire layer. Downloading 50+ files and then unzipping and converting by hand sounds like a lot of work. I am a programmer and programmer’s by nature are lazy, so I wanted to make this process easier.
Just because I want Blocks today, doesn’t mean I won’t want some other layer tomorrow. The TIGER data has all kinds of useful layers from roads to landmarks to water and more. On any given day I could easily want any of those layers. The answer of course is to create an Analytic App in Alteryx. You can see by the screenshot on the right, that it is very simple to use, just pick a layer and an output file and away you go. However, when downloading a layer like the Blocks you do need to be patient – it could be pulling multiple GBs from the FTP site and may take a while to download.
So – if you are just looking for some free spatial data, you can go ahead and get the app here and download away.
Technical Details:
In order to download all the files, there are 2 download tools required – 1 to download the directory of files and then a second one to download each individual ZIP file in batch.
We now need to extract and read each individual file. To do this, I chose to create a Alteryx batch macro. The way batch macros work is they get re-run from start to finish for each record coming in to the control input (the first input – the one labeled ¿) . This made writing it easy, because I only need to think about unzipping and reading 1 file at a time. A lot of people think of batch macros as super advanced and difficult, but really they make a bunch of things much easier. I may have been able to do this without the batch macro, but then I would have had to think about all the files coming through one stream. With the batch macro I figure out how to solve my problem for one file (or group, or item or whatever you are working on) and then easily extend it to work on any size set of files (or groups, or…) The batch macro part of this module was far and away the easiest part to create.
Like the Reading XLSX Directly, the unzipping part is handled by a RunCommand tool. Since there is only 1 SHP file to be read by each batch iteration, the reading can be done via the output portion of the RunCommand tool. In order to place a little less size pressure on the temp directory, since some of these layers can be very big, I added a second RunCommand tool to delete the unzipped temporary files. Since I am writing inside of %TEMP%, I could have skipped this step and Alteryx would clean up after me, but not until the entire module is entirely done running.
Again, the modules for downloading layers from the Census can be found here.
Thanks for reading,
ned.
September 11, 2013 at 2:54 pm
Hi Ned,
Great post! I have a question for you…
Is there a short explanation of how your RegEx expression works? More specifically, I’m stumped at how the initial match knows to stop at the final single whitespace character rather than the initial one. I tried figuring it out myself by writing some expressions which I thought were 1:1 translations: .*\s{1}(.*) and .*\s(.*) . As I’m sure you already know neither worked. This one did work: .*?(\w+[.].*?)$ but I now realize that it’s much more rigid than yours: .* (.*)
Thanks and keep the blogs comin’!!
Garth
September 11, 2013 at 3:04 pm
Good question – and probably worthy of a blog post on its own. Regular expressions are so easy to mess up, I usually try to keep them as simple as possible. Just to remind, the string in question is:
“-rwxrw-r-x 1 15952 8188 8599274 Aug 16 14:29 tl_2013_us_state.zip”
and the regular expression is “.* (.*)” The goal is to match the file name, which is the final part of the string that is delimited by a variable number of spaces. Since the .* in “.* ” is greedy, it will match the longest possible match, which always has to go to the very last space since . can match anything including spaces. The (.*) is a marked expression that matches everything left over which is the file name we care about.
Pingback: Alteryx: Processing Groups of Records in a Macro | Ned Harding
Pingback: Month in Review: September | Ned Harding
Pingback: Alteryx: Big Data and Current Events | Ned Harding