Inspiring Ingenuity

Alteryx, Bicycles and Teaching Kids Programming.

Alteryx: Ensuring fields are in a data stream

3 Comments

This week’s question:  How do I ensure fields are in a data stream?

This is a common issue when you are reading data that changes from time to time.  If you use a Crosstab Tool or are reading from an XML file, the output fields can change based on the presence (or absence) of specific input data.  This makes it very hard to write formulas, or really do any down stream processing on those fields.  In particular, what we want to do is ensure fields are present, but NOT change them if they are already in the record.

imageSo to start with the answer:  EnsureFieldsOnDataStream.yxzp is a macro that makes this easy and adds very little overhead whether or not the fields are already present in the record.  It wraps it up into a simple tool with a simple user interface.  All you have to do is specify a list of fields and the type.  Hopefully this leverages the macro ability in Alteryx to make a tool that is as good as it would have been if it was written in native code.

Technical Details:

The basis of adding the fields is the Union tool.  The Union tool’s main purpose is to combine 2 streams together into a single, longer, stream.  It has the side effect of adding all the fields from all the inputs into the output.  I utilize this side effect here with a second stream with no records, that just has the fields the user wants to add.  If the fields were already present in the original stream, it has no effect.

I used a text input tool with no records in order to create the fields.  Since the Text Input auto types its fields, with no records they all default to bool so it needs to be followed by a select tool to convert all the new fields to the proper type.

The macro would have been very easy if it asked for a single field.  In that case it would just be simple actions to update the field name in the Text Input tool and the Select tool.  Since I chose to make the tool accept any number of fields it was a little more complicated.  In this case it has to use Update Raw XML actions and a slightly complex formula to manufacture the same XML that the GUI would have; had I specified multiple fields.  If you are an advanced Alteryx developer, it is probably worth opening the macro to see how this works.  This is a rarely used action (I hope) but it enables building macros that do all kinds of interesting things that would be impossible otherwise.

As a side note:  I had a choice of connecting the ‘no record text input’ 1st or 2nd into the Union.  Aesthetically, connecting it 2nd would have been nicer since it would add the new fields at the end and not change the order of existing fields.  The downside is that it would cause the data in the stream to have to be cached to a temporary file (or memory if it fits) and stall processing.  Definitely doesn’t meet the goal of low overhead.

Thanks for listening,

Ned.

If you don’t yet have a copy of Alteryx yet, by all means go get one.  Its free (for limited use)…  http://www.alteryx.com/download

Advertisements

3 thoughts on “Alteryx: Ensuring fields are in a data stream

  1. Just an FYI – there is a small bug in the attached macro. If you select String or WString, it ends up creating HUGE fields. Better to pick V_String or V_WString.

  2. Thanks Ned, especially the side note, that’s not something I had really appreciated before. Definitely something I’ll watch out for in future.

    I’m definitely a big fan of the update raw XML action, I think the action really gives an advanced developer the ability to push Alteryx above and beyond on those rare occasions where you can’t get do what you need through more traditional methods.

  3. Chris,

    Thanks for reading! (and commenting!) Usually that behavior of the union tool is not something to worry about because the streams might be similar size. In order to figure out the full field info it has to wait for the start of both streams of data which usually (but not always) means one of them has to get completely cached. In this case it matters because one of the streams has no records, so clearly it is more efficient for that one to be 1st.

    ned.