Sometimes, it seems that Excel is the most popular database format for our customers. My response has always been that Excel is not a database, but at the end of the day it doesn’t matter – they have data there that they want to use in Alteryx. In reality I am sure other formats get used more, its just that I hear about Excel so often because there are so many issues.
Our support for Excel files is through a Microsoft driver. In the case of modern Excel (XLSX files) that support comes from the Access Database Engine. There are a bunch of issues with this; if you have a 32bit version of office installed, you can only have the 32bit driver which makes us have to jump through additional hoops if you have a 64 bit Alteryx. For various reasons, it is a hard one for us to install as part of Alteryx, so it is left to the user. This leaves these files unreadable on many machines.
Recently I got the following on our internal support email:
… The issue here is that when you bring the data of the spreadsheet into Alteryx, it rounds up the decimal points to fewer than from data that is showing in Excel. Where the spreadsheet shows 5 decimal points, Alteryx only brings in 4 decimal points. The user doesn’t need the entire decimal range, but at least wants to have the 5 decimals of data. …
It’s issues like this that we have no control over. If you bring the same file into Microsoft Access, you see the same issue so it is clearly an issue in the underlying driver. There are many other issues with the MS driver, like having a mixed type column that doesn’t have any strings in the 1st 16 rows…
Frustrated with always answering the same thing: “Its an issue in the Microsoft Driver,” I decided to see what I could do for this customer. XLSX files are actually a ZIP file containing a bunch of XML files. I decided to build a macro to read XLSX files directly, without need of the Microsoft driver. While it is not production worthy (no tester has ever looked at it), it seems to work pretty well. I tried to make it as full featured as possible – you can see the options on the right. The one downside is that I have no way of getting the sheet names in the macro interface – you have to specify a sheet number. It will show you all the sheet names to the output log when you run, so its easy to figure them out.
If this is what you need, go ahead and Download ReadXlsx.yxzp. For more information, continue reading…
Since the file is a ZIP file, the 1st step is unzipping it. I used unzip.exe from info-zip which is embedded into the package. When unzipping files or otherwise creating temporary files in Alteryx it is a good idea to put them in a directory starting with %TEMP%. That will be the Alteryx specific temp directory and in fact it creates a new directory every time you run a module. This allows Alteryx to clean up after us so we don’t have to worry about deleting the files.
XLSX files are not simple XML files that contain the data you need. The individual cells have a type, which can refer to a string, or a shared string. If it is a shared string, then the value is in the sharedStrings.xml file. The problem is that if a document has no shared strings, than that file does not even exist. Alteryx doesn’t deal with failure to read a file very gracefully; it essentially stops processing at that point. In order to get around this I use a formula to create the path with the FileExisits function to check for existence, and if not present point it to another file that is always in the XLSX file. After this, I can use a DynamicInput to actually read the file.
The XML parse only outputs fields corresponding to attributes and children that it sees in the document, where Excel tries to save the document with as little XML as possible, probably to make it smaller. I use last weeks macro, Ensuring fields are in a data stream, several times in this macro in order to make sure that fields are present.
One of the bigger challenges was Date/Time values. Date and Time fields are stored as numbers in Excel. The date is represented as the number of days since December 30th, 1899 and the time is the fractional portion of the number. The reason it is the 30th vs. the 31st is interesting – it actually goes back to a bug in Lotus 123! I used the 30th as a cheat to avoid the whole issue, but it does mean that dates before March 1st, 1900 will be off by a day. It would be easy to fix if anyone needed dates from then. In order to know that a particular number is a date or time, we have to look at the style which means we have to load styles.xml from the ZIP. In reading the styles, the field we are interested in is numFmtId. This value sometimes can be user defined and other times coming from a standard list. Some documentation of the values can be found here. In particular, we are concerned with #’s 14 & 15 (Dates), 18 to 20 (Times) and 21 (DateTime).
Properly identifying string fields was interesting. I wanted to use an AutoField tool to properly select types, but some #’s, like ZIP codes, are entered in to Excel as strings and you really want them to stay that way. In order to force the AutoField to leave these fields as strings, I added some punctuation to the front of them that I later remove. The net effect is that field types are all recognized optimally, even for mixed type columns.
Once again, you can go ahead and Download ReadXlsx.yxzp. If you run in to any issues, or just want to make comments about how this works, leave a comment here. This is not officially supported at all.
Thanks for reading,
If you don’t yet have a copy of Alteryx yet, by all means go get one. Its free (for limited use)… http://www.alteryx.com/download