Inspiring Ingenuity

Alteryx, Bicycles and Teaching Kids Programming.

COVID-19 Data

13 Comments

Update – new data for 2022-03-24 I have no words!

Update – new data for 2022-03-23 More of the same – not looking good.

Update – new data for 2020-03-22 Lots more cases and older people getting hit harder than ever.

Update – new data as of 2020-03-21 It shows that it continues to big a big problem for the older population.

Wow – I did not think that I was ever going to be writing another Alteryx blog post…  I have been out of day to day operations for a few years and starting this year I have no association with Alteryx at all.  However, it is still the best data analysis tool out there and Alteryx has been kind enough to let me keep a license. If you just want the conclusion, here is my Colorado COVID-19 report.

Like everyone, I have been obsessively following the COVID-19 data.  We are all scared. Colorado has upped its game on data reporting for this crisis with this portal: https://covid19.colorado.gov/data In particular though, I have a few issues with it. While Tableau produces very nice looking charts and maps, Tableau is not the best way to publish high volume data. The site is crashing periodically and having various issues. If you look at the # of network requests it takes to serve up this one page, it is insane. I have argued for years that static reports are generally more appropriate than interactive for this reason among many. But if the report was just a static PDF, it would make it much easier to put on a content distribution network.

My second issue with the Colorado was one particular chart. The reported positive tests by age just end up looking like an age histogram of the state. I was afraid that this diminished the threat of this virus to older people. See the chart below (copied from https://covid19.colorado.gov/data on 2020-03-21):

It kind of makes it look like there is no problem with the older crowd. I found this hard to believe. I am not an expert on data visualization (I am a software architect/programmer). But I do know a thing or 2. Most importantly is that reporting absolute numbers can often be misleading. It is always better to normalize values. In this case a chart with infection rate instead of raw #’s paints a very different picture:

In this case it is very obvious that older individuals are not getting this disease at a lower rate. And the hospitalization rate for older individuals is very high.

So long story short, I decided to make my own static report on COVID-19 for the state of Colorado. I used the data from the Colorado open data portal: here. This data seems to be 1 day out of date for Colorado. I also wanted to add some national data charts which I got here. Because of the nature of the update schedule, this data seems to be 2 days out of date. And finally the age breakdown data for Colorado does not seem to be anywhere for download, but it was a small enough amount of data that I just typed it in. This would all be very easy to adapt for your own state. For obvious reasons, Colorado is where I am interested right now.

So here is what I did: Colorado COVID-19.yxmd and Colorado COVID-19 report. If there is interest, I will re-run and update the report daily. Let me know.

Footnote: thoughts about Alteryx now that I am a few years removed from it (note: many of these issues are probably originally my fault, I am not placing blame):

  • The download tool temporary file mode is not really documented. Being a few years out it took me a few minutes to figure out how to follow that with a dynamic input tool. That should be easier and better documented. It makes it really easy to read a CSV from the web.
  • The interactive chart needs an option to have a logarithmic scale. Especially for this data.
  • Why is the chart tool in pixels when the other report tools are in inches/cm? And why does it default to 72dpi when the rest of the report tools (and windows in general) default to 96dpi? And what’s up with LaTeX? And finally – it doesn’t work sometimes if you haven’t run your workflow recently. It would be awesome if it told you that.
  • Getting all the tools lining up and connections not intersecting is as hard as ever.
  • I have a fairly big laptop screen, but with almost requiring the config and output windows up all the time, you have a very small amount of work area left. And it still doesn’t support high resolution screens very well.
  • Tabs within tabs within tabs. All with different styles to try to make you think you aren’t in tabs.

13 thoughts on “COVID-19 Data

  1. Awesome! Thanks for sharing your analysis of the Colorado data and the tool. I would enjoy seeing you analyze other data sources around Covid-19, that could either help or inform our approach to tackling this growing situation. I know the data sources may not be particularly good or consistent everywhere.

  2. Thanks Ned for sharing and good insights on presenting data. The news media could follow your advice!

  3. When you say “high volume data”, what kind of volume are you referring to?

    • Hey Chris, I meant data for high volume consumption. Tableau is jumping few a bunch of hoops for every request. In specific the Tableau report for the state of Colorado is generating 43 separate network requests! A pre-formated would generate 1. Much easier on a server.

      • Gotcha, it would probably be relieved by caching in an analytic store, like hyper? or if it was Qlik it would just handle the concurrency because the data is ingested as an analytic store itself.

    • Caching would help – but it is still rebuilding the entire report for every request. Because of the nature of how Tableau works and the use of HTTPS, browsers and intermediate routers can’t easily cache any of those dozens of requests, sp the load always goes back to the server. So you might help with disk load – which might be the biggest part – but you still have CPU and memory load to deal with. Where again, a pdf you can just throw on S3 or Google Drive or something and you are done.

      Don’t get me wrong – interactive visualizations are great! Some people need to be able to dive in and explore data. But they can’t compete with a static PDF for just publishing data. And its not like you couldn’t use Tableau to produce that static PDF – its just not the focus of the product.

      • Qlik caches those types of requests, any data source is cached in the in-memory model. I didn’t know hyper doesn’t cache everything.

      • It might cache the data in the server, but at least Tableau is still sending out dozens of network requests every time. I don’t know about Qlik. There can be (and are) many layers of cache. But for server load, nothing compares to a single simple file.

      • Yep, not to sound like a fanboy, but for concurrency Qlik wins. All data (into the high hundreds of millions of rows) is compressed, cached, indexed and stored in the app itself, which is a single file 🙂

  4. Ned, to one of your points, here’s a NYTimes blurb about the virtues of logarithmic scales in situations like this: https://www.nytimes.com/2020/03/20/health/coronavirus-data-logarithm-chart.html

  5. Ned, Glad to see you are still using Alteryx! Thanks for the feedback on logarithmic scale in Interactive Chart tool and calling out the inconsistency with pixels/inches, I’ll look into both of these. Thanks for building Alteryx.

  6. Hey Ned – thanks for doing this. Seeing the data always helps me understand. Have you thought about comparing hospitalizations to hospital beds? The only data source I could find was here: https://en.wikipedia.org/wiki/List_of_hospitals_in_Colorado. It’s about 18 most old, but probably close to true values.

    • Hey Jay, I’m am going to leave that one to the state. I believe they have good data around how many free beds they have, and besides I suspect the results would scare me too much.