Interactive Data Analysis

The stereotypical notion that comes to my mind when I begin thinking about interactive data visualization and analysis is 3D rendering. Advances in video card technology and rendering algorithms in the mid to late 90s ushered in an era of 3D data visualization. In contrast to 2D plots, it’s usually not possible to see an entire 3D surface from a single perspective. One of the coolest parts about 3D data visualization was the ability to rotate the rendered object in 3-space, and to zoom in and out. 3D data visualization was a very cool advance in the display and analysis of information. By finding a way to render all of the information in a single visual perspective, you could identify trends and relations that you otherwise might have missed. The next big step, one presumes, is 4D data visualization, in which you can examine a 3D rendering evolve over time. Cool!

Cool factor aside, I actually disagree with this general trend of more and more complicated visualization. The often unstated expectation of high-dimensional visualization is that it taps into our innate ability to visually absorb vast amounts of information, so by using high-dimensional visualization we can process and synthesize more data in less time. In my experience, this hope is not realized in practice. In fact, most data is far too complex to boil down to a single, simple 3D plot. Oftentimes the parameters of the visualization do not give smooth results as they are varied continuously, or are not themselves continuous. 3D rendering is useful for identifying structures, but sometimes the expectation of structure impacts our approach to and interpretation of our data, blinding us from information we might have seen obviously with a different approach.

The sort of information that cannot be effectively visualized using 3D rendering or animation is large collections of categorical information. In one of my projects, I examine user data for thousands of users. How could I possibly use 3D rendering to visualize all of their behavior at once? I suppose I could stack the users along one axis, and then plot their data using the remaining two axes, but how do I order the stack? The problem is that I have lots of information, but I do not expect any relation between the raw data of the users. Aye, there’s the rub: raw data. How often do we actually draw results directly from raw data?

Obtaining accurate and useful information can be quite tricky. For part of my Ph. D. work I analyzed oscillators vibrating on a metal plate: our raw data were time series of plate motion. Universally, the first step in analyzing these plate movements was to take the Fourier transform of the time series and construct the power spectrum. When running a single motor at a time, all the plate motion was due to the single motor, which meant that in order to determine the motor’s speed, I simply needed to find the highest peak in the power spectrum. Except, sometimes that wasn’t the case: sometimes the peak of the power spectrum occurred at a multiple of the motor’s speed due to interactions between the resonant frequencies of the plate and higher harmonics of the motor’s motion. After finding the position of the global maximum, I had to check fractions of that frequency to see if there were nontrivial peaks there, in which case the smaller frequency was probably the motor’s true speed. This series of steps, this algorithm for determining a single motor’s speed on the plate, was one of the harder parts of the whole project. Tuning the parameters of the algorithm to get good results took time and left me with a highly specialized technique that only applied to this particular experimental setup.

There are a few aspects of my previous description that bear further discussion. First, I had to have enough familiarity with the system and the data to know that the peak of the power spectrum sometimes identified speeds that were a multiples of the true speed. Fortunately, a professor down the hall lent us a stroboscope that had an adjustable flash rate so I could determine literally watch the motors and identify their true speeds. I knew what I was looking for and I knew that the naive approach gave me incorrect results. For the second step, I had to scrutinize the data even more closely for patterns that would help, and concoct some method that would faithfully extract the information from those patterns. The third step was to find the best set of parameters for the algorithm’s performance. In most work, this last step is wrapped into the second step, but it is critical: you may have the best algorithm for an analysis, but it performs poorly because some of the algorithm’s tuning parameters are not well suited to your data. All three of these steps required a strong familiarity with the quirks of the object of study: first the system, then the data, and finally the algorithm.

Think about that last part: How do I develop familiarity with an algorithm, especially one as specialized as my motor speed extraction algorithm? The first generation of data-processing algorithms include such things as fitting time series to trends, computing fast Fourier transforms, and diagonalizing matrices. All of these techniques are largely if not entirely data-agnostic. They have their quirks, but they can be used for many analyses. Once you’ve learned the signature for one kind of quirk from one project, you’d recognize it the next time you used that algorithm. However, data is becoming more plentiful and yet at the same time more complex, and many algorithms are beginning to crop up that are specific to their problem domain, just like my motor speed extraction algorithm. For a more important example, consider reconstruction of MRI data using the massively parallel computing capabilities of advanced video cards. The resulting software required a very carefully crafted algorithm specific not only to the computing hardware but also to the problem domain. In her article A fourth 'r’ for 21st century literacy, Cathy N. Davidson argues that algorithms ('rithms, if you want it to start with an “r”) are a new form of literacy that we must teach in our elementary and secondary schools, in our colleges and universities. Generic, data-agnostic algorithms are giving way to new, highly targeted algorithms that are domain specific, and as the quantity of data grows, so too do the number of algorithms to process those data. How do we train our intuition about the benefits and blindness of these new procedures? As we develop more algorithms more quickly, what can we do to increase the speed at which we become familiar with our algorithms?

I believe that the best way to become familiar with the intricacies of an algorithm is to write a program that uses the algorithm to analyze real data and gives swift and meaningful feedback as you alter the algorithm’s parameters. Although one can achieve feedback using text-based command-line tools, I believe that these sorts of programs are much better written as GUI programs in which changes to a slider or a text input field lead to immediate changes in graphs and charts.

My belief in the need for interactivity led to and is now partly driven by my writing of my own plotting library, realized in the form of a plot “widget” for a GUI toolkit called Prima. You probably think I’m a little crazy for writing my own library, and there may be some truth to that thought. In my defense, I needed a responsive plotting library that could function as a part of a larger GUI toolkit and I had a chance encounter with Prima’s particularly simple, elegant, and cross-platform low-level drawing operations. It has been nearly two years since I embarked on this project. At first, most of my time with the library was spent working on the library itself, with occasional uses in research that fed back into further development. Fortunately, the initially large investment is beginning to pay dividends: I have used it extensively over the last two months creating over many small programs to determine the effectiveness (or lack of effectiveness) of some of the algorithms used in our research. Writing my own plotting library leads to deep knowledge of both the library (obviously) and the surrounding GUI toolkit, and tasks that once took weeks to execute poorly or months to execute robustly I now complete in days.

The problem with GUI programming in general is that it is a verbose art, time consuming to write and difficult to do well. Graphic designers spend years working out their craft of creating well-designed interfaces for things from sound boards to web sites. I can spend a full day writing an interactive program to study the intricacies of our proposed algorithms. In contrast, writing a small program to perform the analysis and dump the output to a few text files with data that I plot with another small script takes only a couple of hours. The steep learning curve to learning how to write useful GUI programs and the lengthier development process both cause scientists to favor the simpler script-based analyses.

The benefit of writing GUI programs comes not in speed but in robustness. Well-written GUI programs make thorough analyses simple and appealing. Giving ourselves interactive methods for playing with our algorithms and their analyses of our data is not only enjoyable and rewarding in its own right but also draws us into a much deeper understanding of our data and our algorithms. The process happens to be faster in the long-run and it gives us fantastic live demos to use at talks, but these are secondary benefits. The true benefit comes when we describe our choice of algorithm in our papers or in conversations with our colleagues: we know how our algorithms behave because we have played with them.

For more about writing good user-interfaces, check out this excellent—if a little long—talk by Bret Victor.

— David Mertens