Extracting Numbers from Graphs in PDFs

Josiah Johnston, October 27, 2008

YearREER-Unit Labor
1980310.9
1981260.3
1982201.3
...
2004105.8

Note July 6, 2015: These instructions will still work, but there are now more software options that automate parts of this, or let you zoom in on a graph and click the relevant points. The last time I did this task, I used GraphClick for OS X, which is no longer maintained but available for free. Engauge Digitizer is an open source option, but I haven't used it before. For a current list of software, try googling digitize graphs and plots.

In research, it is common to find publications that have graphs of data you are interested in, but no numeric tables of the underlying data. Some analysis requires the numeric form, and getting that data from authors can be a hit-or-miss proposition. This page describes how to extract numeric data from graphs with a high degree of accuracy. I'll start with some background on vector images and selecting objects in Illustrator; you can skim or skip the intro if you already know that stuff.

This exact method will work on most recent pdfs that use vector-based images. If you have old papers (e.g. from the 1970's) that were scanned in and have rasterized graphs, you'll need to do some extra steps to clip out the graph and have Illustrator construct a vector overlay. Vector-based images are a family of file formats that record a series of lines, shapes, etc. in a geometric form (e.g. x1, y1 to x2, y2). Bitmap images (sometimes called rasterized images) are a family that basically draw a grid and record the color of every point on the grid. The PDF file format can store either type of specification. Vector-based images are better for publications because they don't get pixelated when you zoom in or resize them. But until recently, Word couldn't deal with vector-based images outside of excel graphs copied in. Here's a longer explanation you can read if you want more information. Moving on...

When you open a pdf in Illustrator that is vectorized, you can select text or figures in two ways. In the toolbox, you'll see a black arrow in the upper left and a white arrow to its right. The black arrow selects entire groups of objects, and the white arrow selects individual components from a group. These groups are have the same functionality as ones you can make in powerpoint or word. That is, clicking any member of a group selects the entire group, so all the components will get deleted, moved, or resized together. To break apart a group, select it, right click, and choose "Ungroup" from the pop-up menu. Occasionally in Illustrator, you'll also need to choose "Release Compound Path" or "Release Clipping Mask" in order to get at pieces you are interested in.

Using either the black or white arrows in illustrator, you can drag a box to select many objects at one time. If your box contains too many objects, you can hold down select and click on objects to unselect them. Holding down shift while you click also lets you add objects to your selection. It takes a little practice to efficiently select lots of objects, but with the above knowledge and playing around, you can figure it out no problem.

In this example, I'm using Figure 3 at the bottom of page 22 of an IMF report that shows the Real Effective Exchange Rate in Senegal in 1980-2004 calculated using Unit Labor Cost. The X variable is year, and the Y variable is REER-index, which I'll sometimes call REER for short.

The steps:

Use Adobe Illustrator to isolate the graph

You can probably use other software to extract the graph from a pdf file, but I have access to Illustrator and am familiar with it.

Reformat the numbers into a table

In Excel, convert the pixel positions to the original data

Line
start X
Line
start Y
Line
delta x
Line
delta y
Point XPoint YYearREER -
Unit Labor
0.3130.3137.7416.680.3130.3131980
8.05316.9937.819.448.05316.9931981
...
178.87369.0147.74-1.14178.87369.0142003
186.61367.8742004

Each row is a line segment, so if we have N data points, we'll have N-1 rows. We need to extract the points from the line data.

In my data, the x-values were supposed to be evenly spaced and each represented a single year, so I could basically ignore the graph's x-values except for diagnostics. If this is the case for you, set those data values manually. If it is not the case, you can adapt the strategy used to get Y-values that is explained below.

Get reference points for two data points

You can populate the "Data Y" (REER) column, from the "graph Y" column with a simple linear equation if you know REER values for two data points. You can get these reference values by either eyeballing the graph, or reading off numbers in Illustrator's info panel and doing some calculations. If you use the eyeball method and are a little off, you can update your guess in the "Check your work" step below.

Calculate the values for the Y Data column

See REER_UnitLabor.xls

Whether you calculate the reference points numerically, or eyeball them, you end up with two points from which you calculate the slope, and apply a linear formula to get the data Y. The linear formula follows a similar strategy described in the paragraph above. You calculate the conversion between pixels in the SVG file to data units (REER index in this example), and use that and one of your reference points for constants in the formula:

YREER = (Ypixels - Y_refpixels) * Conversion + Y_refREER

Now, your main table should look something like this:

Line
start X
Line
start Y
Line
delta x
Line
delta y
Point XPoint YYearREER -
Unit Labor
0.3130.3137.7416.680.3130.3131980310.9
8.05316.9937.819.448.05316.9931981260.3
...
178.87369.0147.74-1.14178.87369.0142003102.3
186.61367.8742004105.8
See REER_UnitLabor.xls

Check your work


The final test is to overlay a graph drawn from the extracted numbers on top of the original graph.

See REER_UnitLabor.xls

El fin




List of files referenced in this example




Back to my homepage