Note July 6, 2015: These instructions will still work, but there are now more software options that automate parts of this, or let you zoom in on a graph and click the relevant points. The last time I did this task, I used GraphClick for OS X, which is no longer maintained but available for free. Engauge Digitizer is an open source option, but I haven't used it before. For a current list of software, try googling digitize graphs and plots.
In research, it is common to find publications that have graphs of data you are interested in, but no numeric tables of the underlying data. Some analysis requires the numeric form, and getting that data from authors can be a hit-or-miss proposition. This page describes how to extract numeric data from graphs with a high degree of accuracy. I'll start with some background on vector images and selecting objects in Illustrator; you can skim or skip the intro if you already know that stuff.
This exact method will work on most recent pdfs that use vector-based images. If you have old papers (e.g. from the 1970's) that were scanned in and have rasterized graphs, you'll need to do some extra steps to clip out the graph and have Illustrator construct a vector overlay. Vector-based images are a family of file formats that record a series of lines, shapes, etc. in a geometric form (e.g. x1, y1 to x2, y2). Bitmap images (sometimes called rasterized images) are a family that basically draw a grid and record the color of every point on the grid. The PDF file format can store either type of specification. Vector-based images are better for publications because they don't get pixelated when you zoom in or resize them. But until recently, Word couldn't deal with vector-based images outside of excel graphs copied in. Here's a longer explanation you can read if you want more information. Moving on...
When you open a pdf in Illustrator that is vectorized, you can select text or figures in two ways. In the toolbox, you'll see a black arrow in the upper left and a white arrow to its right. The black arrow selects entire groups of objects, and the white arrow selects individual components from a group. These groups are have the same functionality as ones you can make in powerpoint or word. That is, clicking any member of a group selects the entire group, so all the components will get deleted, moved, or resized together. To break apart a group, select it, right click, and choose "Ungroup" from the pop-up menu. Occasionally in Illustrator, you'll also need to choose "Release Compound Path" or "Release Clipping Mask" in order to get at pieces you are interested in.
Using either the black or white arrows in illustrator, you can drag a box to select many objects at one time. If your box contains too many objects, you can hold down select and click on objects to unselect them. Holding down shift while you click also lets you add objects to your selection. It takes a little practice to efficiently select lots of objects, but with the above knowledge and playing around, you can figure it out no problem.
In this example, I'm using Figure 3 at the bottom of page 22 of an IMF report that shows the Real Effective Exchange Rate in Senegal in 1980-2004 calculated using Unit Labor Cost. The X variable is year, and the Y variable is REER-index, which I'll sometimes call REER for short.
You can probably use other software to extract the graph from a pdf file, but I have access to Illustrator and am familiar with it.
<path fill="none" stroke="#000000" stroke-width="0.626" stroke-linecap="round" stroke-linejoin="round" stroke-miterlimit="10" d=" M178.873,69.014l7.74-1.14 M171.073,72.073l7.8-3.06 M163.333,73.754l7.74-1.681 M155.533,69.794l7.8,3.96 M147.793,70.153 l7.74-0.359 M140.053,68.714l7.74,1.439 M132.313,70.273l7.74-1.56 M124.513,79.334l7.8-9.061 M116.773,79.394l7.74-0.06 M108.973,80.294l7.8-0.9 M101.233,58.813l7.74,21.48 M93.493,56.594l7.74,2.22 M85.693,46.813l7.8,9.78 M77.953,44.953l7.74,1.86 M70.153,46.153l7.8-1.2 M62.413,41.413l7.74,4.74 M54.613,39.313l7.8,2.1 M46.873,39.974l7.74-0.66 M39.133,46.754l7.74-6.78 M31.333,44.233l7.8,2.521 M23.593,39.374l7.74,4.859 M15.853,36.434l7.74,2.94 M8.053,16.993l7.8,19.44 M0.313,0.313l7.74,16.68"/>
There are many ways of drawing that path, and my copy of Illustrator uses the pattern M x,y l delta_x, delta_y, only without a lot of spacing or punctuation. "M" means move the "pencil" to absolute coordinates on the page. "l" (lower case L) means draw a line segment from that point over by delta_x and up by delta_y. In the SVG format, the Y-axis is reversed from cartesian space, so a bigger number means you move down instead of up, a negative delta_y moves up, and a positive delta_y moves down. Illustrator occasionally uses "v" and "h" for vertical and horizontal line segments. Those are infrequent in a graph and can be normalized in a step below. If you have a different copy of Illustrator or use another program to convert the graph to an svg file, you may need to reference the svg path specification, and/or work out another strategy of processing the data after it is in excel.
M178.873,69.014l7.74-1.14 M171.073,72.073l7.8-3.06 M163.333,73.754l7.74-1.681 M155.533,69.794l7.8,3.96 M147.793,70.153 l7.74-0.359 M140.053,68.714l7.74,1.439 M132.313,70.273l7.74-1.56 M124.513,79.334l7.8-9.061 M116.773,79.394l7.74-0.06 M108.973,80.294l7.8-0.9 M101.233,58.813l7.74,21.48 M93.493,56.594l7.74,2.22 M85.693,46.813l7.8,9.78 M77.953,44.953l7.74,1.86 M70.153,46.153l7.8-1.2 M62.413,41.413l7.74,4.74 M54.613,39.313l7.8,2.1 M46.873,39.974l7.74-0.66 M39.133,46.754l7.74-6.78 M31.333,44.233l7.8,2.521 M23.593,39.374l7.74,4.859 M15.853,36.434l7.74,2.94 M8.053,16.993l7.8,19.44 M0.313,0.313l7.74,16.68
Now you need to massage these numbers so they can be copied into excel. The replacement strings below are assuming you are using Word. Most text editor's I'm familiar with use "\t" for tabs and "\n" or "\r" for new lines, whereas Word uses "^t" and "^p" for tabs and new lines. If you are in doubt, select a tab or new line from text and paste it into the replace field of your text editor.
M155.533 69.794 l7.8 3.96
A few lines may be blank or have the "l" section on a different line. Delete the blank lines, as well as the extra returns that push the l onto a different line.
Occasionally, you'll see a h followed by one number instead of an l followed by two, or maybe a v. An "h" denotes a horizontal line, so only the x value is given. Change the h to an l, and add a zero for the y. e.g. change "h7.8" to "l7.8 0". A "v" denotes a vertical line and should be dealt with similarly. ex. change "v3.4" to "l0 3.4"
178.873 69.014 7.74 -1.14 171.073 72.073 7.8 -3.06 ... 8.053 16.993 7.8 19.44 0.313 0.313 7.74 16.68
|Point X||Point Y||Year||REER - |
Each row is a line segment, so if we have N data points, we'll have N-1 rows. We need to extract the points from the line data.
In my data, the x-values were supposed to be evenly spaced and each represented a single year, so I could basically ignore the graph's x-values except for diagnostics. If this is the case for you, set those data values manually. If it is not the case, you can adapt the strategy used to get Y-values that is explained below.
You can populate the "Data Y" (REER) column, from the "graph Y" column with a simple linear equation if you know REER values for two data points. You can get these reference values by either eyeballing the graph, or reading off numbers in Illustrator's info panel and doing some calculations. If you use the eyeball method and are a little off, you can update your guess in the "Check your work" step below.
In this example, I got lucky and was able to eyeball near-exact intersections between points in the scatter plot and y-tick marks: 200 for 1982 and 100 for 1997. Normally, I'm not this lucky, and have to read the coordinates of two tick marks and a line segments in Illustrator and use those to calculate the REER values for two reference points.
Read the coordinates of an object in illustrator by selecting it with the white selection arrow, then clicking "Info" under the "Window" menu. A small toolbar will pop up that shows X, Y, W, & H.
|REER value||Y-coordinate (inches)|
Now I have two reference points: (1980, 310.9290) and (1981, 260.2842).
Whether you calculate the reference points numerically, or eyeball them, you end up with two points from which you calculate the slope, and apply a linear formula to get the data Y. The linear formula follows a similar strategy described in the paragraph above. You calculate the conversion between pixels in the SVG file to data units (REER index in this example), and use that and one of your reference points for constants in the formula:
|Year||REER value||Pixel value (Point Y)|
Now, your main table should look something like this:
|Point X||Point Y||Year||REER - |
The final test is to overlay a graph drawn from the extracted numbers on top of the original graph.