PDFs, for all their versatility, too often end up as one-way repositories for data. Summaries, catalogs, real estate listings, convention attendance directories and other types of reports are designed to inform. If you want to analyze and act on the data, you need to retrieve it. Data extraction to Excel or CSV is the first step. That's where I come in.
I use dedicated PDF extraction software that can extract multiple pages from 1,000 PDFs in less than 2 minutes. There's more to extraction than just retrieving data. The next step is to "wrangle" it, reshape it from that end-user report to a format that is useful to you.
PDFs come in all styles. Some are created directly from data tables. But, more than likely, some tidying up is required. I have helped clients turn single page PDFs into CSV format; converted 500-page directories into Excel worksheets; extracted pages from 1,000 individual PDFs and rebuilt them as an Excel-style database.
The key to completing these conversions is knowing which tools will help at each step. I use all of the following tools, sometimes I need to use more than one on a set of extracted data:
* Power Query - part of Excel
* Microsoft Access - when I need fast filtering and sorting on huge datasets
* RegexBuddy - a really powerful text processor that uses regular expressions to parse data
* VBA (macros) for smaller datasets that need special Excel-centric handling
* VB.net - custom, ad-hoc programs to help manage files and folders