R-bloggers

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Getting data from PDFs the easy way with R

Posted on August 24, 2018 by Andrew Treadway in R bloggers | 0 Comments

[This article was first published on Open Source Automation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Getting data from PDFs the easy way with R

Earlier this year, a new package called tabulizer was released in R, which allows you to automatically pull out tables and text from PDFs. Note, this package only works if the PDF’s text is highlightable (if it’s typed) — i.e. it won’t work for scanned-in PDFs, or image files converted to PDFs.

If you don’t have tabulizer installed, just run install.packages(“tabulizer”) to get started.

Initial Setup

After you have tabulizer installed, we’ll load it, and define a variable referencing an example PDF.

library(tabulizer) site 

The PDFs you manipulate with this package don’t have to be located on your machine — you can use tabulizer to reference a PDF by a URL. For our first example, we’re going to use a sample PDF file found here: http://www.sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf

How to extract all the tables from a PDF

You can extract tables from this PDF using the aptly-named extract_tables function, like this:

# default call with no parameters changed matrix_results 

By default, this function will return a matrix for each table, as in the first line of code above. However, as in the second line, we can add parameters to the function to specify the output flag to be data.frame, and set header = TRUE, to get back a list of data frames corresponding to the tables in the PDF.

Once we have the results back, we can refer to any individual PDF table like any data frame we normally would in R.

first_df 

How to scrape text from a PDF

Scraping text from our sample PDF can be done using extract_text:

text 

How to split up a PDF by its pages

tabulizer can also create separate files for the pages in a PDF. This can be done using the split_pdf function:

# split PDF referenced above # output separate page files to current directory split_pdf(site, getwd()) # or output to different directory split_pdf(site, "C:/path/to/other/folder")

The first argument of split_pdf is the filename or URL of your PDF; the second argument is the directory where you want the individual pages to be output.

How to merge a collection of PDFs

What if we want to reverse what we just did? We can use the merge_pdfs function, which takes as input a vector of file names and and the name of the output file which will be the result of merging the files together.

merge_pdfs("C:/path/to/pdf/files", "C:/path/to/merged_result.pdf")

How to get the number of pages in a PDF

Getting the number of pages in a PDF is made easy with the get_n_pages function, which you can call like this:

get_n_pages(site)

How to get metadata associated with a PDF

You can get metadata associated with our PDF using extract_metadata:

extract_metadata(site)

This function returns a list containing information showing the number of pages, title, created / modified dates, and more.

That’s it for this post! Check out other R articles of mine here: http://theautomatic.net/category/r/

Related

To leave a comment for the author, please follow the link and comment on their blog: Open Source Automation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.