Scraping large and complex PDF tables

Scraping large and complex PDF tables - python

I've been trying to scrape some data off of PDFs regarding 2020 election results in California for my own morbid curiosity.
I need to scrape many tables that appear across many pages. In some cases, the rows will continue onto the next page, and additional columns will appear on other pages as well. I've included a link to one example. I'm comfortable with R, but I can also use Python if that will be better for scraping. I haven't found many resources indicating how to deal with tables that carry onto additional pages for either language though. I need to get these tables into a CSV or XLSX format.
Thank you in advance!
In this example, Pages 15-28 should be one table.
https://www.co.tehama.ca.us/images/images/Elections/StatementOfVotesCastNOV2020v2excel.pdf

I was able to get the entire table using the following procedure.
Open the pdf in MS Word - not Adobe Acrobat. Word will convert the
document.
After the conversion has completed, select all. (Both may
take some time.)
Paste into a blank Excel worksheet. Save and enjoy.

Related

Searchable database of files

I have 1000's of scanned field books as PDF. Each has a unique filename. In a spreadsheet I have metadata for each, where each row has:
index number, filename, info1, info2, info3, info4, etc.
filename is the exact file name of the PDF. info1 is just an example of a metadata field, such as 'Year' or whatever. There are only about 8 fields or so, not ever PDF is relevant to all of them.
I assume there should be a reasonable way to create a database, mysql, or other, reading the spreadsheet (which I can just saves as .csv or .txt or something). This part I am sure I can handle.
I want to be able to lookup/search for a pdf file based on entering in various search items based on the metadata, and get a list of results. In a web interface, or a custom window, and be able to click on the results and open the file. Basically a typical search window with predefined fields you can enter and get results - like at an old school library terminal.
I have decent coding skills in python, mostly math, but some file skills as well. Looking for guidance on what tools and approach I should take to this. My short term goal is to be able to query and find files and open whatever results. Long term want to be able to share this with the public so they can search and find stuff.
After trying to figure out what to search for online, I am obviously at a loss. How do you suggest I do this and what tools or libraries should I use. I cannot find an example of this online. Not sure how to word it.

The actual data stuff could be done with Pandas:
read the excel file into Pandas
perform the search on the Pandas dataframe, e.g. using df.query()
But this does not give you a GUI. For that you could go for a web app, using Flask or Django framework. That, however, one does not master over night :)
This is a good course to learn that kind of stuff: https://www.edx.org/course/cs50s-web-programming-with-python-and-javascript?index=product&queryID=01efddd992de28a8b1b27d136111a2a8&position=3

ReportLab: edit a certain page after creating several pages

I want to edit a certain page while creating a PDF with ReportLab only. (There are some solutions with PyPDF2, but I want to use ReportLab only - if it is possible).
Description what I am doing / try to do:
I am creating a PDF-File which is giving the reader a good overview of certain data. I get the data from a server. I know the data structure, but from PDF to PDF it varies how much data I get. That's why some PDFs will be 20 pages long, some can be 50 pages+.
After getting all the data and creating a beautiful PDF (this work is done by now), I want to go back to page 2 of this PDF and create my own, very individual table of content.
But I can't find anywhere how to edit a certain page after creating several new pages.
What I've done so far for trying to solve my problem / search:
I read the documentation
I checked stackoverflow
I checked git-repos
Help would be really appreciated. In case that it is not possible to add a certain page after other pages got added with ReportLab, I think about using PyPDF2 then. I have little to no experience with PyPDF2 so far, so if you have some good links you can send me I'd very thankful.

Looking for the best way to automate scraping values off of a CMS to build reports

first post so go easy on me :)
The situation is that I'm trying to scrape the information off of a web based (customer) CMS (Customer-Management System) that has sales information on it to have it then get those values into excel or Google sheets to ultimately build a report, thus saving time/errors from flipping through all of them manually.
I remember using a solution (multiple tools) once that would basically go through the pages and take values from defined fields on those pages and then throw that information into columns on a sheet that we'd then manipulate manually. I'm pretty sure it was python based and (I think) used tampermonkey extension to get the information on a dev/debugger version of chrome.
The process looked something like this:
Already logged into the CMS -> Execute the tool/script that would then automatically open an order in a new window
It'd then go through that order and take values from specific fields and then copy those values in a sheet
It'd then close the window and proceed on to the next order in the specified range
Once it completes the specified (date) range, the columns would be something like salesperson / order number / sale amount / attachment amount / etc - to then be manually manipulated, no further automation needed (beyond the formulas in the sheet)
Anyone have any ideas on how to get this done or any guides anyone knows of for this specific type of task? Trying to automate this as much as possible - Thanks in advance.

Python should be a good choice as it provides you with many different tools. Depending on the functionality of the CMS you can choose different packages.
Simple HTML scraping
For simple scraping of static HTML content scrapy or Beautiful Soup should be enough.
Scraping including executable content
For these cases you can use Selenium which you can combine with Beautiful Soup. For more details can be found in this related question and this one.

Scraping websites - Online or offline data processing is better

I am scraping websites for a research project using Python Beautifulsoup.
I have scraped a few thousand records and put them in excel.
In essence, I want to extract a substring of text (e.g. "python" from a post-title "Introduction to python for dummies").
The post-title is scraped and stored in a cell in excel.
I want to extract "pyhon" and put it in another cell.
I need some advice if it was better to do the extraction while scraping OR do it offline in excel.
Since this is research project, there is no need for real time speed. i am looking at saving my effort.
Another related question is if python can be used to do the extraction in the offline mode - i.e. open excel, do the extraction , close excel.
Any help or advice is really appreciated.

Do it at the same time. It will probably only take a handful of lines of code. There's no reason to do the work of walking over the whole file twice.

How do I scrape a specific table from a web page and display it in Excel? The table goes horizontally?

I am trying to scrape information from the tables at this website >>Here<<
I want to be able to get the scores when I want, I want to be able to get it and export it into Excel, also, I would like the data to come under the hole no. as well. The data that I want is wrapped in a <table> tag with a class of "scoreboard", that is the bit that I want. I would also like the players name.
Is this possible, if so how?
Please answer.

Excel has its own import data from website feature. It has a nice GUI and can let you easily make dynamic web queries in your excel sheet so that the data will update every time you open the book. This might be the easiest and most efficient way for you to go.
Scrappy is much better, especially for larger projects, for use in python, but if your going to put it back into Excel it might not be worth the extra effort.
Check out the official Excel docs on creating dynamic web queries here.

You might wanna take a look at Scrapy. It's a web scraper framework written in Python. It's powerful and easily extensible and customizable.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping large and complex PDF tables - python

I was able to get the entire table using the following procedure. Open the pdf in MS Word - not Adobe Acrobat. Word will convert the document. After the conversion has completed, select all. (Both may take some time.) Paste into a blank Excel worksheet. Save and enjoy.

Related

Searchable database of files

ReportLab: edit a certain page after creating several pages

Looking for the best way to automate scraping values off of a CMS to build reports

Scraping websites - Online or offline data processing is better

How do I scrape a specific table from a web page and display it in Excel? The table goes horizontally?

Categories

Resources