Split page in pdf by specific title [duplicate] - python

I have a pdf file. It contains of four columns and all the pages don't have grid lines. They are the marks of students.
I would like to run some analysis on this distribution.(histograms, line graphs etc).
I want to parse this pdf file into a Spreadsheet or an HTML file (which i can then parse very easily).
The link to the pdf is:
Pdf
this is a public document and is available on this domain openly to anyone.
note: I know that this can be done by exporting the file to text from adobe reader and then import it into Libre Calc or Excel. But i want to do this using a python script.
Kindly help me with this issue.
specs:
Windows 7
Python 2.7

Use PyPDF2:
from PyPDF2 import PdfFileReader
with open('CT1-All.pdf', 'rb') as f:
reader = PdfFileReader(f)
contents = reader.getPage(0).extractText().split('\n')
pass
When you print contents, it will look like this (I have trimmed it here):
[u'Serial NoRoll NoNameCT1 Marks (50)111MA20026KARADI KALYANI212AR10029MUKESH K
MAR5', u'312MI31004DEEPAK KUMAR7', u'413AE10008FADKE PRASAD DIPAK27', u'513AE10
22RAHUL DUHAN37', u'613AE30005HIMANSHU PRABHAT26.5', u'713AE30019VISHAL KUMAR39
, u'813AG10014HEMANT17', u'913AG10028SHRESTH KR KRISHNA37.51013AG30009HITESH ME
RA33.5', u'1113AG30023RACHIT MADHUKAR40.5', u'1213AR10002ACHARY SUDHEER11', u'1
13AR10004AMAN ASHISH20.5', u'1413AR10008ANKUR44', u'1513AR10010CHUKKA SHALEM RA
U11.5', u'1613AR10012DIKKALA VIJAYA RAGHAVA20.5', u'1713AR10014HRISHABH AMRODIA
1', u'1813AR10016JAPNEET SINGH CHAHAL19.5', u'1913AR10018K VIGNESH42.5', u'2013
R10020KAARTIKEY DWIVEDI49.5', u'2113AR10024LAKSHMISRI KEERTI MANNEY49', u'2213A
10026MAJJI DINESH9.5', u'2313AR10028MOUNIKA BHUKYA17.5', u'2413AR10030PARAS PRA

Related

PDF form filled out with PyPDF2 is shown inconsistently in different PDF viewers

Here is a simple very easily reproducible example of the bug I am having in PyPDF2. Please help me figure out what is going on.
I want to generate this pdf:
As you can see, I have generated it so the code is working correctly. However, it looks like this only when viewed in mac viewer or safari. When I view THE SAME FILE in google chrome or brave or adobe I see this:
So page 2 just shows page 1 again. If I do n pages then all n pages are identical and they all say "page 1".
I put all the versions of the software I am using in the comments below. I want this to be viewable correctly in chrome, brave, adobe, and it's probably wrong in many others as well.
#find original pdf page here and save it in "path": https://www.irs.gov/pub/irs-pdf/f8949.pdf
from PyPDF2 import PdfReader, PdfWriter #, PdfFileMerger, PdfFileWriter, PdfFileReader
writer = PdfWriter()
reader = PdfReader(path)
page = reader.pages[0]
writer.add_page(page)
writer.update_page_form_field_values(writer.pages[0], {"f1_3[0]":"page1"})
reader = PdfReader(path)
page = reader.pages[0]
writer.add_page(page)
writer.update_page_form_field_values(writer.pages[1], {"f1_3[0]":"page2"})
with open("output.pdf", "wb") as output_stream:
writer.write(output_stream)
output_stream.close()
#PyPDF2==2.11.2
#PyPDF4==1.27.0
#Python 3.9.7
#Mac 12.1
#Brave 1.45.127
#Adobe 2022.003.20258
#safari good
#preview good
#google bad
#adobe bad
#brave bad
Yes, this is a bug in PyPDF2. We have issues with filling out forms for a long time, see https://github.com/py-pdf/PyPDF2/issues/355

Is it possible to capture specific parts of a PDF text with AWS Textract?

I need to extract the text from the PDF, but I don't want the entire PDF to be parsed. I wonder if it's possible to get specific parts of the parsed PDF. For example, I have a PDF with information about: Address, city and country. I don't want everything returned, just the Address field, not the other information.
Code that returns the text to me:
from textractcaller.t_call import call_textract
from textractprettyprinter.t_pretty_print import get_lines_string
response = call_textract(input_document="s3://my-bucket/myfile.pdf")
print(get_lines_string(response))
Try this method (it doesn't use AWS Textract, but works as well):
import PyPDF2
def extract_text(filename, page_number):
# Returns the content of a given page
pdf_file_object = open(filename, 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_file_object)
# page_number - 1 below because in python, page 1 is considered as page 0
page_object = pdf_reader.getPage(page_number - 1)
text = page_object.extractText()
pdf_file_object.close()
return text
This function extracts the text from one single PDF page.
If you haven't got PyPDF2 yet, install it through the command line with 'pip install PyPDF2'.

How to extract the title, authors, creation date of a PDF in Python

I manage papers locally and rename each PDF file in the form of "creationdate_authors_title.pdf". Hence, extracting the title, authors, creation date of each paper from the PDF file automatically is required.
I have written a python script using the package pdfminer to extract info. However, for certain files, after parsing them, the file info stored in the dictionary doc.info[0] by using PDFDocument may not contain some keys such as "Author", or these keys' values are empty.
I'm wondering how can I locate the required info such as the paper's title directly from the PDF file using the function like "extract_pages". Or, more generally, how can I accurately and efficiently extract the info I required?
Any hint would be appreciated! Many thanks in advance.
You can use this script to extract all the metadata using the library PyPDF2
from PyPDF2 import PdfFileReader
def get_info(path):
with open(path, 'rb') as f:
pdf = PdfFileReader(f)
info = pdf.getDocumentInfo()
number_of_pages = pdf.getNumPages()
print(info)
author = info.author
creator = info.creator
producer = info.producer
subject = info.subject
title = info.title
if __name__ == '__main__':
path = 'reportlab-sample.pdf'
get_info(path)
As you see inside the info variable you have all you need. Check this documentation

PyPDF4 not reading certain characters

I'm compiling some data for a project and I've been using PyPDF4 to read this data from it's source PDF file, but I've been having trouble with certain characters not showing up correctly. Here's my code:
from PyPDF4 import PdfFileReader
import pandas as pd
import numpy as np
import os
import xml.etree.cElementTree as ET
# File name
pdf_path = "PV-9-2020-10-23-RCV_FR.pdf"
# Results storage
results = {}
# Start page
page = 5
# Lambda to assign votes
serify = lambda voters, vote: pd.Series({voter.strip(): vote for voter in voters})
with open(pdf_path, 'rb') as f:
# Get PDF reader for PDF file f
pdf = PdfFileReader(f)
while page < pdf.numPages:
# Get text of page in PDF
text = pdf.getPage(page).extractText()
proposal = text.split("\n+\n")[0].split("\n")[3]
# Collect all pages relevant pages
while text.find("\n0\n") is -1:
page += 1
text += "\n".join(pdf.getPage(page).extractText().split("\n")[3:])
# Remove corrections
text, corrections = text.split("CORRECCIONES")
# Grab relevant text !!! This is where the missing characters show up.
text = "\n, ".join([n[:n.rindex("\n")] for n in text.split("\n:")])
for_list = "".join(text[text.index("\n+\n")+3:text.index("\n-\n")].split("\n")[:-1]).split(", ")
nay_list = "".join(text[text.index("\n-\n")+3:text.index("\n0\n")].split("\n")[:-1]).split(", ")
abs_list = "".join(text[text.index("\n0\n")+3:].split("\n")[:-1]).split(", ")
# Store data in results
results.update({proposal: dict(pd.concat([serify(for_list, 1), serify(nay_list, -1), serify(abs_list, 0)]).items())})
page += 1
print(page)
results = pd.DataFrame(results)
The characters I'm having difficulty don't show up in the text extracted using extractText. Ždanoka for instance becomes "danoka, Štefanec becomes -tefanc. It seems like most of the characters are Eastern European, which makes me think I need one of the latin decoders.
I've looked through some of PyPDF4's capabilities, it seems like it has plenty of relevant codecs, including latin1. I've attempted decoding the file using different functions from the PyPDF4.generic.codecs module, and either the characters don't show still, or the code throws an error at an unrecognised byte.
I haven't yet attempted using multiple codecs on different bytes from the same file, that seems like it would take some time. Am I missing something in my code that can easily fix this? Or is it more likely I will have to tailor fit a solution using PyPDF4's functions?
Use pypdf instead of PyPDF2/PyPDF3/PyPDF4. You will need to apply the migrations.
pypdf has received a lot of updates in December 2022. Especially the text extraction.
To give you a minimal full example for text extraction:
from pypdf import PdfReader
reader = PdfReader("example.pdf")
for page in reader.pages:
print(page.extract_text())

How to extract Table from PDF in Python? [duplicate]

This question already has answers here:
How can I extract tables from PDF documents?
(4 answers)
Closed 11 days ago.
I have thousands of PDF files, composed only by tables, with this structure:
pdf file
However, despite being fairly structured, I cannot read the tables without losing the structure.
I tried PyPDF2, but the data comes completely messed up.
import PyPDF2
pdfFileObj = open(pdf_file.pdf, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())
print(pageObj.extractText().split('\n')[0])
print(pageObj.extractText().split('/')[0])
I also tried Tabula, but it only reads the header (and not the content of the tables)
from tabula import read_pdf
pdfFile1 = read_pdf(pdf_file.pdf, output_format = 'json') #Option 1: reads all the headers
pdfFile2 = read_pdf(pdf_file.pdf, multiple_tables = True) #Option 2: reads only the first header and few lines of content
Any thoughts?
After struggling a little bit, I found a way.
For each page of the file, it was necessary to define into tabula's read_pdf function the area of the table and the limits of the columns.
Here is the working code:
import pypdf
from tabula import read_pdf
# Get the number of pages in the file
pdf_reader = pypdf.PdfReader(pdf_file)
n_pages = len(pdf_reader.pages)
# For each page the table can be read with the following code
table_pdf = read_pdf(
pdf_file,
guess=False,
pages=1,
stream=True,
encoding="utf-8",
area=(96, 24, 558, 750),
columns=(24, 127, 220, 274, 298, 325, 343, 364, 459, 545, 591, 748),
)
Try this: pip install tabula-py
from tabula import read_pdf
df = read_pdf("file_name.pdf")
use library tabula
pip install tabula
then exract it
import tabula
# this reads page 63
dfs = tabula.read_pdf(url, pages=63, stream=True)
# if you want read all pages
dfs = tabula.read_pdf(url, pages=all)
df[1]
By the way, I tried read pdf files by using another way. Then it works better than library tabula. I will post it soon.
#fmarques
You could also try a new Python package (SLICEmyPDF) developed by StatCan specially for extracting tabular data from PDF:
https://github.com/StatCan/SLICEmyPDF
From my experience SLICEmyPDF outperforms other free Python or R packages.
The catch is that it requires the installation of a few extra free software. The instructions for the installation can be found at
https://dataworldofredhairedgirl.blogspot.com/2022/04/how-to-install-statcan-slicemypdf-on.html

Categories

Resources