Extract Text into Column using Regex

Extract Text into Column using Regex - python

I want to extract data (S no, Item Code, Price and Size) from the attached PDF Document in to columns.
The re.compile works for the S no, Item Code and Price, but as soon as I put the Size - it gives a limited output. I am unable to figure out why? Can you please help
(Attached picture of the PDF page)
Import pandas as pd
Import re
Import PyPDF2
file = open("Petchem.pdf", "rb")
pdfReader = PyPDF2.PdfFileReader(file)
my_dict = {"S no":[], "Item Code":[], "Price":[], "Size":[]}
for page in range (1,25):
pageObj = pdfReader.getPage(page)
data = pageObj.extractText()
size = re.compile(r'((\d{2,4}?)(\d{10})EA\s(\d?\d?,?\d?\d?\d.\d\d)[\s\w\d,:/.()-])')
for number in size.findall(data):
S_No = my_dict["S No"].append(number[1])
Item_Code = my_dict["Item Code"].append(number[2])
Price = my_dict["Price"].append(number[3])
Size = my_dict["Size"].append(number[4])
print(number[1])
a_file = open("Column_Breakup.csv", "w")
datadf = pd.DataFrame(my_dict)
datadf.to_csv("Column_Breakup.csv")
a_file.close()

Related

Filtering the extract of a pdf

I am currently extracting data from a pdf.
My code is able to extract each line of that PDF.
What I want is:
to extract some information, ie: 'Téléc', 'Courriel', 'Territoire desservi'
to have the data that matches such information, ie: Téléc = 579-240-XXX, and so on.
Here is my code, and what can we do to filter my extract?
Code:
import PyPDF2
pdfileObj = xxxxxxxxxxxx
pdf = open(pdfileObj, 'rb')
pdfreader = PyPDF2.PdfFileReader(pdfileObj)
pageObj = pdfreader.getPage(10)
text = pageObj.extract_text()
for words in text:
search_keywords = ['Téléc', 'Courriel', 'Territoire desservi']
print(text)

Unable to convert multiple PDF pages of a PDF File into a CSV using tabula

I have PDF file whose 1st page data format is different however rest of the pages has the same tabular format.
I want to convert this PDF file which has multiple pages into a CSV file using Python Tabula.
The current code is able to convert PDF to CSV if the PDF has only 2 pages and if it has more that two pages it gives error out of range.
I want to count total number of PDF pages of a PDF File and depending upon the same I want python script to convert the PDF to CSV for different data frames.
I am using Linux box to run this python script.
The code is as given below:
#!/usr/bin/env python3
import tabula
import pandas as pd
import csv
pdf_file='/root/scripts/pdf2xls/Test/21KJAZP011.pdf'
column_names=['Product','Batch No','Machin No','Time','Date','Drum/Bag No','Tare Wt.kg','Gross Wt.kg',
'Net Wt.kg','Blender','Remarks','Operator']
df_results=[] # store results in a list
# Page 1 processing
try:
df1 = tabula.read_pdf('/root/scripts/pdf2xls/Test/21KJAZP011.pdf', pages=1,area=(95,20, 800, 840),columns=[93,180,220,252,310,315,333,367,
410,450,480,520]
,pandas_options={'header': None}) #(top,left,bottom,right)
df1[0]=df1[0].drop(columns=5)
df1[0].columns=column_names
df_results.append(df1[0])
df1[0].head(2)
except Exception as e:
print(f"Exception page not found {e}")
# Page 2 processing
try:
df2 = tabula.read_pdf('/root/scripts/pdf2xls/Test/21KJAZP011.pdf', pages=2,area=(10,20, 800, 840),columns=[93,180,220,252,310,315,330,370,
410,450,480,520]
,pandas_options={'header': None}) #(top,left,bottom,right)
row_with_Sta = df2[0][df2[0][0] == 'Sta'].index.tolist()[0]
df2[0] = df2[0].iloc[:row_with_Sta]
df2[0]=df2[0].drop(columns=5)
df2[0].columns=column_names
df_results.append(df2[0])
df2[0].head(2)
except Exception as e:
print(f"Exception page not found {e}")
#res:wult = pd.concat([df1[0],df2[0],df3[0]]) # concate both the pages and then write to CSV
result = pd.concat(df_results) # concate list of pages and then write to CSV
result.to_csv("result.csv")
with open('/root/scripts/pdf2xls/Test/result.csv', 'r') as f_input, open('/root/scripts/pdf2xls/Test/FinalOutput_21KJAZP011.csv', 'w') as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
csv_output.writerow(next(csv_input)) # write header
for cols in csv_input:
for i in range(7, 9):
cols[i] = '{:.2f}'.format(float(cols[i]))
csv_output.writerow(cols)
Please suggest how can achieve the same. I am very new to Python and hence unable to put together things.

Try pdfpumber https://github.com/jsvine/pdfplumber, worked for me like a charm
pdffile = 'your file'
with pdfplumber.open(pdffile) as pdf:
for i in range(len(pdf.pages)):
first_page = pdf.pages[i]
rawdata = first_page.extract_table()

Extract Multiple Tables from PDF using multiple_tables options using tabula
multiple_tables=True
from tabula import convert_into
table_file = r"PDF_path"
output_csv = r"out_csv"
df = convert_into(table_file, output_csv, output_format='csv', lattice=False, stream=True, multiple_tables=True, pages="all")

Merging PDFs while retaining custom page numbers (aka pagelabels) and bookmarks

I'm trying to automate merging several PDF files and have two requirements: a) existing bookmarks AND b) pagelabels (custom page numbering) need to be retained.
Retaining bookmarks when merging happens by default with PyPDF2 and pdftk, but not with pdfrw.
Pagelabels are consistently not retained in PyPDF2, pdftk or pdfrw.
I am guessing, after having searched a lot, that there is no straightforward approach to doing what I want. If I'm wrong then I hope someone can point to this easy solution. But, if there is no easy solution, any tips on how to get this going in python will be much appreciated!
Some example code:
1) With PyPDF2
from PyPDF2 import PdfFileWriter, PdfFileMerger, PdfFileReader
tmp1 = PdfFileReader('file1.pdf', 'rb')
tmp2 = PdfFileReader('file2.pdf', 'rb')
#extracting pagelabels is easy
pl1 = tmp1.trailer['/Root']['/PageLabels']
pl2 = tmp2.trailer['/Root']['/PageLabels']
#but PdfFileWriter or PdfFileMerger does not support writing from what I understand
So I dont know how to proceed from here
2) With pdfrw (has more promise)
from pdfrw import PdfReader, PdfWriter
writer = PdfWriter()
#read 1st file
tmp1 = PdfReader('file1')
#add the pages
writer.addpages(tmp1.pages)
#copy bookmarks to writer
writer.trailer.Root.Outlines = tmp1.Root.Outlines
#copy pagelabels to writer
writer.trailer.Root.PageLabels = tmp1.Root.PageLabels
#read second file
tmp2 = PdfReader('file2')
#append pages
writer.addpages(tmp2.pages)
# so far so good
Page numbers of bookmarks from 2nd file need to be offset before adding them, but when reading outlines I almost always get (IndirectObject, XXX) instead of page numbers. Its unclear how to get page numbers for each label and bookmark using pdfrw. So, I'm stuck again
zp

As mentioned in my comment, I'm posting a generic solution to merge several pdfs that works in PyPDF2. Dont know what is different to make this work in PyPDF2 other than initializing pls as ArrayObject()
from PyPDF2 import PdfFileWriter, PdfFileMerger, PdfFileReader
import PyPDF2.pdf as PDF
# pls holds all the pagelabels as we iterate through multiple pdfs
pls = PDF.ArrayObject()
# used to offset bookmarks
pageCount = 0
cpdf = PdfFileMerger()
# pdffiles is a list of all files to be merged
for i in range(len(pdffiles)):
tmppdf = PdfFileReader(pdffiles[i], 'rb')
cpdf.append(tmppdf)
# copy all the pagelabels which I assume is present in all files
# you could use 'try' in case no pagelabels are present
plstmp = tmppdf.trailer['/Root']['/PageLabels']['/Nums']
# sometimes keys are indirect objects
# so, iterate through each pagelabel and...
for j in range(len(plstmp)):
# ... get the actual values
plstmp[j] = plstmp[j].getObject()
# offset pagenumbers by current count of pages
if isinstance(plstmp[j], int):
plstmp[j] = PDF.NumberObject(plstmp[j] + pageCount)
# once all the pagelabels are processed I append to pls
pls += plstmp
#increment pageCount
pageCount += tmppdf.getNumPages()
# rest follows KevinM's answer
pagenums = PDF.DictionaryObject()
pagenums.update({PDF.NameObject('/Nums') : pls})
pagelabels = PDF.DictionaryObject()
pagelabels.update({PDF.NameObject('/PageLabels') : pagenums})
cpdf.output._root_object.update(pagelabels)
cpdf.write("filename.pdf")

You need to iterate through the existing PageLabels and add them to the merged output, taking care to add an offset to the page index entry, based on the number of pages already added.
This solution also requires PyPDF4, since PyPDF2 produces a weird error (see bottom).
from PyPDF4 import PdfFileWriter, PdfFileMerger, PdfFileReader
# To manipulate the PDF dictionary
import PyPDF4.pdf as PDF
import logging
def add_nums(num_entry, page_offset, nums_array):
for num in num_entry['/Nums']:
if isinstance(num, (int)):
logging.debug("Found page number %s, offset %s: ", num, page_offset)
# Add the physical page information
nums_array.append(PDF.NumberObject(num+page_offset))
else:
# {'/S': '/r'}, or {'/S': '/D', '/St': 489}
keys = num.keys()
logging.debug("Found page label, keys: %s", keys)
number_type = PDF.DictionaryObject()
# Always copy the /S entry
s_entry = num['/S']
number_type.update({PDF.NameObject("/S"): PDF.NameObject(s_entry)})
logging.debug("Adding /S entry: %s", s_entry)
if '/St' in keys:
# If there is an /St entry, fetch it
pdf_label_offset = num['/St']
# and add the new offset to it
logging.debug("Found /St %s", pdf_label_offset)
number_type.update({PDF.NameObject("/St"): PDF.NumberObject(pdf_label_offset)})
# Add the label information
nums_array.append(number_type)
return nums_array
def write_merged(pdf_readers):
# Output
merger = PdfFileMerger()
# For PageLabels information
page_labels = []
page_offset = 0
nums_array = PDF.ArrayObject()
# Iterate through all the inputs
for pdf_reader in pdf_readers:
try:
# Merge the content
merger.append(pdf_reader)
# Handle the PageLabels
# Fetch page information
old_page_labels = pdf_reader.trailer['/Root']['/PageLabels']
page_count = pdf_reader.getNumPages()
# Add PageLabel information
add_nums(old_page_labels, page_offset, nums_array)
page_offset = page_offset + page_count
except Exception as err:
print("ERROR: %s" % err)
# Add PageLabels
page_numbers = PDF.DictionaryObject()
page_numbers.update({PDF.NameObject("/Nums"): nums_array})
page_labels = PDF.DictionaryObject()
page_labels.update({PDF.NameObject("/PageLabels"): page_numbers})
root_obj = merger.output._root_object
root_obj.update(page_labels)
# Write output
merger.write('merged.pdf')
pdf_readers = []
tmp1 = PdfFileReader('file1.pdf', 'rb')
tmp2 = PdfFileReader('file2.pdf', 'rb')
pdf_readers.append(tmp1)
pdf_readers.append(tmp2)
write_merged(pdf_readers)
Note: PyPDF2 produces this weird error:
...
...
File "/usr/lib/python3/dist-packages/PyPDF2/pdf.py", line 552, in _sweepIndirectReferences
data[key] = value
File "/usr/lib/python3/dist-packages/PyPDF2/generic.py", line 507, in __setitem__
raise ValueError("key must be PdfObject")
ValueError: key must be PdfObject

BeautifulSoup - scraping a forum page

I'm trying to scrape a forum discussion and export it as a csv file, with rows such as "thread title", "user", and "post", where the latter is the actual forum post from each individual.
I'm a complete beginner with Python and BeautifulSoup so I'm having a really hard time with this!
My current problem is that all the text is split into one character per row in the csv file. Is there anyone out there who can help me out? It would be fantastic if someone could give me a hand!
Here's the code I've been using:
from bs4 import BeautifulSoup
import csv
import urllib2
f = urllib2.urlopen("https://silkroad5v7dywlc.onion.to/index.php?action=printpage;topic=28536.0")
soup = BeautifulSoup(f)
b = soup.get_text().encode("utf-8").strip() #the posts contain non-ascii words, so I had to do this
writer = csv.writer(open('silkroad.csv', 'w'))
writer.writerows(b)

Ok here we go. Not quite sure what I'm helping you do here, but hopefully you have a good reason to be analyzing silk road posts.
You have a few issues here, the big one is that you aren't parsing the data at all. What you're essentially doing with .get_text() is going to the page, highlighting the whole thing, and then copying and pasting the whole thing to a csv file.
So here is what you should be trying to do:
Read the page source
Use soup to break it into sections you want
Save sections in parallel arrays for author, date, time, post, etc
Write data to csv file row by row
I wrote some code to show you what that looks like, it should do the job:
from bs4 import BeautifulSoup
import csv
import urllib2
# get page source and create a BeautifulSoup object based on it
print "Reading page..."
page = urllib2.urlopen("https://silkroad5v7dywlc.onion.to/index.php?action=printpage;topic=28536.0")
soup = BeautifulSoup(page)
# if you look at the HTML all the titles, dates,
# and authors are stored inside of <dt ...> tags
metaData = soup.find_all("dt")
# likewise the post data is stored
# under <dd ...>
postData = soup.find_all("dd")
# define where we will store info
titles = []
authors = []
times = []
posts = []
# now we iterate through the metaData and parse it
# into titles, authors, and dates
print "Parsing data..."
for html in metaData:
text = BeautifulSoup(str(html).strip()).get_text().encode("utf-8").replace("\n", "") # convert the html to text
titles.append(text.split("Title:")[1].split("Post by:")[0].strip()) # get Title:
authors.append(text.split("Post by:")[1].split(" on ")[0].strip()) # get Post by:
times.append(text.split(" on ")[1].strip()) # get date
# now we go through the actual post data and extract it
for post in postData:
posts.append(BeautifulSoup(str(post)).get_text().encode("utf-8").strip())
# now we write data to csv file
# ***csv files MUST be opened with the 'b' flag***
csvfile = open('silkroad.csv', 'wb')
writer = csv.writer(csvfile)
# create template
writer.writerow(["Time", "Author", "Title", "Post"])
# iterate through and write all the data
for time, author, title, post in zip(times, authors, titles, posts):
writer.writerow([time, author, title, post])
# close file
csvfile.close()
# done
print "Operation completed successfully."
EDIT: Included solution that can read files from directory and use data from that
Okay, so you have your HTML files in a directory. You need to get a list of files in the directory, iterate through them, and append to your csv file for each file in the directory.
This is the basic logic of our new program.
If we had a function called processData() that took a file path as an argument and appended data from the file to your csv file here is what it would look like:
# the directory where we have all our HTML files
dir = "myDir"
# our csv file
csvFile = "silkroad.csv"
# insert the column titles to csv
csvfile = open(csvFile, 'wb')
writer = csv.writer(csvfile)
writer.writerow(["Time", "Author", "Title", "Post"])
csvfile.close()
# get a list of files in the directory
fileList = os.listdir(dir)
# define variables we need for status text
totalLen = len(fileList)
count = 1
# iterate through files and read all of them into the csv file
for htmlFile in fileList:
path = os.path.join(dir, htmlFile) # get the file path
processData(path) # process the data in the file
print "Processed '" + path + "'(" + str(count) + "/" + str(totalLen) + ")..." # display status
count = count + 1 # increment counter
As it happens our processData() function is more or less what we did before, with a few changes.
So this is very similar to our last program, with a few small changes:
We write the column headers first thing
Following that we open the csv with the 'ab' flag to append
We import os to get a list of files
Here's what that looks like:
from bs4 import BeautifulSoup
import csv
import urllib2
import os # added this import to process files/dirs
# ** define our data processing function
def processData( pageFile ):
''' take the data from an html file and append to our csv file '''
f = open(pageFile, "r")
page = f.read()
f.close()
soup = BeautifulSoup(page)
# if you look at the HTML all the titles, dates,
# and authors are stored inside of <dt ...> tags
metaData = soup.find_all("dt")
# likewise the post data is stored
# under <dd ...>
postData = soup.find_all("dd")
# define where we will store info
titles = []
authors = []
times = []
posts = []
# now we iterate through the metaData and parse it
# into titles, authors, and dates
for html in metaData:
text = BeautifulSoup(str(html).strip()).get_text().encode("utf-8").replace("\n", "") # convert the html to text
titles.append(text.split("Title:")[1].split("Post by:")[0].strip()) # get Title:
authors.append(text.split("Post by:")[1].split(" on ")[0].strip()) # get Post by:
times.append(text.split(" on ")[1].strip()) # get date
# now we go through the actual post data and extract it
for post in postData:
posts.append(BeautifulSoup(str(post)).get_text().encode("utf-8").strip())
# now we write data to csv file
# ***csv files MUST be opened with the 'b' flag***
csvfile = open('silkroad.csv', 'ab')
writer = csv.writer(csvfile)
# iterate through and write all the data
for time, author, title, post in zip(times, authors, titles, posts):
writer.writerow([time, author, title, post])
# close file
csvfile.close()
# ** start our process of going through files
# the directory where we have all our HTML files
dir = "myDir"
# our csv file
csvFile = "silkroad.csv"
# insert the column titles to csv
csvfile = open(csvFile, 'wb')
writer = csv.writer(csvfile)
writer.writerow(["Time", "Author", "Title", "Post"])
csvfile.close()
# get a list of files in the directory
fileList = os.listdir(dir)
# define variables we need for status text
totalLen = len(fileList)
count = 1
# iterate through files and read all of them into the csv file
for htmlFile in fileList:
path = os.path.join(dir, htmlFile) # get the file path
processData(path) # process the data in the file
print "Processed '" + path + "'(" + str(count) + "/" + str(totalLen) + ")..." # display status
count = count + 1 # incriment counter

finding on which page a search string is located in a pdf document using python

Which python packages can I use to find out out on which page a specific “search string” is located ?
I looked into several python pdf packages but couldn't figure out which one I should use.
PyPDF does not seem to have this functionality and PDFMiner seems to be an overkill for such simple task.
Any advice ?
More precise:
I have several PDF documents and I would like to extract pages which are between a string “Begin” and a string “End” .

I finally figured out that pyPDF can help. I am posting it in case it can help somebody else.
(1) a function to locate the string
def fnPDF_FindText(xFile, xString):
# xfile : the PDF file in which to look
# xString : the string to look for
import pyPdf, re
PageFound = -1
pdfDoc = pyPdf.PdfFileReader(file(xFile, "rb"))
for i in range(0, pdfDoc.getNumPages()):
content = ""
content += pdfDoc.getPage(i).extractText() + "\n"
content1 = content.encode('ascii', 'ignore').lower()
ResSearch = re.search(xString, content1)
if ResSearch is not None:
PageFound = i
break
return PageFound
(2) a function to extract the pages of interest
def fnPDF_ExtractPages(xFileNameOriginal, xFileNameOutput, xPageStart, xPageEnd):
from pyPdf import PdfFileReader, PdfFileWriter
output = PdfFileWriter()
pdfOne = PdfFileReader(file(xFileNameOriginal, "rb"))
for i in range(xPageStart, xPageEnd):
output.addPage(pdfOne.getPage(i))
outputStream = file(xFileNameOutput, "wb")
output.write(outputStream)
outputStream.close()
I hope this will be helpful to somebody else

I was able to successfully get the output using the code below.
Code:
import PyPDF2
import re
# Open the pdf file
object = PyPDF2.PdfFileReader(r"C:\TEST.pdf")
# Get number of pages
NumPages = object.getNumPages()
# Enter code here
String = "Enter_the_text_to_Search_here"
# Extract text and do the search
for i in range(0, NumPages):
PageObj = object.getPage(i)
Text = PageObj.extractText()
if re.search(String,Text):
print("Pattern Found on Page: " + str(i))
Sample Output:
Pattern Found on Page: 7

Finding on which page a search string is located in a pdf document using python
PyPDF2
# import packages
import PyPDF2
import re
# open the pdf file
object = PyPDF2.PdfFileReader(r"source_file_path")
# get number of pages
NumPages = object.getNumPages()
# define keyterms
String = "P4F-21B"
# extract text and do the search
for i in range(0, NumPages):
PageObj = object.getPage(i)
Text = PageObj.extractText()
ResSearch = re.search(String, Text)
if ResSearch != None:
print(ResSearch)
print("Page Number" + str(i+1))
Output:
<re.Match object; span=(57, 64), match='P4F-21B'>
Page Number1
PyMuPDF
import fitz
import re
# load document
doc = fitz.open(r"C:\Users\shraddha.shetty\Desktop\OCR-pages-deleted.pdf")
# define keyterms
String = "P4F-21B"
# get text, search for string and print count on page.
for page in doc:
text = ''
text += page.get_text()
if len(re.findall(String, text)) > 0:
print(f'count on page {page.number + 1} is: {len(re.findall(String, text))}')

In addition to what #user1043144 mentioned,
To use with python 3.x
Use PyPDF2
import PyPDF2
Use open instead of file
PdfFileReader(open(xFile, 'rb'))

updated answer with PYDF2
import re
import PyPDF2
def pdf_find_text(xfile_pdf, xsearch_string, ignore_case = False):
'''
find page(s) on which a given text is located in a pdf
input: pdf file and the string to search
(string to search can be in a regex like 'references\n')
N.B:
results need to be checked
in case of pdf whose page numbers are not zero indexed ,
the results seems off (by one page)
'''
xlst_res = []
xreader = PyPDF2.PdfFileReader(xfile_pdf)
for xpage_nr, xpage in enumerate(xreader.pages):
xpage_text = xpage.extractText()
xhits = None
if ignore_case == False:
xhits = re.search(xsearch_string, xpage_text.lower())
else:
xhits = re.search(xsearch_string, xpage_text.lower(), re.IGNORECASE)
if xhits:
xlst_res.append(xpage_nr)
return {'num_pages': xreader.numPages, 'page_hits': xlst_res}
def pdf_extract_pages(xpdf_original, xpdf_new , xpage_start, xpage_end):
'''
given a pdf extract a page range and save it in a new pdf file
'''
with open(xpdf_original, 'rb') as xfile_1, open(xpdf_new , 'wb') as xfile_2:
xreader = PyPDF2.PdfFileReader(xfile_1)
xwriter = PyPDF2.PdfFileWriter()
for xpage_nr in range(xpage_start, xpage_end ):
xwriter.addPage(xreader.getPage(xpage_nr))
xwriter.write(xfile_2)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract Text into Column using Regex - python

Related

Filtering the extract of a pdf

Unable to convert multiple PDF pages of a PDF File into a CSV using tabula

Merging PDFs while retaining custom page numbers (aka pagelabels) and bookmarks

BeautifulSoup - scraping a forum page

finding on which page a search string is located in a pdf document using python

Categories

Resources