Extracting tables in 1000's of PDF using tabula area argument - python

I have around 970 pdf files with same format and i want to extract the table from these pdf's. after doing some research i am able to extract table using tabula-area argument, Unfortuntely the area parameters are not same for each pdf hence i cannot iterate . So, if anyone can help me with automating finding this area arguments for each pdf it would be great help.
as you can see in image i have to use area otherwise the junk in header is also parsed. Here is the script i am able to execute successfully for first pdf, but i need to extract from 970files which is not possible manually. PLS. HELP!!
#author: Jiku-tlenova
"""
import numpy as np
import matplotlib as plt
import pandas as pd
import os
import re
import PyPDF2 as rdpdf
import tabula
path = "/codes/python/"
os.chdir(path)
from convert_pdf_to_txt import convert_pdf_to_txt
os.getcwd()
pa="s/"
os.chdir(path+pa)
files= os.listdir(".")
ar=[187.65,66.35,606.7,723.11]
tablist=[]
for file in files:
i=0
pgnum=2;endval=0
weind=re.findall("\d+", file)
print(file)
reader = rdpdf.PdfFileReader(file)
while endval==0:
table0 =tabula.read_pdf(file, pages = i+2, spreadsheet=True,multiple_tables = False ,lattice=True,area=ar) #pandas_options={'header': 'infer'}
table0=table0.dropna(how="all",axis=1)
#foramtiing headers
head=(table0.iloc[0,:]+table0.iloc[1,:]).T
table0.columns=head
table0=table0.drop([0, 1])
table0=table0.iloc[:-1] #delete last row - not needed
mys=table0[table0.columns[-1]]
val=mys.isnull().all()
if val==True:
endval=1
tablist.append(table0)
i=i+1```

finally able to do it myself....basically took code from R and used wrapper....seems R support community is much active in stack than python one.....thanks

Related

Import csv from Kaggle url into a pandas DataFrame

I want to import a public dataset from Kaggle (https://www.kaggle.com/unsdsn/world-happiness?select=2017.csv) into a local jupyter notebook. I don't want to use any credencials in the process.
I saw diverse solutions including: pd.read_html, pd.read_csv, pd.read_table (pd = pandas).
I also found the solutions that imply a login.
The first set of solutions are the ones I am interested in, though I see that they work on other websites because there is a link to the raw data.
I have been clincking everywhere in the kaggle interface but find no direct url to raw data.
Bottom line: Is it possible to use say pd.read_csv to directly get data from the website into your local notebook? If so, how?
You can automate kaggle.cli
follow the instructions to download and save kaggle.json for authentication https://github.com/Kaggle/kaggle-api
import kaggle.cli
import sys
import pandas as pd
from pathlib import Path
from zipfile import ZipFile
# download data set
# https://www.kaggle.com/unsdsn/world-happiness?select=2017.csv
dataset = "unsdsn/world-happiness"
sys.argv = [sys.argv[0]] + f"datasets download {dataset}".split(" ")
kaggle.cli.main()
zfile = ZipFile(f"{dataset.split('/')[1]}.zip")
dfs = {f.filename:pd.read_csv(zfile.open(f)) for f in zfile.infolist() }
dfs["2017.csv"]

COVID-19 data analysis with Python from Github CSV

This link contains CSV files for daily reports of COVID-19.
https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports
What is the best solution to get all the csv files in a dataframe?
I tried the code bellow from other questions but it doesnt work.
from pathlib import Path
import pandas as pd
files = Path('https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports')
csv_only = files.rglob('*.csv')
combo = [pd.read_csv(f)
.assign(f.stem)
.fillna(0)
for f in csv_only]
one_df = pd.concat(combo,ignore_index=True)
one_df = one_df.drop_duplicates('date')
print(one_df)
How could i fit requests to read all the files?
You can simply use requests module to get the names of all the .csv present, which would eliminate the need to run glob:
import requests
url = "https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports"
csv_only = [i.split("=")[1][1:-1] for i in requests.get(url).text.split(" ") if '.csv' in i and 'title' in i]
pathlib only works with filesystems so this won't do. csv_only will be an empty generator since there is no such location on your disk. You need to fetch the data from github with actual http requests. I did something for some personal stuff some time ago, you can have a look and modify it accordingly(uses the github API so you'll need to get one).

Extracting Tables from PDFs Using Tabula

I came across a great library called Tabula and it almost did the trick. Unfortunately, there is a lot of useless area on the first page that I don't want Tabula to extract. According to documentation, you can specify the page area you want to extract from. However, the useless area is only on the first page of my PDF file, and thus, for all subsequent pages, Tabula will miss the top section. Is there a way to specify the area condition to only apply to the first page of the PDF?
from tabula import read_pdf
df = read_pdf(r"C:\Users\riley\Desktop\Bank Statements\50340.pdf", area=(530,12.75,790.5,561), pages='all')
I'm trying to work on something similar (parsing bank statements) and had the same issue. The only way to solve this I have found so far is to parse each page individually.
The only problem is that this requires to know in advance how many pages your file is composed of. For the moment I have not found a how to do this directly with Tabula, so I've decided to use the pyPdf module to get the number of pages.
import pyPdf
from tabula import read_pdf
reader = pyPdf.PdfFileReader(open("C:\Users\riley\Desktop\Bank Statements\50340.pdf", mode='rb' ))
n = reader.getNumPages()
df = []
for page in [str(i+1) for i in range(n)]:
if page == "1":
df.append(read_pdf(r"C:\Users\riley\Desktop\Bank Statements\50340.pdf", area=(530,12.75,790.5,561), pages=page))
else:
df.append(read_pdf(r"C:\Users\riley\Desktop\Bank Statements\50340.pdf", pages=page))
Notice that there are some known and open issues when reading each page individually, or all at the same time.
Good luck!
08/03/2017 EDIT:
Found a simpler way to count the pages of the pdf without going through pyPDf
import re
def count_pdf_pages(file_path):
rxcountpages = re.compile(r"/Type\s*/Page([^s]|$)", re.MULTILINE|re.DOTALL)
with open(file_path, "rb") as temp_file:
return len(rxcountpages.findall(temp_file.read()))
where file_path is the path to your file of course
Use the below code ! It may help you !!!
import os
os.path.abspath("E:/Documents/myPy/")
from tabula import wrapper
tables = wrapper.read_pdf("MyPDF.pdf",multiple_tables=True,pages='all')
i=1
for table in tables:
table.to_excel('output'+str(i)+'.xlsx',index=False)
print(i)
i=i+1
parameter'guess=False' will solve the problem.
Extracting Tables from PDFs Using Tabula
pip install tabula-py
pip install tabulate
#reads table from pdf file
df = read_pdf("abc.pdf", pages=[2:]) #address of pdf file
print(tabulate(df))
Parameters:
pages (str, int, list of int, optional)
An optional values specifying pages to extract from. It allows str,int, list of :int. Default: 1
Examples
'1-2,3', 'all', [1,2]
since the first page is useless dropping first page and reading upto last page

Extracting text from multiple powerpoint files using python

I am trying to find a way to look in a folder and search the contents of all of the powerpoint documents within that folder for specific strings, preferably using Python. When those strings are found, I want to report out the text after that string as well as what document it was found in. I would like to compile the information and report it in a CSV file.
So far I've only come across the olefil package, https://bitbucket.org/decalage/olefileio_pl/wiki/Home. This provides all of the text contained in a specific document, which is not what I am looking to do. Please help.
Actually working
If you want to extract text:
import Presentation from pptx (pip install python-pptx)
for each file in the directory (using glob module)
look in every slides and in every shape in each slide
if there is a shape with text attribute, print the shape.text
from pptx import Presentation
import glob
for eachfile in glob.glob("*.pptx"):
prs = Presentation(eachfile)
print(eachfile)
print("----------------------")
for slide in prs.slides:
for shape in slide.shapes:
if hasattr(shape, "text"):
print(shape.text)
tika-python
A Python port of the Apache Tika library, According to the documentation Apache tika supports text extraction from over 1500 file formats.
Note: It also works charmingly with pyinstaller
Install with pip :
pip install tika
Sample:
#!/usr/bin/env python
from tika import parser
parsed = parser.from_file('/path/to/file')
print(parsed["metadata"]) #To get the meta data of the file
print(parsed["content"]) # To get the content of the file
Link to official GitHub
python-pptx can be used to do what you propose. Just at a high level, you would do something like this (not working code, just and idea of overall approach):
from pptx import Presentation
for pptx_filename in directory:
prs = Presentation(pptx_filename)
for slide in prs.slides:
for shape in slide.shapes:
print shape.text
You'd need to add the bits about searching shape text for key strings and adding them to a CSV file or whatever, but this general approach should work just fine. I'll leave it to you to work out the finer points :)
Textract-Plus
Use textract-plus which can extract text from most of the document extensions including pptx and pptm.
refer docs
Install-
pip install textract-plus
Sample-
import textractplus as tp
text=tp.process('path/to/yourfile.pptx')
for your case-
import os
import pandas as pd
import textractplus as tp
files_csv=[]
your_dir='.'
for f in os.listdir(your_dir):
if f.endswith('pptx') or f.endswith('pptm'):
text=tp.process(os.join(your_dir,f))
files_csv.append([f,text])
pd.Dataframe(files_csv,columns=['filename','text']).to_csv('your_csv.csv')
this code will fetch all the pptx and pptm files from directory and create a csv with first column as filename and second as text extracted from that file
import os
import textract
files_csv = []
your_dir = '.'
for f in os.listdir(your_dir):
if f.endswith('pptx') or f.endswith('pptm'):
text = tp.process(os.path.join('sample.pptx'))
print(text)

Import csv Python with Spyder

I am trying to import a csv file into Python but it doesn't seem to work unless I use the Import Data icon.
I've never used Python before so apologies is I am doing something obviously wrong. I use R and I am trying to replicate the same tasks I do in R in Python.
Here is some sample code:
import pandas as pd
import os as os
Main_Path = "C:/Users/fan0ia/Documents/Python_Files"
Area = "Pricing"
Project = "Elasticity"
Path = os.path.join(R_Files, Business_Area, Project)
os.chdir(Path)
#Read in the data
Seasons = pd.read_csv("seasons.csv")
Dep_Sec_Key = pd.read_csv("DepSecKey.csv")
These files import without any issues but when I execute the following:
UOM = pd.read_csv("FINAL_UOM.csv")
Nothing shows in the variable explorer panel and I get this in the IPython console:
In [3]: UOM = pd.read_csv("FINAL_UOM.csv")
If I use the Import Data icon and use the wizard selecting DataFrame on the preview tab it works fine.
The same file imports into R with the same kind of command so I don't know what I am doing wrong? Is there any way to see what code was generated by the wizard so I can compare it to mine?
Turns out the data had imported, it just wasn't showing in the variable explorer

Categories

Resources