Problem extracting tabular data from a pdf - python

I'm trying to extract table from a pdf that had a lot of name of media sources. The desired output is a comprehensive csv file with a column with all the sources listed.
I'm trying to write a simple python script to extract table data from a pdf. The output I was able to reach is a CSV for every table that I try to combine. Then I use the concat function to merge all the files.
The result is messy, I have redundant punctuation and a lot of spaces in the file.
Can somebody help me reach a better result?
Code:
from camelot import read_pdf
import glob
import os
import pandas as pd
import numpy as np
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
# Get all the tables within the file
all_tables = read_pdf("/Users/zago/code/pdftext/pdftextvenv/mimesiweb.pdf", pages = 'all')
# Show the total number of tables in the file
print("Total number of table: {}".format(all_tables.n))
# print all the tables in the file
for t in range(all_tables.n):
print("Table n°{}".format(t))
print((all_tables[t].df).head())
#convert to excel or csv
#all_tables.export('table.xlsx', f="excel")
all_tables.export('table.csv', f="csv")
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f,encoding = 'utf-8', sep=',') for f in all_filenames ])
#export to csv
combined_csv.to_csv("combined_csv_tables.csv", index=False, encoding="utf-8")
Starting point PDF
Result for 1 csv
Combined csv
Thanks

Select only the first column before concatenating and then save.
Just use this line of code:
combined_csv = pd.concat([pd.read_csv(f,encoding = 'utf-8', sep=',').iloc[:,0] for f in all_filenames ])
Output:
In [25]: combined_csv
Out[25]:
0 Interni.it
1 Intima.fr
2 Intimo Retail
3 Intimoda Magazine - En
4 Intorno Tirano.it
...
47 Alessandria Oggi
48 Aleteia.org
49 Alibi Online
50 Alimentando
51 All About Italy.net
Length: 2824, dtype: object
And final csv output:

There are oddities to beware of when using CSV format.
ALL PDF pages are generally stored as just one column of text from page edge to edge, unless tagged to be multi column pages areas. (One reason data extractors are required to generate a split into tabular text.)
In this case of a single column of text a standard CSV file output/input is no different, except for ONE entry that includes a comma :- (no other entry needs a comma in a CSV output) so if the above PDF is import/exported to Excel it will look like a column.
Thus the only command needed is to export pdftotxt
add " to each line end and rename to csv.
HOWEVER see comment after
pdftotext -layout -nopgbrk -enc UTF-8 mimesiweb.pdf
for /f "tokens=*" %t in (mimesiweb.txt) do echo "%t" >>mimesiweb.csv
This will correctly generate the desired output for open in Excel on command line
We correctly out put UTF-8 characters in the text.csv but my old Excel always cause the UTF to be lost on import, (e.g. Accènto becomes Accènto) even if I use CHCP 65001 (Unicode) to invoke it.
Here exported as UTF8.csv (the file reads accented in notepad) no issue, but reimported again as UTF8.csv the symbology is lost ! Newer Excel may fare better ?
So that is a failing in Excel where without better excel import for me would be to simply cut and paste the 2880 lines of text so that those accents are preserved !! The alternative is to import text into Libre Office that will support UTF-8
However remember to either UNCHECK comma OR pad "each, line" as I did earlier for csv with existing , otherwise the 2nd column is generated. :-)

I've found pdfplumber simpler to use for such tasks.
import pdfplumber
pdf = pdfplumber.open("Downloads/mimesiweb.pdf")
rows = []
for page in pdf.pages:
rows.extend(page.extract_text().splitlines())
>>> len(rows)
2881
>>> rows[:3]
['WEB', 'Ultimo aggiornamento: 03 06 2020', '01net.it']
>>> rows[-3:]
['Zoneombratv.it', 'Zoom 24', 'ZOOMsud.it']

Related

Combine csv files with identical columns and unescape html code

Dear all i often need to concat csv files with identical headers (i.e. but them into one big file). Usually i just use pandas, but i now need to operate in an enviroment were i am not at liberty to install any library. The csv and the html libs do exsit.
I also need to remove all remaining html tags like %amp; for the apercent symbol which are still within the data. I do know in which columns these can come up.
I thought aboug doing it like this, and the concat part of my code seems to work fine:
import CSV
import html
for file in files: # files is a list of csv files.
with open(file, "rt", encoding="utf-8-sig") as source, open(outfilePath, "at", newline='',
encoding='utf-8-sig') as result:
d_reader = csv.DictReader(source,delimiter=";")
# Set header based on first file in file_list:
if file == test_files[0]:
Common_header = d_reader.fieldnames
# Define DcitwriterObject
wtr = csv.DictWriter(result, fieldnames=Common_header, lineterminator='\n', delimiter=";")
# Write Header only once to emtpy file
if result.tell() == 0:
wtr.writeheader()
# If i remove this block i get my concateneated singe csv file as a result
# Howerver the html tags/encoded symbols are sill present.
for row in d_reader:
print(html.unescape(row['ColA'])) # This prints the unescaped Values in the column correctly
# If i kepp these two lines, i get an empty file with just the header as a result of the concatenation
row['ColA'] = html.unescape(row['ColA'])
row['ColB'] = html.unescape(row['ColB'])
wtr.writerows(d_reader)
I would have thought the simply suppling the encoding='utf-8-sig' part to the result file would be sufficient to get rid of the html symbols but that does not work. If you could give me a hint what i am doint wrong in the usage of the code containing the html.unescape function in my code that would be nice.
Thank you in advance

Problem in storing dataframe to csv format

I am currently using the below code to web scrape data and then store it in a CSV file.
from bs4 import BeautifulSoup
import requests
url='https://www.business-standard.com/rss/companies-101.rss'
soup = BeautifulSoup(requests.get(url).content, 'xml')
news_items = []
for item in soup.findAll('item'):
news_item = {}
news_item['title'] = item.title.text
news_item['excerpt'] = item.description.text
print(item.link.text)
s = BeautifulSoup(requests.get(item.link.text).content, 'html.parser')
news_item['text'] = s.select_one('.p-content').get_text(strip=True, separator=' ')
news_item['link'] = item.link.text
news_item['pubDate'] = item.pubDate.text
news_item['Category'] = 'Company'
news_items.append(news_item)
import pandas as pd
df = pd.DataFrame(news_items)
df.to_csv('company_data.csv',index = False)
When displaying the data frame, the results look fine as attached.enter image description here
But while opening the csv file, the columns are not as expected. enter image description hereCan anyone tell me the reason.
The issue is that your data contains commas and the default seperator for to_csv is "," So each comma in your data set is treated as a seperate column.
If you perform df.to_excel('company_data.xlsx', index=False) you won't have this issue since it is not comma seperated.
A csv file is not an Excel file despite what Microsoft pretends but is a text file containing records and fields. The records are separated with sequences of \r\n and the fields are separated with a delimiter, normally the comma. Fields can contain new lines or delimiters provided they are enclosed in quotation marks.
But Excel is known to have a very poor csv handling module. More exactly it can read what it has written, or what is formatted the way it would have written it. To be locale friendly, MS folks decided that they will use the locale decimal separator (which is the comma , in some West European locales) and will use another separator (the semicolon ; when the comma is the decimal separator). As a result, using Excel to read CSV files produced by other tools is a nightmare with possible workarounds like changing the separator when writing the CSV file, but no clean way. LibreOffice is far behind MS Office for most features except CSV handling. So my advice is to avoid using Excel for CSV files but use LibreOffice Calc instead.

Multiple txt files as separate rows in a csv file without breaking into lines (in pandas dataframe)

I have many txt files (which have been converted from pdf) in a folder. I want to create a csv/excel dataset where each text file will become a row. Right now I am opening the files in pandas dataframe and then trying to save it to a csv file. When I print the dataframe, I get one row per txt file. However, when saving to csv file, the texts get broken and create multiple rows/lines for each txt file rather than just one row. Do you know how I can solve this problem? Any help would be highly appreciated. Thank you.
Following is the code I am using now.
import glob
import os
import pandas as pd
file_list = glob.glob(os.path.join(os.getcwd(), "K:\\text_all", "*.txt"))
corpus = []
for file_path in file_list:
with open(file_path, encoding="latin-1") as f_input:
corpus.append(f_input.read())
df = pd.DataFrame({'col':corpus})
print (df)
df.to_csv('K:\\out.csv')
Update
If this solution is not possible it would be also helpful to transform the data a bit in pandas dataframe. I want to create a column with the name of txt files, that is, the name of each txt file in the folder will become the identifier of the respective text file. I will then save it to tsv format so that the lines do not get separated because of comma, as suggested by someone here.
I need something like following.
identifier col
txt1 example text in this file
txt2 second example text in this file
...
txtn final example text in this file
Use
import csv
df.to_csv('K:\\out.csv', quoting=csv.QUOTE_ALL)

How to search for a combination of keywords in a text-file, extract lines above and below, and then export to Excel using pandas

I am trying to extract 5 lines before and after a specific combination of keywords from several SEC 10-K filings and then export that data into Excel so that I can then further process it manually.
Unfortunately I have to rely on the .txt format filings rather than the .html or .xblr ones because the latter are not always available. I already downloaded and partially cleaned the .txt files to remove unneeded tags.
In short, my goal is to tell python to loop through the downloaded .txt files (e.g. all those in the same folder or simply by providing a reference .txt list with all the file names), open each one, look for the the word "cumulative effect" (ideally combined with other keywords, see code below), extract 5 lines before and after it, and then export the output to an excel with the filename in column A and the extracted paragraph in column B.
Using this code I managed to extract 5 lines above and below the keyword "cumulative effect" for one .txt file (which you can find here, for reference).
However I am still struggling with automating/looping the whole process and exporting the extracted text to Excel using pandas.
import collections
import itertools
import sys
from pandas import DataFrame
filing='0000950123-94-002010_1.txt'
#with open(filing, 'r') as f:
with open(filing, 'r', encoding='utf-8', errors='replace') as f:
before = collections.deque(maxlen=5)
for line in f:
if ('cumulative effect' in line or 'Cumulative effect' in line) and ('accounting change' in line or 'adoption' in line or 'adopted' in line or 'charge' in line):
sys.stdout.writelines(before)
sys.stdout.write(line)
sys.stdout.writelines(itertools.islice(f, 5))
break
before.append(line)
findings = {'Filing': [filing],
'Extracted_paragraph': [line]
}
df = DataFrame(findings, columns= ['Filing', 'Extracted_paragraph'])
export_excel = df.to_excel (r'/Users/myname/PYTHON/output.xlsx', index = None, header=True)
print (df)
Using this line of code I obtain the paragraph I need, but I only managed to export the single line in which the keyword is contained to excel and not the entire text.
This is the python output and
this is the exported text to Excel.
How do I go about creating the loop and properly exporting the entire paragraph of interest into excel?
Thanks a lot in advance!!
I believe your basic error was in
'Extracted_paragraph': [line]
which should have been
'Extracted_paragraph': [before]
So with some simplifying changes, the main section of you code should look like this:
with open(filing, 'r', encoding='utf-8', errors='replace') as f:
before = collections.deque(maxlen=5)
for line in f:
if ('cumulative effect' in line or 'Cumulative effect' in line) and ('accounting change' in line or 'adoption' in line or 'adopted' in line or 'charge' in line):
break
before.append(line)
before = ''.join(before)
findings = {'Filing': [filing],
'Extracted_paragraph': [before]
}
df = DataFrame(findings, columns= ['Filing', 'Extracted_paragraph'])
And then continue from there to export to Excel, etc.

How to convert PDF to CSV with tabula-py?

In Python 3, I have a PDF file "Ativos_Fevereiro_2018_servidores_rj.pdf" with 6,041 pages. I'm on a machine with Ubuntu
On each page there is text at the top of the page, two lines. And below a table, with header and two columns. Each table in 36 rows, less on the last page
At the end of each page, after the tables, there is also a line of text
I want to create a CSV from this PDF, considering only the tables in the pages. And ignoring the texts before and after the tables
Initially I tested the tabula-py. But it generates an empty file:
from tabula import convert_into
convert_into("Ativos_Fevereiro_2018_servidores_rj.pdf", "test_s.csv", output_format="csv")
Please, does anyone know of another method to use tabula-py for this type of demand?
Or another way to convert PDF to CSV in this file type?
Ok, I've found the issue: you have to set spreadsheet=True and keep utf-8 encoding:
df = tabula.read_pdf("Ativos_Fevereiro_2018_servidores_rj.pdf", encoding='utf-8', spreadsheet=True, pages='1-6041')
In the picture below I tested it with just the first page (because your file is huge):
You can save the DataFrame as csv afterwards:
df.to_csv('otuput.csv', encoding='utf-8')
Edit:
Ok, the error could be a java-memory issue. To make it faster I added the pages option. And there also was an encoding problem, so encoding='utf-8' added to the csv export.
If you keep running into the java-error, try parse it in chunks, e.g. pages='1-300'. I just did all 6041 (on a 64GB RAM Machine), it worked fine.
Converting PDF to CSV with tabula-py
from tabula import convert_into
table_file = r"ActualPathtoPDF"
output_csv = r"DestinationDirectory/file.csv"
df = convert_into(table_file, output_csv, output_format='csv', lattice=True, stream=False, pages="all")

Categories

Resources