In Python 3, I have a PDF file "Ativos_Fevereiro_2018_servidores_rj.pdf" with 6,041 pages. I'm on a machine with Ubuntu
On each page there is text at the top of the page, two lines. And below a table, with header and two columns. Each table in 36 rows, less on the last page
At the end of each page, after the tables, there is also a line of text
I want to create a CSV from this PDF, considering only the tables in the pages. And ignoring the texts before and after the tables
Initially I tested the tabula-py. But it generates an empty file:
from tabula import convert_into
convert_into("Ativos_Fevereiro_2018_servidores_rj.pdf", "test_s.csv", output_format="csv")
Please, does anyone know of another method to use tabula-py for this type of demand?
Or another way to convert PDF to CSV in this file type?
Ok, I've found the issue: you have to set spreadsheet=True and keep utf-8 encoding:
df = tabula.read_pdf("Ativos_Fevereiro_2018_servidores_rj.pdf", encoding='utf-8', spreadsheet=True, pages='1-6041')
In the picture below I tested it with just the first page (because your file is huge):
You can save the DataFrame as csv afterwards:
df.to_csv('otuput.csv', encoding='utf-8')
Edit:
Ok, the error could be a java-memory issue. To make it faster I added the pages option. And there also was an encoding problem, so encoding='utf-8' added to the csv export.
If you keep running into the java-error, try parse it in chunks, e.g. pages='1-300'. I just did all 6041 (on a 64GB RAM Machine), it worked fine.
Converting PDF to CSV with tabula-py
from tabula import convert_into
table_file = r"ActualPathtoPDF"
output_csv = r"DestinationDirectory/file.csv"
df = convert_into(table_file, output_csv, output_format='csv', lattice=True, stream=False, pages="all")
Related
I'm trying to extract table from a pdf that had a lot of name of media sources. The desired output is a comprehensive csv file with a column with all the sources listed.
I'm trying to write a simple python script to extract table data from a pdf. The output I was able to reach is a CSV for every table that I try to combine. Then I use the concat function to merge all the files.
The result is messy, I have redundant punctuation and a lot of spaces in the file.
Can somebody help me reach a better result?
Code:
from camelot import read_pdf
import glob
import os
import pandas as pd
import numpy as np
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
# Get all the tables within the file
all_tables = read_pdf("/Users/zago/code/pdftext/pdftextvenv/mimesiweb.pdf", pages = 'all')
# Show the total number of tables in the file
print("Total number of table: {}".format(all_tables.n))
# print all the tables in the file
for t in range(all_tables.n):
print("Table n°{}".format(t))
print((all_tables[t].df).head())
#convert to excel or csv
#all_tables.export('table.xlsx', f="excel")
all_tables.export('table.csv', f="csv")
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f,encoding = 'utf-8', sep=',') for f in all_filenames ])
#export to csv
combined_csv.to_csv("combined_csv_tables.csv", index=False, encoding="utf-8")
Starting point PDF
Result for 1 csv
Combined csv
Thanks
Select only the first column before concatenating and then save.
Just use this line of code:
combined_csv = pd.concat([pd.read_csv(f,encoding = 'utf-8', sep=',').iloc[:,0] for f in all_filenames ])
Output:
In [25]: combined_csv
Out[25]:
0 Interni.it
1 Intima.fr
2 Intimo Retail
3 Intimoda Magazine - En
4 Intorno Tirano.it
...
47 Alessandria Oggi
48 Aleteia.org
49 Alibi Online
50 Alimentando
51 All About Italy.net
Length: 2824, dtype: object
And final csv output:
There are oddities to beware of when using CSV format.
ALL PDF pages are generally stored as just one column of text from page edge to edge, unless tagged to be multi column pages areas. (One reason data extractors are required to generate a split into tabular text.)
In this case of a single column of text a standard CSV file output/input is no different, except for ONE entry that includes a comma :- (no other entry needs a comma in a CSV output) so if the above PDF is import/exported to Excel it will look like a column.
Thus the only command needed is to export pdftotxt
add " to each line end and rename to csv.
HOWEVER see comment after
pdftotext -layout -nopgbrk -enc UTF-8 mimesiweb.pdf
for /f "tokens=*" %t in (mimesiweb.txt) do echo "%t" >>mimesiweb.csv
This will correctly generate the desired output for open in Excel on command line
We correctly out put UTF-8 characters in the text.csv but my old Excel always cause the UTF to be lost on import, (e.g. Accènto becomes Accènto) even if I use CHCP 65001 (Unicode) to invoke it.
Here exported as UTF8.csv (the file reads accented in notepad) no issue, but reimported again as UTF8.csv the symbology is lost ! Newer Excel may fare better ?
So that is a failing in Excel where without better excel import for me would be to simply cut and paste the 2880 lines of text so that those accents are preserved !! The alternative is to import text into Libre Office that will support UTF-8
However remember to either UNCHECK comma OR pad "each, line" as I did earlier for csv with existing , otherwise the 2nd column is generated. :-)
I've found pdfplumber simpler to use for such tasks.
import pdfplumber
pdf = pdfplumber.open("Downloads/mimesiweb.pdf")
rows = []
for page in pdf.pages:
rows.extend(page.extract_text().splitlines())
>>> len(rows)
2881
>>> rows[:3]
['WEB', 'Ultimo aggiornamento: 03 06 2020', '01net.it']
>>> rows[-3:]
['Zoneombratv.it', 'Zoom 24', 'ZOOMsud.it']
I am using python to extract Arabic tweets from twitter and save it as a CSV file, but when I open the saved file in excel the Arabic language displays as symbols. However, inside python, notepad, or word, it looks good.
May I know where is the problem?
This is a problem I face frequently with Microsoft Excel when opening CSV files that contain Arabic characters. Try the following workaround that I tested on latest versions of Microsoft Excel on both Windows and MacOS:
Open Excel on a blank workbook
Within the Data tab, click on From Text button (if not
activated, make sure an empty cell is selected)
Browse and select the CSV file
In the Text Import Wizard, change the File_origin to "Unicode (UTF-8)"
Go next and from the Delimiters, select the delimiter used in your file e.g. comma
Finish and select where to import the data
The Arabic characters should show correctly.
Just use encoding='utf-8-sig' instead of encoding='utf-8' as follows:
import csv
data = u"اردو"
with(open('example.csv', 'w', encoding='utf-8-sig')) as fh:
writer = csv.writer(fh)
writer.writerow([data])
It worked on my machine.
The only solution that i've found to save arabic into an excel file from python is to use pandas and to save into the xlsx extension instead of csv, xlsx seems a million times better here's the code i've put together which worked for me
import pandas as pd
def turn_into_csv(data, csver):
ids = []
texts = []
for each in data:
texts.append(each["full_text"])
ids.append(str(each["id"]))
df = pd.DataFrame({'ID': ids, 'FULL_TEXT': texts})
writer = pd.ExcelWriter(csver + '.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', encoding="utf-8-sig")
# Close the Pandas Excel writer and output the Excel file.
writer.save()
Fastest way is after saving the file into .csv from python:
open the .csv file using Notepad++
from Encoding drop-down menu choose UTF-8-BOM
click save as and save at with same name with .csv extension (e.g. data.csv) and keep the file type as it is .txt
re-open the file again with Microsoft Excel.
Excel is known to have an awful csv import sytem. Long story short if on same system you import a csv file that you have just exported, it will work smoothly. Else, the csv file is expected to use the Windows system encoding and delimiter.
A rather awkward but robust system is to use LibreOffice or Oracle OpenOffice. Both are far beyond Excel on any feature but the csv module: they will allow you to specify the delimiters and optional quoting characters along with the encoding of the csv file and you will be able to save the resulting file in xslx.
Although my CSV file encoding was UTF-8; but explicitly redoing it again using the Notepad resolved it.
Steps:
Open your CSV file in Notepad.
Click File --> Save as...
In the "Encoding" drop-down, select UTF-8.
Rename your file using the .csv extension.
Click Save.
Reopen the file with Excel.
I am using cmis package available in python to download the document from FileNet repository. I am using getcontentstream method available in the package. However it returns content file that beings with 'Pk' and ends in 'PK'. when I googled I came to know it is excel zip package content. is there a way to save the content into an excel file. I should be able to open the downloaded excel. I am using below code. but getting byte-liked object is required not str. I noticed type of result is string.io.
# expport the result
result = testDoc.getContentStream()
outfile = open(sample.xlsx, 'wb')
outfile.write(result.read())
result.close()
outfile.close()
Hi there and welcome to stackoverflow. There are a few bits I noticed about your post.
To answer the error code you are getting directly. You called the outfile FileStream to be in terms of binary, however the result.read() must be in Unicode string format which is why you are getting this error. You can try to encode it before passing it to the outfile.write() function (ex: outfile.write(result.read().encode())).
You can also simply just write Unicode directly by:
result = testDoc.getContentStream()
result_text = result.read()
from zipfile import ZipFile
with ZipFile(filepath, 'w') as zf:
zf.writestr('filename_that_is_zipped', result_text)
Not I am not sure what you have in your ContentStream but note that a excel file is made up of xml files zipped up. The minimum file structure you need for an excel file is as follows:
_rels/.rels contains excel schemas
docProps/app.xml contains number of sheets and sheet names
docProps/core.xml boiler plate user info and date created
xl/workbook.xml contains sheet names rdId to workbook link
xl/worksheets/sheet1.xml (and more sheets in this folder) contains cell data for each sheet
xl/_rels/workbook.xml.rels contains sheet file locations within zipfile
xl/sharedStrings.xml if you have string only cell values
[Content_Types].xmlapplies schemas to file types
I recently went through piecing together an excel file from scratch, if you want to see the code check out https://github.com/PydPiper/pylightxl
If only one table is present in a PDF file then that can be simply extracted using the code
from tabula import read_pdf
df = read_pdf(r"C:\Users\Himanshu Poddar\Desktop\pdf_file.pdf")
But if there is more than one table present in a PDF file I am unable to extract those tables because it's only extracting the first one.
There? Hope the below code will be helpful, still I didn't test it with large tables. Let me know is there any scenario which could affect or fail with this code. I'm new to python so that I can improve my knowledge :)
import os
from tabula import wrapper
os.chdir("E:/Documents/myPy/")
tables = wrapper.read_pdf("MyPDF.pdf",multiple_tables=True,pages='all',encoding='utf-8',spreadsheet=True)
i=1
for table in tables:
table.columns = table.iloc[0]
table = table.reindex(table.index.drop(0)).reset_index(drop=True)
table.columns.name = None
#To write Excel
table.to_excel('output'+str(i)+'.xlsx',header=True,index=False)
#To write CSV
table.to_csv('output'+str(i)+'.csv',sep='|',header=True,index=False)
i=i+1
Even when using the tabula-py wrapper you can use all the same options as can be found on the Tabula Java Docs.
In your case you can simply add pages = "all":
from tabula import read_pdf
df = read_pdf(r"C:\Users\Himanshu Poddar\Desktop\pdf_file.pdf", pages = "all")
If your PDF has multiple tables, you can use multiple_tables=true option.
If the tables have the same structure(i.e, have the same table structure and the same relative position) in all pages of pdf, then you can set pages='all' to get the correct result.
If not, you may need to iterate all pages to parser the pdf.
There are a documention that explains it in detail.
using multiple_tables=true parameter in the read_pdf will solve the issue
Example:
from tabula import wrapper
df = wrapper.read_pdf("sample.pdf",multiple_tables=True)
Now the read_pdf is in wrapper, so we need to import that and use as shown above.
For an in class assignment, I am attempting to load a csv file into a dataframe in python using Jupyter notebook.
The below is my attempt. I have defined columns as such:
gnacs_y = "id|postedTime|body|None1|['twitter_entiteis:urls:url']|None2|['actor:languages_list-items']|gnip:language:value|twitter_lang|[u'geo:coordinates_list-items']|geo:type|None3|None4|None5|None6|actor:utcOffset|None7|None8|None9|None10|None11|None12|None13|None14|None15|actor:displayName|actor:preferredUsername|actor:id|gnip:klout_score|actor:followersCount|actor:friendsCount|actor:listedCount|actor:statusesCount|Tweet|None16|None17|None18"
colnames = gnacs_y.split('|')
Then I have the following:
df_3 = pd.read_csv('../data/twitter_sample.csv', sep='|', names=colnames)
df_3.tail(10)
However when the data gets loaded i see only the ID column with what seems like HTML code text and all the other columns are NaN where as there is data in the .CSV file. I have attached screen shots of what i see in the jupyter notebook and the content of the CSV file. I am not sure if I messed up the initial declaration of the column names under gancs_y.
Link to the CSV file for the assignment:
https://github.com/terratenney/yorkBigData/blob/master/assignments/data/twitter_sample.csv
Any help would be great appreciated
Your file is not a csv file, it's a html file that has tables in it. If your assignment says you should have a csv file have you checked you downloaded the right file?
EDIT: It looks like you messed up saving your file - if you click the RAW button on Github and download that, it will give you the correct csv file without the html