So basically I have a simple program where I'm reading a tab delimited file (it's big around 1m rows) and I'm getting an error like the example below. In the 3rd row because of a space after Jan Verheijen doesn't register the tab after and it doesn't put the date value into the next column. This happens to some random rows in the files.
Can it be fixed with the code or does it have to be fixed directly in the tsv files?
Thank you so much.
This is the code I'm using.
df = pd.read_csv(data, sep='\t',header=None, index_col=None,encoding= 'unicode_escape')
Language
Translated By
Date
Version
Portuguese
Paulo Guzmán
25/04/2019
2.41
Czech
Čampulka Jiří
11/06/2014
1.96
Dutch
Jan Verheijen 17/07/2022 2.60
French
skorpix38
28/02/2018
2.39
I am trying to extract tables from this pdf link using camelot, however, when a try this follow code:
import camelot
file = 'relacao_medicamentos_rename_2020.pdf'
tables = camelot.read_pdf(file)
tables.export('relacao_medicamentos_rename_2020.csv', f='csv', compress=False)
Simple nothing happens. This is very strange 'cause when I try the same code but with this pdf link works very welll.
As Stefano suggested you need to specify the relevant pages and set the option flavor='stream'. The default flavor='lattice' only works if there are lines between the cells.
Additionally, increasing row_tol helps to group rows together. So that, for example, the header of the first table is not read as three separate rows but as one row. Specifically 'Concentração/Composição' is identified as coherent text.
Also you may want to use strip_text='\n' to remove new line characters.
This results in (reading page 17 and 18 as an example):
import camelot
file = 'relacao_medicamentos_rename_2020.pdf'
tables = camelot.read_pdf(file, pages='17, 18', flavor='stream', row_tol=20, strip_text='\n')
tables.export('foo.csv', f='csv', compress=False)
Still, this way you end up with one table per page and one csv file per table. I.e. in the example above you get two .csv files. This needs to be handled outside camelot.
To merge tables spanning multiple pages using pandas:
import pandas as pd
dfs = [] # list to store dataframes
for table in tables:
df = table.df
df.columns = df.iloc[0] # use first row as header
df = df[1:] # remove the first row from the dataframe
dfs.append(df)
df = pd.concat(dfs, axis=0) # concatenate all dataframes in list
df.to_csv('foo.csv') # export dataframe to csv
Also, there are difficulties identifying table areas on pages containing both text and tables (e.g. pdf page 16).
In these cases the table area can be specified. For the table on page 16, this would be:
tables = camelot.read_pdf(in_dir + file, pages='16', flavor='stream', row_tol=20, strip_text='\n', table_areas=['35,420,380,65'],)
Note: Throughout the post I referenced pages by 'counting' the pages of the file and not by the page numbers printed on each page (the latter one starts at the second page of the document).
Further to Stefano's comment, you need to specify both "stream" and a page number. To get the number of pages, I used PyPDF2, which should be installed by camelot.
In addition, I also suppressed the "no tables found" warning (which is purely optional).
import camelot
import PyPDF2
import warnings
file = 'relacao_medicamentos_rename_2020.pdf'
reader = PyPDF2.PdfFileReader(file)
num_pages = reader.getNumPages()
warnings.simplefilter('ignore')
for page in range(1,num_pages+1):
tables = camelot.read_pdf(file,flavor='stream',pages=f'{page}')
if tables.n > 0:
tables.export(f'relacao_medicamentos_rename_2020.{page}.csv', f='csv', compress=False)
It is hard to tell why this works on one pdf file, but not on the other. There is so many different ways how pdf's are created and written by the authoring software that one has to do trial and error.
Also, not all tables found are actual tables, and the results are a bit messy, and need some cleansing.
Consider doing this straight away with something like:
tables = camelot.read_pdf(file,flavor='stream',pages=f'{page}')
for table in tables:
df = table.df
# clean data
...
df.to_csv(....)
In many ways there are no such things as human conventional "Tables" in a PDF when converted from say a spreadsheet or Word Processor the PDF writer needs to add such PDF tagging for Audio Readers (Accessibility) and may need to store the page block of data as one single object. Many PDF writers do not bother as the printer should not be disabled :-) Thus every PDF with tabular data will be built totally differently from any other applications tabular method.
Thus often there is no connection with another partial "table" on the next page (so "headings" will not be associated with "columns" on a secondary page). A PDF has NO everyday CONSTRUCTS for "rows" or "columns" those will have in most cases been neutered by writing to a blank canvas.
So when it comes down to extraction ignore the foreground impression of a table and breakdown the body text.
Take this section from Page 189
This is the order of that text placement (remember there are no such things as line feeds only strings of fresh text) and the gap I have shown is just to highlight the challenge in human terms of what should an extractor do with seemingly random order. How could it know to insert those 5 lines after "comprimido"
Exclusões
Medicamento
Situação clínica
Relatório de
Recomendação da
Conitec
Portaria SCTIE
dasabuvir* 250 mg
comprimido
Hepatite C
Relatório nº 429
– Ombitasvir 12,5
mg/ veruprevir 75
mg/ ritonavir 50
mg comprimido e
dasabuvir 250 mg
comprimido para
o tratamento da
hepatite C
Nº 44, de 10/9/2019
ombitasvir +
veruprevir +
ritonavir* 12,5 mg
+ 75 mg + 50 mg
comprimido
Thus the only way to deal with a challenge like this is to consider proximity of text as blocklike and position it exactly the way it is done in the PDF as x columns and y rows but allow for overlaps in both directions, much like "spanning" or "cell merging". For that to work you need a free flowing spreadsheet that's intelligent enough to self adjust.
I've been working in a Image Captioning Project in 'Nepali Language'. For Dataset part I tried to translate all the English captions text to Nepali of the Flickr8k dataset. For this I'm using python translate tool as
dataset = pd.read_csv('/content/gdrive/My Drive/out.csv',delimiter = '\t')
dataset.drop('Unnamed: 0',axis = 1)
def trans(x):
translator= Translator(to_lang="ne")
return translator.translate(x)
dataset['caption'] = dataset['caption'].apply(trans)
print('done')
But it only translated 130 rows of captions to Nepali language and then all other texts are translated as
MYMEMORY WARNING: YOU USED ALL AVAILABLE FREE TRANSLATIONS FOR TODAY. NEXT AVAILABLE IN 23 HOURS 24 MINUTES 38 SECONDSVISIT TO TRANSLATE MORE
Is there's any way of translating all the texts at once??
I've tried googletrans too but it also fails due to frequent request on API
Note: the dataset contains 40458 rows with English sentences in caption column.
It will be great help if there's any way to translate all the text and Thanks in Advance :)
Okay, I figured it out by myself. Use google sheet and import your csv file
and make a column with the header named the target language's name and use the formula =googletranslate(cell_with_text, "source_language", "target_language")
example: =googletranslate(A2,"en","ne") now from the corner of the cell where mouse pointer appears as + like sign and drop all the way down and bingo you can translate all the text in a column at once.
I tried to use Python package, tabula-py to read table in pdf, It seems that line breaks in pdf table cells would separate the contents in the original cell into multiple cells.
I tried to search for all kinds of python packages to solve this problem. It seems that tabula-py is the most steady package to convert pdf table into pandas data. However, if this problem cannot be solved, I have to turn to online service, which would produce ideal excel output for me.
from tabula import read_pdf
df=read_pdf("C:/Users/Desktop/test.pdf", pages='all')
I expected the pdf table can be converted correctly with this.
Tabula no longer has 'spreadsheet' as an option. Instead use 'lattice' option to avoid the line breaks separating into new rows. Code like this:
import tabula
# Read pdf into DataFrame
df = tabula.read_pdf("FDA EPC Text Phrases (updated March 2018.pdf", pages='all',
lattice=True)
print(df)
You can use 'spreadsheet' option with value 'True' to omit multiple rows of NAN value caused by line breaks.
import tabula
# Read pdf into DataFrame
df = tabula.read_pdf("FDA EPC Text Phrases (updated March 2018.pdf", pages='all', spreadsheet=True)
print(df)
#print(df['Active Moiety Name'])
#print(df['FDA Established Pharmacologic Class\r(EPC) Text Phrase\rPLR regulations require that the following\rstatement is included in the Highlights\rIndications and Usage heading if a drug is a\rmember of an EPC [see 21 CFR\r201.57(a)(6)]: “(Drug) is a (FDA EPC Text\rPhrase) indicated for [indication(s)].” For\reach listed active moiety, the associated\rFDA EPC text phrase is included in this\rdocument. For more information about how\rFDA determines the EPC Text Phrase, see\rthe 2009 "Determining EPC for Use in the\rHighlights" guidance and 2013 "Determining\rEPC for Use in the Highlights" MAPP\r7400.13.'])
Output:
1758 ziconotide N-type calcium channel antagonist
1759 zidovudine HIV nucleoside analog reverse transcriptase in...
1760 zileuton 5-lipoxygenase inhibitor
1761 zinc cation copper absorption inhibitor
1762 ziprasidone atypical antipsychotic
1763 zoledronic acid bisphosphonate
1764 zoledronic acid anhydrous bisphosphonate
1765 zolmitriptan serotonin 5-HT1B/1D receptor agonist (triptan)
1766 zolmitriptan serotonin 5-HT1B/1D receptor agonist (triptan)
1767 zolpidem gamma-aminobutyric acid (GABA) A agonist
1768 zonisamide antiepileptic drug (AED)
I would like to save some tables in word document to CSV file or Excel doesn't matter.
I tried to "readlines()" it doesn't work! I don't know know.
Tables in word document are like this..
Name Age Gender
Alex 12 F
Willy 14 M
.
.
.
However, I would like to save this table in the same row.. I mean that.. I would like to save in CSV or Excel File
Alex 12 F Willy 14 M ....
import win32com
word = win32com.client.Dispatch('Word.Application')
f=word.Documents.Open('C:/3.doc')
have a look to www.ironpython.com: it runs over .NET so it has all the libraries to access to the Microsoft world.
For your case, read this small tutorial about convert a .doc to a .txt file. It should be very useful for you:
http://www.ironpython.info/index.php/Converting_a_Word_document_to_Text