Let me try to explain to the best of my ability as I am not a Python wizard. I have read with PyPDF2 a PDF table of data regarding covid-19 in Mexico and tokenize it—long story, I tried doing it with tabula but did not get the format I was expecting and I was going to spend more time reformatting the CSV document I have gotten back than analyzing it—and have gotten a list of strings back with len of 16792 which is fine.
Now, the problem I am facing is that I need to format it in the appropriate way by concatenating some (not all) of those strings together so I can create a list of lists with the same length which is 9 columns.
This is an example of how it looks right now, the columns are Case number, State, Locality, Gender, Age, Date when symptoms started, Status, Type of contagion, Date of arrival to Mexico:
['1', 'PUEBLA', 'PUEBLA', 'M', '49', '15/03/2020', 'Sospechoso', 'Contacto', 'NA', '2', 'GUERRERO', 'ZONA', 'NORTE', 'M', '29', '15/03/2020', 'Sospechoso', 'Contacto', 'NA', '3', 'BAJA', 'CALIFORNIA', 'TIJUANA', 'F', '34', '14/03/2020', 'Sospechoso', 'Estados', 'Unidos', '08/03/2020', '4', 'CIUDAD', 'DE', 'MÉXICO', 'TLALPAN', 'F', '69', '25/02/2020', 'Sospechoso', 'Italia', '03/03/2020', '5', 'JALISCO', 'CENTRO', 'GUADALAJARA', 'M', '19', '18/03/2020', 'Sospechoso', 'España', '17/03/2020'
What I would want is to get certain strings like 'ZONA', 'NORTE' as 'ZONA NORTE' or 'CIUDAD', 'DE', 'MEXICO' as 'CIUDAD DE MEXICO' or 'ESTADOS', 'UNIDOS' as 'ESTADOS UNIDOS'...
I seriously do not know how to tackle this. I have tried, split(), replace(), trying to find the index of each frequency, read all questions about manipulating lists, tried almost all the responses provided... and haven't been able to do it.
Any guidance, will be greatly appreciated. Sorry if this is a very basic question, but I know there has to be a way, I just don't know it.
Since the phrase split is not the same for each row, you have a way to "recognize" it. One way would be if 2 list item that are next to each other have more than 3 letter, then join them.
import re
row_list = ['1', 'PUEBLA', 'PUEBLA', 'M', '49', '15/03/2020', 'Sospechoso', 'Contacto', 'NA', '2', 'GUERRERO', 'ZONA', 'NORTE', 'M', '29', '15/03/2020', 'Sospechoso', 'Contacto', 'NA', '3', 'BAJA', 'CALIFORNIA', 'TIJUANA', 'F', '34', '14/03/2020', 'Sospechoso', 'Estados', 'Unidos', '08/03/2020', '4', 'CIUDAD', 'DE', 'MÉXICO', 'TLALPAN', 'F', '69', '25/02/2020', 'Sospechoso', 'Italia', '03/03/2020', '5', 'JALISCO', 'CENTRO', 'GUADALAJARA', 'M', '19', '18/03/2020', 'Sospechoso', 'España', '17/03/2020']
words_longer_than_3 = r'([^\d\W]){3,}'
def is_a_word(text):
return bool(re.findall(words_longer_than_3, text))
def get_next_item(row_list, i):
try:
next_item = row_list[i+1]
except IndexError:
return
return is_a_word(next_item)
for i, item in enumerate(row_list):
item_is_a_word = is_a_word(row_list[i])
if not item_is_a_word:
continue
next_item_is_a_word = get_next_item(row_list, i)
while next_item_is_a_word:
row_list[i] += f' {row_list[i+1]}'
del row_list[i+1]
next_item_is_a_word = get_next_item(row_list, i)
print(row_list)
result:
['1', 'PUEBLA PUEBLA', 'M', '49', '15/03/2020', 'Sospechoso Contacto', 'NA', '2', 'GUERRERO ZONA NORTE', 'M', '29', '15/03/2020', 'Sospechoso Contacto', 'NA', '3', 'BAJA CALIFORNIA TIJUANA', 'F', '34', '14/03/2020', 'Sospechoso Estados Unidos', '08/03/2020', '4', 'CIUDAD', 'DE', 'MÉXICO TLALPAN', 'F', '69', '25/02/2020', 'Sospechoso Italia', '03/03/2020', '5', 'JALISCO CENTRO GUADALAJARA', 'M', '19', '18/03/2020', 'Sospechoso España', '17/03/2020']
I suppose that the data you want to process comes from a similar file like this one here, which contains 2623 rows x 8 columns.
You can load the data from the PDF file using tabula-py. You can install it through pip install tabula-py==1.3.0. One issue with extracting the table like this is that tabula-py can mess up things sometimes. For example, the header of the table was extracted from the PDF file like this:
"",,,,Identificación Fecha de,
"",,,,de COVID-19,Fecha del llegada a
N° Caso,Estado,Sexo,Edad,Inicio de Procedencia por RT-PCR en,México
"",,,,síntomas,
"",,,,tiempo real,
Pretty nasty huh?
In addition, tabula-py could not separate some of the columns, that is writing a comma in the right place so that the output CSV file would be well-parsed. For instance, the line with número de caso (case number) 8:
8,BAJA CALIFORNIA,F,65,13/03/2020 Sospechoso Estados Unidos,08/03/2020
could be fixed by just replacing " Sospechoso " by ",Sospechoso,". And you are lucky becuase this is the only parse issue you will have to deal with for now. Thus, iterating through the lines of the output CSV file and replacing " Sospechoso " by ",Sospechoso," takes care of everything.
Finally, I added an option (removeaccents) for removing the accents from the data. This can help you avoid encoding issues in the future. For this you will need unicode: pip install unidecode.
Putting everything together, the code that reads the PDF file and converts it to a CSV file and loads it as a pandas dataframe is as follows (You can download the preprocessed CSV file here):
import tabula
from tabula import wrapper
import pandas as pd
import unidecode
"""
1. Adds the header.
2. Skips "corrupted" lines from tabula.
3. Replacing " Sospechoso " for ",Sospechoso," automatically separates
the previous column ("fecha_sintomas") from the next one ("procedencia").
4. Finally, elminiates accents to avoid encoding issues.
"""
def simplePrep(
input_path,
header = [
"numero_caso",
"estado",
"sexo",
"edad",
"fecha_sintomas",
"identification_tiempo_real",
"procedencia",
"fecha_llegada_mexico"
],
lookfor = " Sospechoso ",
replacewith = ",Sospechoso,",
output_path = "preoprocessed.csv",
skiprowsupto = 5,
removeaccents = True
):
fin = open(input_path, "rt")
fout = open(output_path, "wt")
fout.write(",".join(header) + "\n")
count = 0
for line in fin:
if count > skiprowsupto - 1:
if removeaccents:
fout.write(unidecode.unidecode(line.replace(lookfor, replacewith)))
else:
fout.write(line.replace(lookfor, replacewith))
count += 1
fin.close()
fout.close()
"""
Reads all the pdf pages specifying that the table spans multiple pages.
multiple_tables = True, otherwise the first row of each page will be missed.
"""
tabula.convert_into(
input_path = "data.pdf",
output_path = "output.csv",
output_format = "csv",
pages = 'all',
multiple_tables = True
)
simplePrep("output.csv", removeaccents = True)
# reads preprocess data set
df = pd.read_csv("preoprocessed.csv", header = 0)
# prints the first 5 samples in the dataframe
print(df.head(5))
I have a few Python dataframes in Pandas, I want to loop through them to find out which data frame meet my rows' criteria and save it in a new data frame.
d = {'Count' : ['10', '11', '12', '13','13.4','12.5']}
df_1= pd.DataFrame(data=d)
df_1
d = {'Count' : ['10', '-11', '-12', '13','16','2']}
df_2= pd.DataFrame(data=d)
df_2
Here is the logic I want to use, but it does not contain the right syntax,
for df in (df_1,df_2)
if df['Count'][0] >0 and df['Count'][1] >0 and df['Count'][2]>0 and df['Count'][3]>0
and (df['Count'][4] is between df['Count'][3]+0.5 and df['Count'][3]-0.5) is True:
df.save
The correct output is df_1... because it meets my condition. How do I create a new DataFrame or LIST to save the result as well?
Let me know if you have any questions in the comments. Main updates I made to your code was:
Replacing your chained indexing with .loc
Consolidating your first few separate and'd comparisons into a comparison on a slice of the series, reduced down to a single T/F with .all()
Code below:
import pandas as pd
# df_1 & df_2 input taken from you
d = {'Count' : ['10', '11', '12', '13','13.4','12.5']}
df_1= pd.DataFrame(data=d)
d = {'Count' : ['10', '-11', '-12', '13','16','2']}
df_2= pd.DataFrame(data=d)
# my solution here
df_1['Count'] = df_1['Count'].astype('float')
df_2['Count'] = df_2['Count'].astype('float')
my_dataframes = {'df_1': df_1, 'df_2': df_2}
good_dataframes = []
for df_name, df in my_dataframes.items():
if (df.loc[0:3, 'Count'] > 0).all() and (df.loc[3,'Count']-0.5 <= df.loc[4, 'Count'] <= df.loc[3, 'Count']+0.5):
good_dataframes.append(df_name)
good_dataframes_df = pd.DataFrame({'good': good_dataframes})
TEST:
>>> print(good_dataframes_df)
good
0 df_1
I have a csv file full of data, which is all type string. The file is called Identifiedλ.csv.
Here is some of the data from the csv file:
['Ref', 'Ion', 'ULevel', 'UConfig.', 'ULSJ', 'LLevel', 'LConfig.', 'LLSJ']
['12.132', 'Ne X', '4', '2p1', '2P3/2', '1', '1s1', '1S0']
['12.132', 'Ne X', '3', '2p1', '2P3/2', '1', '1s1', '1S0']
['12.846', 'Fe XX', '58', '1s2.2s2.2p2.3d1', '4P5/2', '1', '1s2.2s2.2p3', '4S3/2']
What I would like to do is the read the file and search for a number in the column 'Ref', for example 12.846. And if the number I search matches a number in the file, print the whole row of that number .
eg. something like:
csv_g = csv.reader(open('Identifiedλ.csv', 'r'), delimiter=",")
for row in csv_g:
if 12.846 == (row[0]):
print (row)
And it would return (hopefully)
['12.846', 'Fe XX', '58', '1s2.2s2.2p2.3d1', '4P5/2', '1', '1s2.2s2.2p3', '4S3/2']
However this returns nothing and I think it's because the 'Ref' column is type string and the number I search is type float. I'm trying to find a way to convert the string to float but am seemingly stuck.
I've tried:
df = pd.read_csv('Identifiedλ.csv', dtype = {'Ref': np.float64,})
and
array = b = np.asarray(array,
dtype = np.float64, order = 'C')
but am confused on how to incorporate this with the rest of the search.
Any tips would be most helpful! Thank you!
Python has a function to convert strings to floats. For example, the following evaluates to True:
float('3.14')==3.14
I would try this conversion while comparing the value in the first column.
I have a csv file like this :
id;verbatim
1;je veux manger
2;tu as manger
I have my script which return a dictionary like this :
dico_phrases = {"['manger']": ['7', '1', '0'], "['être', 'laid']": ['0', '21', '1041']}
T would like to add 4 news columns like this :
id;verbatim,key,value1,value2,value3
1;je veux manger
2;tu as manger
And then add my dictionary in each columns like this :
id;verbatim;key,value1,value2,value3
1je veux manger;manger;7;1;0
2;tu as manger;être laid;0;21;1041
Below the script which allow me to get my dictionary :
with open('output.csv','wb') as f:
w = csv.writer(f)
w.writerow(dico_phrases.keys())
w.writerow(dico_phrases.values())
I have this :
['manger'],"['être', 'laid']"
"['7', '1', '0']","['0', '21', '1041']"
It is not quite I have imagined
Consider using pandas for this -
df = pd.read_csv("input.csv", index_col='id')
dico_phrases = {"['manger']": ['7', '1', '0'], "['être', 'laid']": ['0', '21', '1041']}
df['key'] = [" ".join(eval(x)) for x in dico_phrases.keys()]
df = df.join(pd.DataFrame([x for x in dico_phrases.values()], index=df.index))
df.to_csv("output.csv")
Output:
id,verbatim,key,0,1,2
1,je veux manger,manger,7,1,0
2,tu as manger,être laid,0,21,1041
I am working in python using ReportLab. I need to generate report in PDF format. The data is retrieving from the database and insert into table.
Here is simple code:
from reportlab.lib import colors
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Table, TableStyle
from reportlab.lib.units import inch
doc = SimpleDocTemplate("simple_table.pdf", pagesize=letter)
elements = []
data= [['00', '01', '02', '03', '04'],
['10', 'Here is large field retrieve from database', '12', '13', '14'],
['20', '21', '22', '23', '24'],
['30', '31', '32', 'Here is second value', '34']]
t=Table(data)
columnWidth = 1.9*inch;
for x in range(5):
t._argW[x]= cellWidth
elements.append(t)
doc.build(elements)
There are three issues:
The lengthy data in a cell overlap on the other cell in a row.
When I increase the column-width manually such as cellWidth = 2.9*inch; , the page is not visible and not scroll from Left-Right
I do not know how to append the data in a cell , mean if the size of the data is large ,it should append to the next line in the same cell.
How I reach this problem?
For starters i would not set the column size as you did. just pass Table the colWidths argument like this:
Table(data, colWidths=[1.9*inch] * 5)
Now to your problem. If you don't set the colWidth parameter reportlab will do this for you and space the columns according to your data. If this is not what you want, you can encapsulate your data into Paragraph's, like Bertrand said. Here is an modified example of your code:
from reportlab.lib import colors
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Table, TableStyle, Paragraph
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.lib.units import inch
styles = getSampleStyleSheet()
doc = SimpleDocTemplate("simple_table.pdf", pagesize=letter)
elements = []
data= [['00', '01', '02', '03', '04'],
['10', Paragraph('Here is large field retrieve from database', styles['Normal']), '12', '13', '14'],
['20', '21', '22', '23', '24'],
['30', '31', '32', 'Here is second value', '34']]
t=Table(data)
elements.append(t)
doc.build(elements)
I think you will get the idea.
I was facing the same problem today. I was looking for a solution where I could only resize a single column - with much different content length than other columns - and have reportlab do the rest for me.
This is what worked for me:
Table(data, colWidths=[1.9*inch] + [None] * (len(data[0]) - 1))
This specifies only the first column. But of course you can easily place the number somewhere in between the Nones as well
Hi I was getting same issue while I was resizing content of table and then I found solution. This solution might will help you and resolve all your 3 issues which is mention here.
from reportlab.lib import colors
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Table, TableStyle
from reportlab.lib.units import inch
doc = SimpleDocTemplate("simple_table.pdf", pagesize=letter)
elements = []
data= [['00', '01', '02', '03', '04'],
['10', 'Here is large field retrieve from database', '12', '13', '14'],
['20', '21', '22', '23', '24'],
['30', '31', '32', 'Here is second value', '34']]
t=Table(data,colWidths=[1.9*inch]*5, rowHeights=[0.9*inch] *4)
#colWidth = size * number of columns
#rowHeights= size * number of rows
elements.append(t)
doc.build(elements)