Let me try to explain to the best of my ability as I am not a Python wizard. I have read with PyPDF2 a PDF table of data regarding covid-19 in Mexico and tokenize it—long story, I tried doing it with tabula but did not get the format I was expecting and I was going to spend more time reformatting the CSV document I have gotten back than analyzing it—and have gotten a list of strings back with len of 16792 which is fine.
Now, the problem I am facing is that I need to format it in the appropriate way by concatenating some (not all) of those strings together so I can create a list of lists with the same length which is 9 columns.
This is an example of how it looks right now, the columns are Case number, State, Locality, Gender, Age, Date when symptoms started, Status, Type of contagion, Date of arrival to Mexico:
['1', 'PUEBLA', 'PUEBLA', 'M', '49', '15/03/2020', 'Sospechoso', 'Contacto', 'NA', '2', 'GUERRERO', 'ZONA', 'NORTE', 'M', '29', '15/03/2020', 'Sospechoso', 'Contacto', 'NA', '3', 'BAJA', 'CALIFORNIA', 'TIJUANA', 'F', '34', '14/03/2020', 'Sospechoso', 'Estados', 'Unidos', '08/03/2020', '4', 'CIUDAD', 'DE', 'MÉXICO', 'TLALPAN', 'F', '69', '25/02/2020', 'Sospechoso', 'Italia', '03/03/2020', '5', 'JALISCO', 'CENTRO', 'GUADALAJARA', 'M', '19', '18/03/2020', 'Sospechoso', 'España', '17/03/2020'
What I would want is to get certain strings like 'ZONA', 'NORTE' as 'ZONA NORTE' or 'CIUDAD', 'DE', 'MEXICO' as 'CIUDAD DE MEXICO' or 'ESTADOS', 'UNIDOS' as 'ESTADOS UNIDOS'...
I seriously do not know how to tackle this. I have tried, split(), replace(), trying to find the index of each frequency, read all questions about manipulating lists, tried almost all the responses provided... and haven't been able to do it.
Any guidance, will be greatly appreciated. Sorry if this is a very basic question, but I know there has to be a way, I just don't know it.
Since the phrase split is not the same for each row, you have a way to "recognize" it. One way would be if 2 list item that are next to each other have more than 3 letter, then join them.
import re
row_list = ['1', 'PUEBLA', 'PUEBLA', 'M', '49', '15/03/2020', 'Sospechoso', 'Contacto', 'NA', '2', 'GUERRERO', 'ZONA', 'NORTE', 'M', '29', '15/03/2020', 'Sospechoso', 'Contacto', 'NA', '3', 'BAJA', 'CALIFORNIA', 'TIJUANA', 'F', '34', '14/03/2020', 'Sospechoso', 'Estados', 'Unidos', '08/03/2020', '4', 'CIUDAD', 'DE', 'MÉXICO', 'TLALPAN', 'F', '69', '25/02/2020', 'Sospechoso', 'Italia', '03/03/2020', '5', 'JALISCO', 'CENTRO', 'GUADALAJARA', 'M', '19', '18/03/2020', 'Sospechoso', 'España', '17/03/2020']
words_longer_than_3 = r'([^\d\W]){3,}'
def is_a_word(text):
return bool(re.findall(words_longer_than_3, text))
def get_next_item(row_list, i):
try:
next_item = row_list[i+1]
except IndexError:
return
return is_a_word(next_item)
for i, item in enumerate(row_list):
item_is_a_word = is_a_word(row_list[i])
if not item_is_a_word:
continue
next_item_is_a_word = get_next_item(row_list, i)
while next_item_is_a_word:
row_list[i] += f' {row_list[i+1]}'
del row_list[i+1]
next_item_is_a_word = get_next_item(row_list, i)
print(row_list)
result:
['1', 'PUEBLA PUEBLA', 'M', '49', '15/03/2020', 'Sospechoso Contacto', 'NA', '2', 'GUERRERO ZONA NORTE', 'M', '29', '15/03/2020', 'Sospechoso Contacto', 'NA', '3', 'BAJA CALIFORNIA TIJUANA', 'F', '34', '14/03/2020', 'Sospechoso Estados Unidos', '08/03/2020', '4', 'CIUDAD', 'DE', 'MÉXICO TLALPAN', 'F', '69', '25/02/2020', 'Sospechoso Italia', '03/03/2020', '5', 'JALISCO CENTRO GUADALAJARA', 'M', '19', '18/03/2020', 'Sospechoso España', '17/03/2020']
I suppose that the data you want to process comes from a similar file like this one here, which contains 2623 rows x 8 columns.
You can load the data from the PDF file using tabula-py. You can install it through pip install tabula-py==1.3.0. One issue with extracting the table like this is that tabula-py can mess up things sometimes. For example, the header of the table was extracted from the PDF file like this:
"",,,,Identificación Fecha de,
"",,,,de COVID-19,Fecha del llegada a
N° Caso,Estado,Sexo,Edad,Inicio de Procedencia por RT-PCR en,México
"",,,,síntomas,
"",,,,tiempo real,
Pretty nasty huh?
In addition, tabula-py could not separate some of the columns, that is writing a comma in the right place so that the output CSV file would be well-parsed. For instance, the line with número de caso (case number) 8:
8,BAJA CALIFORNIA,F,65,13/03/2020 Sospechoso Estados Unidos,08/03/2020
could be fixed by just replacing " Sospechoso " by ",Sospechoso,". And you are lucky becuase this is the only parse issue you will have to deal with for now. Thus, iterating through the lines of the output CSV file and replacing " Sospechoso " by ",Sospechoso," takes care of everything.
Finally, I added an option (removeaccents) for removing the accents from the data. This can help you avoid encoding issues in the future. For this you will need unicode: pip install unidecode.
Putting everything together, the code that reads the PDF file and converts it to a CSV file and loads it as a pandas dataframe is as follows (You can download the preprocessed CSV file here):
import tabula
from tabula import wrapper
import pandas as pd
import unidecode
"""
1. Adds the header.
2. Skips "corrupted" lines from tabula.
3. Replacing " Sospechoso " for ",Sospechoso," automatically separates
the previous column ("fecha_sintomas") from the next one ("procedencia").
4. Finally, elminiates accents to avoid encoding issues.
"""
def simplePrep(
input_path,
header = [
"numero_caso",
"estado",
"sexo",
"edad",
"fecha_sintomas",
"identification_tiempo_real",
"procedencia",
"fecha_llegada_mexico"
],
lookfor = " Sospechoso ",
replacewith = ",Sospechoso,",
output_path = "preoprocessed.csv",
skiprowsupto = 5,
removeaccents = True
):
fin = open(input_path, "rt")
fout = open(output_path, "wt")
fout.write(",".join(header) + "\n")
count = 0
for line in fin:
if count > skiprowsupto - 1:
if removeaccents:
fout.write(unidecode.unidecode(line.replace(lookfor, replacewith)))
else:
fout.write(line.replace(lookfor, replacewith))
count += 1
fin.close()
fout.close()
"""
Reads all the pdf pages specifying that the table spans multiple pages.
multiple_tables = True, otherwise the first row of each page will be missed.
"""
tabula.convert_into(
input_path = "data.pdf",
output_path = "output.csv",
output_format = "csv",
pages = 'all',
multiple_tables = True
)
simplePrep("output.csv", removeaccents = True)
# reads preprocess data set
df = pd.read_csv("preoprocessed.csv", header = 0)
# prints the first 5 samples in the dataframe
print(df.head(5))
Related
How to add a column based on one of the values of a list of lists in python?
I have the following list and I need to add a new column based on the value of Currency.
If Pound , Euro = Amount *0.9
If USD , Euro =Amount *1.2
I need to code without libraries.
[['Buyer', 'Seller', 'Amount', 'Property_Type','Currency'],
['100', '200', '4923', 'c', 'Pound'],
['600', '429', '838672', 'a', 'USD'],
['650', '400', '8672', 'a', 'Euro']
Result
[['Buyer', 'Seller', 'Amount', 'Property_Type', 'Currency', 'Euro'],
['100', '200', '5000', 'c', 'Livre', '6000'],
['600', '429', '10000', 'a', 'USD', '9000'],
['650', '400', '8600', 'a', 'Euro', '8600']
Thank you very much, any readings on how to import a csv and manipulate it, without libraries, would be much appreciated.
Assuming the columns are always in the same order...
EXCH_RATES = {
'Pound': Decimal('0.9'),
'USD': Decimal('1.2'),
'Euro': 1,
}
rows[0].append('Euro')
for row in rows[1:]:
exch_rate = EXCH_RATES[row[4]]
row.append(str(exch_rate * Decimal[row[2]]))
Check the last item in the list inside the list then check what its currency is then change the amount like so:
lst = [['Buyer', 'Seller', 'Amount', 'Property_Type','Currency'], ['100', '200', '4923', 'c', 'Pound'], ['600', '429', '838672', 'a', 'USD'], ['650', '400', '8672', 'a', 'Euro']]
for i in range(3):
if lst[i][3] == 'Pound':
lst[i].append(str(int(lst[i][2]) * 0.9))
elif lst[i][3] == 'USD':
lst[i].append(str(int(lst[i][2]) * 1.2))
else:
lst[i].append(lst[i][2])
Although you would be better storing the data in a csv file but then you would have to use the csv library.
Tell me if this helps and if you want to use the csv library tell me so I can tell you how to use it.
I have a csv file full of data, which is all type string. The file is called Identifiedλ.csv.
Here is some of the data from the csv file:
['Ref', 'Ion', 'ULevel', 'UConfig.', 'ULSJ', 'LLevel', 'LConfig.', 'LLSJ']
['12.132', 'Ne X', '4', '2p1', '2P3/2', '1', '1s1', '1S0']
['12.132', 'Ne X', '3', '2p1', '2P3/2', '1', '1s1', '1S0']
['12.846', 'Fe XX', '58', '1s2.2s2.2p2.3d1', '4P5/2', '1', '1s2.2s2.2p3', '4S3/2']
What I would like to do is the read the file and search for a number in the column 'Ref', for example 12.846. And if the number I search matches a number in the file, print the whole row of that number .
eg. something like:
csv_g = csv.reader(open('Identifiedλ.csv', 'r'), delimiter=",")
for row in csv_g:
if 12.846 == (row[0]):
print (row)
And it would return (hopefully)
['12.846', 'Fe XX', '58', '1s2.2s2.2p2.3d1', '4P5/2', '1', '1s2.2s2.2p3', '4S3/2']
However this returns nothing and I think it's because the 'Ref' column is type string and the number I search is type float. I'm trying to find a way to convert the string to float but am seemingly stuck.
I've tried:
df = pd.read_csv('Identifiedλ.csv', dtype = {'Ref': np.float64,})
and
array = b = np.asarray(array,
dtype = np.float64, order = 'C')
but am confused on how to incorporate this with the rest of the search.
Any tips would be most helpful! Thank you!
Python has a function to convert strings to floats. For example, the following evaluates to True:
float('3.14')==3.14
I would try this conversion while comparing the value in the first column.
I am currently trying to learn how to apply Data Science skills which I am learning through Coursera and Dataquest to little personal projects.
I found a dataset on Google BigQuery from the US Department of Health and Human Services which includes all weekly surveillance reports of nationally notifiable diseases for all U.S. cities and states published between 1888 and 2013.
I exported the data to a .csv file and imported it into a Jupyter notebook which I am running through Anaconda. Upon looking at the header of the dataset I noticed that the dates/weeks are shown as 'epi_week'.
I am trying to make the data more readable and useable for some analysis, to do this I was hoping to conver it into something along the lines of DD/MM/YYYY or Week/Month/Year etc.
I did some research, apparently epi-weeks are also referred to as CDC weeks and so far I found an extension/package for python 3 which is called "epiweeks".
Using the epiweeks package I can turn some 'normal' dates into what the package creator refers to into some sort of an epi weeks form but they look nothing like what I can see in the dataset.
For example if I use todays date, the 24th of May 2019 (24/05/2019) then the output is: "Week 21 of Year 2019" but this is what the first four entrys in the data (and following the same format, all the other ones) look like:
epi_week
'197006'
'197007'
'197008'
'197012'
In [1]: disease_header
Out [1]:
[['epi_week', 'state', 'loc', 'loc_type', 'disease', 'cases', 'incidence_per_100000']]
In [2]: disease[:4]
Out [2]:
[['197006', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0'],
['197007', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0'],
['197008', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0'],
['197012', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0']]
The epiweeks package was developed to solve problems like the one you have here.
With the example data you provided, let's create a new column with week ending date:
import pandas as pd
from epiweeks import Week
columns = ['epi_week', 'state', 'loc', 'loc_type',
'disease', 'cases', 'incidence_per_100000']
data = [
['197006', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0'],
['197007', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0'],
['197008', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0'],
['197012', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0']
]
df = pd.DataFrame(data, columns=columns)
# Now create a new column with week ending date in ISO format
df['week_ending'] = df['epi_week'].apply(lambda x: Week.fromstring(x).enddate())
That results in something like:
I recommend you to have a look over the epiweeks package documentation for more examples.
If you only need to have year and week columns, that can be done without using the epiweeks package:
df['year'] = df['epi_week'].apply(lambda x: int(x[:4]))
df['week'] = df['epi_week'].apply(lambda x: int(x[4:6]))
That results in something like:
I have a csv file like this :
id;verbatim
1;je veux manger
2;tu as manger
I have my script which return a dictionary like this :
dico_phrases = {"['manger']": ['7', '1', '0'], "['être', 'laid']": ['0', '21', '1041']}
T would like to add 4 news columns like this :
id;verbatim,key,value1,value2,value3
1;je veux manger
2;tu as manger
And then add my dictionary in each columns like this :
id;verbatim;key,value1,value2,value3
1je veux manger;manger;7;1;0
2;tu as manger;être laid;0;21;1041
Below the script which allow me to get my dictionary :
with open('output.csv','wb') as f:
w = csv.writer(f)
w.writerow(dico_phrases.keys())
w.writerow(dico_phrases.values())
I have this :
['manger'],"['être', 'laid']"
"['7', '1', '0']","['0', '21', '1041']"
It is not quite I have imagined
Consider using pandas for this -
df = pd.read_csv("input.csv", index_col='id')
dico_phrases = {"['manger']": ['7', '1', '0'], "['être', 'laid']": ['0', '21', '1041']}
df['key'] = [" ".join(eval(x)) for x in dico_phrases.keys()]
df = df.join(pd.DataFrame([x for x in dico_phrases.values()], index=df.index))
df.to_csv("output.csv")
Output:
id,verbatim,key,0,1,2
1,je veux manger,manger,7,1,0
2,tu as manger,être laid,0,21,1041
I am having a bit of trouble getting started on an assignment. We are issued a tab delineated .txt file with 6 columns of data and around 50 lines of this data. I need help starting a list to store this data in for later recall. Eventually I will need to be able to list all the contents of any particular column and sort it, count it, etc. Any help would be appreciated.
Edit; I really haven't done much besides research on this kinda stuff, I know ill be looking into csv, and i have done single column .txt files before but im not sure how to tackle this situation. How will I give names to the separate columns? how will I tell the program when one row ends and the next begins?
The dataframe structure in Pandas basically does exactly what you want. It's highly analogous to the data frame in R if you're familiar with that. It has built in options for subsetting, sorting, and otherwise manipulating tabular data.
It reads directly from csv and even automatically reads in column names. You'd call:
read_csv(yourfilename,
sep='\t', # makes it tab delimited
header=1) # makes the first row the header row.
Works in Python 3.
Let's say you have a csv like the following.
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
You can read them into a dictionary like so:
>>> import csv
>>> reader = csv.DictReader(open('test.csv','r'), fieldnames= ['col1', 'col2', 'col3', 'col4', 'col5', 'col6'], dialect='excel-tab')
>>> for row in reader:
... print row
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
But Pandas library might be better suited for this. http://pandas.pydata.org/pandas-docs/stable/io.html#csv-text-files
Sounds like a job better suited to a database. You should just use something like PostgreSQLs COPY FROM operation to import the CSV data into a table then use python + SQL for all your sorting, searching and matching needs.
If you feel a real database is overkill there's still options like SQLlite and BerkleyDB which both have python modules.
EDIT: BerkelyDB is deprecated but anydbm is similiar in concept.
I think using a db for 50 lines and 6 colums is overkill, so here's my idea:
from __future__ import print_function
import os
from operator import itemgetter
def get_records_from_file(path_to_file):
"""
Read a tab-deliminated file and return a
list of dictionaries representing the data.
"""
records = []
with open(path_to_file, 'r') as f:
# Use the first line to get names for columns
fields = [e.lower() for e in f.readline().split('\t')]
# Iterate over the rest of the lines and store records
for line in f:
record = {}
for i, field in enumerate(line.split('\t')):
record[fields[i]] = field
records.append(record)
return records
if __name__ == '__main__':
path = os.path.join(os.getcwd(), 'so.txt')
records = get_records_from_file(path)
print('Number of records: {0}'.format(len(records)))
s = sorted(records, key=itemgetter('id'))
print('Sorted: {0}'.format(s))
For storing records for later use, look into Python's pickle library--that'll allow you to preserve them as Python objects.
Also, note I don't have Python 3 installed on the computer I'm using right now, but I'm pretty sure this'll work on Python 2 or 3.