I am currently trying to learn how to apply Data Science skills which I am learning through Coursera and Dataquest to little personal projects.
I found a dataset on Google BigQuery from the US Department of Health and Human Services which includes all weekly surveillance reports of nationally notifiable diseases for all U.S. cities and states published between 1888 and 2013.
I exported the data to a .csv file and imported it into a Jupyter notebook which I am running through Anaconda. Upon looking at the header of the dataset I noticed that the dates/weeks are shown as 'epi_week'.
I am trying to make the data more readable and useable for some analysis, to do this I was hoping to conver it into something along the lines of DD/MM/YYYY or Week/Month/Year etc.
I did some research, apparently epi-weeks are also referred to as CDC weeks and so far I found an extension/package for python 3 which is called "epiweeks".
Using the epiweeks package I can turn some 'normal' dates into what the package creator refers to into some sort of an epi weeks form but they look nothing like what I can see in the dataset.
For example if I use todays date, the 24th of May 2019 (24/05/2019) then the output is: "Week 21 of Year 2019" but this is what the first four entrys in the data (and following the same format, all the other ones) look like:
epi_week
'197006'
'197007'
'197008'
'197012'
In [1]: disease_header
Out [1]:
[['epi_week', 'state', 'loc', 'loc_type', 'disease', 'cases', 'incidence_per_100000']]
In [2]: disease[:4]
Out [2]:
[['197006', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0'],
['197007', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0'],
['197008', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0'],
['197012', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0']]
The epiweeks package was developed to solve problems like the one you have here.
With the example data you provided, let's create a new column with week ending date:
import pandas as pd
from epiweeks import Week
columns = ['epi_week', 'state', 'loc', 'loc_type',
'disease', 'cases', 'incidence_per_100000']
data = [
['197006', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0'],
['197007', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0'],
['197008', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0'],
['197012', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0']
]
df = pd.DataFrame(data, columns=columns)
# Now create a new column with week ending date in ISO format
df['week_ending'] = df['epi_week'].apply(lambda x: Week.fromstring(x).enddate())
That results in something like:
I recommend you to have a look over the epiweeks package documentation for more examples.
If you only need to have year and week columns, that can be done without using the epiweeks package:
df['year'] = df['epi_week'].apply(lambda x: int(x[:4]))
df['week'] = df['epi_week'].apply(lambda x: int(x[4:6]))
That results in something like:
Related
Pandas is not labelling columns correctly and it does make sense because I have used this method a few times already. I can not think of anything that goes wrong and I am able to reproduce this error on a new jupyter notebook as well.
import pandas as pd
columns = {'Start Time', 'Open', 'High', 'Low', 'Close', 'Volume',
'End Time', 'Amount', 'No. Trades', 'Taker Buy Base',
'Taker Buy Quote', 'N'}
df = pd.DataFrame(test_data, columns=columns)
df
Sample data
test_data = [[1617231600000,
'538.32000000',
'545.15000000',
'535.05000000',
'541.06000000',
'8663.58299000',
1617235199999,
'4686031.35064850',
11361,
'5051.86680000',
'2733476.69840350',
'0'],
[1617235200000,
'541.11000000',
'554.67000000',
'541.00000000',
'552.49000000',
'13507.97221000',
1617238799999,
'7404389.49931720',
14801,
'7736.80002000',
'4242791.14275430',
'0'],
[1617238800000,
'552.58000000',
'553.73000000',
'544.82000000',
'548.50000000',
'7115.15238000',
1617242399999,
'3907155.60293150',
5556,
'3580.46860000',
'1964701.27448790',
'0'],
[1617242400000,
'548.49000000',
'550.63000000',
'544.70000000',
'545.45000000',
'3589.18173000',
1617245999999,
'1964514.51702120',
3974,
'1742.76278000',
'954042.85262340',
'0'],
[1617246000000,
'545.80000000',
'545.80000000',
'540.48000000',
'541.56000000',
'4767.67233000',
1617249599999,
'2586841.14566500',
4960,
'2516.25626000',
'1364734.14900580',
'0']]
Expected output: The column names should be labelled in the same order as described in the columns variable. It is labelled like this for me, clearly the wrong order. There is another post but it did not help.
Python sets do not maintain order. Use a list (square brackets instead of curly) instead.
columns = ['Start Time', 'Open', 'High', 'Low', 'Close', 'Volume',
'End Time', 'Amount', 'No. Trades', 'Taker Buy Base',
'Taker Buy Quote', 'N']
How to add a column based on one of the values of a list of lists in python?
I have the following list and I need to add a new column based on the value of Currency.
If Pound , Euro = Amount *0.9
If USD , Euro =Amount *1.2
I need to code without libraries.
[['Buyer', 'Seller', 'Amount', 'Property_Type','Currency'],
['100', '200', '4923', 'c', 'Pound'],
['600', '429', '838672', 'a', 'USD'],
['650', '400', '8672', 'a', 'Euro']
Result
[['Buyer', 'Seller', 'Amount', 'Property_Type', 'Currency', 'Euro'],
['100', '200', '5000', 'c', 'Livre', '6000'],
['600', '429', '10000', 'a', 'USD', '9000'],
['650', '400', '8600', 'a', 'Euro', '8600']
Thank you very much, any readings on how to import a csv and manipulate it, without libraries, would be much appreciated.
Assuming the columns are always in the same order...
EXCH_RATES = {
'Pound': Decimal('0.9'),
'USD': Decimal('1.2'),
'Euro': 1,
}
rows[0].append('Euro')
for row in rows[1:]:
exch_rate = EXCH_RATES[row[4]]
row.append(str(exch_rate * Decimal[row[2]]))
Check the last item in the list inside the list then check what its currency is then change the amount like so:
lst = [['Buyer', 'Seller', 'Amount', 'Property_Type','Currency'], ['100', '200', '4923', 'c', 'Pound'], ['600', '429', '838672', 'a', 'USD'], ['650', '400', '8672', 'a', 'Euro']]
for i in range(3):
if lst[i][3] == 'Pound':
lst[i].append(str(int(lst[i][2]) * 0.9))
elif lst[i][3] == 'USD':
lst[i].append(str(int(lst[i][2]) * 1.2))
else:
lst[i].append(lst[i][2])
Although you would be better storing the data in a csv file but then you would have to use the csv library.
Tell me if this helps and if you want to use the csv library tell me so I can tell you how to use it.
Let me try to explain to the best of my ability as I am not a Python wizard. I have read with PyPDF2 a PDF table of data regarding covid-19 in Mexico and tokenize it—long story, I tried doing it with tabula but did not get the format I was expecting and I was going to spend more time reformatting the CSV document I have gotten back than analyzing it—and have gotten a list of strings back with len of 16792 which is fine.
Now, the problem I am facing is that I need to format it in the appropriate way by concatenating some (not all) of those strings together so I can create a list of lists with the same length which is 9 columns.
This is an example of how it looks right now, the columns are Case number, State, Locality, Gender, Age, Date when symptoms started, Status, Type of contagion, Date of arrival to Mexico:
['1', 'PUEBLA', 'PUEBLA', 'M', '49', '15/03/2020', 'Sospechoso', 'Contacto', 'NA', '2', 'GUERRERO', 'ZONA', 'NORTE', 'M', '29', '15/03/2020', 'Sospechoso', 'Contacto', 'NA', '3', 'BAJA', 'CALIFORNIA', 'TIJUANA', 'F', '34', '14/03/2020', 'Sospechoso', 'Estados', 'Unidos', '08/03/2020', '4', 'CIUDAD', 'DE', 'MÉXICO', 'TLALPAN', 'F', '69', '25/02/2020', 'Sospechoso', 'Italia', '03/03/2020', '5', 'JALISCO', 'CENTRO', 'GUADALAJARA', 'M', '19', '18/03/2020', 'Sospechoso', 'España', '17/03/2020'
What I would want is to get certain strings like 'ZONA', 'NORTE' as 'ZONA NORTE' or 'CIUDAD', 'DE', 'MEXICO' as 'CIUDAD DE MEXICO' or 'ESTADOS', 'UNIDOS' as 'ESTADOS UNIDOS'...
I seriously do not know how to tackle this. I have tried, split(), replace(), trying to find the index of each frequency, read all questions about manipulating lists, tried almost all the responses provided... and haven't been able to do it.
Any guidance, will be greatly appreciated. Sorry if this is a very basic question, but I know there has to be a way, I just don't know it.
Since the phrase split is not the same for each row, you have a way to "recognize" it. One way would be if 2 list item that are next to each other have more than 3 letter, then join them.
import re
row_list = ['1', 'PUEBLA', 'PUEBLA', 'M', '49', '15/03/2020', 'Sospechoso', 'Contacto', 'NA', '2', 'GUERRERO', 'ZONA', 'NORTE', 'M', '29', '15/03/2020', 'Sospechoso', 'Contacto', 'NA', '3', 'BAJA', 'CALIFORNIA', 'TIJUANA', 'F', '34', '14/03/2020', 'Sospechoso', 'Estados', 'Unidos', '08/03/2020', '4', 'CIUDAD', 'DE', 'MÉXICO', 'TLALPAN', 'F', '69', '25/02/2020', 'Sospechoso', 'Italia', '03/03/2020', '5', 'JALISCO', 'CENTRO', 'GUADALAJARA', 'M', '19', '18/03/2020', 'Sospechoso', 'España', '17/03/2020']
words_longer_than_3 = r'([^\d\W]){3,}'
def is_a_word(text):
return bool(re.findall(words_longer_than_3, text))
def get_next_item(row_list, i):
try:
next_item = row_list[i+1]
except IndexError:
return
return is_a_word(next_item)
for i, item in enumerate(row_list):
item_is_a_word = is_a_word(row_list[i])
if not item_is_a_word:
continue
next_item_is_a_word = get_next_item(row_list, i)
while next_item_is_a_word:
row_list[i] += f' {row_list[i+1]}'
del row_list[i+1]
next_item_is_a_word = get_next_item(row_list, i)
print(row_list)
result:
['1', 'PUEBLA PUEBLA', 'M', '49', '15/03/2020', 'Sospechoso Contacto', 'NA', '2', 'GUERRERO ZONA NORTE', 'M', '29', '15/03/2020', 'Sospechoso Contacto', 'NA', '3', 'BAJA CALIFORNIA TIJUANA', 'F', '34', '14/03/2020', 'Sospechoso Estados Unidos', '08/03/2020', '4', 'CIUDAD', 'DE', 'MÉXICO TLALPAN', 'F', '69', '25/02/2020', 'Sospechoso Italia', '03/03/2020', '5', 'JALISCO CENTRO GUADALAJARA', 'M', '19', '18/03/2020', 'Sospechoso España', '17/03/2020']
I suppose that the data you want to process comes from a similar file like this one here, which contains 2623 rows x 8 columns.
You can load the data from the PDF file using tabula-py. You can install it through pip install tabula-py==1.3.0. One issue with extracting the table like this is that tabula-py can mess up things sometimes. For example, the header of the table was extracted from the PDF file like this:
"",,,,Identificación Fecha de,
"",,,,de COVID-19,Fecha del llegada a
N° Caso,Estado,Sexo,Edad,Inicio de Procedencia por RT-PCR en,México
"",,,,síntomas,
"",,,,tiempo real,
Pretty nasty huh?
In addition, tabula-py could not separate some of the columns, that is writing a comma in the right place so that the output CSV file would be well-parsed. For instance, the line with número de caso (case number) 8:
8,BAJA CALIFORNIA,F,65,13/03/2020 Sospechoso Estados Unidos,08/03/2020
could be fixed by just replacing " Sospechoso " by ",Sospechoso,". And you are lucky becuase this is the only parse issue you will have to deal with for now. Thus, iterating through the lines of the output CSV file and replacing " Sospechoso " by ",Sospechoso," takes care of everything.
Finally, I added an option (removeaccents) for removing the accents from the data. This can help you avoid encoding issues in the future. For this you will need unicode: pip install unidecode.
Putting everything together, the code that reads the PDF file and converts it to a CSV file and loads it as a pandas dataframe is as follows (You can download the preprocessed CSV file here):
import tabula
from tabula import wrapper
import pandas as pd
import unidecode
"""
1. Adds the header.
2. Skips "corrupted" lines from tabula.
3. Replacing " Sospechoso " for ",Sospechoso," automatically separates
the previous column ("fecha_sintomas") from the next one ("procedencia").
4. Finally, elminiates accents to avoid encoding issues.
"""
def simplePrep(
input_path,
header = [
"numero_caso",
"estado",
"sexo",
"edad",
"fecha_sintomas",
"identification_tiempo_real",
"procedencia",
"fecha_llegada_mexico"
],
lookfor = " Sospechoso ",
replacewith = ",Sospechoso,",
output_path = "preoprocessed.csv",
skiprowsupto = 5,
removeaccents = True
):
fin = open(input_path, "rt")
fout = open(output_path, "wt")
fout.write(",".join(header) + "\n")
count = 0
for line in fin:
if count > skiprowsupto - 1:
if removeaccents:
fout.write(unidecode.unidecode(line.replace(lookfor, replacewith)))
else:
fout.write(line.replace(lookfor, replacewith))
count += 1
fin.close()
fout.close()
"""
Reads all the pdf pages specifying that the table spans multiple pages.
multiple_tables = True, otherwise the first row of each page will be missed.
"""
tabula.convert_into(
input_path = "data.pdf",
output_path = "output.csv",
output_format = "csv",
pages = 'all',
multiple_tables = True
)
simplePrep("output.csv", removeaccents = True)
# reads preprocess data set
df = pd.read_csv("preoprocessed.csv", header = 0)
# prints the first 5 samples in the dataframe
print(df.head(5))
I am getting html table based on day so if I search for 20 days it brings me 20 table and I want to add all 20 tables in 1 table so I can verify data within time series.
I have tried merge and add functions of pandas but it just add as string.
Table one
[['\xa0', 'All Issues', 'Investment Grade', 'High Yield', 'Convertible'],
['Total Issues Traded', '8039', '5456', '2386', '197'],
['Advances', '3834', '2671', '1075', '88'],
['Declines', '3668', '2580', '994', '94'],
['Unchanged', '163', '54', '99', '10'],
['52 Week High', '305', '100', '193', '12'],
['52 Week Low', '152', '83', '63', '6'],
['Dollar Volume*', '27568', '17000', '9299', '1269']]
table two
[['\xa0', 'All Issues', 'Investment Grade', 'High Yield', 'Convertible'],
['Total Issues Traded', '8039', '5456', '2386', '197'],
['Advances', '3834', '2671', '1075', '88'],
['Declines', '3668', '2580', '994', '94'],
['Unchanged', '163', '54', '99', '10'],
['52 Week High', '305', '100', '193', '12'],
['52 Week Low', '152', '83', '63', '6'],
['Dollar Volume*', '27568', '17000', '9299', '1269']]
code but it add as string.
tab_data = [[item.text for item in row_data.select("th,td")]
for row_data in tables.select("tr")]
df = pd.DataFrame(tab_data)
df2 = pd.DataFrame(tab_data)
df3 = df.add(df2,fill_value=0)
df
If you want to convert the numeric cells into integers, you would need to do that explicitly, as follows:
tab_data = [[int(item.text) if item.text.isdigit() else item.text
for item in row_data.select("th,td")]
for row_data in tables.select("tr")]
Hope it helps.
The way you are converting the data frame treats all values as text.
There are two options here.
Explicitly convert the strings to the data type you want using astype
Use read_html to create data frames from html tables, which also tries to do the data type conversion.
So, this is my first time using the Census API (or any API at all), and I have had some amount of luck so far. Rather than using the census package or cenpy, I have just been creating the URL's by hand since it is not that difficult.
For example, the following successfully gets me the number of people who speak different languages by state and congressional district.
This Works:
import requests
import pandas as pd
r = requests.get('https://api.census.gov/data/2016/acs/acs1/subject?get=S1601_C01_002E,S1601_C01_004E,S1601_C01_008E,S1601_C01_012E&for=congressional%20district:*&in=state:*&key=<my key>')
r.status_code
data = r.json()
df = pd.DataFrame(data)
header = df.iloc[0]
df = df[1:]
df.columns = header
lang = df.rename(columns = {'congressional district': 'cd',
'S1601_C01_002E': 'English',
'S1601_C01_004E': 'Spanish',
'S1601_C01_008E': 'IndoEuropean',
'S1601_C01_012E': 'AsianPacific'})
However, I am now interested retrieving a variable that is a percentage. When I try to pull it, I get -888888888 for all the values in the field. I'm assuming there is a simple way to retrieve this data in the correct format that I am missing. Anyone know what that is?
Faulty Code:
r = requests.get('https://api.census.gov/data/2016/acs/acs1/subject?get=S1501_C01_015E&for=congressional%20district:*&in=state:*&key=<my key>')
r.status_code
data = r.json()
Results:
[['S1501_C01_015E', 'state', 'congressional district'],
['-888888888', '01', '01'],
['-888888888', '01', '02'],
['-888888888', '01', '03'],
['-888888888', '01', '04'],
['-888888888', '01', '05'],...