Getting date field from JSON url as pandas DataFrame - python

I am trying to bring this API URL into a pandas DataFrame and getting the values but still needing to add the date as a column like the other values:
import pandas as pd
from pandas.io.json import json_normalize
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
df = pd.read_json("https://covidapi.info/api/v1/country/DOM")
df = pd.DataFrame(df['result'].values.tolist())
print (df)
Getting this output:
confirmed deaths recovered
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
.. ... ... ...
72 1488 68 16
73 1488 68 16
74 1745 82 17
75 1828 86 33
76 1956 98 36

You need to pass the index from your dataframe as well as the data itself:
df = pd.DataFrame(index=df.index, data=df['result'].values.tolist())
The line above creates the same columns, but keeps the original date index from the API call.

Related

How do I assign year&months in PD dataframe?

My pandaframe looks very weird after running the code. The data doesnt not come with a year/month variable so I have to add them manually. Is there a way I could do that?
sample = []
url1 = "https://api.census.gov/data/2018/cps/basic/jan?get=PEFNTVTY,PEMNTVTY&for=state:01&PEEDUCA=39&key=YOUR_KEY_GOES_HERE"
url2 = "https://api.census.gov/data/2018/cps/basic/feb?get=PEFNTVTY,PEMNTVTY&for=state:01&PEEDUCA=39&key=YOUR_KEY_GOES_HERE"
url3 = "https://api.census.gov/data/2018/cps/basic/mar?get=PEFNTVTY,PEMNTVTY&for=state:01&PEEDUCA=39&key=YOUR_KEY_GOES_HERE"
sample.append(requests.get(url1).text)
sample.append(requests.get(url2).text)
sample.append(requests.get(url3).text)
sample = [json.loads(i) for i in sample]
sample = pd.DataFrame(sample)
sample
Consider read_json to directly read the Census URL API inside a user-defined method. Then iterate pairwise through all possible pairs of years and months using itertools.product to build data frame and assign corresponding columns:
import pandas as pd
import calendar
import itertools
def get_census_data(year, month):
# BUILD DYNAMIC URL
url = (
f"https://api.census.gov/data/{year}/cps/basic/{month.lower()}?"
"get=PEFNTVTY,PEMNTVTY&for=state:01"
)
# CLEAN RAW DATA FOR APPROPRIATE ROWS AND COLS, ASSIGN YEAR/MONTH COLS
raw_df = pd.read_json(url)
cps_df = (
pd.DataFrame(raw_df.iloc[1:,])
.set_axis(raw_df.iloc[0,], axis="columns", inplace=False)
.assign(year = year, month = month)
)
return cps_df
# MONTH AND YEAR LISTS
months_years = itertools.product(
range(2010, 2021),
calendar.month_abbr[1:13]
)
# ITERATE PAIRWISE THROUGH LISTS
cps_list = [get_census_data(yr, mo) for yr, mo in months_years]
# COMPILE AND CLEAN FINAL DATA FRAME
cps_df = (
pd.concat(cps_list, ignore_index=True)
.drop_duplicates()
.reset_index(drop=True)
.rename_axis(None, axis="columns")
)
Output
cps_df
PEFNTVTY PEMNTVTY state year month
0 57 57 1 2010 Jan
1 303 303 1 2010 Jan
2 233 233 1 2010 Jan
3 57 233 1 2010 Jan
4 73 73 1 2010 Jan
... ... ... ... ...
6447 210 139 1 2020 Dec
6448 363 363 1 2020 Dec
6449 301 57 1 2020 Dec
6450 57 242 1 2020 Dec
6451 416 416 1 2020 Dec
[6452 rows x 5 columns]
The response to each API call is a JSON array of arrays. You called the wrong DataFrame constructor. Try this:
base_url = "https://api.census.gov/data/2018/cps/basic"
params = {
"get": "PEFNTVTY,PEMNTVTY",
"for": "state:01",
"PEEDUCA": 39,
}
df = []
for month in ["jan", "feb", "mar"]:
r = requests.get(f"{base_url}/{month}", params=params)
r.raise_for_status()
j = r.json()
df.append(pd.DataFrame.from_records(j[1:], columns=j[0]).assign(month=month))
df = pd.concat(df)
Result:
PEFNTVTY PEMNTVTY PEEDUCA state month
0 57 57 39 1 jan
1 57 57 39 1 jan
2 57 57 39 1 jan
3 57 57 39 1 jan
4 57 57 39 1 jan
...

Concat function not giving desired result

I want to add dataframe to excel every time the code executes, in the last row available in the sheet. Here is the code I am using:
import pandas as pd
import pandas
from openpyxl import load_workbook
def append_df_to_excel(df, excel_path):
df_excel = pd.read_excel(excel_path)
result = pd.concat([df_excel, df], ignore_index=True)
result.to_excel(excel_path)
data_set1 = {
'Name': ['Rohit', 'Mohit'],
'Roll no': ['01', '02'],
'maths': ['93', '63']}
df1 = pd.DataFrame(data_set1)
append_df_to_excel(df1, r'C:\Users\kashk\OneDrive\Documents\ScreenStocks.xlsx')
My desired output(after 3 code runs):
Rohit 1 93
Mohit 2 63
Rohit 1 93
Mohit 2 63
Rohit 1 93
Mohit 2 63
But what I get:
Unnamed: 0.1 Unnamed: 0 Name Roll no maths
0 0 0 Rohit 1 93
1 1 1 Mohit 2 63
2 2 Rohit 1 93
3 3 Mohit 2 63
4 Rohit 1 93
5 Mohit 2 63
Not sure where I am going wrong.
It's happening because in a default situation these functions like to_excel or to_csv (and etc.) add a new column with index. So every time you save the file, it adds a new column.
That's why you just should change the raw where you save your dataframe to a file.
result.to_excel(excel_path, index=False)

Changing a coded min to a datetime in python pandas

I have a data set which looks like this. I must mention that 263 means (0-15 min), 264 means (16-30 min), 265 means (31-45 min), and 266 is (46-60 min). I need to convert these columns to a single column as : YYYY-MM-DD HH:MM:SS
LOCAL_YEAR LOCAL_MONTH LOCAL_DAY LOCAL_HOUR VALUE FLAG STATUS MEAS_TYPE_ELEMENT_ALIAS
2006 4 11 0 0 R 263
2006 4 11 0 0 R 264
2006 4 11 0 0 R 265
2006 4 11 0 0 R 266
2006 4 11 1 0 R 263
2006 4 11 1 0 R 264
2006 4 11 1 0 R 265
2006 4 11 1 0 R 266
I was wondering if anyone could help me with this?
This is the code:
import pandas as pd
import numpy as np
raw_data=pd.read_csv('Squamish_263_264_265_266.csv')
############################################## Reading rainfall and years ###################################
df=raw_data.iloc[:,[2,3,4,5,6,9]]
#print(df)
import datetime
dmap = {263:0,264:16,265:31,266:46}
df['MEAS_TYPE_ELEMENT_ALIAS_map'] = df['MEAS_TYPE_ELEMENT_ALIAS'].map(dmap)
for row, v in df.iterrows():
df.loc[row,'date'] = datetime.datetime(v['LOCAL_YEAR'],v['LOCAL_MONTH'],v['LOCAL_DAY'],v['LOCAL_HOUR'],v['MEAS_TYPE_ELEMENT_ALIAS_map'])
but it gives this error:
TypeError: integer argument expected, got float
and
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Use a map to translate the alias into a minute and the iterate to build your dates
dmap = {263:0,264:16,265:31,266:46}
df['ALIAS_map'] = df['MEAS_TYPE_ELEMENT_ALIAS'].map(dmap)
df.reset_index(inplace=True)
for row in df.head(50).itertuples():
df.loc[row[0],'date'] = datetime.datetime(int(row[1]),row[2],row[3],row[4],row[-1])

Create a pandas dataframe from dictionary whilst maintaining order of columns

When creating a dataframe as below (instructions from here), the order of the columns changes from "Day, Visitors, Bounce Rate" to "Bounce Rate, Day, Visitors"
import pandas as pd
web_stats = {'Day':[1,2,3,4,5,6],
'Visitors':[43,34,65,56,29,76],
'Bounce Rate':[65,67,78,65,45,52]}
df = pd.DataFrame(web_stats)
Gives:
Bounce Rate Day Visitors
0 65 1 43
1 67 2 34
2 78 3 65
3 65 4 56
4 45 5 29
5 52 6 76
How can the order be kept in tact? (i.e. Day, Visitors, Bounce Rate)
One approach is to use columns
Ex:
import pandas as pd
web_stats = {'Day':[1,2,3,4,5,6],
'Visitors':[43,34,65,56,29,76],
'Bounce Rate':[65,67,78,65,45,52]}
df = pd.DataFrame(web_stats, columns = ['Day', 'Visitors', 'Bounce Rate'])
print(df)
Output:
Day Visitors Bounce Rate
0 1 43 65
1 2 34 67
2 3 65 78
3 4 56 65
4 5 29 45
5 6 76 52
Dictionaries are not considered to be ordered in Python <3.7.
You can use collections.OrderedDict instead:
from collections import OrderedDict
web_stats = OrderedDict([('Day', [1,2,3,4,5,6]),
('Visitors', [43,34,65,56,29,76]),
('Bounce Rate', [65,67,78,65,45,52])])
df = pd.DataFrame(web_stats)
If you don't want to write the column names which becomes really inconvenient if you have multiple keys you may use
df = pd.DataFrame(web_stats, columns = web_stats.keys())

Looping through pandas DataFrame and having the output switch from a DataFrame to a Series between loops causes an error

I have vehicle information that I want to evaluate over several different time periods and I'm modifying different columns in the DataFrame as I move through the information. I'm working with the current and previous time periods so I need to concat the two and work on them together.
The problem I'm having is when I use the 'time' column as a index in pandas and loop through the data the object that is returned is either a DataFrame or a Series depending on number of vehicles (or rows) in the time period. This change in object type creates a error as I'm trying to use DataFrame methods on Series objects.
I created a small sample program that shows what I'm trying to do and the error that I'm receiving. Note this is a sample and not the real code. I have tried just simple querying the data by time period instead of using a index and that works but it is too slow for what I need to do.
import pandas as pd
df = pd.DataFrame({
'id' : range(44, 51),
'time' : [99,99,97,97,96,96,100],
'spd' : [13,22,32,41,42,53,34],
})
df = df.set_index(['time'], drop = False)
st = True
for ind in df.index.unique():
data = df.ix[ind]
print data
if st:
old_data = data
st = False
else:
c = pd.concat([data, old_data])
#do some work here
OUTPUT IS:
id spd time
time
99 44 13 99
99 45 22 99
id spd time
time
97 46 32 97
97 47 41 97
id spd time
time
96 48 42 96
96 49 53 96
id 50
spd 34
time 100
Name: 100, dtype: int64
Traceback (most recent call last):
File "C:/Users/m28050/Documents/Projects/fhwa/tca/v_2/code/pandas_ind.py", line 24, in <module>
c = pd.concat([data, old_data])
File "C:\Python27\lib\site-packages\pandas\tools\merge.py", line 873, in concat
return op.get_result()
File "C:\Python27\lib\site-packages\pandas\tools\merge.py", line 946, in get_result
new_data = com._concat_compat([x.values for x in self.objs])
File "C:\Python27\lib\site-packages\pandas\core\common.py", line 1737, in _concat_compat
return np.concatenate(to_concat, axis=axis)
ValueError: all the input arrays must have same number of dimensions
If anyone has the correct way to loop through the DataFrame and update the columns or can point out a different method to use, that would be great.
Thanks for your help.
Jim
I think groupby could help here:
In [11]: spd_lt_40 = df1[df1.spd < 40]
In [12]: spd_lt_40_count = spd_lt_40.groupby('time')['id'].count()
In [13]: spd_lt_40_count
Out[13]:
time
97 1
99 2
100 1
dtype: int64
and then set this to a column in the original DataFrame:
In [14]: df1['spd_lt_40_count'] = spd_lt_40_count
In [15]: df1['spd_lt_40_count'].fillna(0, inplace=True)
In [16]: df1
Out[16]:
id spd time spd_lt_40_count
time
99 44 13 99 2
99 45 22 99 2
97 46 32 97 1
97 47 41 97 1
96 48 42 96 0
96 49 53 96 0
100 50 34 100 1

Categories

Resources