How do I assign year&months in PD dataframe?

How do I assign year&months in PD dataframe? - python

My pandaframe looks very weird after running the code. The data doesnt not come with a year/month variable so I have to add them manually. Is there a way I could do that?
sample = []
url1 = "https://api.census.gov/data/2018/cps/basic/jan?get=PEFNTVTY,PEMNTVTY&for=state:01&PEEDUCA=39&key=YOUR_KEY_GOES_HERE"
url2 = "https://api.census.gov/data/2018/cps/basic/feb?get=PEFNTVTY,PEMNTVTY&for=state:01&PEEDUCA=39&key=YOUR_KEY_GOES_HERE"
url3 = "https://api.census.gov/data/2018/cps/basic/mar?get=PEFNTVTY,PEMNTVTY&for=state:01&PEEDUCA=39&key=YOUR_KEY_GOES_HERE"
sample.append(requests.get(url1).text)
sample.append(requests.get(url2).text)
sample.append(requests.get(url3).text)
sample = [json.loads(i) for i in sample]
sample = pd.DataFrame(sample)
sample

Consider read_json to directly read the Census URL API inside a user-defined method. Then iterate pairwise through all possible pairs of years and months using itertools.product to build data frame and assign corresponding columns:
import pandas as pd
import calendar
import itertools
def get_census_data(year, month):
# BUILD DYNAMIC URL
url = (
f"https://api.census.gov/data/{year}/cps/basic/{month.lower()}?"
"get=PEFNTVTY,PEMNTVTY&for=state:01"
)
# CLEAN RAW DATA FOR APPROPRIATE ROWS AND COLS, ASSIGN YEAR/MONTH COLS
raw_df = pd.read_json(url)
cps_df = (
pd.DataFrame(raw_df.iloc[1:,])
.set_axis(raw_df.iloc[0,], axis="columns", inplace=False)
.assign(year = year, month = month)
)
return cps_df
# MONTH AND YEAR LISTS
months_years = itertools.product(
range(2010, 2021),
calendar.month_abbr[1:13]
)
# ITERATE PAIRWISE THROUGH LISTS
cps_list = [get_census_data(yr, mo) for yr, mo in months_years]
# COMPILE AND CLEAN FINAL DATA FRAME
cps_df = (
pd.concat(cps_list, ignore_index=True)
.drop_duplicates()
.reset_index(drop=True)
.rename_axis(None, axis="columns")
)
Output
cps_df
PEFNTVTY PEMNTVTY state year month
0 57 57 1 2010 Jan
1 303 303 1 2010 Jan
2 233 233 1 2010 Jan
3 57 233 1 2010 Jan
4 73 73 1 2010 Jan
... ... ... ... ...
6447 210 139 1 2020 Dec
6448 363 363 1 2020 Dec
6449 301 57 1 2020 Dec
6450 57 242 1 2020 Dec
6451 416 416 1 2020 Dec
[6452 rows x 5 columns]

The response to each API call is a JSON array of arrays. You called the wrong DataFrame constructor. Try this:
base_url = "https://api.census.gov/data/2018/cps/basic"
params = {
"get": "PEFNTVTY,PEMNTVTY",
"for": "state:01",
"PEEDUCA": 39,
}
df = []
for month in ["jan", "feb", "mar"]:
r = requests.get(f"{base_url}/{month}", params=params)
r.raise_for_status()
j = r.json()
df.append(pd.DataFrame.from_records(j[1:], columns=j[0]).assign(month=month))
df = pd.concat(df)
Result:
PEFNTVTY PEMNTVTY PEEDUCA state month
0 57 57 39 1 jan
1 57 57 39 1 jan
2 57 57 39 1 jan
3 57 57 39 1 jan
4 57 57 39 1 jan
...

Related

How prevent pd.pivot_table from sorting columns

I have the following long df:
df = pd.DataFrame({'stations':["Toronto","Toronto","Toronto","New York","New York","New York"],'forecast_date':["Jul 30","Jul 31","Aug 1","Jul 30","Jul 31","Aug 1"],'low':[58,57,59,70,72,71],'high':[65,66,64,88,87,86]})
print(df)
I want to pivot the table to a wide df that looks like this:
Desired Output
so I used the following function:
df = df.pivot_table(index = 'stations', columns = "forecast_date", values = ["high","low"],aggfunc = "first").reset_index()
print(df)
but with this, I get the following df:
Output Received (Undesired)
So basically pd.pivot_table seems to be sorting the columns alphabetically, whereas I want it to be sorted in chronological order
Any help would be appreciated,
(Note that the dates are continuously changing, so other months will have a similar problem)

You won't be able to prevent the sorting, but you can always enforce the original ordering by using .reindex with the unique values from the column!
table = df.pivot_table(index = 'stations', columns = "forecast_date", values = ["high","low"],aggfunc = "first")
print(
table
)
high low
forecast_date Aug 1 Jul 30 Jul 31 Aug 1 Jul 30 Jul 31
stations
New York 86 88 87 71 70 72
Toronto 64 65 66 59 58 57
print(
table.reindex(columns=df['forecast_date'].unique(), level='forecast_date')
)
high low
forecast_date Jul 30 Jul 31 Aug 1 Jul 30 Jul 31 Aug 1
stations
New York 88 87 86 70 72 71
Toronto 65 66 64 58 57 59
Note that this is different than sorting in chronological order. To do that you would have to cast to a datetime and sort on that.

Getting date field from JSON url as pandas DataFrame

I am trying to bring this API URL into a pandas DataFrame and getting the values but still needing to add the date as a column like the other values:
import pandas as pd
from pandas.io.json import json_normalize
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
df = pd.read_json("https://covidapi.info/api/v1/country/DOM")
df = pd.DataFrame(df['result'].values.tolist())
print (df)
Getting this output:
confirmed deaths recovered
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
.. ... ... ...
72 1488 68 16
73 1488 68 16
74 1745 82 17
75 1828 86 33
76 1956 98 36

You need to pass the index from your dataframe as well as the data itself:
df = pd.DataFrame(index=df.index, data=df['result'].values.tolist())
The line above creates the same columns, but keeps the original date index from the API call.

Get index of where group starts and ends pandas

I grouped my data by month. Now I need to know at which observation/index my group starts and ends.
What I have is the following output where the second column represents the number of observation in each month:
date
01 145
02 2232
03 12785
04 16720
Name: date, dtype: int64
with this code:
leave.groupby([leave['date'].dt.strftime('%m')])['date'].count()
What I want though is an index range I could access later. Somehow like that (the format doesn't really matter and I don't mind if it returns a list or a data frame)
date
01 0 - 145
02 146 - 2378
03 2378 - 15163
04 15164 - 31884

try the following - using shift
df['data'] = df['data'].shift(1).add(1).fillna(0).apply(int).apply(str) + ' - ' + df['data'].apply(str)
OUTPUT:
data
date
1 0 - 145
2 146 - 2232
3 2233 - 12785
4 12786 - 16720
5 16721 - 30386
6 30387 - 120157

I think you are asking for a data frame containing the indices of first and last occurrences of each value.
How about something like this.
Example data (note -- it's better to include reproducible data in your question so I don't have to guess):
import pandas as pd
import numpy as np
np.random.seed(123)
n = 500
df = pd.DataFrame(
{'date':pd.to_datetime(
pd.DataFrame( { 'year': np.random.choice(range(2017,2019), size=n),
'month': np.random.choice(range(1,13), size=n),
'day': np.random.choice(range(1,28), size=n)
} )
) }
)
Approach:
pd.DataFrame( ( { '_month_':x,'firstIndex':y[0],'lastIndex':y[-1]}
for x, y in df.index.groupby(df['date'].dt.month).items()
)
)
Result:
_month_ firstIndex lastIndex
0 1 0 495
1 2 21 499
2 3 1 488
3 4 5 498
4 5 14 492
5 6 12 470
6 7 15 489
7 8 2 494
8 9 18 475
9 10 3 491
10 11 10 473
11 12 7 497
If you are only going use it for indexing in a loop, you wouldn't have to wrap it in pd.DataFrame() -- you could just leave it as a generator.

pandas: how to run a pivot with a multi-index?

I would like to run a pivot on a pandas DataFrame, with the index being two columns, not one. For example, one field for the year, one for the month, an 'item' field which shows 'item 1' and 'item 2' and a 'value' field with numerical values. I want the index to be year + month.
The only way I managed to get this to work was to combine the two fields into one, then separate them again. is there a better way?
Minimal code copied below. Thanks a lot!
PS Yes, I am aware there are other questions with the keywords 'pivot' and 'multi-index', but I did not understand if/how they can help me with this question.
import pandas as pd
import numpy as np
df= pd.DataFrame()
month = np.arange(1, 13)
values1 = np.random.randint(0, 100, 12)
values2 = np.random.randint(200, 300, 12)
df['month'] = np.hstack((month, month))
df['year'] = 2004
df['value'] = np.hstack((values1, values2))
df['item'] = np.hstack((np.repeat('item 1', 12), np.repeat('item 2', 12)))
# This doesn't work:
# ValueError: Wrong number of items passed 24, placement implies 2
# mypiv = df.pivot(['year', 'month'], 'item', 'value')
# This doesn't work, either:
# df.set_index(['year', 'month'], inplace=True)
# ValueError: cannot label index with a null key
# mypiv = df.pivot(columns='item', values='value')
# This below works but is not ideal:
# I have to first concatenate then separate the fields I need
df['new field'] = df['year'] * 100 + df['month']
mypiv = df.pivot('new field', 'item', 'value').reset_index()
mypiv['year'] = mypiv['new field'].apply( lambda x: int(x) / 100)
mypiv['month'] = mypiv['new field'] % 100

You can group and then unstack.
>>> df.groupby(['year', 'month', 'item'])['value'].sum().unstack('item')
item item 1 item 2
year month
2004 1 33 250
2 44 224
3 41 268
4 29 232
5 57 252
6 61 255
7 28 254
8 15 229
9 29 258
10 49 207
11 36 254
12 23 209
Or use pivot_table:
>>> df.pivot_table(
values='value',
index=['year', 'month'],
columns='item',
aggfunc=np.sum)
item item 1 item 2
year month
2004 1 33 250
2 44 224
3 41 268
4 29 232
5 57 252
6 61 255
7 28 254
8 15 229
9 29 258
10 49 207
11 36 254
12 23 209

I believe if you include item in your MultiIndex, then you can just unstack:
df.set_index(['year', 'month', 'item']).unstack(level=-1)
This yields:
value
item item 1 item 2
year month
2004 1 21 277
2 43 244
3 12 262
4 80 201
5 22 287
6 52 284
7 90 249
8 14 229
9 52 205
10 76 207
11 88 259
12 90 200
It's a bit faster than using pivot_table, and about the same speed or slightly slower than using groupby.

The following worked for me:
mypiv = df.pivot(index=['year','month'],columns='item')[['values1','values2']]

thanks to gmoutso comment you can use this:
def multiindex_pivot(df, index=None, columns=None, values=None):
if index is None:
names = list(df.index.names)
df = df.reset_index()
else:
names = index
list_index = df[names].values
tuples_index = [tuple(i) for i in list_index] # hashable
df = df.assign(tuples_index=tuples_index)
df = df.pivot(index="tuples_index", columns=columns, values=values)
tuples_index = df.index # reduced
index = pd.MultiIndex.from_tuples(tuples_index, names=names)
df.index = index
return df
usage:
df.pipe(multiindex_pivot, index=['idx_column1', 'idx_column2'], columns='foo', values='bar')
You might want to have a simple flat column structure and have columns to be of their intended type, simply add this:
(df
.infer_objects() # coerce to the intended column type
.rename_axis(None, axis=1)) # flatten column headers

query a pandas dataframe based in index and datacolumns

I have a Datset that looks like :
data="""cruiseid year station month day date lat lon depth_w taxon count
AA8704 1987 1 04 13 13-APR-87 35.85 -75.48 18 Centropages_typicus 75343
AA8704 1987 1 04 13 13-APR-87 35.85 -75.48 18 Gastropoda 0
AA8704 1987 1 04 13 13-APR-87 35.85 -75.48 18 Calanus_finmarchicus 2340
AA8704 1987 1 07 13 13-JUL-87 35.85 -75.48 18 Acartia_spp. 5616
AA8704 1987 1 07 13 13-JUL-87 35.85 -75.48 18 Metridia_lucens 468
AA8704 1987 1 08 13 13-AUG-87 35.85 -75.48 18 Evadne_spp. 0
AA8704 1987 1 08 13 13-AUG-87 35.85 -75.48 18 Salpa 0
AA8704 1987 1 08 13 13-AUG-87 35.85 -75.48 18 Oithona_spp. 468
"""
datafile = open('data.txt','w')
datafile.write(data)
datafile.close()
I read it into pandas with :
parse = lambda x: dt.datetime.strptime(x, '%d-%m-%Y')
df = pd.read_csv('data.txt',index_col=0, header=False, parse_dates={"Datetime" : [1,3,4]}, skipinitialspace=True, sep=' ', skiprows=0)
How can i generate a subset from this dataframe with all the records in April where the taxon is 'Calanus_finmarchicus' or 'Gastropoda'
I can query the dataframe where taxon is equal to 'Calanus_finmarchicus' or 'Gastropoda' using
df[(df.taxon == 'Calanus_finmarchicus') | (df.taxon == 'Gastropoda')]
But i'm in trouble quering the time, something similar in numy can be like :
import numpy as np
data = np.genfromtxt('data.txt', dtype=[('cruiseid','S6'), ('year','i4'), ('station','i4'), ('month','i4'), ('day','i4'), ('date','S9'), ('lat','f8'), ('lon','f8'), ('depth_w','i8'), ('taxon','S60'), ('count','i8')], skip_header=1)
selection = [np.where((data['taxon']=='Calanus_finmarchicus') | (data['taxon']=='Gastropoda') & ((data['month']==4) | (data['month']==3)))[0]]
data[selection]
Here's a link with a notebook to reproduce the example

You can refer to datetime's month attribute:
>>> df.index.month
array([4, 4, 4, 7, 7, 8, 8, 8], dtype=int32)
>>> df[((df.taxon == 'Calanus_finmarchicus') | (df.taxon == 'Gastropoda'))
... & (df.index.month == 4)]
cruiseid station date lat lon depth_w \
Datetime
1987-04-13 AA8704 1 13-APR-87 35.85 -75.48 18
1987-04-13 AA8704 1 13-APR-87 35.85 -75.48 18
taxon count Unnamed: 11
Datetime
1987-04-13 Gastropoda 0 NaN
1987-04-13 Calanus_finmarchicus 2340 NaN

As others said, you can use df.index.month to filter by month, but I also suggest to use pandas.Series.isin() to check your taxon condition:
>>> df[df.taxon.isin(['Calanus_finmarchicus', 'Gastropoda']) & (df.index.month == 4)]
cruiseid station date lat lon depth_w \
Datetime
1987-04-13 AA8704 1 13-APR-87 35.85 -75.48 18
1987-04-13 AA8704 1 13-APR-87 35.85 -75.48 18
taxon count Unnamed: 11
Datetime
1987-04-13 Gastropoda 0 NaN
1987-04-13 Calanus_finmarchicus 2340 NaN

Use the month attribute of your index:
df[(df.index.month == 4) & ((df.taxon == 'Calanus_finmarchicus') | (df.taxon == 'Gastropoda'))]

i didn't pay attention on the syntax (brachets order) and on the dataframe.index attributes, this line give me what i was lloking for :
results = df[((df.taxon == 'Calanus_finmarchicus') | (df.taxon == 'Gastropoda')) & (df.index.month==4)] # [df.index.month==4)]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I assign year&months in PD dataframe? - python

Related

How prevent pd.pivot_table from sorting columns

Getting date field from JSON url as pandas DataFrame

Get index of where group starts and ends pandas

pandas: how to run a pivot with a multi-index?

query a pandas dataframe based in index and datacolumns

Categories

Resources