Merge/Concat 2 dataframe with different holiday dates - python

I would like to merge/concat "outer" for 2 different dataframes with different set of holiday dates. Date column is string. Both dataframe prices exclude non-pricing days e.g. public holiday and weekends
Assuming Dataframe 1 follows US holiday:
df1_US_holiday
Date Price_A
5/6/2020 2
5/5/2020 3
5/4/2020 4
5/1/2020 5
4/30/2020 6
4/29/2020 1
4/28/2020 3
4/27/2020 1
Assuming Dataframe 2 follows China holiday (note: 1-5 May is China holiday):
df2_China_holiday
Date Price_B
5/6/2020 4
4/30/2020 3
4/29/2020 2
4/28/2020 2
4/27/2020 5
Expected merge/concat results:
Date Price_A Price_B
5/6/2020 2 4
5/5/2020 3 NaN
5/4/2020 4 NaN
5/1/2020 5 NaN
4/30/2020 6 3
4/29/2020 1 2
4/28/2020 3 2
4/27/2020 1 5
Ultimately, Would like fill the NaN for fillna(method='bfill'). Should I include any holiday library pack for this merge/concat action?

Pandas provides various facilities for easily combining together Series or DataFrame with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.
Please take a look at these documents that may be useful for what you want to achieve

Related

Python: Comparing rows values in a time period conditional

This is a sample of a pandas dataframe that I'm working on.
ID DATE HOUR TYPE CODE CITY
0 222304678 27/09/22 15:19:00 50201 3 Manila
1 222304694 18/09/22 10:46:00 30202 2 Innsbruck
2 222081537 18/09/22 10:47:00 30202 1 Innsbruck
3 221848197 17/09/22 21:54:00 30202 2 Austin
4 221455590 13/09/22 4:50:00 30409 2 Panama
5 220540157 06/09/22 12:29:00 30603 3 Sydney
6 220367113 06/09/22 12:32:00 30202 2 Sydney
7 221380583 06/09/22 12:56:00 30204 4 Sydney
8 221381826 06/09/22 12:58:00 30202 1 Sydney
9 221365584 22/08/22 12:35:00 50202 1 Tokyo
When a row is Code = 1. I need a comparison to be made of the rows that occurred 30 minutes before, with the following conditions:
The same city
The same date
Codes other than 1
And need to create another dataframe with the rows that met the condition (or at least just highlight them)
I have tried with df.loc but I dont know how to make the range in time

pandas groupby by customized year, e.g. a school year

In a pandas data frame I would like to find the mean values of a column, grouped by a 'customized' year.
An example would be to compute the mean values of school marks for a school year (e.g. Sep/YYYY to Aug/YYYY+1).
The pandas docs gives some information on offsets and business year etc., but I can't really make any sense out of that to get a working example.
Here is a minimal example where mean values of school marks are computed per year (Jan-Dec), which is what I do not want.
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randint(low=1, high=5, size=36),
index=pd.date_range('2001-09-01', freq='M', periods=36),
columns=['marks'])
df_yearly = df.groupby(pd.Grouper(freq="A")).mean()
This could yield e.g.:
print(df):
marks
2001-09-30 1
2001-10-31 4
2001-11-30 2
2001-12-31 1
2002-01-31 4
2002-02-28 1
2002-03-31 2
2002-04-30 1
2002-05-31 3
2002-06-30 3
2002-07-31 3
2002-08-31 3
2002-09-30 4
2002-10-31 1
...
2003-11-30 4
2003-12-31 2
2004-01-31 1
2004-02-29 2
2004-03-31 1
2004-04-30 3
2004-05-31 4
2004-06-30 2
2004-07-31 2
2004-08-31 4
print(df_yearly):
marks
2001-12-31 2.000000
2002-12-31 2.583333
2003-12-31 2.666667
2004-12-31 2.375000
My desired output would correspond to something like:
2001-09/2002-08 mean_value
2002-09/2003-08 mean_value
2003-09/2004-08 mean_value
Many thanks!
We can manually compute the school years:
# if month>=9 we move it to the next year
school_years = df.index.year + (df.index.month>8).astype(int)
Another option is to use fiscal year starting from September:
school_years = df.index.to_period('Q-AUG').qyear
And we can groupby:
df.groupby(school_years).mean()
Output:
marks
2002 2.333333
2003 2.500000
2004 2.500000
One more approach
a = (df.index.month == 9).cumsum()
val = df.groupby(a, sort=False)['marks'].mean().reset_index()
dates = df.index.to_series().groupby(a, sort=False).agg(['first', 'last']).reset_index()
dates.merge(val, on='index')
Output
index first last marks
0 1 2001-09-30 2002-08-31 2.750000
1 2 2002-09-30 2003-08-31 2.333333
2 3 2003-09-30 2004-08-31 2.083333

Pandas Map creating NaNs

My intention is to replace labels. I found out about using a dictionary and map it to the dataframe. To that end, I first extracted the necessary fields and created a dictionary which I then fed to the map function.
My programme is as follows:
factor_name = 'Help in household'
df = pd.read_csv('dat.csv')
labels = pd.read_csv('labels.csv')
fact_df = labels.loc[labels['Column'] == factor_name]
fact_dict = dict(zip(fact_df['Level'], fact_df['Rename']))
print df.index.to_series().map(fact_dict)
My labels.csv is as follows:
Column,Name,Level,Rename
Help in household,Every day,4,Every day
Help in household,Never,1,Never
Help in household,Once a month,2,Once a month
Help in household,Once a week,3,Once a week
State,AN,AN,Andaman & Nicobar
State,AP,AP,Andhra Pradesh
State,AR,AR,Arunachal Pradesh
State,BR,BR,Bihar
State,CG,CG,Chattisgarh
State,CH,CH,Chandigarh
State,DD,DD,Daman & Diu
State,DL,DL,Delhi
State,DN,DN,Dadra & Nagar Haveli
State,GA,GA,Goa
State,GJ,GJ,Gujarat
State,HP,HP,Himachal Pradesh
State,HR,HR,Haryana
State,JH,JH,Jharkhand
State,JK,JK,Jammu & Kashmir
State,KA,KA,Karnataka
State,KL,KL,Kerala
State,MG,MG,Meghalaya
State,MH,MH,Maharashtra
State,MN,MN,Manipur
State,MP,MP,Madhya Pradesh
State,MZ,MZ,Mizoram
State,NG,NG,Nagaland
State,OR,OR,Orissa
State,PB,PB,Punjab
State,PY,PY,Pondicherry
State,RJ,RJ,Rajasthan
State,SK,SK,Sikkim
State,TN,TN,Tamil Nadu
State,TR,TR,Tripura
State,UK,UK,Uttarakhand
State,UP,UP,Uttar Pradesh
State,WB,WB,West Bengal
My dat.csv is as follows:
Id,Help in household,Maths,Reading,Science,Social
11011001001,4,20.37,,27.78,
11011001002,3,12.96,,38.18,
11011001003,4,27.78,70,,
11011001004,4,,56.67,,36
11011001005,1,,,14.55,8.33
11011001006,4,,23.33,,30
11011001007,4,40.74,70,,
11011001008,3,,26.67,,22.92
Intended result is as follows:
4 Every day
1 Never
2 Once a month
3 Once a week
The mapping fails. The result always causes NaNs to appear which I do not want. Can anyone tell me why?
Try this:
In [140]: df['Help in household'] \
.astype(str) \
.map(labels.loc[labels['Column']=='Help in household',['Level','Rename']]
.set_index('Level')['Rename'])
Out[140]:
0 Every day
1 Once a week
2 Every day
3 Every day
4 Never
5 Every day
6 Every day
7 Once a week
Name: Help in household, dtype: object
You may also consider using merge:
In [147]: df.assign(Level=df['Help in household'].astype(str)) \
.merge(labels.loc[labels['Column']=='Help in household',['Level','Rename']],
on='Level')
Out[147]:
Id Help in household Maths Reading Science Social Level Rename
0 11011001001 4 20.37 NaN 27.78 NaN 4 Every day
1 11011001003 4 27.78 70.00 NaN NaN 4 Every day
2 11011001004 4 NaN 56.67 NaN 36.00 4 Every day
3 11011001006 4 NaN 23.33 NaN 30.00 4 Every day
4 11011001007 4 40.74 70.00 NaN NaN 4 Every day
5 11011001002 3 12.96 NaN 38.18 NaN 3 Once a week
6 11011001008 3 NaN 26.67 NaN 22.92 3 Once a week
7 11011001005 1 NaN NaN 14.55 8.33 1 Never

Applying function to Pandas Groupby

I'm currently working with panel data in Python and I'm trying to compute the rolling average for each time series observation within a given group (ID).
Given the size of my data set (thousands of groups with multiple time periods), the .groupby and .apply() functions are taking way too long to compute (has been running over an hour and still nothing -- entire data set only contains around 300k observations).
I'm ultimately wanting to iterate over multiple columns, doing the following:
Compute a rolling average for each time step in a given column, per group ID
Create a new column containing the difference between the original value and the moving average [x_t - (x_t-1 + x_t)/2]
Store column in a new DataFrame, which would be identical to original data set, except that it has the residual from #2 instead of the original value.
Repeat and append new residuals to df_resid (as seen below)
df_resid
date id rev_resid exp_resid
2005-09-01 1 NaN NaN
2005-12-01 1 -10000 -5500
2006-03-01 1 -352584 -262058.5
2006-06-01 1 240000 190049.5
2006-09-01 1 82648.75 37724.25
2005-09-01 2 NaN NaN
2005-12-01 2 4206.5 24353
2006-03-01 2 -302574 -331951
2006-06-01 2 103179 117405.5
2006-09-01 2 -52650 -72296.5
Here's small sample of the original data.
df
date id rev exp
2005-09-01 1 745168.0 545168.0
2005-12-01 1 725168.0 534168.0
2006-03-01 1 20000.0 10051.0
2006-06-01 1 500000.0 390150.0
2006-09-01 1 665297.5 465598.5
2005-09-01 2 956884.0 736987.0
2005-12-01 2 965297.0 785693.0
2006-03-01 2 360149.0 121791.0
2006-06-01 2 566507.0 356602.0
2006-09-01 2 461207.0 212009.0
And the (very slow) code:
df['rev_resid'] = df.groupby('id')['rev'].apply(lambda x:x.rolling(center=False,window=2).mean())
I'm hoping there is a much more computationally efficient way to do this (primarily with respect to #1), and could be extended to multiple columns.
Any help would be truly appreciated.
To quicken up the calculation, if dataframe is already sorted on 'id' then you don't have to do rolling within a groupby (if it isn't sorted... do so). Then since your window is only length 2 then we mask the result by checking where id == id.shift This works because it's sorted.
d1 = df[['rev', 'exp']]
df.join(
d1.rolling(2).mean().rsub(d1).add_suffix('_resid')[df.id.eq(df.id.shift())]
)
date id rev exp rev_resid exp_resid
0 2005-09-01 1 745168.0 545168.0 NaN NaN
1 2005-12-01 1 725168.0 534168.0 -10000.00 -5500.00
2 2006-03-01 1 20000.0 10051.0 -352584.00 -262058.50
3 2006-06-01 1 500000.0 390150.0 240000.00 190049.50
4 2006-09-01 1 665297.5 465598.5 82648.75 37724.25
5 2005-09-01 2 956884.0 736987.0 NaN NaN
6 2005-12-01 2 965297.0 785693.0 4206.50 24353.00
7 2006-03-01 2 360149.0 121791.0 -302574.00 -331951.00
8 2006-06-01 2 566507.0 356602.0 103179.00 117405.50
9 2006-09-01 2 461207.0 212009.0 -52650.00 -72296.50

Grouping records with close DateTimes in Python pandas DataFrame

I have been spinning my wheels with this problem and was wondering if anyone has any insight on how best to approach it. I have a pandas DataFrame with a number of columns, including one datetime64[ns]. I would like to find some way to 'group' records together which have datetimes which are very close to one another. For example, I might be interested in grouping the following transactions together if they occur within two seconds of each other by assigning a common ID called Grouped ID:
Transaction ID Time Grouped ID
1 08:10:02 1
2 08:10:03 1
3 08:10:50
4 08:10:55
5 08:11:00 2
6 08:11:01 2
7 08:11:02 2
8 08:11:03 3
9 08:11:04 3
10 08:15:00
Note that I am not looking to have the time window expand ad infinitum if transactions continue to occur at quick intervals - once a full 2 second window has passed, a new window would begin with the next transaction (as shown in transactions 5 - 9). Additionally, I will ultimately be performing this analysis at the millisecond level (i.e. combine transactions within 50 ms) but stuck with seconds for ease of presentation above.
Thanks very much for any insight you can offer!
The solution i suggest requires you to reindex your data with your Time data.
You can use a list of datetimes with the desired frequency, use searchsorted to find the nearest datetimes in your index, and then use it for slicing (as suggested in question python pandas dataframe slicing by date conditions and Python pandas, how to truncate DatetimeIndex and fill missing data only in certain interval).
I'm using pandas 0.14.1 and the DataOffset object (http://pandas.pydata.org/pandas-docs/dev/timeseries.html?highlight=dateoffset). I didn't check with datetime64, but i guess you might adapt the code. DataOffset goes down to the microsecond level.
Using the following code,
import pandas as pd
import pandas.tseries.offsets as pto
import numpy as np
# Create some ome test data
d_size = 15
df = pd.DataFrame({"value": np.arange(d_size)}, index=pd.date_range("2014/11/03", periods=d_size, freq=pto.Milli()))
# Define periods to define groups (ticks)
ticks = pd.date_range("2014/11/03", periods=d_size/3, freq=5*pto.Milli())
# find nearest indexes matching the ticks
index_ticks = np.unique(df.index.searchsorted(ticks))
# make a dataframe with the group ids
dgroups = pa.DataFrame(index=df.index, columns=['Group id',])
# sets the group ids
for i, (mini, maxi) in enumerate(zip(index_ticks[:-1], index_ticks[1:])):
dgroups.loc[mini:maxi] = i
# update original dataframe
df['Group id'] = dgroups['Group id']
I was able to obtain this kind of dataframe:
value Group id
2014-11-03 00:00:00 0 0
2014-11-03 00:00:00.001000 1 0
2014-11-03 00:00:00.002000 2 0
2014-11-03 00:00:00.003000 3 0
2014-11-03 00:00:00.004000 4 0
2014-11-03 00:00:00.005000 5 1
2014-11-03 00:00:00.006000 6 1
2014-11-03 00:00:00.007000 7 1
2014-11-03 00:00:00.008000 8 1
2014-11-03 00:00:00.009000 9 1
2014-11-03 00:00:00.010000 10 2
2014-11-03 00:00:00.011000 11 2
2014-11-03 00:00:00.012000 12 2
2014-11-03 00:00:00.013000 13 2
2014-11-03 00:00:00.014000 14 2

Categories

Resources