Resampling or Reindexing Two Columns of Data with different frequencies - python

I have a dataframe which contains a Time Stamp column, and two data columns (data1 and data2).
The data1 column spans the entire Time Stamp, while the data2 column stops about halfway. When I was collecting my data, both data1 and data2 collected data for the same time, except at different frequencies.
I would like the data2 column to I understand that I should be leaning towards the resample or reindex functions, but I am unsure how to do this. My Time Stamp column is an object, while my two data columns are float64 types.
What is the easiest way for me to accomplish this goal?
I have tried to refer to the following question, but I was having trouble implementing it:
PANDAS - Loop over two datetime indexes with different sizes to compare days and values

Here's what I think you're trying to do. My assumptions is that your timestamps are aligned by some multiplier. I've used every 2 minutes in my example, since that's what your example appears to be. Here's my sample dataframe:
df
a b
DATE
2017-05-29 06:30:00 0.0 0.0
2017-05-29 06:31:00 9.0 24.0
2017-05-29 06:32:00 10.0 1.0
2017-05-29 06:33:00 10.0 1.0
2017-05-29 06:34:00 0.0 7.0
2017-05-29 06:35:00 3.0 3.0
2017-05-29 06:36:00 0.0 4.0
2017-05-29 06:37:00 0.0 1.0
2017-05-29 06:38:00 0.0 0.0
2017-05-29 06:39:00 0.0 2.0
2017-05-29 06:40:00 0.0 NaN
2017-05-29 06:41:00 0.0 NaN
2017-05-29 06:42:00 0.0 NaN
2017-05-29 06:43:00 0.0 NaN
2017-05-29 06:44:00 0.0 NaN
2017-05-29 06:45:00 2.0 NaN
2017-05-29 06:46:00 4.0 NaN
2017-05-29 06:47:00 0.0 NaN
2017-05-29 06:48:00 4.0 NaN
2017-05-29 06:49:00 8.0 NaN
Extract the misaligned column to it's own dataframe and add a counter column, then add the timedelta to the index, replace the old index, and concatenate the data columns.
b = df['b'][:10].to_frame()
b.insert(0, 'counter', range(len(b)))
b.index = b.index.to_series().apply(lambda x: x + pd.Timedelta(minutes=b.loc[x].counter))
pd.concat([df['a'], b['b']], axis=1)
a b
DATE
2017-05-29 06:30:00 0.0 0.0
2017-05-29 06:31:00 9.0 NaN
2017-05-29 06:32:00 10.0 24.0
2017-05-29 06:33:00 10.0 NaN
2017-05-29 06:34:00 0.0 1.0
2017-05-29 06:35:00 3.0 NaN
2017-05-29 06:36:00 0.0 1.0
2017-05-29 06:37:00 0.0 NaN
2017-05-29 06:38:00 0.0 7.0
2017-05-29 06:39:00 0.0 NaN
2017-05-29 06:40:00 0.0 3.0
2017-05-29 06:41:00 0.0 NaN
2017-05-29 06:42:00 0.0 4.0
2017-05-29 06:43:00 0.0 NaN
2017-05-29 06:44:00 0.0 1.0
2017-05-29 06:45:00 2.0 NaN
2017-05-29 06:46:00 4.0 0.0
2017-05-29 06:47:00 0.0 NaN
2017-05-29 06:48:00 4.0 2.0
2017-05-29 06:49:00 8.0 NaN
It probably goes without saying, but it would be much better to apply correct timestamps to each of the columns when you ingest them.

Related

How do I plot my cohorts by week and month without changing the underlying pivot table data?

I'm doing a cohort analysis. The cohorts are tied to the day that the user was created in the database.
Is there a way to plot this such that I can group cohorts together by week and month? I'd prefer not to change the pivot table data if possible.
periods_weeks 0.0 1.0 2.0 3.0 4.0 5.0 6.0
cohort
2021-03-01 2192.0 NaN 10.0 NaN NaN NaN NaN
2021-06-02 270.0 141.0 46.0 71.0 96.0 63.0 57.0
2021-06-03 338.0 97.0 62.0 30.0 10.0 21.0 15.0
2021-06-04 17801.0 7611.0 6342.0 6062.0 5064.0 4731.0 4518.0
2021-06-05 4105.0 1123.0 982.0 859.0 597.0 486.0 448.0

Subset pandas dataframe based on first non zero occurrence

Here is the sample dataframe:-
Trade_signal
2007-07-31 0.0
2007-08-31 0.0
2007-09-28 0.0
2007-10-31 0.0
2007-11-30 0.0
2007-12-31 0.0
2008-01-31 0.0
2008-02-29 0.0
2008-03-31 0.0
2008-04-30 0.0
2008-05-30 0.0
2008-06-30 0.0
2008-07-31 -1.0
2008-08-29 0.0
2008-09-30 -1.0
2008-10-31 -1.0
2008-11-28 -1.0
2008-12-31 0.0
2009-01-30 -1.0
2009-02-27 -1.0
2009-03-31 0.0
2009-04-30 0.0
2009-05-29 1.0
2009-06-30 1.0
2009-07-31 1.0
2009-08-31 1.0
2009-09-30 1.0
2009-10-30 0.0
2009-11-30 1.0
2009-12-31 1.0
1 represents buy and -1 represents sell. I want to subset the dataframe so that the new dataframe starts with first 1 occurrence. Expected Output:-
2009-05-29 1.0
2009-06-30 1.0
2009-07-31 1.0
2009-08-31 1.0
2009-09-30 1.0
2009-10-30 0.0
2009-11-30 1.0
2009-12-31 1.0
Please suggest the way forward. Apologies if this is a repeated question.
Simply do. Here df[1] refers to the column containing buy/sell data.
new_df = df.iloc[df[df["Trade Signal"]==1].index[0]:,:]

How to find maximum values in DataFrame and return a resulting DataFrame

I have a DataFrame looking partly like this:
df_all_q
Out[43]:
Qtot Ptot Q_G1 Q_G2 P_G1 P_G2
0 0.0 0.000000 0.0 0.0 0.000000 0.000000
1 5.0 0.576190 0.0 5.0 0.000000 0.576190
2 5.0 0.581900 5.0 0.0 0.581900 0.000000
3 10.0 1.152380 0.0 10.0 0.000000 1.152380
4 10.0 1.163800 10.0 0.0 1.163800 0.000000
5 10.0 1.158090 5.0 5.0 0.581900 0.576190
6 15.0 1.805147 15.0 0.0 1.805147 0.000000
7 15.0 1.734280 5.0 10.0 0.581900 1.152380
8 15.0 1.739990 10.0 5.0 1.163800 0.576190
9 15.0 1.569220 0.0 15.0 0.000000 1.569220
10 20.0 2.381337 15.0 5.0 1.805147 0.576190
11 20.0 2.151120 5.0 15.0 0.581900 1.569220
12 20.0 2.466860 20.0 0.0 2.466860 0.000000
13 20.0 1.782640 0.0 20.0 0.000000 1.782640
14 20.0 2.316180 10.0 10.0 1.163800 1.152380
15 25.0 2.713030 0.0 25.0 0.000000 2.713030
16 25.0 2.364540 5.0 20.0 0.581900 1.782640
17 25.0 3.043050 20.0 5.0 2.466860 0.576190
18 25.0 3.111990 25.0 0.0 3.111990 0.000000
19 25.0 2.957527 15.0 10.0 1.805147 1.152380
20 25.0 2.733020 10.0 15.0 1.163800 1.569220
Now I need to create another DataFrame with the maximum value of Ptot for each Qtot, like this:
df_result
Out[45]:
Qtot Ptot Q_G1 Q_G2 P_G1 P_G2
0 0.0 0.000000 0.0 0.0 0.000000 0.0
2 5.0 0.581900 5.0 0.0 0.581900 0.0
4 10.0 1.163800 10.0 0.0 1.163800 0.0
6 15.0 1.805147 15.0 0.0 1.805147 0.0
12 20.0 2.466860 20.0 0.0 2.466860 0.0
18 25.0 3.111990 25.0 0.0 3.111990 0.0
I guess this should be possible quite easily, however I'm stuck.
You can try to use .groupby method. It works similar groupby in sql and returns a dataframe.
After grouping, you define operations to the groups. There is the max operation for the Ptot (as you want it) and you can return mean for the others.
Other option is returning just the Ptot column and then merge the resulting dataframe with the old one.
df_all_q.groupby('Qtot').agg({'Ptot': 'max', 'Q_G1': 'mean',
'Q_G2': 'mean', 'P_G1': 'mean', 'P_G2': 'mean'}).reset_index()
So, by parts:
.groupby groups all Qtot with same values
.add sets the aggregating function for each column
.reset_index makes the Qtot a ordinary column instead of the index of the new dataframe
If, for instance, you want that the P_G1 column on the results is the max instead of the mean you can substitute the 'max' by 'mean'.
Common aggregation functions inlcudes sum, max, min, mean, size and first. Full list can be found in docs

Python Pandas: Denormalize data from one data frame into another

I have a Pandas data frame which you might describe as “normalized”. For display purposes, I want to “de-normalize” the data. That is, I want to take some data spread across multiple key values which I want to put on the same row in the output records. Some records need to be summed as they are combined. (Aside: if anyone has a better term for this than “denormalization”, please make an edit to this question, or say so in the comments.)
I am working with a pandas data frame with many columns, so I will show you a simplified version below.
The following code sets up a (nearly) normalized source data frame. (Note that I am looking for advice on the second code block, and this code block is just to provide some context.) Similar to my actual data, there are some duplications in the identifying data, and some numbers to be summed:
import pandas as pd
dates = pd.date_range('20170701', periods=21)
datesA1 = pd.date_range('20170701', periods=11)
datesB1 = pd.date_range('20170705', periods=9)
datesA2 = pd.date_range('20170708', periods=10)
datesB2 = pd.date_range('20170710', periods=11)
datesC1 = pd.date_range('20170701', periods=5)
datesC2 = pd.date_range('20170709', periods=9)
cols=['Date','Type','Count']
df_A1 = pd.DataFrame({'Date':datesA1,
'Type':'Apples',
'Count': np.random.randint(30,size=11)})
df_A2 = pd.DataFrame({'Date':datesA2,
'Type':'Apples',
'Count': np.random.randint(30,size=10)})
df_B1 = pd.DataFrame({'Date':datesB1,
'Type':'Berries',
'Count': np.random.randint(30,size=9)})
df_B2 = pd.DataFrame({'Date':datesB2,
'Type':'Berries',
'Count': np.random.randint(30,size=11)})
df_C1 = pd.DataFrame({'Date':datesC1,
'Type':'Canteloupes',
'Count': np.random.randint(30,size=5)})
df_C2 = pd.DataFrame({'Date':datesC2,
'Type':'Canteloupes',
'Count': np.random.randint(30,size=9)})
frames = [df_A1, df_A2, df_B1, df_B2, df_C1, df_C2]
dat_fra_source = pd.concat(frames)
Further, the following code achieves my intention. The source data frame has multiple rows per date and type of fruit (A, B, and C). The destination data has a single row per day, with a sum of A, B, and C.
dat_fra_dest = pd.DataFrame(0, index=dates, columns=['Apples','Berries','Canteloupes'])
for index,row in dat_fra_source.iterrows():
dat_fra_dest.at[row['Date'],row['Type']]+=row['Count']
My question is if there is a cleaner way to do this: a way that doesn’t require the zero-initialization and/or a way that operates on the entire data frame instead of line-by-line. I am also skeptical that I have an efficient implementation. I’ll also note that while I am only dealing with “count” in the simplified example, I have additional columns in my real-world example. Think that for A, B, and C there is not only a count, but also a weight and a volume.
Option 1
dat_fra_source.groupby(['Date','Type']).sum().unstack().fillna(0)
Out[63]:
Count
Type Apples Berries Canteloupes
Date
2017-07-01 13.0 0.0 24.0
2017-07-02 18.0 0.0 16.0
2017-07-03 11.0 0.0 29.0
2017-07-04 13.0 0.0 7.0
2017-07-05 24.0 11.0 23.0
2017-07-06 6.0 4.0 0.0
2017-07-07 29.0 26.0 0.0
2017-07-08 31.0 19.0 0.0
2017-07-09 38.0 17.0 26.0
2017-07-10 57.0 54.0 1.0
2017-07-11 4.0 41.0 10.0
2017-07-12 16.0 28.0 23.0
2017-07-13 25.0 20.0 20.0
2017-07-14 19.0 6.0 15.0
2017-07-15 6.0 22.0 7.0
2017-07-16 16.0 0.0 5.0
2017-07-17 29.0 7.0 4.0
2017-07-18 0.0 21.0 0.0
2017-07-19 0.0 19.0 0.0
2017-07-20 0.0 8.0 0.0
Option 2
pd.pivot_table(dat_fra_source,index=['Date'],columns=['Type'],values='Count',aggfunc=sum).fillna(0)
Out[75]:
Type Apples Berries Canteloupes
Date
2017-07-01 13.0 0.0 24.0
2017-07-02 18.0 0.0 16.0
2017-07-03 11.0 0.0 29.0
2017-07-04 13.0 0.0 7.0
2017-07-05 24.0 11.0 23.0
2017-07-06 6.0 4.0 0.0
2017-07-07 29.0 26.0 0.0
2017-07-08 31.0 19.0 0.0
2017-07-09 38.0 17.0 26.0
2017-07-10 57.0 54.0 1.0
2017-07-11 4.0 41.0 10.0
2017-07-12 16.0 28.0 23.0
2017-07-13 25.0 20.0 20.0
2017-07-14 19.0 6.0 15.0
2017-07-15 6.0 22.0 7.0
2017-07-16 16.0 0.0 5.0
2017-07-17 29.0 7.0 4.0
2017-07-18 0.0 21.0 0.0
2017-07-19 0.0 19.0 0.0
2017-07-20 0.0 8.0 0.0
And assuming you have columns vol and weight
dat_fra_source['vol']=2
dat_fra_source['weight']=2
dat_fra_source.groupby(['Date','Type']).apply(lambda x: sum(x['vol']*x['weight']*x['Count'])).unstack().fillna(0)
Out[88]:
Type Apples Berries Canteloupes
Date
2017-07-01 52.0 0.0 96.0
2017-07-02 72.0 0.0 64.0
2017-07-03 44.0 0.0 116.0
2017-07-04 52.0 0.0 28.0
2017-07-05 96.0 44.0 92.0
2017-07-06 24.0 16.0 0.0
2017-07-07 116.0 104.0 0.0
2017-07-08 124.0 76.0 0.0
2017-07-09 152.0 68.0 104.0
2017-07-10 228.0 216.0 4.0
2017-07-11 16.0 164.0 40.0
2017-07-12 64.0 112.0 92.0
2017-07-13 100.0 80.0 80.0
2017-07-14 76.0 24.0 60.0
2017-07-15 24.0 88.0 28.0
2017-07-16 64.0 0.0 20.0
2017-07-17 116.0 28.0 16.0
2017-07-18 0.0 84.0 0.0
2017-07-19 0.0 76.0 0.0
2017-07-20 0.0 32.0 0.0
Use pd.crosstab:
pd.crosstab(dat_fra_source['Date'],
dat_fra_source['Type'],
dat_fra_source['Count'],
aggfunc='sum',
dropna=False).fillna(0)
Output:
Type Apples Berries Canteloupes
Date
2017-07-01 19.0 0.0 4.0
2017-07-02 25.0 0.0 4.0
2017-07-03 11.0 0.0 26.0
2017-07-04 27.0 0.0 8.0
2017-07-05 8.0 18.0 12.0
2017-07-06 10.0 11.0 0.0
2017-07-07 6.0 17.0 0.0
2017-07-08 10.0 5.0 0.0
2017-07-09 51.0 25.0 16.0
2017-07-10 31.0 23.0 21.0
2017-07-11 35.0 40.0 10.0
2017-07-12 16.0 30.0 9.0
2017-07-13 13.0 23.0 20.0
2017-07-14 21.0 26.0 27.0
2017-07-15 20.0 17.0 19.0
2017-07-16 12.0 4.0 2.0
2017-07-17 27.0 0.0 5.0
2017-07-18 0.0 5.0 0.0
2017-07-19 0.0 26.0 0.0
2017-07-20 0.0 6.0 0.0

Copy nan values from one dataframe to another

I have two dataframes df1 and df2 of the same size and dimensions. Is there a simple way to copy all the NaN values in 'df1' to 'df2' ? The example below demonstrates the output I want from .copynans()
In: df1
Out:
10053802 10053856 10053898 10058054
2012-07-01 00:00:00 100.0 0.353 0.300 0.326
2012-07-01 00:30:00 101.0 0.522 0.258 0.304
2012-07-01 01:00:00 102.0 0.311 0.369 0.228
2012-07-01 01:30:00 103.0 NaN 0.478 0.247
2012-07-01 02:00:00 101.0 NaN NaN 0.259
2012-07-01 02:30:00 102.0 0.281 NaN 0.239
2012-07-01 03:00:00 125.0 0.320 NaN 0.217
2012-07-01 03:30:00 136.0 0.288 NaN 0.283
In: df2
Out:
10053802 10053856 10053898 10058054
2012-07-01 00:00:00 1.0 2.0 3.0 4.0
2012-07-01 00:30:00 1.0 2.0 3.0 4.0
2012-07-01 01:00:00 1.0 2.0 3.0 4.0
2012-07-01 01:30:00 1.0 2.0 3.0 4.0
2012-07-01 02:00:00 1.0 2.0 3.0 4.0
2012-07-01 02:30:00 1.0 2.0 3.0 4.0
2012-07-01 03:00:00 1.0 2.0 3.0 4.0
2012-07-01 03:30:00 1.0 2.0 3.0 4.0
In: df2.copynans(df1)
Out:
10053802 10053856 10053898 10058054
2012-07-01 00:00:00 1.0 2.0 3.0 4.0
2012-07-01 00:30:00 1.0 2.0 3.0 4.0
2012-07-01 01:00:00 1.0 2.0 3.0 4.0
2012-07-01 01:30:00 1.0 NaN 3.0 4.0
2012-07-01 02:00:00 1.0 NaN NaN 4.0
2012-07-01 02:30:00 1.0 2.0 NaN 4.0
2012-07-01 03:00:00 1.0 2.0 NaN 4.0
2012-07-01 03:30:00 1.0 2.0 NaN 4.0
Either
df1.where(df2.notnull())
Or
df1.mask(df2.isnull())
#Use null cells from df1 as index to set the the corresponding cell to nan in df2
df2[df1.isnull()]=np.nan

Categories

Resources