loop through multiple dataframes and create datetime index then join dataframes - python

I have 9 dataframes of different lengths but similar formats. Each dataframe has a year, month, and day column with dates that span from 1/1/2009-12/31/2019, but some dataframes are missing data for some days. I would like to build one large dataframe with a DateTime Index, but I am having trouble creating a loop to convert the year, month, and day columns to a datetime index for each dataframe, and don't know which function to use to join the dataframes together. I have one dataframe called Temp that has all 4017 lines of data for every day of the 11 year period but the rest of the dataframes are missing some dates.
import pandas as pd
#just creating some sample data to make it easier
Temp = pd.DataFrame({'year':[2009,2009,2009,2010,2010,2010,2011,2011,2011,2012,2012,2012,2013,2013,2013,
2014,2014,2014,2015,2015,2015],'month':[1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3],
'day':[1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3],
'T1':[20,21,25,28,30,33,39,35,34,34,31,30,27,24,20,21,25,28,30,33,39],
'T2':[33,39,35,34,34,31,30,27,24,20,21,25,28,30,33,39,20,21,25,28,30]})
WS = pd.DataFrame({'year':[2009,2009,2010,2011,2011,2011,2012,2012,2012,2013,2013,2013,
2014,2014,2014,2015,2015,2015],'month':[1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3],
'day':[1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3],
'WS1':[5.4,5.1,5.2,4.3,4.4,4.4,1.2,1.5,1.6,2.3,2.5,3.1,2.5,4.6,4.4,4.4,1.2,1.5],
'WS2':[5.4,5.1,4.4,4.4,1.2,1.5,1.6,2.3,2.5,5.2,4.3,4.4,4.4,1.2,1.5,1.6,2.3,2.5]})
RH = pd.DataFrame({'year':[2009,2009,2010,2011,2011,2011,2012,2012,2012,2013,2013,2013,
2014,2014,2014],'month':[1,2,3,1,2,3,1,2,3,1,2,3,1,2,3],
'day':[1,2,3,1,2,3,1,2,3,1,2,3,1,2,3],
'RH1':[33,38,30,45,52,60,61,66,60,59,30,45,52,60,61],
'RH2':[33,38,59,30,45,52,60,61,30,45,52,60,61,66,60]})
Okay, so far what I have tried was to first create a loop that would convert the year, month, and day columns into a DateTime index and drop the remaining year, month, and day columns.
df = [Temp, WS, RH]
for dfs in df:
dfs['date'] = pd.to_datetime(dfs[['year','month','day']])
dfs.set_index(['date'],inplace=True)
dfs.drop(columns = ['year','month','day'],inplace=True)
But I keep getting errors that say TypeError: tuple indices must be integers or slices, not list or TypeError: list indices must be integers or slices, not list. Since I can't get over this issue, I'm having trouble discerning what to do after in order to merge all the dataframes together. I assume that I will have to set an index like idx = pd.date_range('2018-01-01 00:00:00', '2018-12-31 23:00:00', freq='H') and then reset_index for the dataframes that are missing the data. And then, couldn't I use a left-join or concatenate since they would all have the same index? The dataframe examples given above do not have the desired date range, I just didn't know how else to make sample dataframes.

Is it what are you looking for?
dfs = [Temp, WS, RH]
data = []
for df in dfs:
data.append(df.set_index(pd.to_datetime(df[["year", "month", "day"]]))
.drop(columns=["year", "month", "day"]))
out = pd.concat(data, axis="columns")
>>> out
T1 T2 WS1 WS2 RH1 RH2
2009-01-01 20 33 5.4 5.4 33.0 33.0
2009-02-02 21 39 5.1 5.1 38.0 38.0
2009-03-03 25 35 NaN NaN NaN NaN
2010-01-01 28 34 NaN NaN NaN NaN
2010-02-02 30 34 NaN NaN NaN NaN
2010-03-03 33 31 5.2 4.4 30.0 59.0
2011-01-01 39 30 4.3 4.4 45.0 30.0
2011-02-02 35 27 4.4 1.2 52.0 45.0
2011-03-03 34 24 4.4 1.5 60.0 52.0
2012-01-01 34 20 1.2 1.6 61.0 60.0
2012-02-02 31 21 1.5 2.3 66.0 61.0
2012-03-03 30 25 1.6 2.5 60.0 30.0
2013-01-01 27 28 2.3 5.2 59.0 45.0
2013-02-02 24 30 2.5 4.3 30.0 52.0
2013-03-03 20 33 3.1 4.4 45.0 60.0
2014-01-01 21 39 2.5 4.4 52.0 61.0
2014-02-02 25 20 4.6 1.2 60.0 66.0
2014-03-03 28 21 4.4 1.5 61.0 60.0
2015-01-01 30 25 4.4 1.6 NaN NaN
2015-02-02 33 28 1.2 2.3 NaN NaN
2015-03-03 39 30 1.5 2.5 NaN NaN

Related

Merging multiple dataframes with overlapping rows and different columns

I have multiple pandas data frames with some common columns and some overlapping rows. I would like to combine them in such a way that I have one final data frame with all of the columns and all of the unique rows (overlapping/duplicate rows dropped). The remaining gaps should be nans.
I have come up with the function below. In essence it goes through all columns one by one, appending all of the values from each data frame, dropping the duplicates (overlap), and building a new output data frame column by column.
def combine_dfs(dataframes:list):
## Identifying all unique columns in all data frames
columns = []
for df in dataframes:
columns.extend(df.columns)
columns = np.unique(columns)
## Appending values from each data frame per column
output_df = pd.DataFrame()
for col in columns:
column = pd.Series(dtype="object", name=col)
for df in dataframes:
if col in df.columns:
column = column.append(df[col])
## Removing overlapping data (assuming consistent values)
column = column[~column.index.duplicated()]
## Adding column to output data frame
column = pd.DataFrame(column)
output_df = pd.concat([output_df,column], axis=1)
output_df.sort_index(inplace=True)
return output_df
df_1 = pd.DataFrame([[10,20,30],[11,21,31],[12,22,32],[13,23,33]], columns=["A","B","C"])
df_2 = pd.DataFrame([[33,43,54],[34,44,54],[35,45,55],[36,46,56]], columns=["C","D","E"], index=[3,4,5,6])
df_3 = pd.DataFrame([[50,60],[51,61],[52,62],[53,63],[54,64]], columns=["E","F"])
print(combine_dfs([df_1,df_2,df_3]))
The output, as intended in the visualization, looks like this:
A B C D E F
0 10.0 20.0 30 NaN 50 60.0
1 11.0 21.0 31 NaN 51 61.0
2 12.0 22.0 32 NaN 52 62.0
3 13.0 23.0 33 43.0 54 63.0
4 NaN NaN 34 44.0 54 64.0
5 NaN NaN 35 45.0 55 NaN
6 NaN NaN 36 46.0 56 NaN
This method works well on small data sets. Is there a way to optimize this?
IIUC you can chain combine_first:
print (df_1.combine_first(df_2).combine_first(df_3))
A B C D E F
0 10.0 20.0 30 NaN 50.0 60.0
1 11.0 21.0 31 NaN 51.0 61.0
2 12.0 22.0 32 NaN 52.0 62.0
3 13.0 23.0 33 43.0 54.0 63.0
4 NaN NaN 34 44.0 54.0 64.0
5 NaN NaN 35 45.0 55.0 NaN
6 NaN NaN 36 46.0 56.0 NaN

maximum sum of consecutive n-days using pandas

I've seen solutions in different languages (i.e. SQL, fortran, or C++) which mainly do for loops.
I am hoping that someone can help me solve this task using pandas instead.
If I have a data frame that looks like this.
date pcp sum_count sumcum
7/13/2013 0.1 3.0 48.7
7/14/2013 48.5
7/15/2013 0.1
7/16/2013
8/1/2013 1.5 1.0 1.5
8/2/2013
8/3/2013
8/4/2013 0.1 2.0 3.6
8/5/2013 3.5
9/22/2013 0.3 3.0 26.3
9/23/2013 14.0
9/24/2013 12.0
9/25/2013
9/26/2013
10/1/2014 0.1 11.0
10/2/2014 96.0 135.5
10/3/2014 2.5
10/4/2014 37.0
10/5/2014 9.5
10/6/2014 26.5
10/7/2014 0.5
10/8/2014 25.5
10/9/2014 2.0
10/10/2014 5.5
10/11/2014 5.5
And I was hoping I could do the following:
STEP 1 : create the sum_count column by determining total count of consecutive non-zeros in the 'pcp' column.
STEP 2 : create the sumcum column and calculate the sum of non-consecutive 'pcp'.
STEP 3 : create a pivot table that will look like this:
year max_sum_count
2013 48.7
2014 135.5
BUT!! the max_sum_count is based on the condition when sum_count = 3
I'd appreciate any help! thank you!
UPDATED QUESTION:
I have previously emphasized that the sum_count should only return the maximum consecutive 3 pcps. But I, mistakenly gave the wrong data frame, I had to edit it. Sorry.
The sumcum of 135.5 came from 96.0 + 2.5 + 37.0. It is the maximum consecutive 3 pcps within the sum_count 11.
Thank you
Use:
#filtering + rolling by days
N = 3
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
#test NaNs
m = df['pcp'].isna()
#groups by consecutive non NaNs
df['g'] = m.cumsum()[~m]
#extract years
df['year'] = df.index.year
#filter no NaNs rows
df = df[~m].copy()
#filter rows greater like N
df['sum_count1'] = df.groupby(['g','year'])['g'].transform('size')
df = df[df['sum_count1'].ge(N)].copy()
#get rolling sum per groups per N days
df['sumcum1'] = (df.groupby(['g','year'])
.rolling(f'{N}D')['pcp']
.sum()
.reset_index(level=[0, 1], drop=True))
#get only maximal counts non NaN and consecutive datetimes
#add missing years
r = range(df['year'].min(), df['year'].max() + 1)
df1 = df.groupby('year')['sumcum1'].max().reindex(r).reset_index(name='max_sum_count')
print (df1)
year max_sum_count
0 2013 48.7
1 2014 135.5
First, convert date as a real datetime dtype and create a binary mask which keep rows where pcp is not null. Then you can create groups and compute your variables:
Input data:
>>> df
date pcp
0 7/13/2013 0.1
1 7/14/2013 48.5
2 7/15/2013 0.1
3 7/16/2013 NaN
4 8/1/2013 1.5
5 8/2/2013 NaN
6 8/3/2013 NaN
7 8/4/2013 0.1
8 8/5/2013 3.5
9 9/22/2013 0.3
10 9/23/2013 14.0
11 9/24/2013 12.0
12 9/25/2013 NaN
13 9/26/2013 NaN
14 10/1/2014 0.1
15 10/2/2014 96.0
16 10/3/2014 2.5
17 10/4/2014 37.0
18 10/5/2014 9.5
19 10/6/2014 26.5
20 10/7/2014 0.5
21 10/8/2014 25.5
22 10/9/2014 2.0
23 10/10/2014 5.5
24 10/11/2014 5.5
Code:
df['date'] = pd.to_datetime(df['date'])
mask = df['pcp'].notna()
grp = df.loc[mask, 'date'] \
.ne(df.loc[mask, 'date'].shift().add(pd.Timedelta(days=1))) \
.cumsum()
df = df.join(df.reset_index()
.groupby(grp)
.agg(index=('index', 'first'),
sum_count=('pcp', 'size'),
sumcum=('pcp', 'sum'))
.set_index('index'))
pivot = df.groupby(df['date'].dt.year)['sumcum'].max() \
.rename('max_sum_count').reset_index()
Output results:
>>> df
date pcp sum_count sumcum
0 2013-07-13 0.1 3.0 48.7
1 2013-07-14 48.5 NaN NaN
2 2013-07-15 0.1 NaN NaN
3 2013-07-16 NaN NaN NaN
4 2013-08-01 1.5 1.0 1.5
5 2013-08-02 NaN NaN NaN
6 2013-08-03 NaN NaN NaN
7 2013-08-04 0.1 2.0 3.6
8 2013-08-05 3.5 NaN NaN
9 2013-09-22 0.3 3.0 26.3
10 2013-09-23 14.0 NaN NaN
11 2013-09-24 12.0 NaN NaN
12 2013-09-25 NaN NaN NaN
13 2013-09-26 NaN NaN NaN
14 2014-10-01 0.1 11.0 210.6
15 2014-10-02 96.0 NaN NaN
16 2014-10-03 2.5 NaN NaN
17 2014-10-04 37.0 NaN NaN
18 2014-10-05 9.5 NaN NaN
19 2014-10-06 26.5 NaN NaN
20 2014-10-07 0.5 NaN NaN
21 2014-10-08 25.5 NaN NaN
22 2014-10-09 2.0 NaN NaN
23 2014-10-10 5.5 NaN NaN
24 2014-10-11 5.5 NaN NaN
>>> pivot
date max_sum_count
0 2013 48.7
1 2014 210.6

How to apply a function/impute on an interval in Pandas

I have a Pandas dataset with a monthly Date-time index and a column of outstanding orders (like below):
Date
orders
1991-01-01
nan
1991-02-01
nan
1991-03-01
24
1991-04-01
nan
1991-05-01
nan
1991-06-01
nan
1991-07-01
nan
1991-08-01
34
1991-09-01
nan
1991-10-01
nan
1991-11-01
22
1991-12-01
nan
I want to linearly interpolate the values to fill the nans. However it has to be applied within 6-month blocks (non-rolling). So for example, one 6-month block would be all the rows between 1991-01-01 and 1991-06-01, where we would do forward and backward linear imputation such that if there is a nan the interpolation would be descending to a final value of 0. So for the same dataset above here is how I would like the end result to look:
Date
orders
1991-01-01
8
1991-02-01
16
1991-03-01
24
1991-04-01
18
1991-05-01
12
1991-06-01
6
1991-07-01
17
1991-08-01
34
1991-09-01
30
1991-10-01
26
1991-11-01
22
1991-12-01
11
I am lost on how to do this in Pandas however. Any ideas?
Idea is grouping per 6 months with prepend and append 0 values, interpolate and then remove first and last 0 values per groups:
df['Date'] = pd.to_datetime(df['Date'])
f = lambda x: pd.Series([0] + x.tolist() + [0]).interpolate().iloc[1:-1]
df['orders'] = (df.groupby(pd.Grouper(freq='6MS', key='Date'))['orders']
.transform(f))
print (df)
Date orders
0 1991-01-01 8.0
1 1991-02-01 16.0
2 1991-03-01 24.0
3 1991-04-01 18.0
4 1991-05-01 12.0
5 1991-06-01 6.0
6 1991-07-01 17.0
7 1991-08-01 34.0
8 1991-09-01 30.0
9 1991-10-01 26.0
10 1991-11-01 22.0
11 1991-12-01 11.0

Find a value in a column in function of another column

Assuming that the value exists, how can I for example, create another column "testFinal" in the dataframe where I will have the absolute value of df["test"]- " df["test"] which is 0.2 seconds after "
for example, the first value for testFinal is the absolute value of the difference between 2 and the value 0.2 seconds after -> so 8, the result is abs(2-8) = 6
My goal is to calculate "testFinal"
I don't know if its clear so here is the example
NB : the Timestamp is not homogeneous, so the interval between two values can be different over time
Thanks a lot
Here is the code for the dataframe
df = pd.DataFrame({'Timestamp':[11.1,11.2,11.3,11.4,11.5,11.6,11.7,11.8,11.9,12.0,12.10],
'test':[2,22,8,4,5,4,5,3,54,23,89],
'testFinal':[6,18,3,0,0,1,49,20,35,np.NaN,np.NaN]})
First, create a new temporary column temp obtained from converting the Timestamp column to timedelta using pd.to_timedelta, then set this temp column as dataframe index, then create a new column testFinal having the values as this new index + 0.2 seconds, then using Series.map, map the testFinal column to the values from df['test'] column, thus now the testFinal column should have values corresponding to the values in test column 0.2s later, thereafter you can subtract the values in the testFinal and test column to get the desired result:
df['temp'] = pd.to_timedelta(df['Timestamp'], unit='s')
df = df.set_index('temp')
df['testFinal'] = df.index + pd.Timedelta(seconds=0.2)
df['testFinal'] = df['testFinal'].map(df['test']).sub(df['test']).abs()
df = df.reset_index(drop=True)
# print(df)
Timestamp test testFinal
0 11.1 2 6.0
1 11.2 22 18.0
2 11.3 8 3.0
3 11.4 4 0.0
4 11.5 5 0.0
5 11.6 4 1.0
6 11.7 5 49.0
7 11.8 3 20.0
8 11.9 54 35.0
9 12.0 23 NaN
10 12.1 89 NaN
You could use numpy as follows. I created a new column test_final to compare with the expected testFinal column.
import numpy as np
test = df.test.values
df['test_final'] = np.abs(test - np.concatenate((test[2:], np.array([np.nan]*2)), axis=0))
print(df)
Output:
Timestamp test testFinal test_final
0 11.1 2 6.0 6.0
1 11.2 22 18.0 18.0
2 11.3 8 3.0 3.0
3 11.4 4 0.0 0.0
4 11.5 5 0.0 0.0
5 11.6 4 1.0 1.0
6 11.7 5 49.0 49.0
7 11.8 3 20.0 20.0
8 11.9 54 35.0 35.0
9 12.0 23 NaN NaN
10 12.1 89 NaN NaN

How to add conditions to columns at grouped by pivot table Pandas

I've used group by and pivot table from pandas package in order to create the following table:
Input:
q4 = q1[['category','Month']].groupby(['category','Month']).Month.agg({'Count':'count'}).reset_index()
q4 = pd.DataFrame(q4.pivot(index='category',columns='Month').reset_index())
then the output :
category Count
Month 6 7 8
0 adult-classes 29.0 109.0 162.0
1 air-pollution 27.0 43.0 13.0
2 babies-and-toddlers 4.0 51.0 2.0
3 bicycle 210.0 96.0 23.0
4 building NaN 17.0 NaN
5 buildings-maintenance 23.0 12.0 NaN
6 catering 1351.0 4881.0 1040.0
7 childcare 9.0 NaN NaN
8 city-planning 105.0 81.0 23.0
9 city-services 2461.0 2130.0 1204.0
10 city-taxes 1.0 4.0 42.0
I'm trying to add a condition to the months,
the problem I'm having is that after pivoting I can't access the columns
how can I show only the rows where 6<7<8?
To flatten your multi-index, you can use renaming of your columns (check out this answer).
q4.columns = [''.join([str(c) for c in col]).strip() for col in q4.columns.values]
To remove NaNs:
q4.fillna(0, inplace=True)
To select according to your constraint:
result = q4[(q4['Count6'] < q['Count7']) & (q4['Count7'] < q4['Count8'])]

Categories

Resources