Find a value in a column in function of another column - python
Assuming that the value exists, how can I for example, create another column "testFinal" in the dataframe where I will have the absolute value of df["test"]- " df["test"] which is 0.2 seconds after "
for example, the first value for testFinal is the absolute value of the difference between 2 and the value 0.2 seconds after -> so 8, the result is abs(2-8) = 6
My goal is to calculate "testFinal"
I don't know if its clear so here is the example
NB : the Timestamp is not homogeneous, so the interval between two values can be different over time
Thanks a lot
Here is the code for the dataframe
df = pd.DataFrame({'Timestamp':[11.1,11.2,11.3,11.4,11.5,11.6,11.7,11.8,11.9,12.0,12.10],
'test':[2,22,8,4,5,4,5,3,54,23,89],
'testFinal':[6,18,3,0,0,1,49,20,35,np.NaN,np.NaN]})
First, create a new temporary column temp obtained from converting the Timestamp column to timedelta using pd.to_timedelta, then set this temp column as dataframe index, then create a new column testFinal having the values as this new index + 0.2 seconds, then using Series.map, map the testFinal column to the values from df['test'] column, thus now the testFinal column should have values corresponding to the values in test column 0.2s later, thereafter you can subtract the values in the testFinal and test column to get the desired result:
df['temp'] = pd.to_timedelta(df['Timestamp'], unit='s')
df = df.set_index('temp')
df['testFinal'] = df.index + pd.Timedelta(seconds=0.2)
df['testFinal'] = df['testFinal'].map(df['test']).sub(df['test']).abs()
df = df.reset_index(drop=True)
# print(df)
Timestamp test testFinal
0 11.1 2 6.0
1 11.2 22 18.0
2 11.3 8 3.0
3 11.4 4 0.0
4 11.5 5 0.0
5 11.6 4 1.0
6 11.7 5 49.0
7 11.8 3 20.0
8 11.9 54 35.0
9 12.0 23 NaN
10 12.1 89 NaN
You could use numpy as follows. I created a new column test_final to compare with the expected testFinal column.
import numpy as np
test = df.test.values
df['test_final'] = np.abs(test - np.concatenate((test[2:], np.array([np.nan]*2)), axis=0))
print(df)
Output:
Timestamp test testFinal test_final
0 11.1 2 6.0 6.0
1 11.2 22 18.0 18.0
2 11.3 8 3.0 3.0
3 11.4 4 0.0 0.0
4 11.5 5 0.0 0.0
5 11.6 4 1.0 1.0
6 11.7 5 49.0 49.0
7 11.8 3 20.0 20.0
8 11.9 54 35.0 35.0
9 12.0 23 NaN NaN
10 12.1 89 NaN NaN
Related
maximum sum of consecutive n-days using pandas
I've seen solutions in different languages (i.e. SQL, fortran, or C++) which mainly do for loops. I am hoping that someone can help me solve this task using pandas instead. If I have a data frame that looks like this. date pcp sum_count sumcum 7/13/2013 0.1 3.0 48.7 7/14/2013 48.5 7/15/2013 0.1 7/16/2013 8/1/2013 1.5 1.0 1.5 8/2/2013 8/3/2013 8/4/2013 0.1 2.0 3.6 8/5/2013 3.5 9/22/2013 0.3 3.0 26.3 9/23/2013 14.0 9/24/2013 12.0 9/25/2013 9/26/2013 10/1/2014 0.1 11.0 10/2/2014 96.0 135.5 10/3/2014 2.5 10/4/2014 37.0 10/5/2014 9.5 10/6/2014 26.5 10/7/2014 0.5 10/8/2014 25.5 10/9/2014 2.0 10/10/2014 5.5 10/11/2014 5.5 And I was hoping I could do the following: STEP 1 : create the sum_count column by determining total count of consecutive non-zeros in the 'pcp' column. STEP 2 : create the sumcum column and calculate the sum of non-consecutive 'pcp'. STEP 3 : create a pivot table that will look like this: year max_sum_count 2013 48.7 2014 135.5 BUT!! the max_sum_count is based on the condition when sum_count = 3 I'd appreciate any help! thank you! UPDATED QUESTION: I have previously emphasized that the sum_count should only return the maximum consecutive 3 pcps. But I, mistakenly gave the wrong data frame, I had to edit it. Sorry. The sumcum of 135.5 came from 96.0 + 2.5 + 37.0. It is the maximum consecutive 3 pcps within the sum_count 11. Thank you
Use: #filtering + rolling by days N = 3 df['date'] = pd.to_datetime(df['date']) df = df.set_index('date') #test NaNs m = df['pcp'].isna() #groups by consecutive non NaNs df['g'] = m.cumsum()[~m] #extract years df['year'] = df.index.year #filter no NaNs rows df = df[~m].copy() #filter rows greater like N df['sum_count1'] = df.groupby(['g','year'])['g'].transform('size') df = df[df['sum_count1'].ge(N)].copy() #get rolling sum per groups per N days df['sumcum1'] = (df.groupby(['g','year']) .rolling(f'{N}D')['pcp'] .sum() .reset_index(level=[0, 1], drop=True)) #get only maximal counts non NaN and consecutive datetimes #add missing years r = range(df['year'].min(), df['year'].max() + 1) df1 = df.groupby('year')['sumcum1'].max().reindex(r).reset_index(name='max_sum_count') print (df1) year max_sum_count 0 2013 48.7 1 2014 135.5
First, convert date as a real datetime dtype and create a binary mask which keep rows where pcp is not null. Then you can create groups and compute your variables: Input data: >>> df date pcp 0 7/13/2013 0.1 1 7/14/2013 48.5 2 7/15/2013 0.1 3 7/16/2013 NaN 4 8/1/2013 1.5 5 8/2/2013 NaN 6 8/3/2013 NaN 7 8/4/2013 0.1 8 8/5/2013 3.5 9 9/22/2013 0.3 10 9/23/2013 14.0 11 9/24/2013 12.0 12 9/25/2013 NaN 13 9/26/2013 NaN 14 10/1/2014 0.1 15 10/2/2014 96.0 16 10/3/2014 2.5 17 10/4/2014 37.0 18 10/5/2014 9.5 19 10/6/2014 26.5 20 10/7/2014 0.5 21 10/8/2014 25.5 22 10/9/2014 2.0 23 10/10/2014 5.5 24 10/11/2014 5.5 Code: df['date'] = pd.to_datetime(df['date']) mask = df['pcp'].notna() grp = df.loc[mask, 'date'] \ .ne(df.loc[mask, 'date'].shift().add(pd.Timedelta(days=1))) \ .cumsum() df = df.join(df.reset_index() .groupby(grp) .agg(index=('index', 'first'), sum_count=('pcp', 'size'), sumcum=('pcp', 'sum')) .set_index('index')) pivot = df.groupby(df['date'].dt.year)['sumcum'].max() \ .rename('max_sum_count').reset_index() Output results: >>> df date pcp sum_count sumcum 0 2013-07-13 0.1 3.0 48.7 1 2013-07-14 48.5 NaN NaN 2 2013-07-15 0.1 NaN NaN 3 2013-07-16 NaN NaN NaN 4 2013-08-01 1.5 1.0 1.5 5 2013-08-02 NaN NaN NaN 6 2013-08-03 NaN NaN NaN 7 2013-08-04 0.1 2.0 3.6 8 2013-08-05 3.5 NaN NaN 9 2013-09-22 0.3 3.0 26.3 10 2013-09-23 14.0 NaN NaN 11 2013-09-24 12.0 NaN NaN 12 2013-09-25 NaN NaN NaN 13 2013-09-26 NaN NaN NaN 14 2014-10-01 0.1 11.0 210.6 15 2014-10-02 96.0 NaN NaN 16 2014-10-03 2.5 NaN NaN 17 2014-10-04 37.0 NaN NaN 18 2014-10-05 9.5 NaN NaN 19 2014-10-06 26.5 NaN NaN 20 2014-10-07 0.5 NaN NaN 21 2014-10-08 25.5 NaN NaN 22 2014-10-09 2.0 NaN NaN 23 2014-10-10 5.5 NaN NaN 24 2014-10-11 5.5 NaN NaN >>> pivot date max_sum_count 0 2013 48.7 1 2014 210.6
Sorting value by two columns in Pandas Python
The idea is to sort value by two columns. Such that, given two column, I am expecting the output something like Expected output x y 0 2.0 NaN 1 3.0 NaN 2 4.0 4.1 3 NaN 5.0 4 10.0 NaN 5 24.0 24.7 6 31.0 31.4 However, using the code below import pandas as pd import numpy as np df1 = pd.DataFrame ( {'x': [2, 3, 4, 24, 31, '',10], 'y':['','',4.1,24.7,31.4,5,'']} ) df1.replace(r'^\s*$', np.nan, regex=True,inplace=True) rslt_df = df1.sort_values ( by=['x', 'y'], ascending=(True, True) ) print(rslt_df) Produce the following x y 0 2.0 NaN 1 3.0 NaN 2 4.0 4.1 6 10.0 NaN 3 24.0 24.7 4 31.0 31.4 5 NaN 5.0 Notice that at the last row, the 5.0 of column y is placed at the bottom. May I know what modification to the code in order to obtained the intended output?
Try sorting by x fillna y, then reindex from those sorted values: df1.reindex(df1['x'].fillna(df1['y']).sort_values().index).reset_index(drop=True) To update the df1 variable: df1 = ( df1.reindex(df1['x'].fillna(df1['y']).sort_values().index) .reset_index(drop=True) ) df1: x y 0 2.0 NaN 1 3.0 NaN 2 4.0 4.1 3 NaN 5.0 4 10.0 NaN 5 24.0 24.7 6 31.0 31.4
with np.sort and argsort: df1.iloc[np.sort(df1[['x','y']],axis=1)[:,0].argsort()] x y 0 2.0 NaN 1 3.0 NaN 2 4.0 4.1 5 NaN 5.0 6 10.0 NaN 3 24.0 24.7 4 31.0 31.4
loop through multiple dataframes and create datetime index then join dataframes
I have 9 dataframes of different lengths but similar formats. Each dataframe has a year, month, and day column with dates that span from 1/1/2009-12/31/2019, but some dataframes are missing data for some days. I would like to build one large dataframe with a DateTime Index, but I am having trouble creating a loop to convert the year, month, and day columns to a datetime index for each dataframe, and don't know which function to use to join the dataframes together. I have one dataframe called Temp that has all 4017 lines of data for every day of the 11 year period but the rest of the dataframes are missing some dates. import pandas as pd #just creating some sample data to make it easier Temp = pd.DataFrame({'year':[2009,2009,2009,2010,2010,2010,2011,2011,2011,2012,2012,2012,2013,2013,2013, 2014,2014,2014,2015,2015,2015],'month':[1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3], 'day':[1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3], 'T1':[20,21,25,28,30,33,39,35,34,34,31,30,27,24,20,21,25,28,30,33,39], 'T2':[33,39,35,34,34,31,30,27,24,20,21,25,28,30,33,39,20,21,25,28,30]}) WS = pd.DataFrame({'year':[2009,2009,2010,2011,2011,2011,2012,2012,2012,2013,2013,2013, 2014,2014,2014,2015,2015,2015],'month':[1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3], 'day':[1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3], 'WS1':[5.4,5.1,5.2,4.3,4.4,4.4,1.2,1.5,1.6,2.3,2.5,3.1,2.5,4.6,4.4,4.4,1.2,1.5], 'WS2':[5.4,5.1,4.4,4.4,1.2,1.5,1.6,2.3,2.5,5.2,4.3,4.4,4.4,1.2,1.5,1.6,2.3,2.5]}) RH = pd.DataFrame({'year':[2009,2009,2010,2011,2011,2011,2012,2012,2012,2013,2013,2013, 2014,2014,2014],'month':[1,2,3,1,2,3,1,2,3,1,2,3,1,2,3], 'day':[1,2,3,1,2,3,1,2,3,1,2,3,1,2,3], 'RH1':[33,38,30,45,52,60,61,66,60,59,30,45,52,60,61], 'RH2':[33,38,59,30,45,52,60,61,30,45,52,60,61,66,60]}) Okay, so far what I have tried was to first create a loop that would convert the year, month, and day columns into a DateTime index and drop the remaining year, month, and day columns. df = [Temp, WS, RH] for dfs in df: dfs['date'] = pd.to_datetime(dfs[['year','month','day']]) dfs.set_index(['date'],inplace=True) dfs.drop(columns = ['year','month','day'],inplace=True) But I keep getting errors that say TypeError: tuple indices must be integers or slices, not list or TypeError: list indices must be integers or slices, not list. Since I can't get over this issue, I'm having trouble discerning what to do after in order to merge all the dataframes together. I assume that I will have to set an index like idx = pd.date_range('2018-01-01 00:00:00', '2018-12-31 23:00:00', freq='H') and then reset_index for the dataframes that are missing the data. And then, couldn't I use a left-join or concatenate since they would all have the same index? The dataframe examples given above do not have the desired date range, I just didn't know how else to make sample dataframes.
Is it what are you looking for? dfs = [Temp, WS, RH] data = [] for df in dfs: data.append(df.set_index(pd.to_datetime(df[["year", "month", "day"]])) .drop(columns=["year", "month", "day"])) out = pd.concat(data, axis="columns") >>> out T1 T2 WS1 WS2 RH1 RH2 2009-01-01 20 33 5.4 5.4 33.0 33.0 2009-02-02 21 39 5.1 5.1 38.0 38.0 2009-03-03 25 35 NaN NaN NaN NaN 2010-01-01 28 34 NaN NaN NaN NaN 2010-02-02 30 34 NaN NaN NaN NaN 2010-03-03 33 31 5.2 4.4 30.0 59.0 2011-01-01 39 30 4.3 4.4 45.0 30.0 2011-02-02 35 27 4.4 1.2 52.0 45.0 2011-03-03 34 24 4.4 1.5 60.0 52.0 2012-01-01 34 20 1.2 1.6 61.0 60.0 2012-02-02 31 21 1.5 2.3 66.0 61.0 2012-03-03 30 25 1.6 2.5 60.0 30.0 2013-01-01 27 28 2.3 5.2 59.0 45.0 2013-02-02 24 30 2.5 4.3 30.0 52.0 2013-03-03 20 33 3.1 4.4 45.0 60.0 2014-01-01 21 39 2.5 4.4 52.0 61.0 2014-02-02 25 20 4.6 1.2 60.0 66.0 2014-03-03 28 21 4.4 1.5 61.0 60.0 2015-01-01 30 25 4.4 1.6 NaN NaN 2015-02-02 33 28 1.2 2.3 NaN NaN 2015-03-03 39 30 1.5 2.5 NaN NaN
How to apply a function/impute on an interval in Pandas
I have a Pandas dataset with a monthly Date-time index and a column of outstanding orders (like below): Date orders 1991-01-01 nan 1991-02-01 nan 1991-03-01 24 1991-04-01 nan 1991-05-01 nan 1991-06-01 nan 1991-07-01 nan 1991-08-01 34 1991-09-01 nan 1991-10-01 nan 1991-11-01 22 1991-12-01 nan I want to linearly interpolate the values to fill the nans. However it has to be applied within 6-month blocks (non-rolling). So for example, one 6-month block would be all the rows between 1991-01-01 and 1991-06-01, where we would do forward and backward linear imputation such that if there is a nan the interpolation would be descending to a final value of 0. So for the same dataset above here is how I would like the end result to look: Date orders 1991-01-01 8 1991-02-01 16 1991-03-01 24 1991-04-01 18 1991-05-01 12 1991-06-01 6 1991-07-01 17 1991-08-01 34 1991-09-01 30 1991-10-01 26 1991-11-01 22 1991-12-01 11 I am lost on how to do this in Pandas however. Any ideas?
Idea is grouping per 6 months with prepend and append 0 values, interpolate and then remove first and last 0 values per groups: df['Date'] = pd.to_datetime(df['Date']) f = lambda x: pd.Series([0] + x.tolist() + [0]).interpolate().iloc[1:-1] df['orders'] = (df.groupby(pd.Grouper(freq='6MS', key='Date'))['orders'] .transform(f)) print (df) Date orders 0 1991-01-01 8.0 1 1991-02-01 16.0 2 1991-03-01 24.0 3 1991-04-01 18.0 4 1991-05-01 12.0 5 1991-06-01 6.0 6 1991-07-01 17.0 7 1991-08-01 34.0 8 1991-09-01 30.0 9 1991-10-01 26.0 10 1991-11-01 22.0 11 1991-12-01 11.0
Write previous entries of a time serie into additional columns
I have a dataframe that contains values for individual days: day value 1 10.1 2 15.4 3 12.1 4 14.1 5 -9.7 6 2.0 8 3.4 There is not necessary a value for each day (day 7 is missing in my example), but there is never more than one value per day. I want to add additional columns to this dataframe, containing per row the value of the day before, the value of two days ago, the value of three days ago etc. The result would be: day value value-of-1 value-of-2 value-of-3 1 10.1 NaN NaN NaN 2 15.4 10.1 NaN NaN 3 12.1 15.4 10.1 NaN 4 14.1 12.1 15.4 10.1 5 -9.7 14.1 12.1 15.4 6 2.0 -9.7 14.1 12.1 8 3.4 NaN 2.0 -9.7 At the moment, I add to the orginal dataframe a column containing the required day and then merge the original dataframe using this new column as join condition. After some reorganizing of the columns, I get my result: data = [[1, 10.1], [2, 15.4], [3, 12.1], [4, 14.1], [5, -9.7], [6, 2.0], [8, 3.4]] df = pd.DataFrame(data, columns = ['day', 'value']) def add_column_for_prev_day(df, day): df[f"day-{day}"] = df["day"] - day df = df.merge(df[["day", "value"]], how="left", left_on=f"day-{day}", right_on="day", suffixes=("", "_r")) \ .drop(["day_r",f"day-{day}"],axis=1) \ .rename({"value_r": f"value-of-{day}"}, axis=1) return df df = add_column_for_prev_day(df, 1) df = add_column_for_prev_day(df, 2) df = add_column_for_prev_day(df, 3) I wonder if there is a better and faster way to get the same result, especially without having to merge the dataframe over and over again. A simple shift does not help as there are days without data.
You can use: m=df.set_index('day').reindex(range(df['day'].min(),df['day'].max()+1)) l=[1,2,3] for i in l: m[f"value_of_{i}"] = m['value'].shift(i) m.reindex(df.day).reset_index() day value value_of_1 value_of_2 value_of_3 0 1 10.1 NaN NaN NaN 1 2 15.4 10.1 NaN NaN 2 3 12.1 15.4 10.1 NaN 3 4 14.1 12.1 15.4 10.1 4 5 -9.7 14.1 12.1 15.4 5 6 2.0 -9.7 14.1 12.1 6 8 3.4 NaN 2.0 -9.7