Find a value in a column in function of another column

Find a value in a column in function of another column - python

Assuming that the value exists, how can I for example, create another column "testFinal" in the dataframe where I will have the absolute value of df["test"]- " df["test"] which is 0.2 seconds after "
for example, the first value for testFinal is the absolute value of the difference between 2 and the value 0.2 seconds after -> so 8, the result is abs(2-8) = 6
My goal is to calculate "testFinal"
I don't know if its clear so here is the example
NB : the Timestamp is not homogeneous, so the interval between two values can be different over time
Thanks a lot
Here is the code for the dataframe
df = pd.DataFrame({'Timestamp':[11.1,11.2,11.3,11.4,11.5,11.6,11.7,11.8,11.9,12.0,12.10],
'test':[2,22,8,4,5,4,5,3,54,23,89],
'testFinal':[6,18,3,0,0,1,49,20,35,np.NaN,np.NaN]})

First, create a new temporary column temp obtained from converting the Timestamp column to timedelta using pd.to_timedelta, then set this temp column as dataframe index, then create a new column testFinal having the values as this new index + 0.2 seconds, then using Series.map, map the testFinal column to the values from df['test'] column, thus now the testFinal column should have values corresponding to the values in test column 0.2s later, thereafter you can subtract the values in the testFinal and test column to get the desired result:
df['temp'] = pd.to_timedelta(df['Timestamp'], unit='s')
df = df.set_index('temp')
df['testFinal'] = df.index + pd.Timedelta(seconds=0.2)
df['testFinal'] = df['testFinal'].map(df['test']).sub(df['test']).abs()
df = df.reset_index(drop=True)
# print(df)
Timestamp test testFinal
0 11.1 2 6.0
1 11.2 22 18.0
2 11.3 8 3.0
3 11.4 4 0.0
4 11.5 5 0.0
5 11.6 4 1.0
6 11.7 5 49.0
7 11.8 3 20.0
8 11.9 54 35.0
9 12.0 23 NaN
10 12.1 89 NaN

You could use numpy as follows. I created a new column test_final to compare with the expected testFinal column.
import numpy as np
test = df.test.values
df['test_final'] = np.abs(test - np.concatenate((test[2:], np.array([np.nan]*2)), axis=0))
print(df)
Output:
Timestamp test testFinal test_final
0 11.1 2 6.0 6.0
1 11.2 22 18.0 18.0
2 11.3 8 3.0 3.0
3 11.4 4 0.0 0.0
4 11.5 5 0.0 0.0
5 11.6 4 1.0 1.0
6 11.7 5 49.0 49.0
7 11.8 3 20.0 20.0
8 11.9 54 35.0 35.0
9 12.0 23 NaN NaN
10 12.1 89 NaN NaN

Related

maximum sum of consecutive n-days using pandas

I've seen solutions in different languages (i.e. SQL, fortran, or C++) which mainly do for loops.
I am hoping that someone can help me solve this task using pandas instead.
If I have a data frame that looks like this.
date pcp sum_count sumcum
7/13/2013 0.1 3.0 48.7
7/14/2013 48.5
7/15/2013 0.1
7/16/2013
8/1/2013 1.5 1.0 1.5
8/2/2013
8/3/2013
8/4/2013 0.1 2.0 3.6
8/5/2013 3.5
9/22/2013 0.3 3.0 26.3
9/23/2013 14.0
9/24/2013 12.0
9/25/2013
9/26/2013
10/1/2014 0.1 11.0
10/2/2014 96.0 135.5
10/3/2014 2.5
10/4/2014 37.0
10/5/2014 9.5
10/6/2014 26.5
10/7/2014 0.5
10/8/2014 25.5
10/9/2014 2.0
10/10/2014 5.5
10/11/2014 5.5
And I was hoping I could do the following:
STEP 1 : create the sum_count column by determining total count of consecutive non-zeros in the 'pcp' column.
STEP 2 : create the sumcum column and calculate the sum of non-consecutive 'pcp'.
STEP 3 : create a pivot table that will look like this:
year max_sum_count
2013 48.7
2014 135.5
BUT!! the max_sum_count is based on the condition when sum_count = 3
I'd appreciate any help! thank you!
UPDATED QUESTION:
I have previously emphasized that the sum_count should only return the maximum consecutive 3 pcps. But I, mistakenly gave the wrong data frame, I had to edit it. Sorry.
The sumcum of 135.5 came from 96.0 + 2.5 + 37.0. It is the maximum consecutive 3 pcps within the sum_count 11.
Thank you

Use:
#filtering + rolling by days
N = 3
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
#test NaNs
m = df['pcp'].isna()
#groups by consecutive non NaNs
df['g'] = m.cumsum()[~m]
#extract years
df['year'] = df.index.year
#filter no NaNs rows
df = df[~m].copy()
#filter rows greater like N
df['sum_count1'] = df.groupby(['g','year'])['g'].transform('size')
df = df[df['sum_count1'].ge(N)].copy()
#get rolling sum per groups per N days
df['sumcum1'] = (df.groupby(['g','year'])
.rolling(f'{N}D')['pcp']
.sum()
.reset_index(level=[0, 1], drop=True))
#get only maximal counts non NaN and consecutive datetimes
#add missing years
r = range(df['year'].min(), df['year'].max() + 1)
df1 = df.groupby('year')['sumcum1'].max().reindex(r).reset_index(name='max_sum_count')
print (df1)
year max_sum_count
0 2013 48.7
1 2014 135.5

First, convert date as a real datetime dtype and create a binary mask which keep rows where pcp is not null. Then you can create groups and compute your variables:
Input data:
>>> df
date pcp
0 7/13/2013 0.1
1 7/14/2013 48.5
2 7/15/2013 0.1
3 7/16/2013 NaN
4 8/1/2013 1.5
5 8/2/2013 NaN
6 8/3/2013 NaN
7 8/4/2013 0.1
8 8/5/2013 3.5
9 9/22/2013 0.3
10 9/23/2013 14.0
11 9/24/2013 12.0
12 9/25/2013 NaN
13 9/26/2013 NaN
14 10/1/2014 0.1
15 10/2/2014 96.0
16 10/3/2014 2.5
17 10/4/2014 37.0
18 10/5/2014 9.5
19 10/6/2014 26.5
20 10/7/2014 0.5
21 10/8/2014 25.5
22 10/9/2014 2.0
23 10/10/2014 5.5
24 10/11/2014 5.5
Code:
df['date'] = pd.to_datetime(df['date'])
mask = df['pcp'].notna()
grp = df.loc[mask, 'date'] \
.ne(df.loc[mask, 'date'].shift().add(pd.Timedelta(days=1))) \
.cumsum()
df = df.join(df.reset_index()
.groupby(grp)
.agg(index=('index', 'first'),
sum_count=('pcp', 'size'),
sumcum=('pcp', 'sum'))
.set_index('index'))
pivot = df.groupby(df['date'].dt.year)['sumcum'].max() \
.rename('max_sum_count').reset_index()
Output results:
>>> df
date pcp sum_count sumcum
0 2013-07-13 0.1 3.0 48.7
1 2013-07-14 48.5 NaN NaN
2 2013-07-15 0.1 NaN NaN
3 2013-07-16 NaN NaN NaN
4 2013-08-01 1.5 1.0 1.5
5 2013-08-02 NaN NaN NaN
6 2013-08-03 NaN NaN NaN
7 2013-08-04 0.1 2.0 3.6
8 2013-08-05 3.5 NaN NaN
9 2013-09-22 0.3 3.0 26.3
10 2013-09-23 14.0 NaN NaN
11 2013-09-24 12.0 NaN NaN
12 2013-09-25 NaN NaN NaN
13 2013-09-26 NaN NaN NaN
14 2014-10-01 0.1 11.0 210.6
15 2014-10-02 96.0 NaN NaN
16 2014-10-03 2.5 NaN NaN
17 2014-10-04 37.0 NaN NaN
18 2014-10-05 9.5 NaN NaN
19 2014-10-06 26.5 NaN NaN
20 2014-10-07 0.5 NaN NaN
21 2014-10-08 25.5 NaN NaN
22 2014-10-09 2.0 NaN NaN
23 2014-10-10 5.5 NaN NaN
24 2014-10-11 5.5 NaN NaN
>>> pivot
date max_sum_count
0 2013 48.7
1 2014 210.6

Sorting value by two columns in Pandas Python

The idea is to sort value by two columns.
Such that, given two column, I am expecting the output something like
Expected output
x y
0 2.0 NaN
1 3.0 NaN
2 4.0 4.1
3 NaN 5.0
4 10.0 NaN
5 24.0 24.7
6 31.0 31.4
However, using the code below
import pandas as pd
import numpy as np
df1 = pd.DataFrame ( {'x': [2, 3, 4, 24, 31, '',10],
'y':['','',4.1,24.7,31.4,5,'']} )
df1.replace(r'^\s*$', np.nan, regex=True,inplace=True)
rslt_df = df1.sort_values ( by=['x', 'y'], ascending=(True, True) )
print(rslt_df)
Produce the following
x y
0 2.0 NaN
1 3.0 NaN
2 4.0 4.1
6 10.0 NaN
3 24.0 24.7
4 31.0 31.4
5 NaN 5.0
Notice that at the last row, the 5.0 of column y is placed at the bottom.
May I know what modification to the code in order to obtained the intended output?

Try sorting by x fillna y, then reindex from those sorted values:
df1.reindex(df1['x'].fillna(df1['y']).sort_values().index).reset_index(drop=True)
To update the df1 variable:
df1 = (
df1.reindex(df1['x'].fillna(df1['y']).sort_values().index)
.reset_index(drop=True)
)
df1:
x y
0 2.0 NaN
1 3.0 NaN
2 4.0 4.1
3 NaN 5.0
4 10.0 NaN
5 24.0 24.7
6 31.0 31.4

with np.sort and argsort:
df1.iloc[np.sort(df1[['x','y']],axis=1)[:,0].argsort()]
x y
0 2.0 NaN
1 3.0 NaN
2 4.0 4.1
5 NaN 5.0
6 10.0 NaN
3 24.0 24.7
4 31.0 31.4

loop through multiple dataframes and create datetime index then join dataframes

I have 9 dataframes of different lengths but similar formats. Each dataframe has a year, month, and day column with dates that span from 1/1/2009-12/31/2019, but some dataframes are missing data for some days. I would like to build one large dataframe with a DateTime Index, but I am having trouble creating a loop to convert the year, month, and day columns to a datetime index for each dataframe, and don't know which function to use to join the dataframes together. I have one dataframe called Temp that has all 4017 lines of data for every day of the 11 year period but the rest of the dataframes are missing some dates.
import pandas as pd
#just creating some sample data to make it easier
Temp = pd.DataFrame({'year':[2009,2009,2009,2010,2010,2010,2011,2011,2011,2012,2012,2012,2013,2013,2013,
2014,2014,2014,2015,2015,2015],'month':[1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3],
'day':[1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3],
'T1':[20,21,25,28,30,33,39,35,34,34,31,30,27,24,20,21,25,28,30,33,39],
'T2':[33,39,35,34,34,31,30,27,24,20,21,25,28,30,33,39,20,21,25,28,30]})
WS = pd.DataFrame({'year':[2009,2009,2010,2011,2011,2011,2012,2012,2012,2013,2013,2013,
2014,2014,2014,2015,2015,2015],'month':[1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3],
'day':[1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3],
'WS1':[5.4,5.1,5.2,4.3,4.4,4.4,1.2,1.5,1.6,2.3,2.5,3.1,2.5,4.6,4.4,4.4,1.2,1.5],
'WS2':[5.4,5.1,4.4,4.4,1.2,1.5,1.6,2.3,2.5,5.2,4.3,4.4,4.4,1.2,1.5,1.6,2.3,2.5]})
RH = pd.DataFrame({'year':[2009,2009,2010,2011,2011,2011,2012,2012,2012,2013,2013,2013,
2014,2014,2014],'month':[1,2,3,1,2,3,1,2,3,1,2,3,1,2,3],
'day':[1,2,3,1,2,3,1,2,3,1,2,3,1,2,3],
'RH1':[33,38,30,45,52,60,61,66,60,59,30,45,52,60,61],
'RH2':[33,38,59,30,45,52,60,61,30,45,52,60,61,66,60]})
Okay, so far what I have tried was to first create a loop that would convert the year, month, and day columns into a DateTime index and drop the remaining year, month, and day columns.
df = [Temp, WS, RH]
for dfs in df:
dfs['date'] = pd.to_datetime(dfs[['year','month','day']])
dfs.set_index(['date'],inplace=True)
dfs.drop(columns = ['year','month','day'],inplace=True)
But I keep getting errors that say TypeError: tuple indices must be integers or slices, not list or TypeError: list indices must be integers or slices, not list. Since I can't get over this issue, I'm having trouble discerning what to do after in order to merge all the dataframes together. I assume that I will have to set an index like idx = pd.date_range('2018-01-01 00:00:00', '2018-12-31 23:00:00', freq='H') and then reset_index for the dataframes that are missing the data. And then, couldn't I use a left-join or concatenate since they would all have the same index? The dataframe examples given above do not have the desired date range, I just didn't know how else to make sample dataframes.

Is it what are you looking for?
dfs = [Temp, WS, RH]
data = []
for df in dfs:
data.append(df.set_index(pd.to_datetime(df[["year", "month", "day"]]))
.drop(columns=["year", "month", "day"]))
out = pd.concat(data, axis="columns")
>>> out
T1 T2 WS1 WS2 RH1 RH2
2009-01-01 20 33 5.4 5.4 33.0 33.0
2009-02-02 21 39 5.1 5.1 38.0 38.0
2009-03-03 25 35 NaN NaN NaN NaN
2010-01-01 28 34 NaN NaN NaN NaN
2010-02-02 30 34 NaN NaN NaN NaN
2010-03-03 33 31 5.2 4.4 30.0 59.0
2011-01-01 39 30 4.3 4.4 45.0 30.0
2011-02-02 35 27 4.4 1.2 52.0 45.0
2011-03-03 34 24 4.4 1.5 60.0 52.0
2012-01-01 34 20 1.2 1.6 61.0 60.0
2012-02-02 31 21 1.5 2.3 66.0 61.0
2012-03-03 30 25 1.6 2.5 60.0 30.0
2013-01-01 27 28 2.3 5.2 59.0 45.0
2013-02-02 24 30 2.5 4.3 30.0 52.0
2013-03-03 20 33 3.1 4.4 45.0 60.0
2014-01-01 21 39 2.5 4.4 52.0 61.0
2014-02-02 25 20 4.6 1.2 60.0 66.0
2014-03-03 28 21 4.4 1.5 61.0 60.0
2015-01-01 30 25 4.4 1.6 NaN NaN
2015-02-02 33 28 1.2 2.3 NaN NaN
2015-03-03 39 30 1.5 2.5 NaN NaN

How to apply a function/impute on an interval in Pandas

I have a Pandas dataset with a monthly Date-time index and a column of outstanding orders (like below):
Date
orders
1991-01-01
nan
1991-02-01
nan
1991-03-01
24
1991-04-01
nan
1991-05-01
nan
1991-06-01
nan
1991-07-01
nan
1991-08-01
34
1991-09-01
nan
1991-10-01
nan
1991-11-01
22
1991-12-01
nan
I want to linearly interpolate the values to fill the nans. However it has to be applied within 6-month blocks (non-rolling). So for example, one 6-month block would be all the rows between 1991-01-01 and 1991-06-01, where we would do forward and backward linear imputation such that if there is a nan the interpolation would be descending to a final value of 0. So for the same dataset above here is how I would like the end result to look:
Date
orders
1991-01-01
8
1991-02-01
16
1991-03-01
24
1991-04-01
18
1991-05-01
12
1991-06-01
6
1991-07-01
17
1991-08-01
34
1991-09-01
30
1991-10-01
26
1991-11-01
22
1991-12-01
11
I am lost on how to do this in Pandas however. Any ideas?

Idea is grouping per 6 months with prepend and append 0 values, interpolate and then remove first and last 0 values per groups:
df['Date'] = pd.to_datetime(df['Date'])
f = lambda x: pd.Series([0] + x.tolist() + [0]).interpolate().iloc[1:-1]
df['orders'] = (df.groupby(pd.Grouper(freq='6MS', key='Date'))['orders']
.transform(f))
print (df)
Date orders
0 1991-01-01 8.0
1 1991-02-01 16.0
2 1991-03-01 24.0
3 1991-04-01 18.0
4 1991-05-01 12.0
5 1991-06-01 6.0
6 1991-07-01 17.0
7 1991-08-01 34.0
8 1991-09-01 30.0
9 1991-10-01 26.0
10 1991-11-01 22.0
11 1991-12-01 11.0

Write previous entries of a time serie into additional columns

I have a dataframe that contains values for individual days:
day value
1 10.1
2 15.4
3 12.1
4 14.1
5 -9.7
6 2.0
8 3.4
There is not necessary a value for each day (day 7 is missing in my example), but there is never more than one value per day.
I want to add additional columns to this dataframe, containing per row the value of the day before, the value of two days ago, the value of three days ago etc. The result would be:
day value value-of-1 value-of-2 value-of-3
1 10.1 NaN NaN NaN
2 15.4 10.1 NaN NaN
3 12.1 15.4 10.1 NaN
4 14.1 12.1 15.4 10.1
5 -9.7 14.1 12.1 15.4
6 2.0 -9.7 14.1 12.1
8 3.4 NaN 2.0 -9.7
At the moment, I add to the orginal dataframe a column containing the required day and then merge the original dataframe using this new column as join condition. After some reorganizing of the columns, I get my result:
data = [[1, 10.1], [2, 15.4], [3, 12.1], [4, 14.1], [5, -9.7], [6, 2.0], [8, 3.4]]
df = pd.DataFrame(data, columns = ['day', 'value'])
def add_column_for_prev_day(df, day):
df[f"day-{day}"] = df["day"] - day
df = df.merge(df[["day", "value"]], how="left", left_on=f"day-{day}", right_on="day", suffixes=("", "_r")) \
.drop(["day_r",f"day-{day}"],axis=1) \
.rename({"value_r": f"value-of-{day}"}, axis=1)
return df
df = add_column_for_prev_day(df, 1)
df = add_column_for_prev_day(df, 2)
df = add_column_for_prev_day(df, 3)
I wonder if there is a better and faster way to get the same result, especially without having to merge the dataframe over and over again.
A simple shift does not help as there are days without data.

You can use:
m=df.set_index('day').reindex(range(df['day'].min(),df['day'].max()+1))
l=[1,2,3]
for i in l:
m[f"value_of_{i}"] = m['value'].shift(i)
m.reindex(df.day).reset_index()
day value value_of_1 value_of_2 value_of_3
0 1 10.1 NaN NaN NaN
1 2 15.4 10.1 NaN NaN
2 3 12.1 15.4 10.1 NaN
3 4 14.1 12.1 15.4 10.1
4 5 -9.7 14.1 12.1 15.4
5 6 2.0 -9.7 14.1 12.1
6 8 3.4 NaN 2.0 -9.7

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find a value in a column in function of another column - python

Related

maximum sum of consecutive n-days using pandas

Sorting value by two columns in Pandas Python

loop through multiple dataframes and create datetime index then join dataframes

How to apply a function/impute on an interval in Pandas

Write previous entries of a time serie into additional columns

Categories

Resources