Create a dataframe based on column values of another dataframe - python

I have a dataframe as 20000 X 50. Two of the columns are Date and Time (represented as hour). Remaining columns have observations of some parameters during the time. What I am trying to achieve is create a new dataframe which averages all the remaining column values for every 3 hours per day and creates a an ID columns for this which can be numbers from 1 to 8. Each representing 3 hour range.
I have enclosed an image of the source and what should be result. Any help is very much appreciated.
Data

Use groupby by column Date and column Hour created by sub by 1 and floordiv with add with aggregate mean:
df['Hour'] = df['Hour'].sub(1).floordiv(3).add(1)
df = df.groupby(['Date', 'Hour'], as_index=False).mean()
print (df)
Date Hour col1 col2 col3
0 05/01/2018 1 5.333333 5.333333 7.666667
1 05/01/2018 2 6.000000 6.000000 4.000000
2 06/01/2018 1 4.000000 6.333333 7.000000
3 06/01/2018 3 6.000000 6.000000 3.666667
Detail:
print (df['Hour'].sub(1).floordiv(3).add(1))
0 1
1 1
2 1
3 2
4 1
5 1
6 1
7 3
8 3
9 3
Name: Hour, dtype: int64

Related

Create a column that contains the sum of rows above within group

Here's the dataset I've got. Basically I would like to create a column containing the sum of the values before the date (which means the sum of the values that is above the row) within the same group. So the first row of each group is supposed to be always 0.
group
date
value
1
10/04/2022
2
1
12/04/2022
3
1
17/04/2022
5
1
22/04/2022
1
2
11/04/2022
3
2
15/04/2022
2
2
17/04/2022
4
The column I want would look like this.
Could you give me an idea how to create such a column?
group
date
value
sum
1
10/04/2022
2
0
1
12/04/2022
3
2
1
17/04/2022
5
5
1
22/04/2022
1
10
2
11/04/2022
3
0
2
15/04/2022
2
3
2
17/04/2022
4
5
You can try groupby.transform and call Series.cumsum().shift()
df['sum'] = (df
# sort the dataframe if needed
.assign(date=pd.to_datetime(df['date'], dayfirst=True))
.sort_values(['group', 'date'])
.groupby('group')['value']
.transform(lambda col: col.cumsum().shift())
.fillna(0))
print(df)
group date value sum
0 1 10/04/2022 2 0.0
1 1 12/04/2022 3 2.0
2 1 17/04/2022 5 5.0
3 1 22/04/2022 1 10.0
4 2 11/04/2022 3 0.0
5 2 15/04/2022 2 3.0
6 2 17/04/2022 4 5.0

Sum if column name is higher than row value

I am trying to sum rows values of each of the below columns if their "date" values are lower than column names' values :
01-01-2020 01-01-2021 01-01-2022 date
1 1 3 6 01-01-2020
2 4 4 2 01-10-2021
3 5 1 9 01-12-2021
For instance for column 1, the only row whose date value is equal or lower than column 1's name (01-01-2020) is the first row, thus the sum is 1 for column 1.
Likewise, as all dates in the "date" column are lower than last column's name (01-01-2022), the total is 6+2+9=17, which would result to this :
01-01-2020 01-01-2021 01-01-2022 date
1 1 3 6 01-01-2020
2 4 4 2 01-10-2021
3 5 1 9 01-12-2021
Total 1 3 17
Is there any way do to it more elegantly than looping over each columns and then each rows ?
We can check with np.greater_equal.outer, then slice the column mask the unwanted cell with boolean output as NaN
s = pd.to_datetime(df.date).values
m = np.greater_equal.outer(pd.to_datetime(df.columns[:-1]).values,s).T
df = df.append(df.iloc[:,:-1].where(m).sum().to_frame('Total').T)
df
Out[381]:
01-01-2020 01-01-2021 01-01-2022 date
1 1.0 3.0 6.0 01-01-2020
2 4.0 4.0 2.0 01-10-2021
3 5.0 1.0 9.0 01-12-2021
Total 1.0 3.0 17.0 NaN

Count number of columns above a date

I have a pandas dataframe with several columns and I would like to know the number of columns above the date 2016-12-31 . Here is an example:
ID
Bill
Date 1
Date 2
Date 3
Date 4
Bill 2
4
6
2000-10-04
2000-11-05
1999-12-05
2001-05-04
8
6
8
2016-05-03
2017-08-09
2018-07-14
2015-09-12
17
12
14
2016-11-16
2017-05-04
2017-07-04
2018-07-04
35
And I would like to get this column
Count
0
2
3
Just create the mask and call sum on axis=1
date = pd.to_datetime('2016-12-31')
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1)
OUTPUT:
0 0
1 2
2 3
dtype: int64
If needed, call .to_frame('count') to create datarame with column as count
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1).to_frame('Count')
Count
0 0
1 2
2 3
Use df.filter to filter the Date* columns + .sum(axis=1)
(df.filter(like='Date') > '2016-12-31').sum(axis=1).to_frame(name='Count')
Result:
Count
0 0
1 2
2 3
You can do:
df['Count'] = (df.loc[:, [x for x in df.columns if 'Date' in x]] > '2016-12-31').sum(axis=1)
Output:
ID Bill Date 1 Date 2 Date 3 Date 4 Bill 2 Count
0 4 6 2000-10-04 2000-11-05 1999-12-05 2001-05-04 8 0
1 6 8 2016-05-03 2017-08-09 2018-07-14 2015-09-12 17 2
2 12 14 2016-11-16 2017-05-04 2017-07-04 2018-07-04 35 3
We select columns with 'Date' in the name. It's better when we have lots of columns like these and don't want to put them one by one. Then we compare it with lookup date and sum 'True' values.

Is there a way to do rolling percentage in pandas with three columns?

I'm looking for a method to perform rolling percantage in a Pandas DataFrame with three columns. For each row in my df, I want to calculate the difference between the last three rows and the triple-wise rows in that column and then do this for each column. With the output, I want to sum the average of each row. Below, I will try to show you what I mean and what I have tried. However, as you will be able to tell, my knowledge is limited and I'm looking for a faster and more efficent way to produce likwise output as below but for each row in a larger DataFrame...
I'm greatful for any feedback!
My test dataset looks like this:
df1 = pd.DataFrame([[1,3,2,4,5,6,3,4],[1,3,4,6,7,2,3,4],[1,2,2,4,12,9,8,4]]).T
print(df1)
0 1 2
0 1 1 1
1 3 3 2
2 2 4 2
3 4 6 4
4 5 7 12
5 6 2 9
6 3 3 8
7 4 4 4
If I was to do this "manually" it would start with this:
pctChange = pd.DataFrame([df1.First.pct_change(periods=3),df1.Second.pct_change(periods=3),df1.Third.pct_change(periods=3)]).T
print(pctChange)
First Second Third
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 3.000000 5.000000 3.000000
4 0.666667 1.333333 5.000000
5 2.000000 -0.500000 3.500000
6 -0.250000 -0.500000 1.000000
7 -0.200000 -0.428571 -0.666667
Then taking the average of each row.
ave = pctChange.mean(axis=1)
print(ave)
0 NaN
1 NaN
2 NaN
3 3.666667
4 2.333333
5 1.666667
6 0.083333
7 -0.431746
Finally, sum the latest three rows
SumOfLastThree = ave.iloc[-3:].sum()
print(SumOfLastThree)
#desired output
1.3182539682539682
maybe you could try this to get your 3 day moving average:
df1 = pd.DataFrame([[1,3,2,4,5,6,3,4],[1,3,4,6,7,2,3,4],[1,2,2,4,12,9,8,4]]).T
df1.columns = ['First','Second','Third']
#3 day rolling average of value
df1['PctChange1']=pd.to_numeric(df1.First.rolling(3,min_periods=3).mean().fillna(''))
I did pd.to_numeric because it was returning an object. To change the size of the moving average, you would change the first parameter .rolling(). If you want to still have a rolling average for the first couple rows, you can change it min_periods=1 - that should allow you to add that moving average as a new column in your frame.
Then your rolling sum would be:
df1['RollingSum'] = pd.to_numeric(df1.PctChange1.rolling(3,min_periods=3).sum().fillna(''))
Add it all up:
df1 = pd.DataFrame([[1,3,2,4,5,6,3,4],[1,3,4,6,7,2,3,4],[1,2,2,4,12,9,8,4]]).T
df1.columns = ['First','Second','Third']
#3 day rolling average of value
df1['PctChange1']=pd.to_numeric(df1.First.rolling(3,min_periods=3).mean().fillna(''))
#sum of last three rolling averages
df1['RollingSum'] = pd.to_numeric(df1.PctChange1.rolling(3,min_periods=3).sum().fillna(''))
df1
Let me know if that works!

Shifting the values of a column in pandas dataframe one month forward

Is there a way to shift the values of a column in pandas dataframe one month forward? (note that I want to thift the column value and not the date value).
For example, if I have:
ColumnA ColumnB
2016-10-01 1 0
2016-09-30 2 1
2016-09-29 5 1
2016-09-28 7 1
.
.
2016-09-01 3 1
2016-08-31 4 7
2016-08-30 4 7
2016-08-29 9 7
2016-08-28 10 7
Then I want to be able to shift the values in ColumnB
one month forward, to get the desired output:
ColumnA ColumnB
2016-10-01 1 1
2016-09-30 2 7
2016-09-29 5 7
2016-09-28 7 7
.
.
2016-09-01 3 7
2016-08-31 3 X
2016-08-30 4 X
2016-08-29 9 x
2016-08-28 10 x
In the data I have, the value if fixed for each month (for example, the value in ColumnB was 1 during september), so the fact that the number of days is a bit different each month should not be a problem.
This seems related Python/Pandas - DataFrame Index - Move one month forward, but in the linked question the OP wanted to shift the whole frame, and I want to shift only selected columns.
It is not too elegant, but you can do something like that:
df=df.reset_index()
df['index']=pd.to_datetime(df['index'],infer_datetime_format=True)
df['offset']=df['index']-pd.DateOffset(months=1)
res=df.merge(df,right_on='index',left_on='offset',how='left')
and just take from res the columns you want
You can first create a new index of pandas Periods for each month and then find get the value of each month and use pandas automatic index alignment to create a new column.
df1 = df.copy()
orig_idx = df.index
df1.index = orig_idx.to_period('M')
col_b_new = df1.groupby(level=0)['ColumnB'].first().tshift(1)
df1['ColumnB_new'] = col_b_new
df1.index = orig_idx
Output
ColumnA ColumnB ColumnB_new
2016-10-01 1 0 1.0
2016-09-30 2 1 7.0
2016-09-29 5 1 7.0
2016-09-28 7 1 7.0
2016-09-01 3 1 7.0
2016-08-31 4 7 NaN
2016-08-30 4 7 NaN
2016-08-29 9 7 NaN
2016-08-28 10 7 NaN

Categories

Resources