Slice values of a column and calculate average in python - python

I have a dataframe with three columns:
a b c
0 73 12
73 80 2
80 100 5
100 150 13
Values in "a" and "b" are days. I need to find the average values of "c" in each 30 day-interval (slice values inside [min(a),max(b)] in 30 days and calculate average of c). I want as a result have a dataframe like this:
aa bb c_avg
0 30 12
30 60 12
60 90 6.33
90 120 9
120 150 13
Another sample data could be:
a b c
0 1264.0 1629.0 0.000000
1 1629.0 1632.0 133.333333
6 1632.0 1699.0 0.000000
2 1699.0 1706.0 21.428571
7 1706.0 1723.0 0.000000
3 1723.0 1726.0 50.000000
8 1726.0 1890.0 0.000000
4 1890.0 1893.0 33.333333
1 1893.0 1994.0 0.000000
How can I get to the final table?

First create ranges DataFrame by ranges defined a and b columns:
a = np.arange(0, 180, 30)
df1 = pd.DataFrame({'aa':a[:-1], 'bb':a[1:]})
#print (df1)
Then cross join all rows by helper column tmp:
df3 = pd.merge(df1.assign(tmp=1), df.assign(tmp=1), on='tmp')
#print (df3)
And last filter - There are 2 solution by columns for filtering:
df4 = df3[df3['aa'].between(df3['a'], df3['b']) | df3['bb'].between(df3['a'], df3['b'])]
print (df4)
aa bb tmp a b c
0 0 30 1 0 73 12
4 30 60 1 0 73 12
8 60 90 1 0 73 12
10 60 90 1 80 100 5
14 90 120 1 80 100 5
15 90 120 1 100 150 13
19 120 150 1 100 150 13
df4 = df4.groupby(['aa','bb'], as_index=False)['c'].mean()
print (df4)
aa bb c
0 0 30 12.0
1 30 60 12.0
2 60 90 8.5
3 90 120 9.0
4 120 150 13.0
df5 = df3[df3['a'].between(df3['aa'], df3['bb']) | df3['b'].between(df3['aa'], df3['bb'])]
print (df5)
aa bb tmp a b c
0 0 30 1 0 73 12
8 60 90 1 0 73 12
9 60 90 1 73 80 2
10 60 90 1 80 100 5
14 90 120 1 80 100 5
15 90 120 1 100 150 13
19 120 150 1 100 150 13
df5 = df5.groupby(['aa','bb'], as_index=False)['c'].mean()
print (df5)
aa bb c
0 0 30 12.000000
1 60 90 6.333333
2 90 120 9.000000
3 120 150 13.000000

Related

multiple cumulative sum based on grouped columns

I have a dataset where I would like to sum two columns and then perform a subtraction while displaying a cumulative sum.
Data
id date t1 t2 total start cur_t1 cur_t2 final_o finaldb de_t1 de_t2
a q122 4 1 5 50 25 20 55 21 1 1
a q222 1 1 2 50 25 20 57 22 0 0
a q322 0 0 0 50 25 20 57 22 5 5
b q122 5 5 10 100 30 40 110 27 4 4
b q222 2 2 4 100 30 70 114 29 5 1
b q322 3 4 7 100 30 70 121 33 0 1
Desired
id date t1 t2 total start cur_t1 cur_t2 final_o finaldb de_t1 de_t2 finalt1
a q122 4 1 5 50 25 20 55 21 1 1 28
a q222 1 1 2 50 25 20 57 22 0 0 29
a q322 0 0 0 50 25 20 57 22 5 5 24
b q122 5 5 10 100 30 40 110 27 4 4 31
b q222 2 2 4 100 30 70 114 29 5 1 28
b q322 3 4 7 100 30 70 121 33 0 1 31
Logic
Create 'finalt1' column by summing 't1' and 'cur_t1'
initially and then subtracting 'de_t1' cumulatively and grouping by 'id' and 'date'
Doing
df['finalt1'] = df['cur_t1'].add(df.groupby('id')['t1'].cumsum())
I am still researching on how to subtract the 'de_t1' column cumulatively.
I can't test right now, but logically:
(df['cur_t1'].add(df.groupby('id')['t1'].cumsum())
.sub(df.groupby('id')['de_t1'].cumsum())
)
Of note, there was also this possibility to avoid grouping twice (it is calculating both cumsums at once and computing the difference), but it is actually slower:
df['cur_t1'].add(df.groupby('id')[['de_t1', 't1']].cumsum().diff(axis=1)['t1'])

Pivoting a Pandas Table - Peculiar Problem

It seemed I had a simple problem of pivoting a pandas Table, but unfortunately, the problem seems a bit complicated to me.
I am providing a tiny sample table and the output I am looking to give the example of the problem I am facing:
Say, I have a table like this:
df =
AF BF AT BT
1 4 100 70
2 7 102 66
3 11 200 90
4 13 300 178
5 18 403 200
So I need it into a wide/pivot format but the parameter name in each case will be set as the same. ( I am not looking to subset the string if possible)
My output table should like the following:
dfout =
PAR F T
A 1 100
B 4 70
A 2 102
B 7 66
A 3 200
B 11 90
A 4 300
B 13 178
A 5 403
B 18 200
I tried pivoting, but not able to achieve the desired output. Any help will be immensely appreciated. Thanks.
You can use pandas wide_to_long, but first you have to reorder the columns:
pd.wide_to_long(
df.rename(columns=lambda x: x[::-1]).reset_index(),
stubnames=["F", "T"],
i="index",
sep="",
j="PAR",
suffix=".",
).reset_index("PAR")
PAR F T
index
0 A 1 100
1 A 2 102
2 A 3 200
3 A 4 300
4 A 5 403
0 B 4 70
1 B 7 66
2 B 11 90
3 B 13 178
4 B 18 200
Alternatively, you could use the pivot_longer function from the pyjanitor, to reshape the data :
# pip install pyjanitor
import janitor
df.pivot_longer(names_to=("PAR", ".value"), names_pattern=r"(.)(.)")
PAR F T
0 A 1 100
1 B 4 70
2 A 2 102
3 B 7 66
4 A 3 200
5 B 11 90
6 A 4 300
7 B 13 178
8 A 5 403
9 B 18 200
Update: Using data from #jezrael:
df
C AF BF AT BT
0 10 1 4 100 70
1 20 2 7 102 66
2 30 3 11 200 90
3 40 4 13 300 178
4 50 5 18 403 200
pd.wide_to_long(
df.rename(columns=lambda x: x[::-1]),
stubnames=["F", "T"],
i="C",
sep="",
j="PAR",
suffix=".",
).reset_index()
C PAR F T
0 10 A 1 100
1 20 A 2 102
2 30 A 3 200
3 40 A 4 300
4 50 A 5 403
5 10 B 4 70
6 20 B 7 66
7 30 B 11 90
8 40 B 13 178
9 50 B 18 200
if you use the pivot_longer function:
df.pivot_longer(index="C", names_to=("PAR", ".value"), names_pattern=r"(.)(.)")
C PAR F T
0 10 A 1 100
1 10 B 4 70
2 20 A 2 102
3 20 B 7 66
4 30 A 3 200
5 30 B 11 90
6 40 A 4 300
7 40 B 13 178
8 50 A 5 403
9 50 B 18 200
pivot_longer is being worked on; in the next release of pyjanitor it should be much better. But pd.wide_to_long can solve your task pretty easily. The other answers can easily solve it as well.
Idea is create MultiIndex in columns by first and last letter and then use DataFrame.stack for reshape, last some data cleaning in MultiIndex in index:
df.columns= [df.columns.str[-1], df.columns.str[0]]
df = df.stack().reset_index(level=0, drop=True).rename_axis('PAR').reset_index()
print (df)
PAR F T
0 A 1 100
1 B 4 70
2 A 2 102
3 B 7 66
4 A 3 200
5 B 11 90
6 A 4 300
7 B 13 178
8 A 5 403
9 B 18 200
EDIT:
print (df)
C AF BF AT BT
0 10 1 4 100 70
1 20 2 7 102 66
2 30 3 11 200 90
3 40 4 13 300 178
4 50 5 18 403 200
df = df.set_index('C')
df.columns = pd.MultiIndex.from_arrays([df.columns.str[-1],
df.columns.str[0]], names=[None,'PAR'])
df = df.stack().reset_index()
print (df)
C PAR F T
0 10 A 1 100
1 10 B 4 70
2 20 A 2 102
3 20 B 7 66
4 30 A 3 200
5 30 B 11 90
6 40 A 4 300
7 40 B 13 178
8 50 A 5 403
9 50 B 18 200
Let's try:
(pd.wide_to_long(df.reset_index(),stubnames=['A','B'],
i='index',
j='PAR', sep='', suffix='[FT]')
.stack().unstack('PAR').reset_index(level=1)
)
Output:
PAR level_1 F T
index
0 A 1 100
0 B 4 70
1 A 2 102
1 B 7 66
2 A 3 200
2 B 11 90
3 A 4 300
3 B 13 178
4 A 5 403
4 B 18 200

Pandas Python highest 2 rows of every 3 and tabling the results

Suppose I have the following dataframe:
. Column1 Column2
0 25 1
1 89 2
2 59 3
3 78 10
4 99 20
5 38 30
6 89 100
7 57 200
8 87 300
Im not sure if what I want to do is impossible or not. But I want to compare every three rows of column1 and then take the highest 2 out the three rows and assign the corresponding 2 Column2 values to a new column. The values in column 3 does not matter if they are joined or not. It does not matter if they are arranged or not for I know every 2 rows of column 3 belong to every 3 rows of column 1.
. Column1 Column2 Column3
0 25 1 2
1 89 2 3
2 59 3
3 78 10 20
4 99 20 10
5 38 30
6 89 100 100
7 57 200 300
8 87 300
You can use np.arange with np.repeat to create a grouping array which groups every 3 values.
Then use GroupBy.nlargest then extract indices of those values using pd.Index.get_level_values, then assign them to Column3 pandas handles index alignment.
n_grps = len(df)/3
g = np.repeat(np.arange(n_grps), 3)
idx = df.groupby(g)['Column1'].nlargest(2).index.get_level_values(1)
vals = df.loc[idx, 'Column2']
vals
# 1 2
# 2 3
# 4 20
# 3 10
# 6 100
# 8 300
# Name: Column2, dtype: int64
df['Column3'] = vals
df
Column1 Column2 Column3
0 25 1 NaN
1 89 2 2.0
2 59 3 3.0
3 78 10 10.0
4 99 20 20.0
5 38 30 NaN
6 89 100 100.0
7 57 200 NaN
8 87 300 300.0
To get output like you mentioned in the question you have to sort and push NaN to last then you have perform this additional step.
df['Column3'] = df.groupby(g)['Column3'].apply(lambda x:x.sort_values()).values
Column1 Column2 Column3
0 25 1 2.0
1 89 2 3.0
2 59 3 NaN
3 78 10 10.0
4 99 20 20.0
5 38 30 NaN
6 89 100 100.0
7 57 200 300.0
8 87 300 NaN

Labeling by period

my dataset
name day value
A 7 88
A 15 101
A 21 121
A 29 56
B 21 131
B 30 78
B 35 102
C 8 80
C 16 101
...
I am trying to plot with values for these days, but I want to label because there are too many unique numbers of days.
I try to label it consistently,
Is there a way to speed up labeling by cutting it every 7 days(week)?
For example, ~ 7day = 1week, 8 ~ 14day = 2week, and so on.
output what I want
name day value week
A 7 88 1
A 15 101 3
A 21 121 3
A 29 56 5
B 21 131 3
B 30 78 5
B 35 102 5
C 8 80 2
C 16 101 3
thank you for reading
Subtract 1, then use integer division by 7 and last add 1:
df['week'] = (df['day'] - 1) // 7 + 1
print (df)
name day value week
0 A 7 88 1
1 A 15 101 3
2 A 21 121 3
3 A 29 56 5
4 B 21 131 3
5 B 30 78 5
6 B 35 102 5
7 C 8 80 2
8 C 16 101 3

Comparing two consecutive rows and creating a new column based on a specific logical operation

I have a data frame with two columns
df = ['xPos', 'lineNum']
import pandas as pd
data = '''\
xPos lineNum
40 1
50 1
75 1
90 1
42 2
75 2
110 2
45 3
70 3
95 3
125 3
38 4
56 4
74 4'''
I have created the aggregate data frame for this by using
aggrDF = df.describe(include='all')
command
and I am interested in the minimum of the xPos value. So, i get it by using
minxPos = aggrDF.ix['min']['xPos']
Desired output
data = '''\
xPos lineNum xDiff
40 1 2
50 1 10
75 1 25
90 1 15
42 2 4
75 2 33
110 2 35
45 3 7
70 3 25
95 3 25
125 3 30
38 4 0
56 4 18
74 4 18'''
The logic
I want to compere the two consecutive rows of the data frame and calculate a new column based on this logic:
if( df['LineNum'] != df['LineNum'].shift(1) ):
df['xDiff'] = df['xPos'] - minxPos
else:
df['xDiff'] = df['xPos'].shift(1)
Essentially, I want the new column to have the difference of the two consecutive rows in the df, as long as the line number is the same.
If the line number changes, then, the xDiff column should have the difference with the minimum xPos value that I have from the aggregate data frame.
Can you please help? thanks,
These two lines should do it:
df['xDiff'] = df.groupby('lineNum').diff()['xPos']
df.loc[df['xDiff'].isnull(), 'xDiff'] = df['xPos'] - minxPos
>>> df
xPos lineNum xDiff
0 40 1 2.0
1 50 1 10.0
2 75 1 25.0
3 90 1 15.0
4 42 2 4.0
5 75 2 33.0
6 110 2 35.0
7 45 3 7.0
8 70 3 25.0
9 95 3 25.0
10 125 3 30.0
11 38 4 0.0
12 56 4 18.0
13 74 4 18.0
You just need groupby lineNum and apply the condition you already writing down
df['xDiff']=np.concatenate(df.groupby('lineNum').apply(lambda x : np.where(x['lineNum'] != x['lineNum'].shift(1),x['xPos'] - x['xPos'].min(),x['xPos'].shift(1)).astype(int)).values)
df
Out[76]:
xPos lineNum xDiff
0 40 1 0
1 50 1 40
2 75 1 50
3 90 1 75
4 42 2 0
5 75 2 42
6 110 2 75
7 45 3 0
8 70 3 45
9 95 3 70
10 125 3 95
11 38 4 0
12 56 4 38
13 74 4 56

Categories

Resources