Find average based upon two conditions; create column from these averages - python

I have a df with weather reporting data. It has over 2 million rows and the following columns.
ID MONTH TEMP
1 1 0
1 1 10
2 1 50
2 1 60
3 1 80
3 1 90
1 2 0
1 2 10
2 2 50
2 2 60
3 2 80
3 2 90
I am looking to create an column for the average monthly temperature. I need a faster way than for-loops. The values for average monthly temperature are from the TEMP column. I would like them to be specific to each ID for each MONTH.
ID MONTH TEMP AVE MONTHLY TEMP
1 1 0 5
1 1 10 5
2 1 50 55
2 1 60 55
3 1 80 85
3 1 90 85
1 2 0 5
1 2 10 5
2 2 50 55
2 2 60 55
3 2 80 85
3 2 90 85

Use groupby.transform:
df['AVE MONTHLY TEMP']=df.groupby(['ID','MONTH'])['TEMP'].transform('mean')
print(df)
Output
ID MONTH TEMP AVE MONTHLY TEMP
0 1 1 0 5
1 1 1 10 5
2 2 1 50 55
3 2 1 60 55
4 3 1 80 85
5 3 1 90 85
6 1 2 0 5
7 1 2 10 5
8 2 2 50 55
9 2 2 60 55
10 3 2 80 85
11 3 2 90 85

I think this solution may work better if you have millions of lines of data as those groupings may repeat (ID, MONTH). This makes an assumption that the ID series is always grouped as you have in your data. I'm trying to think out of the box here as you said you have a million lines of data:
df['AVG MONTHLY TEMP'] = df.groupby(df['ID'].ne(df['ID'].shift()).cumsum(), as_index=False)['TEMP'].transform('mean')
Also, if you average temperatures are ALWAYS grouped in two you can do this formula as well:
df.groupby(np.arange(len(df))//2)['TEMP'].transform('mean')
output:
ID MONTH TEMP AVG MONTHLY TEMP
0 1 1 0 5
1 1 1 10 5
2 2 1 50 55
3 2 1 60 55
4 3 1 80 85
5 3 1 90 85
6 1 2 0 5
7 1 2 10 5
8 2 2 50 55
9 2 2 60 55
10 3 2 80 85
11 3 2 90 85
I hope this help or give ideas as a million lines of data is a lot of data

Related

Pandas: groupby and get tail based on some column value

I have a dataframe looks like
id week value
1 1 15
1 2 29
1 3 49
1 3 19
2 6 10
2 7 99
2 8 53
How extract dataframe based on the last 2 weeks for each id?
It's like a tail but not for the records.
Desirable output
id week value
1 2 29
1 3 49
1 3 19
2 7 99
2 8 53
This more like factorized then pick the last two of each group
m = df.iloc[::-1].groupby('id')['week'].transform(lambda x :x.factorize()[0]).isin([0,1])
out = df[m]
id week value
1 1 2 29
2 1 3 49
3 1 3 19
5 2 7 99
6 2 8 53
Or we fix the tail with drop_duplicates
df.merge(df.drop_duplicates(['id','week']).groupby('id').tail(2).drop('value',1))
id week value
0 1 2 29
1 1 3 49
2 1 3 19
3 2 7 99
4 2 8 53
Assume data have been sorted by id and week, then groupby tail will do the job
df.groupby('id').tail(2)
Revision:
(df[['id', 'week']]
.drop_duplicates()
.groupby('id')
.tail(2)
.merge(df)
)

In pandas, how to operate on the row with the first instance of a string?

I have a csv file, and I'm trying to convert a column with cumulative values to individual values. I can form most of the column with
df['delta'] = df['expenditure'].diff()
So for each person (A,B..) I want the change in expenditure since they last attended. What which gives me
person days expenditure delta
A 1 10
A 2 24 14
A 10 45 21
B 2 0 -45
B 7 2 2
B 8 10 8
C 5 50 40
C 6 78 28
C 7 90 12
and what I want is
person days expenditure delta
A 1 10 ---> 10
A 2 24 14
A 10 45 21
B 2 0 ---> 0
B 7 2 2
B 8 10 8
C 5 50 ---> 50
C 6 78 28
C 7 90 12
so for each person, I want their lowest day's expenditure value put in delta.
Additionally, if I'm trying to average delta by the days, how would I go about it? That is if I wanted
person days expenditure delta
A 1 10 10
A 2 24 14
A 10 45 21/8
B 2 0 0
B 7 2 2/5
B 8 10 8
So 21/8 is the (change in expenditure)/(change in days) for A
Use DataFrameGroupBy.diff with replace first missing values by original by Series.fillna:
df['delta'] = df.groupby('person')['expenditure'].diff().fillna(df['expenditure'])
print (df)
person days expenditure delta
0 A 1 10 10.0
1 A 2 24 14.0
2 A 10 45 21.0
3 B 2 0 0.0
4 B 7 2 2.0
5 B 8 10 8.0
6 C 5 50 50.0
7 C 6 78 28.0
8 C 7 90 12.0
And for second is possible processing both columns and then divide in DataFrame.eval:
df['delta'] = (df.groupby('person')[['expenditure', 'days']].diff()
.fillna(df[['expenditure','days']])
.eval('expenditure / days'))
What working same like:
df['delta'] = (df.groupby('person')['expenditure'].diff().fillna(df['expenditure'])
.div(df.groupby('person')['days'].diff().fillna(df['days'])))
print (df)
person days expenditure delta
0 A 1 10 10.000
1 A 2 24 14.000
2 A 10 45 2.625
3 B 2 0 0.000
4 B 7 2 0.400
5 B 8 10 8.000
6 C 5 50 10.000
7 C 6 78 28.000
8 C 7 90 12.000

Python Dataframe GroupBy Function

I am having hard time understanding what the code below does. I initially thought it was counting the unique appearances of the values in (weight age) and (weight height) however when I ran this example, I found out it was doing something else.
data = [[0,33,15,4],[1,44,12,3],[0,44,12,5],[1,33,15,4],[0,77,13,4],[1,33,15,4],[1,99,40,7],[0,58,45,4],[1,11,13,4]]
df = pd.DataFrame(data,columns=["Lbl","Weight","Age","Height"])
print (df)
def group_fea(df,key,target):
'''
Adds columns for feature combinations
'''
tmp = df.groupby(key, as_index=False)[target].agg({
key+target + '_nunique': 'nunique',
}).reset_index()
del tmp['index']
print("****{}****".format(target))
return tmp
#Add feature combinations
feature_key = ['Weight']
feature_target = ['Age','Height']
for key in feature_key:
for target in feature_target:
tmp = group_fea(df,key,target)
df = df.merge(tmp,on=key,how='left')
print (df)
Lbl Weight Age Height
0 0 33 15 4
1 1 44 12 3
2 0 44 12 5
3 1 33 15 4
4 0 77 13 4
5 1 33 15 4
6 1 99 40 7
7 0 58 45 4
8 1 11 13 4
****Age****
****Height****
Lbl Weight Age Height WeightAge_nunique WeightHeight_nunique
0 0 33 15 4 1 1
1 1 44 12 3 1 2
2 0 44 12 5 1 2
3 1 33 15 4 1 1
4 0 77 13 4 1 1
5 1 33 15 4 1 1
6 1 99 40 7 1 1
7 0 58 45 4 1 1
8 1 11 13 4 1 1
I want to understand what the values in WeightAge_nunique WeightHeight_nunique mean
The value of WeightAge_nunique on a given row is the number of unique Ages that have the same Weight. The corresponding thing is true of WeightHeight_nunique. E.g., for people of Weight=44, there is only 1 unique age (12), hence WeightAge_nunique=1 on those rows, but there are 2 unique Heights (3 and 5), hence WeightHeight_nunique=2 on those same rows.
You can see that this happens because the grouping function groups by the "key" column (Weight), then performs the "nunique" aggregation function on the "target" column (either Age or Height).
Let us try transform
g = df.groupby('Weight').transform('nunique')
df['WeightAge_nunique'] = g['Age']
df['WeightHeight_nunique'] = g['Height']
df
Out[196]:
Lbl Weight Age Height WeightAge_nunique WeightHeight_nunique
0 0 33 15 4 1 1
1 1 44 12 3 1 2
2 0 44 12 5 1 2
3 1 33 15 4 1 1
4 0 77 13 4 1 1
5 1 33 15 4 1 1
6 1 99 40 7 1 1
7 0 58 45 4 1 1
8 1 11 13 4 1 1

Subtract fixed row value in reference to column value in pandas dataframe

I would like to subtract a fixed row value in rows, in reference to their values in another column.
My data looks like this:
TRACK TIME POSITION_X
0 1 0 12
1 1 30 13
2 1 60 15
3 1 90 11
4 2 0 10
5 2 20 11
6 2 60 13
7 2 90 17
I would like to subtract a fixed row value (WHEN TIME=0) of the POSITION_X column in reference to the TRACK column, and create a new column ("NEW_POSX") with those values. The output should be like this:
TRACK TIME POSITION_X NEW_POSX
0 1 0 12 0
1 1 30 13 1
2 1 60 15 3
3 1 90 11 -1
4 2 0 10 0
5 2 20 11 1
6 2 60 13 3
7 2 90 17 7
I have been using the following code to get this done:
import pandas as pd
data = {'TRACK': [1,1,1,1,2,2,2,2],
'TIME': [0,30,60,90,0,20,60,90],
'POSITION_X': [12,13,15,11,10,11,13,17],
}
df = pd.DataFrame (data, columns = ['TRACK','TIME','POSITION_X'])
df['NEW_POSX']= df.groupby('TRACK')['POSITION_X'].diff().fillna(0).astype(int)
df.head(8)
... but I don't get the desired output. Instead, I get a new column where every row is subtracted by the previous row (according to the "TRACK" column):
TRACK TIME POSITION_X NEW_POSX
0 1 0 12 0
1 1 30 13 1
2 1 60 15 2
3 1 90 11 -4
4 2 0 10 0
5 2 20 11 1
6 2 60 13 2
7 2 90 17 4
can anyone help me with this?
You can use transform and first to get the value at time 0, and then substract it to the 'POSITION_X' column:
s=df.groupby('TRACK')['POSITION_X'].transform('first')
df['NEW_POSX']=df['POSITION_X']-s
#Same as:
#df['NEW_POSX']=df['POSITION_X'].sub(s)
Output:
df
TRACK TIME POSITION_X NEW_POSX
0 1 0 12 0
1 1 30 13 1
2 1 60 15 3
3 1 90 11 -1
4 2 0 10 0
5 2 20 11 1
6 2 60 13 3
7 2 90 17 7

How to check if the value is between two consecutive rows in dataframe or numpy array?

I need to write a code that checks if a certain value is between 2 consecutive rows, for example:
row <50 < next row
meaning if the value is between row and its consecutive row.
df = pd.DataFrame(np.random.randint(0,100,size=(10, 1)), columns=list('A'))
The output is:
A
0 67
1 78
2 53
3 44
4 84
5 2
6 63
7 13
8 56
9 24
What I'd like to do is to check if (let's say I have a set value) "50" is between all consecutive rows.
Say, we check if 50 is between 67 and 78 and then between 78 and 53, obviously the answer is no, therefore in column B the result would be 0.
Now, if we check if 50 is between 53 and 44, then we'll get 1 in column B and we'll use cumsum() to count how many times the value of 50 is between consecutive rows in column A.
UPDATE: Let's say, if I have column C where I have 2 categories only: 1 and 2. How would I ensure that the check is performed within each of the categories separately? In other words, the check is reset once the category changes?
The desired output is:
A B C count
0 67 0 1 0
1 78 0 1 0
2 53 0 1 0
3 44 1 2 0
4 84 2 1 0
5 2 3 2 0
6 63 4 1 0
7 13 5 2 0
8 56 6 1 0
9 24 7 1 1
Greatly appreciate your help.
Let's just subtract "50" from series and check sign change:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[67,78,53,44,84,2,63,13,56,24]}, columns=list('A'))
s = df['A'] - 50
df['count'] = np.sign(s).diff().fillna(0).ne(0).cumsum()
print(df)
Output:
A count
0 67 0
1 78 0
2 53 0
3 44 1
4 84 2
5 2 3
6 63 4
7 13 5
8 56 6
9 24 7
This should work:
what = ((df.A < 50) | (50 > df.A.shift())) & ((df.A > 50) | (50 < df.A.shift()))
df['count'] = what.astype(int).cumsum()
A count
0 67 0
1 78 0
2 53 0
3 44 1
4 84 2
5 2 3
6 63 4
7 13 5
8 56 6
9 24 7
or
df = pd.DataFrame(np.random.randint(0,100,size=(10, 1)), columns=list('A'))
what = ((df.A < 50) | (50 > df.A.shift())) & ((df.A > 50) | (50 < df.A.shift()))
df['count'] = what.astype(int).cumsum()
A count
0 45 0
1 53 1
2 44 2
3 87 3
4 47 4
5 13 4
6 20 4
7 89 5
8 81 5
9 53 5
Would your second output look like this:
df
A B C
0 67 0 1
1 78 0 1
2 53 0 1
3 44 1 2
4 84 2 1
5 2 3 2
6 63 4 1
7 13 5 2
8 56 6 1
9 24 7 1
df_new = df
what = ((df_new.A < 50) | (50 > df_new.A.shift())) & ((df_new.A > 50) | (50 < df_new.A.shift())) & ((df_new.C == df_new.C.shift() ))
df['count'] = what.astype(int).cumsum()
df
Ouput:
A B C count
0 67 0 1 0
1 78 0 1 0
2 53 0 1 0
3 44 1 2 0
4 84 2 1 0
5 2 3 2 0
6 63 4 1 0
7 13 5 2 0
8 56 6 1 0
9 24 7 1 1

Categories

Resources