Iterating over groups into a dataframe - python

A B
0 2002-01-16 0
1 2002-01-16 4
2 2002-01-16 -2
3 2002-01-16 11
4 2002-01-16 12
5 2002-01-17 0
6 2002-01-17 -18
7 2002-01-17 16
8 2002-01-18 0
9 2002-01-18 -1
10 2002-01-18 4
results = {}
grouped = df.groupby("A")
for name, group in grouped:
if (df["B"] >= 10).any():
results[name] = df.loc[df["B"] >= 10].head(1)
print(results[name])
elif (df["B"] <= -10).any():
results[name] = df.loc[df["B"] <= -10].head(1)
print(results[name])
else:
results[name] = df.loc[df["B"] > -10, :].tail(1)
print(results[name])
Output:
A B
3 2002-01-16 11
A B
3 2002-01-16 11
A B
3 2002-01-16 11
I want to iterate and get one result per each A group, with the next conditions:
If any B column value is >= 10 or <= -10, add just the first to "results" and skip to the next group to continue iterating.
If there is not any B column value >= 10 or <= -10, add the last value to "results" and skip to the next group to continue iterating.
The desired output would be:
A B
3 2002-01-16 11
A B
6 2002-01-17 -18
A B
10 2002-01-18 4

Your code contains two errors that prevent the correct output. The first, and most obvoius, is that you are not using your group in the for loop. Instead you operate on the full df frame. That is why you get the same result for every entry.
When that is fixed, you will get almost the expected result. Not exactly, though, due to your second mistake. According to your description, you want to equate >= 10 and <= -10. Your code, however, executes first the greater than check and if that one is successful, it will generate your output. Thus, the result for group 2002-01-17 will be 16, rather than -18.
The fix for the second problem is to ensure that you test for both the conditions in the same ifclause, generally using an or. However, in your current situation, it is possible to collapse these two tests to one, using absolute values (the abs() operator). This is somewhat of a special case (albeit pretty common), though. It is good to understand both this and the more general way, using or.
This will reduce the number of cases to two, removing the elif line. In addition, it is possible to do some minor modifications to increase readablity. That taken together will leave you with someting similar to:
results = {}
grouped = df.groupby("A")
for name, group in grouped:
if (abs(group["B"]) >= 10).any():
results[name] = group[abs(group["B"]) >= 10].head(1)
else:
results[name] = group.tail(1)
print(results[name])
which generates the wanted output:
A B
3 2002-01-16 11
A B
6 2002-01-17 -18
A B
10 2002-01-18 4

Here's another way, following your approach:
# we'll use this function to get output
def get_values(df):
# check the condition
if any(df.loc[(df["B"] >= 10) |(df["B"] <= -10),'B'].values > 0):
# spit correct value
val = df.loc[(df["B"] >= 10) |(df["B"] <= -10),'B'].head(1)
else:
val = df['B'].tail(1)
return val
df.groupby('A').apply(get_values)
A
2002-01-16 3 11
2002-01-17 6 -18
2002-01-18 10 4
Name: B, dtype: int64

If you don't want to use loop, try this:
df["C"] = df["B"].apply(lambda x: abs(x)>=10)
df.groupby("A", as_index=False).apply(lambda x: x[x["C"]].head(1) if not x[x["C"]].empty else x.tail(1))[["A", "B"]]
result
Out[315]:
A B
0 3 2002-01-16 11
1 6 2002-01-17 -18
2 10 2002-01-18 4

Related

Is there a way to reference a previous value in Pandas column efficiently?

I want to do some complex calculations in pandas while referencing previous values (basically I'm calculating row by row). However the loops take forever and I wanted to know if there was a faster way. Everybody keeps mentioning using shift but I don't understand how that would even work.
df = pd.DataFrame(index=range(500)
df["A"]= 2
df["B"]= 5
df["A"][0]= 1
for i in range(len(df):
if i != 0: df['A'][i] = (df['A'][i-1] / 3) - df['B'][i-1] + 25
numpy_ext can be used for expanding calculations
pandas-rolling-apply-using-multiple-columns for reference
I have also included a simpler calc to demonstrate behaviour in simpler way
df = pd.DataFrame(index=range(5000))
df["A"]= 2
df["B"]= 5
df["A"][0]= 1
import numpy_ext as npe
# for i in range(len(df):
# if i != 0: df['A'][i] = (df['A'][i-1] / 3) - df['B'][i-1] + 25
# SO example - function of previous values in A and B
def f(A,B):
r = np.sum(A[:-1]/3) - np.sum(B[:-1] + 25) if len(A)>1 else A[0]
return r
# much simpler example, sum of previous values
def g(A):
return np.sum(A[:-1])
df["AB_combo"] = npe.expanding_apply(f, 1, df["A"].values, df["B"].values)
df["A_running"] = npe.expanding_apply(g, 1, df["A"].values)
print(df.head(10).to_markdown())
sample output
A
B
AB_combo
A_running
0
1
5
1
0
1
2
5
-29.6667
1
2
2
5
-59
3
3
2
5
-88.3333
5
4
2
5
-117.667
7
5
2
5
-147
9
6
2
5
-176.333
11
7
2
5
-205.667
13
8
2
5
-235
15
9
2
5
-264.333
17

Index and save last N points from a list that meets conditions from dataframe Python

I have a DataFrame that contains gas concentrations and the corresponding valve number. This data was taken continuously where we switched the valves back and forth (valves=1 or 2) for a certain amount of time to get 10 cycles for each valve value (20 cycles total). A snippet of the data looks like this (I have 2,000+ points and each valve stayed on for about 90 seconds each cycle):
gas1 valveW time
246.9438 2 1
247.5367 2 2
246.7167 2 3
246.6770 2 4
245.9197 1 5
245.9518 1 6
246.9207 1 7
246.1517 1 8
246.9015 1 9
246.3712 2 10
247.0826 2 11
... ... ...
My goal is to save the last N points of each valve's cycle. For example, the first cycle where valve=1, I want to index and save the last N points from the end before the valve switches to 2. I would then save the last N points and average them to find one value to represent that first cycle. Then I want to repeat this step for the second cycle when valve=1 again.
I am currently converting from Matlab to Python so here is the Matlab code that I am trying to translate:
% NOAA high
n2o_noaaHigh = [];
co2_noaaHigh = [];
co_noaaHigh = [];
h2o_noaaHigh = [];
ind_noaaHigh_end = zeros(1,length(t_c));
numPoints = 40;
for i = 1:length(valveW_c)-1
if (valveW_c(i) == 1 && valveW_c(i+1) ~= 1)
test = (i-numPoints):i;
ind_noaaHigh_end(test) = 1;
n2o_noaaHigh = [n2o_noaaHigh mean(n2o_c(test))];
co2_noaaHigh = [co2_noaaHigh mean(co2_c(test))];
co_noaaHigh = [co_noaaHigh mean(co_c(test))];
h2o_noaaHigh = [h2o_noaaHigh mean(h2o_c(test))];
end
end
ind_noaaHigh_end = logical(ind_noaaHigh_end);
This is what I have so far for Python:
# NOAA high
n2o_noaaHigh = [];
co2_noaaHigh = [];
co_noaaHigh = [];
h2o_noaaHigh = [];
t_c_High = []; # time
for i in range(len(valveW_c)):
# NOAA HIGH
if (valveW_c[i] == 1):
t_c_High.append(t_c[i])
n2o_noaaHigh.append(n2o_c[i])
co2_noaaHigh.append(co2_c[i])
co_noaaHigh.append(co_c[i])
h2o_noaaHigh.append(h2o_c[i])
Thanks in advance!
I'm not sure if I understood correctly, but I guess this is what you are looking for:
# First we create a column to show cycles:
df['cycle'] = (df.valveW.diff() != 0).cumsum()
print(df)
gas1 valveW time cycle
0 246.9438 2 1 1
1 247.5367 2 2 1
2 246.7167 2 3 1
3 246.677 2 4 1
4 245.9197 1 5 2
5 245.9518 1 6 2
6 246.9207 1 7 2
7 246.1517 1 8 2
8 246.9015 1 9 2
9 246.3712 2 10 3
10 247.0826 2 11 3
Now you can use groupby method to get the average for the last n points of each cycle:
n = 3 #we assume this is n
df.groupby('cycle').apply(lambda x: x.iloc[-n:, 0].mean())
Output:
cycle 0
1 246.9768
2 246.6579
3 246.7269
Let's call your DataFrame df; then you could do:
results = {}
for k, v in df.groupby((df['valveW'].shift() != df['valveW']).cumsum()):
results[k] = v
print(f'[group {k}]')
print(v)
Shift(), as it suggests, shifts the column of the valve cycle allows to detect changes in number sequences. Then, cumsum() helps to give a unique number to each of the group with the same number sequence. Then we can do a groupby() on this column (which was not possible before because groups were either of ones or twos!).
which gives e.g. for your code snippet (saved in results):
[group 1]
gas1 valveW time
0 246.9438 2 1
1 247.5367 2 2
2 246.7167 2 3
3 246.6770 2 4
[group 2]
gas1 valveW time
4 245.9197 1 5
5 245.9518 1 6
6 246.9207 1 7
7 246.1517 1 8
8 246.9015 1 9
[group 3]
gas1 valveW time
9 246.3712 2 10
10 247.0826 2 11
Then to get the mean for each cycle; you could e.g. do:
df.groupby((df['valveW'].shift() != df['valveW']).cumsum()).mean()
which gives (again for your code snippet):
gas1 valveW time
valveW
1 246.96855 2.0 2.5
2 246.36908 1.0 7.0
3 246.72690 2.0 10.5
where you wouldn't care much about the time mean but the gas1 one!
Then, based on results you could e.g. do:
n = 3
mean_n_last = []
for k, v in results.items():
if len(v) < n:
mean_n_last.append(np.nan)
else:
mean_n_last.append(np.nanmean(v.iloc[len(v) - n:, 0]))
which gives [246.9768, 246.65796666666665, nan] for n = 3 !
If your dataframe is sorted by time you could get the last N records for each valve like this.
N=2
valve1 = df[df['valveW']==1].iloc[-N:,:]
valve2 = df[df['valveW']==2].iloc[-N:,:]
If it isn't currently sorted you could easily sort it like this.
df.sort_values(by=['time'])

replicating data in same dataFrame

I want to replicate the data from the same dataframe when a certain condition is fulfilled.
Dataframe:
Hour,Wage
1,15
2,17
4,20
10,25
15,26
16,30
17,40
19,15
I want to replicate the dataframe when going through a loop and there is a difference greater than 4 in row.hour.
Expected Output:
Hour,Wage
1,15
2,17
4,20
10,25
15,26
16,30
17,40
19,15
2,17
4,20
i want to replicate the rows when the iterating through all the row and there is a difference greater than 4 in row.hour
row.hour[0] = 1
row.hour[1] = 2.here the difference between is 1 but in (row.hour[2]=4 and row,hour[3]=10).here the difference is 6 which is greater than 4.I want to replicate the data above of the index where this condition(greater than 4) is fulfilled
I can replicate the data with **df = pd.concat([df]*2, ignore_index=False)**.but it does not replicate when i run it with if statement
I tried the code below but nothing is happening.
**for i in range(0,len(df)-1):
if (df.iloc[i,0] - df.iloc[i+1,0]) > 4 :
df = pd.concat([df]*2, ignore_index=False)**
My understanding is: you want to compare 'Hour' values for two successive rows.
If the difference is > 4 you want to add the previous row to the DF.
If that is what you want try this:
Create a DF:
j = pd.DataFrame({'Hour':[1, 2, 4,10,15,16,17,19],
'Wage':[15,17,20,25,26,30,40,15]})
Define a function:
def f1(d):
dn = d.copy()
for x in range(len(d)-2):
if (abs(d.iloc[x+1].Hour - d.iloc[x+2].Hour) > 4):
idx = x + 0.5
dn.loc[idx] = d.iloc[x]['Hour'], d.iloc[x]['Wage']
dn = dn.sort_index().reset_index(drop=True)
return dn
Call the function passing your DF:
nd = f1(j)
Hour Wage
0 1 15
1 2 17
2 2 17
3 4 20
4 4 20
5 10 25
6 15 26
7 16 30
8 17 40
9 19 15
In line
if df.iloc[i,0] - df.iloc[i+1,0] > 4
you calculate 4-10 instead of 10-4 so you check -6 > 4 instead of 6 > 4
You have to replace items
if df.iloc[i+1,0] - df.iloc[i,0] > 4
or use abs() if you want to replicate in both situations - > 4 and < -4
if abs(df.iloc[i+1,0] - df.iloc[i,0]) > 4
If you would use print( df.iloc[i,0] - df.iloc[i+1,0]) (or debuger) the you would see it.

Is there a pandas function to sum a set number of previous row elements in a dataframe?

I'm trying to create a function which can look at previous rows in a DataFrame and sum them based on a set number of rows to look back over. Here I have used 3 but ideally I would like to scale it up to look back over more rows. My solution works but doesn't seem very efficient. The other criteria is each time it hits a new team the count must start again, so the first row for each new team is always 0, the data will be ordered in team order but if a solution is known for where the data isn't in team order this would be incredible.
Is there a function in Pandas which could help with this?
So far I've tried the code below and tried googling the issue, the closest example I could find is: here! but this groups the index and I'm unsure how to apply this when the value has to keep resetting each time it hits a new team, as it wouldn't distinguish each time there is a new team.
np.random.seed(0)
data = {'team':['a','a','a','a','a','a','a','a','b','b',
'b','b','b','b','b','b','c','c','c','c','c','c','c','c'],
'teamPoints': np.random.randint(0,4,24)}
df = pd.DataFrame.from_dict(data)
df.reset_index(inplace=True)
def find_sum_last_3(x):
if x == 0:
return 0
elif x == 1:
return df['teamPoints'][x-1]
elif x == 2:
return df['teamPoints'][x-1] + df['teamPoints'][x-2]
elif df['team'][x] != df['team'][x-1]:
return 0
elif df['team'][x] != df['team'][x-2]:
return df['teamPoints'][x-1]
elif df['team'][x] != df['team'][x-3]:
return df['teamPoints'][x-1] + df['teamPoints'][x-2]
else:
return df['teamPoints'][x-1] + df['teamPoints'][x-2] +
df['teamPoints'][x-3]
df['team_form_3games'] = df['index'].apply(lambda x : find_sum_last_3(x))
The first part of the function addresses the edge cases where a sum of 3 isn't possible because there are less than 3 elements
The second part of the function addresses the problem of the 'team' changing. When the team changes the sum needs to start again, so each 'team' is considered seperately
The final part simply looks at the previous 3 elements of the dataFrame and sums them together.
This example works as expected and gives a new column with expected output as follows:
0, 0, 3, 4, 4, 4, 6, 9, 0, 1, 4, 5, 6, 3, 5, 5, 0, 0, 0, 2, 3, 5, 6, 8
1st element is 0 as it is edge case, 2nd is 0 because the sum of the first element is 0. 3rd is 3 as the sum of the 1st and 2nd elements are 3. 4th is the sum of 1st,2nd,3rd. 5th is sum of 2nd,3rd,4th. 6th is sum of 3rd,4th,5th
However when scaled up to 10 it is shown to be very inefficient which makes it difficult to scale up to 10 or 15. It is also inelegant and a new function needs to be written for each different length of sum.
I think you are looking for GroupBy.apply + rolling:
r3=df.groupby('team')['teamPoints'].apply(lambda x: x.rolling(3).sum().shift())
r2=df.groupby('team')['teamPoints'].apply(lambda x: x.rolling(2).sum().shift())
r1=df.groupby('team')['teamPoints'].apply(lambda x: x.shift())
df['team_form_3games'] = r3.fillna(r2.fillna(r1).fillna(0))
print(df)
Output:
index team teamPoints team_form_3games
0 0 a 0 0.0
1 1 a 3 0.0
2 2 a 1 3.0
3 3 a 0 4.0
4 4 a 3 4.0
5 5 a 3 4.0
6 6 a 3 6.0
7 7 a 3 9.0
8 8 b 1 0.0
9 9 b 3 1.0
10 10 b 1 4.0
11 11 b 2 5.0
12 12 b 0 6.0
13 13 b 3 3.0
14 14 b 2 5.0
15 15 b 0 5.0
16 16 c 0 0.0
17 17 c 0 0.0
18 18 c 2 0.0
19 19 c 1 2.0
20 20 c 2 3.0
21 21 c 3 5.0
22 22 c 3 6.0
23 23 c 2 8.0

Pythonic way of copying values in an array with a complex rule

If the value in column a is 1 then the value of b is copied in column c until a is -1.
In the example below, a is 1 in row 2 and -1 in row 5. Then the second value in column b (13) is copied in column c from row 2 to 5.
row a b c
1 0 12 0
2 1 13 13
3 0 15 13
4 0 2 13
5 -1 19 13
6 0 34 0
7 0 11 0
8 1 23 23
9 0 14 23
10 -1 9 23
11 0 18 0
12 0 19 0
I've done this with a for loop, but there must be a more elegant way to do this manipulating series (I'm using pandas, numpy). All your help is greatly appreciated.
Here's a solution that does use a for loop but is pretty succinct while still being understandable.
I'm assuming you have the data stored in table, with a as table[:,0] and that a always appears as (1, -1)*, with 0 interspersed.
starts = table[:,0] == 1
ends = table[:,0] == -1
for start, end in zip(starts.nonzero()[0], ends.nonzero()[0]):
table[start:end+1,2] = table[start,1]
I bet there's some fancy way to get rid of that loop, but I'd also bet that it's harder to tell what's going on.
I agree with everyone else that if you post what you currently have it'd help to go from there.

Categories

Resources