Pandas Sampling Every Time Condition is Met - python

Given a pandas dataframe, I want to get the indices of each row when the sum of the previous row's column value (or current row's column value) is equal to or greater than n, then the sum restarts back to zero. So for example, if our dataframe has values:
index colB
1 10
2 20
3 5
4 5
5 15
6 5
7 7
8 3
and say n=10, then the indices I want are [1, 2, 4, 5, 7] since the previous rows (or current row) for ColB add up to 10.
So far, I can do a for-loop on this dataframe to get the indices I want, but when there are many rows, it is very slow. Therefore, I am seeking help on a faster method. Thanks!

There may be a clever way with some combination of cumsum() but this is a tough problem because the value needs to reset after the sum is greater than n. So it's kind of like a rolling sum with a window no greater than n.
I would probably use a custom function for this.
import pandas as pd
def go(s, n=10):
r = []
c = 0
s = s.tolist()
cv = 0
for v in s:
if v >= n:
r.append(c)
c += 1
else:
cv += v
if cv >= n:
r.append(c)
c += 1
cv = 0
else:
r.append(c)
return r
df = pd.DataFrame.from_dict({
'colB': {0: 10, 1: 20, 2: 5, 3: 5, 4: 15, 5: 5, 6: 1, 7: 3, 8: 1, 9: 12}})
df['group'] = df.apply(go, args=(10,))
indices = df['group'].drop_duplicates().index
Note that I modified your example numbers a bit.
If df is this:
colB
0 10
1 20
2 5
3 5
4 15
5 5
6 1
7 3
8 1
9 12
indices is:
Int64Index([0, 1, 2, 4, 5, 9], dtype='int64')

Related

pandas for loop for running average does not work

I tried to make a kind of running average - out of 90 rows, every 3 in column A should make an average that would be the same as those rows in column B.
For example:
From this:
df = pd.DataFrame( A B
2 0
3 0
4 0
7 0
9 0
8 0)
to this:
df = pd.DataFrame( A B
2 3
3 3
4 3
7 8
9 8
8 8)
I tried running this code:
x=0
for i in df['A']:
if x<90:
y = (df['A'][x]+ df['A'][(x +1)]+df['A'][(x +2)])/3
df['B'][x] = y
df['B'][(x+1)] = y
df['B'][(x+2)] = y
x=x+3
print(y)
It does print the correct Y
But does not change B
I know there is a better way to do it, and if anyone knows - it would be great if they shared it. But the more important thing for me is to understand why what I wrote down doesn't have an effect on the df.
You could group by the index divided by 3, then use transform to compute the mean of those values and assign to B:
df = pd.DataFrame({'A': [2, 3, 4, 7, 9, 8], 'B': [0, 0, 0, 0, 0, 0]})
df['B'] = df.groupby(df.index // 3)['A'].transform('mean')
Output:
A B
0 2 3
1 3 3
2 4 3
3 7 8
4 9 8
5 8 8
Note that this relies on the index being of the form 0,1,2,3,4,.... If that is not the case, you could either reset the index (df.reset_index(drop=True)) or use np.arange(df.shape[0]) instead i.e.
df['B'] = df.groupby(np.arange(df.shape[0]) // 3)['A'].transform('mean')
i = 0
batch_size = 3
df = pd.DataFrame({'A':[2,3,4,7,9,8,9,10],'B':[-1] * 8})
while i < len(df):
j = min(i+batch_size-1,len(df)-1)
avg =sum(df.loc[i:j,'A'])/ (j-i+1)
df.loc[i:j,'B'] = [avg] * (j-i+1)
i+=batch_size
df
corner case when len(df) % batch_size != 0 assumes we take the average of the leftover rows.

data frame and list operation

There are three columns in df: mins, maxs, and col. I would like to generate a binary list according to the following rule: if col[i] is smaller than or equal to mins[i], add a "1" to the list and keep adding "1" for each row until col[i+n] is greater than or equal maxs[i+n]. After reaching maxs[i+n], add "0" to the list for each row until finding the next row where col[i] is smaller than or equal to mins[i]. Repeat this entire process till going over all rows.
For example,
col mins maxs
2 1 6 (0)
4 2 6 (0)
2 3 7 (1)
5 5 6 (1)
4 3 8 (1)
4 2 5 (1)
5 3 5 (0)
4 0 5 (0)
3 3 8 (1)
......
So the list would be [0,0,1,1,1,1,0,0,1]. Does this make sense?
I gave it a shot and wrote the following, which unfortunately did not achieve what I wanted.
def get_list(col, mins, maxs):
l = []
i = 0
while i <= len(col):
if col[i] <= mins[i]:
l.append(1)
while col[i+1] <= maxs[i+1]:
l.append(1)
i += 1
break
break
return l
Thank you so much folks!
My answer may not be elegant but should work according to your expectation.
Import the pandas library.
import pandas as pd
Create dataframe according to data provided.
input_data = {
'col': [2, 4, 2, 5, 4, 4, 5, 4, 3],
'mins': [1, 2, 3, 5, 3, 2 , 3, 0, 3],
'maxs': [6, 6, 7, 6, 8, 5, 5, 5, 8]
}
dataframe_ = pd.DataFrame(data=input_data)
Using a for loop iterate over the rows. The switch variable will change accordingly depending on the conditions was provided which results in the binary column being populated.
binary_switch = False
for index, row in dataframe_.iterrows():
if row['col'] <= row['mins']:
binary_switch = True
elif row['col'] >= row['maxs']:
binary_switch = False
binary_output = 1 if binary_switch else 0
dataframe_.at[index, 'binary'] = binary_output
dataframe_['binary'] = dataframe_['binary'].astype('int')
print(dataframe_)
Output from code.
col mins maxs binary
0 2 1 6 0
1 4 2 6 0
2 2 3 7 1
3 5 5 6 1
4 4 3 8 1
5 4 2 5 1
6 5 3 5 0
7 4 0 5 0
8 3 3 8 1
Your rules give the following decision tree:
1: is col <= mins?
True: l.append(1)
False: next question
2: was col <= mins before?
False: l.append(0)
True: next question:
3: is col >= maxs?
True: l.append(0)
False: l.append(1)
Making this into a function with an if/else tree, you get this:
def make_binary_list(df):
l = []
col_lte_mins = False
for index, row in df.iterrows():
col = row["col"]
mins = row["mins"]
maxs = row["maxs"]
if col <= mins:
col_lte_mins = True
l.append(1)
else:
if col_lte_mins:
if col >= maxs:
col_lte_mins = False
l.append(0)
else:
l.append(1)
else:
l.append(0)
return l
make_binary_list(df) gives [0, 0, 1, 1, 1, 1, 0, 0, 1]

Checking the length of a part of a dataframe in conditional row selection in pandas

Suppose I have a pandas dataframe like this:
first second third
1 2 2 1
2 2 1 0
3 3 4 5
4 4 6 3
5 5 4 3
6 8 8 4
7 3 4 2
8 5 6 6
and could be created with the code:
dataframe = pd.DataFrame(
{
'first': [2, 2, 3, 4, 5, 8, 3, 5],
'second': [2, 1, 4, 6, 4, 8, 4, 6],
'third': [1, 0, 5, 3, 3, 4, 2, 6]
}
)
I want to select the rows in which the value of the second column is more than the value of the first column and at the same time the values in the third column are less than the values in the second column for k consecutive rows where the last row of these k consecutive rows is exactly before the row in which the value of the second column is more than the value of the first column, and k could be any integer between 2 and 4 (closed interval).
So, the output should be rows:
3, 7, 8
To get the above-mentioned result using conditional row selection in pandas, I know I should write a code like this:
dataframe[(dataframe['first'] < dataframe['second']) & (second_condition)].index
But I don't know what to write for the second_condition which I have explained above. Can anyone help me with this?
The trick here is to calculate the rolling sum on a boolean mask to find out the number of values in k previous rows where third column is less than the second column
k = 2
m1 = df['second'].gt(df['first'])
m2 = df['third'].lt(df['second']).shift(fill_value=0).rolling(k).sum().eq(k)
print(df[m1 & m2])
first second third
3 3 4 5
7 3 4 2
8 5 6 6
I will center my answer in the second part of your question. You need to use shift function to compare. It allows you to shift rows.
Assuming your k is fixed at 2, you should do something like this:
import pandas as pd
df = pd.DataFrame(
{
'first': [2, 2, 3, 4, 5, 8, 3, 5],
'second': [2, 1, 4, 6, 4, 8, 4, 6],
'third': [1, 0, 5, 3, 3, 4, 2, 6]
}
)
# this is the line
df[(df['third'] < df['second'].shift(1)) & (df['third'] < df['second'].shift(2))]
What's going on?
Start comparing 'third' with previous value of 'second' by shifting one row, and then shift it two places in a second condition.
Note this only works for fixed values of k. What if k is variable?
In such case, you need to write your condition dynamically. The following code assumes that condition must be met for all values of n in [1,k]
k = 2 # pick any k > 1
df[~pd.concat([df['third'] < df['second'].shift(n) for n in range(1, k+1)]).any(level=0)].index
What's going on here?: Long answer
first, we check using the shift trick, which are the rows that meet your criteria for every value of n in [1, k]:
In [1]: [df['third'] < df['second'].shift(n) for n in range(1, k+1)]
out[1]:
[0 False
1 True
2 False
3 True
4 True
5 False
6 True
7 False
dtype: bool,
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 True
dtype: bool]
then, we concatenate them to create a single dataframe, with a column for each of the k values.
In [2]: pd.concat([df['third'] < df['second'].shift(n) for n in range(1, k+1)])
Out[2]:
0 False
1 True
2 False
3 True
4 True
5 False
6 True
7 False
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 True
dtype: bool
finally, we pick to use as index all rows that meets the criteria for any column (i.e. value of n). So: if it is true for any n, we will return it:
In [3]: pd.concat([df['third'] < df['second'].shift(n) for n in range(1, k+1)]).any(level=0)
Out[3]:
0 False
1 True
2 False
3 True
4 True
5 True
6 True
7 True
dtype: bool
Then, all you need to do is to project over your original dataframe and pick up the index.
In [3]: df[~pd.concat([df['third'] < df['second'].shift(n) for n in range(1, k+1)]).any(level=0)].index
Out[3]: Int64Index([0, 2], dtype='int64')
Final note
If the criteria must be met for all the values n in [1, k], then replace .any with .all.
# First condition is easy
cond1 = df["second"] > df["first"]
# Since the second condition compare the second and third column, let's compute
# the result before hand for convenience
s = df["third"] < df["second"]
# Now we are gonna run a rolling window over `s`. What we want is that the
# previous `k` rows of `s` are all True.
# A rolling window always ends on the current row but you want it to end on the
# previous row. So we will increase the window size by 1 and exclude the last
# element from comparison.
all_true = lambda arr: arr[:-1].all()
cond2_with_k_equal_2 = s.rolling(3).apply(all_true, raw=True)
cond2_with_k_equal_3 = s.rolling(4).apply(all_true, raw=True)
cond2_with_k_equal_4 = s.rolling(5).apply(all_true, raw=True)
cond2 = cond2_with_k_equal_2 | cond2_with_k_equal_3 | cond2_with_k_equal_4
# Or you can consolidate them into a loop
cond2 = pd.Series(False, df.index)
for k in range(2,5):
cond2 |= s.rolling(k+1).apply(all_true, raw=True)
# Get the result
df[cond1 & cond2]

Replacing duplicate values in a dataframe

I've the following dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID': [1, 1, 2, 5, 5, 6, 1, 1, 2, 2, 5, 9, 1, 2, 3, 3, 3, 5]})
print(df)
Which gives:
ID
0 1
1 1
2 2
3 5
4 5
5 6
6 1
7 1
8 2
9 2
10 5
11 9
12 1
13 2
14 3
15 3
16 3
17 5
I want to replace duplicate values in the 'ID' column with the lowest, not yet used, value. However, consequtive identical values should be seen as a group and their values should be changed in the same way. For example: the first two values are both 1. These are consequtive so they are a group and the second '1' should therefore not be replaced with a '2'. Row 14-16 are three consequtive threes. The value 3 has already been used to replace above values, so these threes need to be replaced. But they're consequtive, thus a group, and should get the same replacemnt value. The expected outcome is as follows and will make it more clear:
ID
0 1
1 1
2 2
3 5
4 5
5 6
6 3
7 3
8 4
9 4
10 7
11 9
12 8
13 10
14 11
15 11
16 11
17 12
df = pd.DataFrame({'ID': [1, 1, 2, 5, 5, 6, 1, 1, 2, 2, 5, 9, 1, 2, 3, 3, 3, 5]})
def fun():
v, dub = 1, set()
d = yield
while True:
num = d.iloc[0]['ID']
if num in dub:
while v in dub:
v += 1
d.ID = num = v
dub.add(num)
d = yield d
f = fun()
next(f)
df = df.groupby([df['ID'].diff().ne(0).cumsum(), 'ID'], as_index=False).apply(lambda x: f.send(x))
print(df)
Output:
ID
0 1
1 1
2 2
3 5
4 5
5 6
6 3
7 3
8 4
9 4
10 7
11 9
12 8
13 10
14 11
15 11
16 11
17 12
I made up a way to get your outcome using for loops and dictionaries. More difficult I expected to be fair, the code can seem a bit complex at first but it isnt. Probably, there's a way to do it with multiple logical vectors, but I don't know.
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID': [1, 1, 2, 5, 5, 6, 1, 1, 2, 2, 5, 9, 1, 2, 3, 3, 3, 5]})
print(df)
####################
diffs = np.diff(df.ID) # differences ID(k) - ID(k-1)
uniq = sorted(pd.unique(df.ID)) # unique values in ID colums
# dict with range of numbers from min to max in ID col
d = {} # Empty dict
a = range(uniq[0],uniq[-1]*int(df.shape[0]/len(uniq))) # range values
d = {a[k]:False for k in range(len(a))} # Fill dict
d[df.ID[0]] = True # Set first value in col as True
for m in range(1,df.shape[0]):
# Find a value different from previous one
# therefore, beginning of new subgroup
if diffs[m-1] != 0:
# Check if value was before in the ID column
if d[df.ID[m]] == True:
# Get the lowest value which wasn't used
lowest = [k for k, v in d.items() if v == False][0]
# loop over the subgroup (which differences are 0)
for n in range(m+1,df.shape[0]):
if diffs[n-1] > 0: # If find a new subgroup
break # then stop looping
# Replace the subgroup with the lowest value
df.ID[m:n] = lowest # n is the final index of the subgroup
# *Exception in case last number is a subgroup itself
# then previous for loop doesnt work
if m == df.shape[0]-1:
df.ID[m] = lowest
# Update dictionaries for values retrieved from ID column
d[df.ID[m]] = True
print(df)
Therefore, what you want is to think about your column ID as subgroups or different arrays, checking different conditions and making different operations then. You can think about your column as a set of multiple arrays:
[1, 1 | 2 | 5, 5 | 6 | 1, 1 | 2, 2 | 5 | 9 | 1 | 2 | 3, 3, 3 | 5]
What you need to do is find the limits of that subgroups and check whether they meet certain conditions (1. not a previous number, 2. the lowest number which we didn't use). We can know the subgroups if we calculate the differences between a value and the previous one
diffs = np.diff(df.ID) # differences ID(k) - ID(k-1)
We can know the conditions using a dictionary which keys are the integers in the array or longer values we could need and values are whether we have used them or not (True or False).
To do so, we need to do the max value of the ID column. However, we need to build the dictionary with more numbers as there are in the column (in your example the max(input) = 9 and max(output) = 12). You could do randomly, I chose to calculate the possible proportion we could need attending to the number of rows and number of unique values in the column (the last input in a = range... ).
uniq = sorted(pd.unique(df.ID)) # unique values in ID colums
# dict with range of numbers from min to max in ID col
d = {}
a = range(uniq[0],uniq[-1]*int(df.shape[0]/len(uniq)))
d = {a[k]:False for k in range(len(a))}
d[df.ID[0]] = True # Set first value in col as True
Last part of the code is a main for loop with some If and another for inside, it works as:
# 1. Loop over ID column
# 2. Check if ID[m] value is different number from previous one (diff != 0)
# 3. Check if the ID[m] value is already in the ID column.
# 4. Calculate lowest value (first key == False in dict) and change the subset of
# in the ID
# 5. How is made step 4, if last value is a subset itself, it doesn't work, so
# there's a little condition to check it will work.
# 6. Update the dict every time a new value shows up.
I am sure there are many ways to shorten this code. But this work and should work with larger dataframes and the same conditions.

How to sum all values with index greater than X?

Let's say I have this series:
>>> s = pd.Series({1:10,2:5,3:8,4:12,5:7,6:3})
>>> s
1 10
2 5
3 8
4 12
5 7
6 3
I want to sum all the values for which the index is greater than X. So if e.g. X = 3, I want to get this:
>>> X = 3
>>> s.some_magic(X)
1 10
2 5
3 8
>3 22
I managed to do it in this rather clumsy way:
lt = s[s.index.values <= 3]
gt = s[s.index.values > 3]
gt_s = pd.Series({'>3':sum(gt)})
lt.append(gt_s)
and got the desired result, but I believe there should be an easier and more elegant way... or is there?
s.groupby(np.where(s.index > 3, '>3', s.index)).sum()
Or,
s.groupby(s.index.to_series().mask(s.index > 3, '>3')).sum()
Out:
1 10
2 5
3 8
>3 22
dtype: int64
Here's a possible solution:
import pandas as pd
s = pd.Series({1: 10, 2: 5, 3: 8, 4: 12, 5: 7, 6: 3})
iv = s.index.values
print s[iv <= 3].append(pd.Series({'>3': s[iv > 3].sum()}))

Categories

Resources