Pandas: find interval distance from N consecutive to M consecutive - python

TLDR version:
I have a column like below,
[2, 2, 0, 0, 0, 2, 2, 0, 3, 3, 3, 0, 0, 2, 2, 0, 0, 0, 0, 2, 2, 0, 0, 0, 3, 3, 3]
# There is the probability that has more sequences, like 4, 5, 6, 7, 8...
I need a function that has parameters n,m, if I use
n=2, m=3,
I will get a distance between 2 and 3, and then final result after the group could be :
[6, 9]
Detailed version
Here is the test case. And I'm writing a function that will give n,m then generate a list of distances between each consecutive. Currently, this function can only work with one parameter N (which is the distance from N consecutive to another N consecutive). I want to make some changes to this function to make it accept M.
dummy = [1,1,0,0,0,1,1,0,1,1,1,0,0,1,1,0,0,0,0,1,1,0,0,0,1,1,1]
df = pd.DataFrame({'a': dummy})
What I write currently,
def get_N_seq_stat(df, N=2, M=3):
df["c1"] = (
df.groupby(df.a.ne(df.a.shift()).cumsum())["a"]
.transform("size")
.where(df.a.eq(1), 0)
)
df["c2"] = np.where(df.c1.ne(N) , 1, 0)
df["c3"] = df["c2"].ne(df["c2"].shift()).cumsum()
result = df.loc[df["c2"] == 1].groupby("c3")["c2"].count().tolist()
# if last N rows are not consequence shouldn't add last.
if not (df["c1"].tail(N) == N).all():
del result[-1]
if not (df["c1"].head(N) == N).all():
del result[0]
return result
if I set N=2, M=3 ( from 2 consecutive to 3 consecutive), Then the ideal value return from this would be [6,9] because below.
dummy = [1,1,**0,0,0,1,1,0,**1,1,1,0,0,1,1,**0,0,0,0,1,1,0,0,0,**1,1,1]
Currently, if I set N =2, the return list would be [3, 6, 4] that because
dummy = [1,1,**0,0,0,**1,1,**0,1,1,1,0,0,**1,1,**0,0,0,0,**1,1,0,0,0,1,1,1]

I would modify your code this way:
def get_N_seq_stat(df, N=2, M=3, debug=False):
# get number of consecutive 1s
c1 = (
df.groupby(df.a.ne(df.a.shift()).cumsum())["a"]
.transform("size")
.where(df.a.eq(1), 0)
)
# find stretches between N and M
m1 = c1.eq(N)
m2 = c1.eq(M)
c2 = pd.Series(np.select([m1.shift()&~m1, m2], [True, False], np.nan),
index=df.index).ffill().eq(1)
# debug mode to understand how this works
if debug:
return df.assign(c1=c1, c2=c2,
length=c2[c2].groupby(c2.ne(c2.shift()).cumsum())
.transform('size')
)
# get the length of the stretches
return c2[c2].groupby(c2.ne(c2.shift()).cumsum()).size().to_list()
get_N_seq_stat(df, N=2, M=3)
Output: [6, 9]
Intermediate c1, c2, and length:
get_N_seq_stat(df, N=2, M=3, debug=True)
a c1 c2 length
0 1 2 False NaN
1 1 2 False NaN
2 0 0 True 6.0
3 0 0 True 6.0
4 0 0 True 6.0
5 1 2 True 6.0
6 1 2 True 6.0
7 0 0 True 6.0
8 1 3 False NaN
9 1 3 False NaN
10 1 3 False NaN
11 0 0 False NaN
12 0 0 False NaN
13 1 2 False NaN
14 1 2 False NaN
15 0 0 True 9.0
16 0 0 True 9.0
17 0 0 True 9.0
18 0 0 True 9.0
19 1 2 True 9.0
20 1 2 True 9.0
21 0 0 True 9.0
22 0 0 True 9.0
23 0 0 True 9.0
24 1 3 False NaN
25 1 3 False NaN
26 1 3 False NaN

Related

Determine if Values are within range based on pandas DataFrame column

I am trying to determine whether or a given value in a row of a DataFrame is within two other columns from a separate DataFrame, or if that estimate is zero.
import pandas as pd
df = pd.DataFrame([[-1, 2, 1, 3], [4, 6, 7,8], [-2, 10, 11, 13], [5, 6, 8, 9]],
columns=['lo1', 'up1','lo2', 'up2'])
lo1 up1 lo2 up2
0 -1 2 1 3
1 4 6 7 8
2 -2 10 11 13
3 5 6 8 9
df2 = pd.DataFrame([[1, 3], [4, 6] , [5, 8], [10, 2,]],
columns=['pe1', 'pe2'])
pe1 pe2
0 1 3
1 4 6
2 5 8
3 10 2
To be more clear, is it possible to develop a for-loop or use a function that can look at pe1 and its corresponding values and determine if they are within lo1 and up1, if lo1 and up1 cross zero, and if pe1=0? I am having a hard time coding this in Python.
EDIT: I'd like the output to be something like:
m1 m2
0 0 3
1 4 0
2 0 0
3 0 0
Since the only pe that falls within its corresponding lo and up column are in the first row, second column, and second row, first column.
You can eventually concatenate the two dataframes along the horizontal axis and then use np.where. This has a similar behaviour as where used by RJ Adriaansen.
import pandas as pd
import numpy as np
# Data
df1 = pd.DataFrame([[-1, 2, 1, 3], [4, 6, 7,8], [-2, 10, 11, 13], [5, 6, 8, 9]],
columns=['lo1', 'up1','lo2', 'up2'])
df2 = pd.DataFrame([[1, 3], [4, 6] , [5, 8], [10, 2,]],
columns=['pe1', 'pe2'])
# concatenate dfs
df = pd.concat([df1, df2], axis=1)
where now df looks like
lo1 up1 lo2 up2 pe1 pe2
0 -1 2 1 3 1 3
1 4 6 7 8 4 6
2 -2 10 11 13 5 8
3 5 6 8 9 10 2
Finally we use np.where and between
for k in [1, 2]:
df[f"m{k}"] = np.where(
(df[f"pe{k}"].between(df[f"lo{k}"], df[f"up{k}"]) &
df[f"lo{k}"].gt(0)),
df[f"pe{k}"],
0)
and the result is
lo1 up1 lo2 up2 pe1 pe2 m1 m2
0 -1 2 1 3 1 3 0 3
1 4 6 7 8 4 6 4 0
2 -2 10 11 13 5 8 0 0
3 5 6 8 9 10 2 0 0
You can create a boolean mask for the required condition. For pe1 that would be:
value in lo1 is smaller or equal to pe1
value in up1 is larger or equal to pe1
value in lo1 is larger than 0
This would make this mask:
(df['lo1'] <= df2['pe1']) & (df['up1'] >= df2['pe1']) & (df['lo1'] > 0)
which returns:
0 False
1 True
2 False
3 False
dtype: bool
Now you can use where to keep the values that match True and replace those who don't with 0:
df2['pe1'] = df2['pe1'].where((df['lo1'] <= df2['pe1']) & (df['up1'] >= df2['pe1']) & (df['lo1'] > 0), other=0)
df2['pe2'] = df2['pe2'].where((df['lo2'] <= df2['pe2']) & (df['up2'] >= df2['pe2']) & (df['lo2'] > 0), other=0)
Result:
pe1
pe2
0
0
3
1
4
0
2
0
0
3
0
0
To loop all columns:
for i in df2.columns:
nr = i[2:] #remove the first two characters to get the number, then use that number to match the columns in the other df
df2[i] = df2[i].where((df[f'lo{nr}'] <= df2[i]) & (df[f'up{nr}'] >= df2[i]) & (df[f'lo{nr}'] > 0), other=0)

Pandas replace all but first in consecutive group

The problem description is simple, but I cannot figure how to make this work in Pandas. Basically, I'm trying to replace consecutive values (except the first) with some replacement value. For example:
data = {
"A": [0, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2, 3]
}
df = pd.DataFrame.from_dict(data)
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 2
10 2
11 2
12 3
If I run this through some function foo(df, 2, 0) I would get the following:
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 0
10 0
11 0
12 3
Which replaces all values of 2 with 0, except for the first one. Is this possible?
You can find all the rows where A = 2 and A is also equal to the previous A value and set them to 0:
data = {
"A": [0, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2, 3]
}
df = pd.DataFrame.from_dict(data)
df[(df.A == 2) & (df.A == df.A.shift(1))] = 0
Output:
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 0
10 0
11 0
12 3
If you have more than one column in the dataframe, use df.loc to just set the A values:
df.loc[(df.A == 2) & (df.A == df.A.shift(1)), 'A'] = 0
Try, if 'A' is duplicated further down the datafame, an is monotonic increasing:
def foo(df, val=2, repl=0):
return df.mask((df.groupby('A').transform('cumcount') > 0) & (df['A'] == val), repl)
foo(df, 2, 0)
Output:
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 0
10 0
11 0
12 3
I'm not sure if this is the best way, but I came up with this solution, hope to be helpful:
import pandas as pd
data = {
"A": [0, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2, 3]
}
df = pd.DataFrame(data)
def replecate(df, number, replacement):
i = 1
for column in df.columns:
for index,value in enumerate(df[column]):
if i == 1 and value == number :
i = 0
elif value == number and i != 1:
df[column][index] = replacement
i = 1
return df
replecate(df, 2 , 0)
Output
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 0
10 0
11 0
12 3
I've managed a solution to this problem by shifting the row down by one and checking to see if the values align. Also included a function which can take multiple values to check for (not just 2).
import pandas as pd
data = {
"A": [0, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2, 3]
}
df = pd.DataFrame(data)
def replace_recurring(df,key,offset=1,values=[2]):
df['offset'] = df[key].shift(offset)
df.loc[(df[key]==df['offset']) & (df[key].isin(values)),key] = 0
df = df.drop(['offset'],axis=1)
return df
df = replace_recurring(df,'A',offset=1,values=[2])
Giving the output:
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 0
10 0
11 0
12 3

How to resolve or skip particular line of column where we get > "float object is not iterable" in str.findall

Hi I am trying to iterate over a column in pandas.
I tried replacing 'i' with '[i]'. But it gave rise to different error.
I have the small input, not the entire input.
Or is also possible that we can skip such a row in dataframe where we get error : "'float' object is not iterable" and it continues to iterate in next rows ?
Input:
Name Matches
John [1, 0, 500,], [2, 0, 600,],[70,67,78]
Wall [4, 0, 14], [2, 0, 40]
Austin [1, 0, 5,], [0,2, 7,]
Code:
df['any_value_greater_than_10?'] = (['yes' if any(int(a)>10 for a in i) else 'no'
for i in df['Matches'].str.findall('\d+')])
Error:
for i in df['Matches'].str.findall('\d+')])
'float' object is not iterable
For me working nice if convert values to strings, also added some empty list for better test if no data match:
print (df)
Name Matches
0 John [1, 0, 500,], [2, 0, 600,],[70,67,78]
1 Wall [4, 0, 14], [2, 0, 40]
2 Austin [1, 0, 5,], [0,2, 7,]
3 Josh []
print (df['Matches'].astype(str).str.findall('\d+'))
0 [1, 0, 500, 2, 0, 600, 70, 67, 78]
1 [4, 0, 14, 2, 0, 40]
2 [1, 0, 5, 0, 2, 7]
3 []
Name: Matches, dtype: object
df['any_value_greater_than_10?'] = (['yes' if any(int(a)>10 for a in i) else 'no'
for i in df['Matches'].astype(str).str.findall('\d+')])
print (df)
Name Matches any_value_greater_than_10?
0 John [1, 0, 500,], [2, 0, 600,],[70,67,78] yes
1 Wall [4, 0, 14], [2, 0, 40] yes
2 Austin [1, 0, 5,], [0,2, 7,] no
3 Josh [] no
Another solution:
m = (df['Matches'].astype(str)
.str.extractall('(\d+)')[0]
.astype(float)
.gt(10)
.any(level=0)
.reindex(df.index, fill_value=False))
df['any_value_greater_than_10?'] = np.where(m, 'yes','no')
print (df)
Name Matches any_value_greater_than_10?
0 John [1, 0, 500,], [2, 0, 600,],[70,67,78] yes
1 Wall [4, 0, 14], [2, 0, 40] yes
2 Austin [1, 0, 5,], [0,2, 7,] no
3 Josh [] no
How it working:
After converting to strings is used Series.str.extractall for all integers to column 0:
print (df['Matches'].astype(str).str.extractall('(\d+)'))
0
match
0 0 1
1 0
2 500
3 2
4 0
5 600
6 70
7 67
8 78
1 0 4
1 0
2 14
3 2
4 0
5 40
2 0 1
1 0
2 5
3 0
4 2
5 7
For Series is selected column 0:
print (df['Matches'].astype(str).str.extractall('(\d+)')[0])
match
0 0 1
1 0
2 500
3 2
4 0
5 600
6 70
7 67
8 78
1 0 4
1 0
2 14
3 2
4 0
5 40
2 0 1
1 0
2 5
3 0
4 2
5 7
Name: 0, dtype: object
Convert to floats and then test for greater like 10:
print (df['Matches'].astype(str)
.str.extractall('(\d+)')[0]
.astype(float)
.gt(10)
)
match
0 0 False
1 False
2 True
3 False
4 False
5 True
6 True
7 True
8 True
1 0 False
1 False
2 True
3 False
4 False
5 True
2 0 False
1 False
2 False
3 False
4 False
5 False
Name: 0, dtype: bool
Last check if at least one True per first level created by original index values:
print (df['Matches'].astype(str)
.str.extractall('(\d+)')[0]
.astype(float)
.gt(10)
.any(level=0))
0 True
1 True
2 False
Name: 0, dtype: bool
... and add some non matche rows, here last one:
print (df['Matches'].astype(str)
.str.extractall('(\d+)')[0]
.astype(float)
.gt(10)
.any(level=0)
.reindex(df.index, fill_value=False))
0 True
1 True
2 False
3 False
Name: 0, dtype: bool
And last last is passed to numpy.where.

Add a state column when another column is increasing/decreasing

I would like to add a column in a data frame when another column is increasing/decreasing or stays the same with:
1 -> increasing, 0 -> same, -1 -> decreasing
So if df['battery'] = [1,2,3,4,7,9,3,3,3,]
I would like state to be df['state'] = [1,1,1,1,1,-1,0,0]
This should do the trick!
a = [1,2,3,4,7,9,3,3,3]
b = []
for x in range(len(a)-1):
b.append((a[x+1] > a[x]) - (a[x+1] < a[x]))
print(b)
You could use pd.Series.diff method to get the difference between consecutive values, and then assign the necessary state values by using boolean indexing:
import pandas as pd
df = pd.DataFrame()
df['battery'] = [1,2,3,4,7,9,3,3,3]
diff = df['battery'].diff()
df.loc[diff > 0, 'state'] = 1
df.loc[diff == 0, 'state'] = 0
df.loc[diff < 0, 'state'] = -1
print(df)
# battery state
# 0 1 NaN
# 1 2 1.0
# 2 3 1.0
# 3 4 1.0
# 4 7 1.0
# 5 9 1.0
# 6 3 -1.0
# 7 3 0.0
# 8 3 0.0
Or, alternatively, one could use np.select:
import numpy as np
diff = df['battery'].diff()
df['state'] = np.select([diff < 0, diff > 0], [-1, 1], 0)
# Be careful, default 0 will replace the first NaN as well.
print(df)
# battery state
# 0 1 0
# 1 2 1
# 2 3 1
# 3 4 1
# 4 7 1
# 5 9 1
# 6 3 -1
# 7 3 0
# 8 3 0
So here's your dataframe:
>>> import pandas as pd
>>> data = [[[1,2,3,4,7,9,3,3,3]]]
>>> df = pd.DataFrame(data, columns = ['battery'])
>>> df
battery
0 [1, 2, 3, 4, 7, 9, 3, 3, 3]
And finally use apply and a lambda function in order to generate the required result:
>>> df['state'] = df.apply(lambda row: [1 if t - s > 0 else -1 if t-s < 0 else 0 for s, t in zip(row['battery'], row['battery'][1:])], axis=1)
>>> df
battery state
0 [1, 2, 3, 4, 7, 9, 3, 3, 3] [1, 1, 1, 1, 1, -1, 0, 0]
Alternatively, if you want the exact difference between each element in the list, you can use the following:
>>> df['state'] = df.apply(lambda row: [t - s for s, t in zip(row['battery'], row['battery'][1:])], axis=1)
>>> df
battery state
0 [1, 2, 3, 4, 7, 9, 3, 3, 3] [1, 1, 1, 3, 2, -6, 0, 0]
Try pd.np.sign
pd.np.sign(df.battery.diff().fillna(1))
0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
5 1.0
6 -1.0
7 0.0
8 0.0
Name: battery, dtype: float64

Increment counter the first time a number is reached

This is probably a very silly question. But, I'll still go ahead and ask. How would you increment a counter only the first time a particular value is reached?
For example, if I have step below as a column of the df and would want to add a counter column called 'counter' which increments the first time the 'step' column has a value of 6
You can use .shift() in pandas -
Notice how you only want to increment if value of df['step'] is 6
and value of df.shift(1)['step'] is not 6.
df['counter'] = ((df['step']==6) & (df.shift(1)['step']!=6 )).cumsum()
print(df)
Output
step counter
0 2 0
1 2 0
2 2 0
3 3 0
4 4 0
5 4 0
6 5 0
7 6 1
8 6 1
9 6 1
10 6 1
11 7 1
12 5 1
13 6 2
14 6 2
15 6 2
16 7 2
17 5 2
18 6 3
19 7 3
20 5 3
Explanation
a. df['step']==6 gives boolean values - True if the step is 6
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 True
9 True
10 True
11 False
12 False
13 True
14 True
15 True
16 False
17 False
18 True
19 False
20 False
Name: step, dtype: bool
b. df.shift(1)['step']!=6 shifts the data by 1 row and then checks if value is equal to 6.
When both these conditions satisfy, you want to increment - .cumsum() will take care of that. Hope that helps!
P.S - Although it's a good question, going forward please avoid pasting images. You can directly paste data and format as code. Helps the people who are answering to copy-paste
Use:
df = pd.DataFrame({'step':[2, 2, 2, 3, 4, 4, 5, 6, 6, 6, 6, 7, 5, 6, 6, 6, 7, 5, 6, 7, 5]})
a = df['step'] == 6
b = (~a).shift()
b[0] = a[0]
df['counter1'] = (a & b).cumsum()
print (df)
step counter
0 2 0
1 2 0
2 2 0
3 3 0
4 4 0
5 4 0
6 5 0
7 6 1
8 6 1
9 6 1
10 6 1
11 7 1
12 5 1
13 6 2
14 6 2
15 6 2
16 7 2
17 5 2
18 6 3
19 7 3
20 5 3
Explanation:
Get boolean mask for comparing with 6:
a = df['step'] == 6
Invert Series and shift:
b = (~a).shift()
If first value is 6 then get no first group, so need set first value by first a value:
b[0] = a[0]
Chain conditions by bitwise and - &:
c = a & b
Get cumulative sum:
d = c.cumsum()
print (pd.concat([df['step'], a, b, c, d], axis=1, keys=('abcde')))
a b c d e
0 2 False False False 0
1 2 False True False 0
2 2 False True False 0
3 3 False True False 0
4 4 False True False 0
5 4 False True False 0
6 5 False True False 0
7 6 True True True 1
8 6 True False False 1
9 6 True False False 1
10 6 True False False 1
11 7 False False False 1
12 5 False True False 1
13 6 True True True 2
14 6 True False False 2
15 6 True False False 2
16 7 False False False 2
17 5 False True False 2
18 6 True True True 3
19 7 False False False 3
20 5 False True False 3
If performance is important, use numpy solution:
a = (df['step'] == 6).values
b = np.insert((~a)[:-1], 0, a[0])
df['counter1'] = np.cumsum(a & b)
If your DataFrame is called df, one possible way without iteration is
df['counter'] = 0
df.loc[1:, 'counter'] = ((df['steps'].values[1:] == 6) & (df['steps'].values[:-1] != 6)).cumsum()
This creates two boolean arrays, the conjunction of which is True when the previous row did not contain a 6 and the current row does contain a 6. You can sum this array to obtain the counter.
That's not a silly question. To get the desired output in your counter column, you can try (for example) this:
steps = [2, 2, 2, 3, 4, 4, 5, 6, 6, 6, 6, 7, 5, 6, 6, 6, 7, 5, 6, 7, 5]
counter = [idx for idx in range(len(steps)) if steps[idx] == 6 and (idx==0 or steps[idx-1] != 6)]
print(counter)
results in:
>> [7, 13, 18]
, which are the indices in steps where a first 6 occurred. You can now get the total times that has happened with len(counter), or reproduce the second column the exact way you have given it with
counter_column = [0]
for idx in range(len(steps)):
counter_column.append(counter_column[-1])
if idx in counter:
counter_column[-1] += 1
If your DataFrame is called df, it's
import pandas as pd
q_list = [2, 2, 2, 3, 4, 4, 5, 6, 6, 6, 6, 7, 5, 6, 6, 6, 7, 5, 6, 7, 5]
df = pd.DataFrame(q_list, columns=['step'])
counter = 0
flag = False
for index, row in df.iterrows():
if row ['step'] == 6 and flag == False:
counter += 1
flag = True
elif row ['step'] != 6 and flag == True:
flag = False
df.set_value(index,'counter',counter)

Categories

Resources