Conditional length of a binary data series in Pandas - python

Having a DataFrame with the following column:
df['A'] = [1,1,1,0,1,1,1,1,0,1]
What would be the best vectorized way to control the length of "1"-series by some limiting value? Let's say the limit is 2, then the resulting column 'B' must look like:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1

One fully-vectorized solution is to use the shift-groupby-cumsum-cumcount combination1 to indicate where consecutive runs are shorter than 2 (or whatever limiting value you like). Then, & this new boolean Series with the original column:
df['B'] = ((df.groupby((df.A != df.A.shift()).cumsum()).cumcount() <= 1) & df.A)\
.astype(int) # cast the boolean Series back to integers
This produces the new column in the DataFrame:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
1 See the pandas cookbook; the section on grouping, "Grouping like Python’s itertools.groupby"

Another way (checking if previous two are 1):
In [443]: df = pd.DataFrame({'A': [1,1,1,0,1,1,1,1,0,1]})
In [444]: limit = 2
In [445]: df['B'] = map(lambda x: df['A'][x] if x < limit else int(not all(y == 1 for y in df['A'][x - limit:x])), range(len(df)))
In [446]: df
Out[446]:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1

If you know that the values in the series will all be either 0 or 1, I think you can use a little trick involving convolution. Make a copy of your column (which need not be a Pandas object, it can just be a normal Numpy array)
a = df['A'].as_matrix()
and convolve it with a sequence of 1's that is one longer than the cutoff you want, then chop off the last cutoff elements. E.g. for a cutoff of 2, you would do
long_run_count = numpy.convolve(a, [1, 1, 1])[:-2]
The resulting array, in this case, gives the number of 1's that occur in the 3 elements prior to and including that element. If that number is 3, then you are in a run that has exceeded length 2. So just set those elements to zero.
a[long_run_count > 2] = 0
You can now assign the resulting array to a new column in your DataFrame.
df['B'] = a
To turn this into a more general method:
def trim_runs(array, cutoff):
a = numpy.asarray(array)
a[numpy.convolve(a, numpy.ones(cutoff + 1))[:-cutoff] > cutoff] = 0
return a

Related

Conditional sum of non zero values

I have a daraframe as below:
Datetime Data Fn
0 18747.385417 11275.0 0
1 18747.388889 8872.0 1
2 18747.392361 7050.0 0
3 18747.395833 8240.0 1
4 18747.399306 5158.0 1
5 18747.402778 3926.0 0
6 18747.406250 4043.0 0
7 18747.409722 2752.0 1
8 18747.420139 3502.0 1
9 18747.423611 4026.0 1
I want to calculate the sum of continious non zero values of Column (Fn)
I want my result dataframe as below:
Datetime Data Fn Sum
0 18747.385417 11275.0 0 0
1 18747.388889 8872.0 1 1
2 18747.392361 7050.0 0 0
3 18747.395833 8240.0 1 1
4 18747.399306 5158.0 1 2 <<<
5 18747.402778 3926.0 0 0
6 18747.406250 4043.0 0 0
7 18747.409722 2752.0 1 1
8 18747.420139 3502.0 1 2
9 18747.423611 4026.0 1 3
You can use groupby() and cumsum():
groups = df.Fn.eq(0).cumsum()
df['Sum'] = df.Fn.ne(0).groupby(groups).cumsum()
Details
First use df.Fn.eq(0).cumsum() to create pseudo-groups of consecutive non-zeros. Each zero will get a new id while consecutive non-zeros will keep the same id:
groups = df.Fn.eq(0).cumsum()
# groups Fn (Fn added just for comparison)
# 0 1 0
# 1 1 1
# 2 2 0
# 3 2 1
# 4 2 1
# 5 3 0
# 6 4 0
# 7 4 1
# 8 4 1
# 9 4 1
Then group df.Fn.ne(0) on these pseudo-groups and cumsum() to generate the within-group sequences:
df['Sum'] = df.Fn.ne(0).groupby(groups).cumsum()
# Datetime Data Fn Sum
# 0 18747.385417 11275.0 0 0
# 1 18747.388889 8872.0 1 1
# 2 18747.392361 7050.0 0 0
# 3 18747.395833 8240.0 1 1
# 4 18747.399306 5158.0 1 2
# 5 18747.402778 3926.0 0 0
# 6 18747.406250 4043.0 0 0
# 7 18747.409722 2752.0 1 1
# 8 18747.420139 3502.0 1 2
# 9 18747.423611 4026.0 1 3
How about using cumsum and reset when value is 0
df['Fn2'] = df['Fn'].replace({0: False, 1: True})
df['Fn2'] = df['Fn2'].cumsum() - df['Fn2'].cumsum().where(df['Fn2'] == False).ffill().astype(int)
df
You can store the fn column in a list and then create a new list and iterate over the stored fn column and check the previous index value if it is greater than zero then add it to current index else do not update it and after this u can make a dataframe for the list and concat column wise to existing dataframe
fn=df[Fn]
sum_list[0]=fn first value
for i in range(1,lenghtofthe column):
if fn[i-1]>0:
sum_list.append(fn[i-1]+fn[i])
else:
sum_list.append(fn[i])
dfsum=pd.Dataframe(sum_list)
df=pd.concat([df,dfsum],axis=1)
Hope this will help you.there may me syntax errors that you can refer google.But the idea is this
try this:
sum_arr = [0]
for val in df['Fn']:
if val > 0:
sum_arr.append(sum_arr[-1] + 1)
else:
sum_arr.append(0)
df['sum'] = sum_arr[1:]
df

How to set ranges of rows in pandas?

I have the following working code that sets 1 to "new_col" at the locations pointed by intervals dictated by starts and ends.
import pandas as pd
import numpy as np
df = pd.DataFrame({"a": np.arange(10)})
starts = [1, 5, 8]
ends = [1, 6, 10]
value = 1
df["new_col"] = 0
for s, e in zip(starts, ends):
df.loc[s:e, "new_col"] = value
print(df)
a new_col
0 0 0
1 1 1
2 2 0
3 3 0
4 4 0
5 5 1
6 6 1
7 7 0
8 8 1
9 9 1
I want these intervals to come from another dataframe pointer_df.
How to vectorize this?
pointer_df = pd.DataFrame({"starts": starts, "ends": ends})
Attempt:
df.loc[pointer_df["starts"]:pointer_df["ends"], "new_col"] = 2
print(df)
obviously doesn't work and gives
raise AssertionError("Start slice bound is non-scalar")
AssertionError: Start slice bound is non-scalar
EDIT:
it seems all answers use some kind of pythonic for loop.
the question was how to vectorize the operation above?
Is this not doable without for loops/list comprehentions?
You could do:
pointer_df = pd.DataFrame({"starts": starts, "ends": ends})
rang = np.arange(len(df))
indices = [i for s, e in pointer_df.to_numpy() for i in rang[slice(s, e + 1, None)]]
df.loc[indices, 'new_col'] = value
print(df)
Output
a new_col
0 0 0
1 1 1
2 2 0
3 3 0
4 4 0
5 5 1
6 6 1
7 7 0
8 8 1
9 9 1
If you want a method that do not uses uses any for loop or list comprehension, only relies on numpy, you could do:
def indices(start, end, ma=10):
limits = end + 1
lens = np.where(limits < ma, limits, end) - start
np.cumsum(lens, out=lens)
i = np.ones(lens[-1], dtype=int)
i[0] = start[0]
i[lens[:-1]] += start[1:]
i[lens[:-1]] -= limits[:-1]
np.cumsum(i, out=i)
return i
pointer_df = pd.DataFrame({"starts": starts, "ends": ends})
df.loc[indices(pointer_df.starts.values, pointer_df.ends.values, ma=len(df)), "new_col"] = value
print(df)
I adapted the method to your use case from the one in this answer.
for i,j in zip(pointer_df["starts"],pointer_df["ends"]):
print (i,j)
Apply same method but on your dictionary

Get the sum of rows that contain 0 as a value

I want to know how can I make the source code of the following problem based on Python.
I have a dataframe that contain this column:
Column X
1
0
0
0
1
1
0
0
1
I want to create a list b counting the sum of successive 0 value for getting something like that :
List X
1
3
3
3
1
1
2
2
1
If I understand your question correctly, you want to replace all the zeros with the number of consecutive zeros in the current streak, but leave non-zero numbers untouched. So
1 0 0 0 0 1 0 1 1 0 0 1 0 1 0 0 0 0 0
becomes
1 4 4 4 4 1 1 1 1 2 2 1 1 1 5 5 5 5 5
To do that, this should work, assuming your input column (a pandas Series) is called x.
result = []
i = 0
while i < len(x):
if x[i] != 0:
result.append(x[i])
i += 1
else:
# See how many times zero occurs in a row
j = i
n_zeros = 0
while j < len(x) and x[j] == 0:
n_zeros += 1
j += 1
result.extend([n_zeros] * n_zeros)
i += n_zeros
result
Adding screenshot below to make usage clearer

Return rows based off the most recent increase in value from other columns python

The title of this question is a little confusing to write out succinctly.
I have pandas df that contains integers and a relevant key Column. When a value is in the key Column is present I want to return the most recent increase in integers from the other Columns.
For the df below, the key Column is [Area]. When X is in [Area], I want to find the most recent increase is integers from Columns ['ST_A','PG_A','ST_B','PG_B'].
import pandas as pd
d = ({
'ST_A' : [0,0,0,0,0,1,1,1,1],
'PG_A' : [0,0,0,1,1,1,2,2,2],
'ST_B' : [0,1,1,1,1,1,1,1,1],
'PG_B' : [0,0,0,0,0,0,0,1,1],
'Area' : ['','','X','','X','','','','X'],
})
df = pd.DataFrame(data = d)
Output:
ST_A PG_A ST_B PG_B Area
0 0 0 0 0
1 0 0 1 0
2 0 0 1 0 X
3 0 1 1 0
4 0 1 1 0 X
5 1 1 1 0
6 1 2 1 0
7 1 2 1 1
8 1 2 1 1 X
I tried to use df = df.loc[(df['Area'] == 'X')] but this returns the rows where X is situated. I need something that uses X to return the most recent row where there was an increase in Columns ['ST_A','PG_A','ST_B','PG_B'].
I have also tried:
cols = ['ST_A','PG_A','ST_B','PG_B']
df[cols] = df[cols].diff()
df = df.fillna(0.)
df = df.loc[(df[cols] == 1).any(axis=1)]
This returns all rows where there was an increase in Columns ['ST_A','PG_A','ST_B','PG_B']. Not the most recent increase before X in ['Area'].
Intended Output:
ST_A PG_A ST_B PG_B Area
1 0 0 1 0
3 0 1 1 0
7 1 2 1 1
Does this question make sense or do I need to simplify it?
I believe you can use NumPy here via np.searchsorted:
import numpy as np
increases = np.where(df.iloc[:, :-1].diff().gt(0).max(1))[0]
marks = np.where(df['Area'].eq('X'))[0]
idx = increases[np.searchsorted(increases, marks) - 1]
res = df.iloc[idx]
print(res)
ST_A PG_A ST_B PG_B Area
1 0 0 1 0
3 0 1 1 0
7 1 2 1 1
Not efficient tho, but works, so big chunk of code which is kinda slow:
indexes=np.where(df['Area']=='X')[0].tolist()
indexes2=list(map((1).__add__,np.where(df[df.columns[:-1]].sum(axis=1) < df[df.columns[:-1]].shift(-1).sum(axis=1).sort_index())[0].tolist()))
l=[]
for i in indexes:
if min(indexes2,key=lambda x: abs(x-i)) in l:
l.append(min(indexes2,key=lambda x: abs(x-i))-2)
else:
l.append(min(indexes2,key=lambda x: abs(x-i)))
print(df.iloc[l].sort_index())
Output:
Area PG_A PG_B ST_A ST_B
1 0 0 0 1
3 1 0 0 1
7 2 1 1 1

How to calculate amounts that row values greater than a specific value in pandas?

How to calculate amounts that row values greater than a specific value in pandas?
For example, I have a Pandas DataFrame dff. I want to count row values greater than 0.
dff = pd.DataFrame(np.random.randn(9,3),columns=['a','b','c'])
dff
a b c
0 -0.047753 -1.172751 0.428752
1 -0.763297 -0.539290 1.004502
2 -0.845018 1.780180 1.354705
3 -0.044451 0.271344 0.166762
4 -0.230092 -0.684156 -0.448916
5 -0.137938 1.403581 0.570804
6 -0.259851 0.589898 0.099670
7 0.642413 -0.762344 -0.167562
8 1.940560 -1.276856 0.361775
I am using an inefficient way. How to be more efficient?
dff['count'] = 0
for m in range(len(dff)):
og = 0
for i in dff.columns:
if dff[i][m] > 0:
og += 1
dff['count'][m] = og
dff
a b c count
0 -0.047753 -1.172751 0.428752 1
1 -0.763297 -0.539290 1.004502 1
2 -0.845018 1.780180 1.354705 2
3 -0.044451 0.271344 0.166762 2
4 -0.230092 -0.684156 -0.448916 0
5 -0.137938 1.403581 0.570804 2
6 -0.259851 0.589898 0.099670 2
7 0.642413 -0.762344 -0.167562 1
8 1.940560 -1.276856 0.361775 2
You can create a boolean mask of your DataFrame, that is True wherever a value is greater than your threshold (in this case 0), and then use sum along the first axis.
dff.gt(0).sum(1)
0 1
1 1
2 2
3 2
4 0
5 2
6 2
7 1
8 2
dtype: int64

Categories

Resources