Python - Finding longest continuous run in Pandas Dataframe - python

I have a pandas dataframe which has the following variables: week, product code, constraint flag (0 or 1 indicating whether the product supply was constrained).
Week Product_Code Constraint_Flag
1 A 1
1 B 0
2 A 0
2 B 1
3 A 0
3 B 0
4 A 0
4 B 0
5 A 1
5 B 0
I want to find the longest time period that the supply was unconstrained for i.e. the longest string of 0s for each product code. So for product A I would want to know that the longest string started in week 3 and lasted for 2 weeks, and for product B the longest string started in week 3 and lasted for 3 weeks.
How can I do this?

Use this solution for find longest only 0 period and then filter with aggregate first and last:
m = np.concatenate(( [True], df['Constraint_Flag'] != 0, [True] ))
ss = np.flatnonzero(m[1:] != m[:-1]).reshape(-1,2)
s, e = ss[(ss[:,1] - ss[:,0]).argmax()]
pos = df.columns.get_loc('Week')
print (s,e)
4 8
print (df.iloc[s:e])
Week Product_Code Constraint_Flag
4 3 A 0
5 3 B 0
6 4 A 0
7 4 B 0
df = df.iloc[s:e].groupby('Product_Code')['Week'].agg(['first','last'])
print (df)
first last
Product_Code
A 3 4
B 3 4
But if need compare per groups:
def f(x):
print (x)
m = np.concatenate(( [True], x['Constraint_Flag'] != 0, [True] ))
ss = np.flatnonzero(m[1:] != m[:-1]).reshape(-1,2)
s, e = ss[(ss[:,1] - ss[:,0]).argmax()]
pos = x.columns.get_loc('Week')
c = ['start','end']
return pd.Series([x.iat[s, pos], x.iat[e-1, pos]], index=c)
Week Product_Code Constraint_Flag
0 1 A 1
2 2 A 0
4 3 A 0
6 4 A 0
8 5 A 1
Week Product_Code Constraint_Flag
1 1 B 0
3 2 B 1
5 3 B 0
7 4 B 0
9 5 B 0
df = df.groupby('Product_Code').apply(f)
print (df)
start end
Product_Code
A 2 4
B 3 5

Related

find smallest index to fit multiple condition in pandas

I want to group my values together so that the max sum of 2 values comes to a certain value ( here 6 ).
For example, I want to put together (1+5), (3+3), (4+1) and the rest by themselves. For this, I need to be able to search for a certain condition combination, and also ignore it if there is no such number. In the "Grouped" column I keep track of if they have already been grouped, if so then leave them, an index can only be grouped once.
I have:
df_1= pd.DataFrame({'Rest_after_division': [1,3,3,4,5,5,1],
'Grouped_with_index': ["-","-","-","-","-","-","-"],
'Grouped': [0,0,0,0,0,0,0]})
Rest_after_division Grouped_with_index Grouped
0 1 - 0
1 2 - 0
2 3 - 0
3 4 - 0
4 5 - 0
5 5 - 0
6 5 - 0
I want:
Rest_after_division Grouped_with_index Grouped
0 1 4 1
1 2 3 1
2 3 - 0
3 4 1 1
4 5 0 1
5 5 - 0
6 5 - 0
I have example 2:
df_1= pd.DataFrame({'Rest_after_division': [1,1,1,4,5,5,5],
'Grouped_with_index': ["-","-","-","-","-","-","-"],
'Grouped': [0,0,0,0,0,0,0]})
Rest_after_division Grouped_with_index Grouped
0 1 - 0
1 1 - 0
2 1 - 0
3 4 - 0
4 5 - 0
5 5 - 0
6 5 - 0
I want example 2:
Rest_after_division Grouped_with_index Grouped
0 1 4 1
1 1 5 1
2 1 6 1
3 4 - 0
4 5 0 1
5 5 1 1
6 5 2 1
I have tried: ( i know I need to loop this eventually, but I can't get the index..)
df_1 = df_1.sort_values('Grouped')
index_group_buddy= df_1[df_1['Rest_after_division']==5].head(1).index[0]
print(index_group_buddy)
This almost works, but not when the condition does not exist, how do I skip this? And I also think it will be problematic when all are grouped...
I have also tried:
#index_group_buddy = df_1.loc[((df_1['Rest_after_division'] == 5) & (df_1['Grouped'] != 1)) ].idxmin(axis=1)
#index_group_buddy =df_1.query("Rest_after_division==5 and Grouped!=1")
index_group_buddy = df_1[(df_1['Rest_after_division']==5) & (df_1['Grouped']!=1)].index[0]
df_1.at[index_group_buddy, 'Grouped'] = 1
df_1.at[index_group_buddy, 'Grouped_with_index '] = index_group_buddy
print(index_group_buddy)
I want to find the first index that has the right conditions.
rework you df_1 to map unique "Rest_after_division" to their index
Map the complement to 6 on those keys
calculate which values should be grouped (not complement with self and first value of group)
insert the values with mask
keys = (df_1['Rest_after_division']
.drop_duplicates()
.reset_index()
.set_index('Rest_after_division')
['index']
)
compl_index = (6-df_1['Rest_after_division']).map(keys)
df_1['Grouped'] = (compl_index.ne(df_1.index)
& df_1.groupby('Rest_after_division').cumcount().eq(0)
).astype(int)
df_1['Grouped_with_index'] = compl_index.where(df_1['Grouped'].eq(1),
df_1['Grouped_with_index'])
output:
Rest_after_division Grouped_with_index Grouped
0 1 4 1
1 2 3 1
2 3 - 0
3 4 1 1
4 5 0 1
5 5 - 0
6 1 - 0

How do I calculate the first value in each group from every other value in the group to calculate change over time?

In R I can calculate the change over time for each group in a data set like this:
df %>%
group_by(z) %>%
mutate(diff = y - y[x == 0])
What is the equivalent in pandas?
I know that using pandas you can minus the first value of a column like this:
df['diff'] = df.y-df.y.iloc[0]
But how do you group by variable z?
Example data:
x y z
0 2 A
5 4 A
10 6 A
0 1 B
5 3 B
10 9 B
Expected output:
x y z diff
0 2 A 0
5 4 A 2
10 6 A 4
0 1 B 0
5 5 B 4
10 9 B 8
You can try this.
temp = df.groupby('z').\
apply(lambda g: g.y - g.y[0]).\
reset_index().\
rename(columns={'y': 'diff'}).\
drop('z', axis=1)
df.merge(temp, how='inner', left_index=True, right_on='level_1').\
drop('level_1', axis=1)
Return:
x y z diff
0 2 A 0
5 4 A 2
10 6 A 4
0 1 B 0
5 5 B 4
10 9 B 8

python pandas - remove duplicates in a column and keep rows according to a complex criteria

Suppose I have this DF:
s1 = pd.Series([1,1,2,2,2,3,3,3,4])
s2 = pd.Series([10,20,10,5,10,7,7,3,10])
s3 = pd.Series([0,0,0,0,1,1,0,2,0])
df = pd.DataFrame([s1,s2,s3]).transpose()
df.columns = ['id','qual','nm']
df
id qual nm
0 1 10 0
1 1 20 0
2 2 10 0
3 2 5 0
4 2 10 1
5 3 7 1
6 3 7 0
7 3 3 2
8 4 10 0
I want to get a new DF in which there are no duplicate ids, so there should be 4 rows with ids 1,2,3,4. The row that should be kept should be chosen based on the following criteria: take the one with smallest nm, if equal, take the one with largest qual, if still equal, just choose one.
I figure that my code should look something like:
df.groupby('id').apply(lambda x: ???)
And it should return:
id qual nm
0 1 20 0
1 2 10 0
2 3 7 0
3 4 10 0
But not sure what my function should take and return.
Or possibly there is an easier way?
Thanks!
Use boolean indexing with GroupBy.transform for minumum rows per groups, then for maximum values and last if still dupes remove them by DataFrame.drop_duplicates:
#get minimal nm
df1 = df[df['nm'] == df.groupby('id')['nm'].transform('min')]
#get maximal qual
df1 = df1[df1['qual'] == df1.groupby('id')['qual'].transform('max')]
#if still dupes get first id
df1 = df1.drop_duplicates('id')
print (df1)
id qual nm
1 1 20 0
2 2 10 0
6 3 7 0
8 4 10 0
Use -
grouper = df.groupby(['id'])
df.loc[(grouper['nm'].transform(min) == df['nm'] ) & (grouper['qual'].transform(max) == df['qual']),:].drop_duplicates(subset=['id'])
Output
id qual nm
1 1 20 0
2 2 10 0
6 3 7 0
8 4 10 0

Select rows which have only zeros in columns

I want to select the rows in a dataframe which have zero in every column in a list of columns. e.g. this df:.
In:
df = pd.DataFrame([[1,2,3,6], [2,4,6,8], [0,0,3,4],[1,0,3,4],[0,0,0,0]],columns =['a','b','c','d'])
df
Out:
a b c d
0 1 2 3 6
1 2 4 6 8
2 0 0 3 4
3 1 0 3 4
4 0 0 0 0
Then:
In:
mylist = ['a','b']
selection = df.loc[df['mylist']==0]
selection
I would like to see:
Out:
a b c d
2 0 0 3 4
4 0 0 0 0
Should be simple but I'm having a slow day!
You'll need to determine whether all columns of a row have zeros or not. Given a boolean mask, use DataFrame.all(axis=1) to do that.
df[df[mylist].eq(0).all(1)]
a b c d
2 0 0 3 4
4 0 0 0 0
Note that if you wanted to find rows with zeros in every column, remove the subsetting step:
df[df.eq(0).all(1)]
a b c d
4 0 0 0 0
Using reduce and Numpy's logical_and
The point of this is to eliminate the need to create new Pandas objects and simply produce the mask we are looking for using the data where it sits.
from functools import reduce
df[reduce(np.logical_and, (df[c].values == 0 for c in mylist))]
a b c d
2 0 0 3 4
4 0 0 0 0

pandas dataframe apply using additional arguments

with below example:
df = pd.DataFrame({'signal':[1,0,0,1,0,0,0,0,1,0,0,1,0,0],'product':['A','A','A','A','A','A','A','B','B','B','B','B','B','B'],'price':[1,2,3,4,5,6,7,1,2,3,4,5,6,7],'price2':[1,2,1,2,1,2,1,2,1,2,1,2,1,2]})
I have a function "fill_price" to create a new column 'Price_B' based on 'signal' and 'price'. For every 'product' subgroup, Price_B equals to Price if 'signal' is 1. Price_B equals previous row's Price_B if signal is 0. If the subgroup starts with a 0 'signal', then 'price_B' will be kept at 0 until 'signal' turns 1.
Currently I have:
def fill_price(df, signal,price_A):
p = df[price_A].where(df[signal] == 1)
return p.ffill().fillna(0).astype(df[price_A].dtype)
this is then applied using:
df['Price_B'] = fill_price(df,'signal','price')
However, I want to use df.groupby('product').apply() to apply this fill_price function to two subsets of 'product' columns separately, and also apply it to both'price' and 'price2' columns. Could someone help with that?
I basically want to do:
df.groupby('product',groupby_keys=False).apply(fill_price, 'signal','price2')
IIUC, you can use this syntax:
df['Price_B'] = df.groupby('product').apply(lambda x: fill_price(x,'signal','price2')).reset_index(level=0, drop=True)
Output:
price price2 product signal Price_B
0 1 1 A 1 1
1 2 2 A 0 1
2 3 1 A 0 1
3 4 2 A 1 2
4 5 1 A 0 2
5 6 2 A 0 2
6 7 1 A 0 2
7 1 2 B 0 0
8 2 1 B 1 1
9 3 2 B 0 1
10 4 1 B 0 1
11 5 2 B 1 2
12 6 1 B 0 2
13 7 2 B 0 2
You can write this much simplier without the extra function.
df['Price_B'] = (df.groupby('product',as_index=False)
.apply(lambda x: x['price2'].where(x.signal==1).ffill().fillna(0))
.reset_index(level=0, drop=True))

Categories

Resources