Pandas: count how many rows between two values in a column - python

Let say I have the following dataframe
import pandas as pd
df = pd.DataFrame({
'Est': [1.18,1.83,2.08,2.30,2.45,3.21,3.26,3.54,3.87,4.58,4.59,4.98],
'Buy': [0,1,1,1,0,1,1,0,1,0,0,1]
})
Est Buy
0 1.18 0
1 1.83 1
2 2.08 1
3 2.30 1
4 2.45 0
5 3.21 1
6 3.26 1
7 3.54 0
8 3.87 1
9 4.58 0
10 4.59 0
11 4.98 1
I will like to create a new dataframe with two columns and 4 rows with the following format: the first row contains how many 'Est' values are between 1 and 2, and how many 1's in the column 'Buy'; the second row the same for the 'Est' values between 2 and 3; third row between 3 and 4, and so on. So my output should be
A B
0 2 1
1 3 2
2 4 3
3 3 1
I tried to use the where clause in pandas (or np.where) to create new columns with restrictions like df['Est'] >= 1 & df['Est'] <= 2 and then count. But, is there an easier and cleaner way to do this? Thanks

Sounds like you want to group by the floor of the first column:
g = df.groupby(df['Est'] // 1)
You count the Est column:
count = g['Est'].count()
And sum the Buy column:
buys = g['Buy'].sum()

Related

How to check if there is a row with same value combinations in a dataframe?

I have a dataframe and want to create a new column based on other rows of the dataframe. My dataframe looks like
MitarbeiterID ProjektID Jahr Monat Week mean freq last
0 583 83224 2020 1 2 3.875 4 0
1 373 17364 2020 1 3 5.00 0 4
2 923 19234 2020 1 4 5.00 3 3
3 643 17364 2020 1 3 4.00 2 2
Now I want to check, if the freq of a row is zero, then I will check if there is another row with the same ProjektID and Year an Week where the freq is not 0. If this is true I want a new column "other" which is value 1 and 0 else.
So, the output should be
MitarbeiterID ProjektID Jahr Monat Week mean freq last other
0 583 83224 2020 1 2 3.875 4 0 0
1 373 17364 2020 1 3 5.00 0 4 1
2 923 19234 2020 1 4 5.00 3 3 0
3 643 17364 2020 1 3 4.00 2 2 0
This time I have no approach, can anyone help?
Thanks!
The following solution tests if the required conditions are True.
import io
import pandas as pd
Data
df = pd.read_csv(io.StringIO("""
MitarbeiterID ProjektID Jahr Monat Week mean freq last
0 583 83224 2020 1 2 3.875 4 0
1 373 17364 2020 1 3 5.00 0 4
2 923 19234 2020 1 4 5.00 3 3
3 643 17364 2020 1 3 4.00 2 2
"""), sep="\s\s+", engine="python")
Make a column other with all values zero.
df['other'] = 0
If ProjektID, Jahr, Week are duplicated and any of the Freq values is larger than zero, then the rows that are duplicated (keep=False to also capture the original duplicated row) and where Freq is zero will have the value Other filled with 1. Change any() to all() if you need all values to be larger than zero.
if (df.loc[df[['ProjektID','Jahr', 'Week']].duplicated(), 'freq'] > 0).any(): df.loc[(df[['ProjektID','Jahr', 'Week']].duplicated(keep=False)) & (df['freq'] == 0), ['other']] = 1
else: print("Other stays zero")
Output:
I think the best way to solve this is not to use pandas too much :-) converting things to sets and tuples should make it fast enough.
The idea is to make a dictionary of all the triples (ProjektID, Jahr, Week) that appear in the dataset with freq != 0 and then check for all lines with freq == 0 if their triple belongs to this dictionary or not. In code, I'm creating a dummy dataset with:
x = pd.DataFrame(np.random.randint(0, 2, (8, 4)), columns=['id', 'year', 'week', 'freq'])
which in my case randomly gave:
>>> x
id year week freq
0 1 0 0 0
1 0 0 0 1
2 0 1 0 1
3 0 0 1 0
4 0 1 0 0
5 1 0 0 1
6 0 0 1 1
7 0 1 1 0
Now, we want triplets only where freq != 0, so we use
x1 = x.loc[x['freq'] != 0]
triplets = {tuple(row) for row in x1[['id', 'year', 'week']].values}
Note that I'm using x1.values, which is not a pandas DataFrame but rather a numpy array; so each row in there can now be converted to tuple. This is necessary because dataframe rows, or even numpy array or lists, are mutable objects and cannot be hashed in a dictionary otherwise. Using a set instead of e.g. a list (which doesn't have this restriction) is for efficiency purposes.
Next, we define a boolean variable which is True if a triplet (id, year, week) belongs to the above set:
belongs = x[['id', 'year', 'week']].apply(lambda x: tuple(x) in triplets, axis=1)
We are basically done, this is the further column you want, except for also needing to force freq == 0:
x['other'] = np.logical_and(belongs, x['freq'] == 0).astype(int)
(the final .astype(int) is to have it values 0 and 1, as you were asking, instead of False and True). Final result in my case:
>>> x
id year week freq other
0 1 0 0 0 1
1 0 0 0 1 0
2 0 1 0 1 0
3 0 0 1 0 1
4 0 1 0 0 1
5 1 0 0 1 0
6 0 0 1 1 0
7 0 1 1 0 0
Looks like I am too late ...:
df.set_index(['ProjektID', 'Jahr', 'Week'], drop=True, inplace=True)
df['other'] = 0
df.other.mask(df.freq == 0,
df.freq[df.freq == 0].index.isin(df.freq[df.freq != 0].index),
inplace=True)
df.other = df.other.astype('int')
df.reset_index(drop=False, inplace=True)

Pandas rank valus in rows of DataFrame

Learning Python. I have a dataframe like this
cand1 cand2 cand3
0 40.0900 39.6700 36.3700
1 44.2800 44.2800 35.4200
2 43.0900 51.2200 46.3500
3 35.7200 55.2700 36.4700
and I want to rank each row according to the value of the columns, so that I get
cand1 cand2 cand3
0 1 2 3
1 1 1 3
2 1 3 2
3 3 1 2
I have now
for index, row in df.iterrows():
df.loc['Rank'] = df.loc[index].rank(ascending=False).astype(int)
print (df)
However, this keeps on repeating the whole dataframe. Note also the special case in row 2, where two values are the same.
Suggestion appreciated
Use df.rank instead of series rank
df_rank = df.rank(axis=1, ascending=False, method='min').astype(int)
Out[165]:
cand1 cand2 cand3
0 1 2 3
1 1 1 3
2 3 1 2
3 3 1 2

expand pandas groupby results to initial dataframe

Say I have a dataframe df and group it by a few columns, dfg, with the median of one of its columns. How could I then take those median values, and expand them out so that those mean values are in a new column of the original df, and associated with the respective conditions? This will mean there are duplicates, but I will next be using this column for a subsequent calculation and having these in a column will make this possible.
Example data:
import pandas as pd
data = {'idx':[1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
'condition1':[1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4],
'condition2':[1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2],
'values':np.random.normal(0,1,16)}
df = pd.DataFrame(data)
dfg = df.groupby(['idx', 'condition2'], as_index=False)['values'].median()
example of desired result (note duplicates corresponding to correct conditions):
idx condition1 condition2 values medians
0 1 1 1 0.35031 0.656355
1 1 1 2 -0.291736 -0.024304
2 1 2 1 1.593545 0.656355
3 1 2 2 -1.275154 -0.024304
4 1 3 1 0.075259 0.656355
5 1 3 2 1.054481 -0.024304
6 1 4 1 0.9624 0.656355
7 1 4 2 0.243128 -0.024304
8 2 1 1 1.717391 1.155406
9 2 1 2 0.788847 1.006583
10 2 2 1 1.145891 1.155406
11 2 2 2 -0.492063 1.006583
12 2 3 1 -0.157029 1.155406
13 2 3 2 1.224319 1.006583
14 2 4 1 1.164921 1.155406
15 2 4 2 2.042239 1.006583
I believe you need GroupBy.transform with median for new column:
df['medians'] = df.groupby(['idx', 'condition2'])['values'].transform('median')

Checking condition in future rows in pandas with group by

Following is what my dataframe looks like and Expected_Output is my desired column.
Group Signal Value1 Value2 Expected_Output
0 1 0 3 1 NaN
1 1 1 4 2 NaN
2 1 0 7 4 NaN
3 1 0 8 9 1.0
4 1 0 5 3 NaN
5 2 1 3 6 NaN
6 2 1 1 2 1.0
7 2 0 3 4 1.0
For a given Group, if Signal == 1, then I am attempting to look at the next three rows(and not the current row) and check if Value1 < Value2. If that condition is true, then I return a 1 in the Expected_Output column. If for example, Value < Value2 condition is satisfied for multiple reasons as it comes within 3 next rows from Signal == 1 in both row 5 & 6(Group 2), then I am also returning a 1 in Expected_Output.
I am assuming the right combination of group by object,np.where, any, shift could be the solution but cant quite get there.
N.B:- Alexander pointed out a conflict in the comments. Ideally, a value being set due to a signal in a prior row will supersede the current row rule conflict in a given row.
If you are going to be checking lots of previous rows, multiple shifts can quickly get messy, but here it's not too bad:
s = df.groupby('Group').Signal
condition = ((s.shift(1).eq(1) | s.shift(2).eq(1) | s.shift(3).eq(1))
& df.Value1.lt(df.Value2))
df.assign(out=np.where(condition, 1, np.nan))
Group Signal Value1 Value2 out
0 1 0 3 1 NaN
1 1 1 4 2 NaN
2 1 0 7 4 NaN
3 1 0 8 9 1.0
4 1 0 5 3 NaN
5 2 1 3 6 NaN
6 2 1 1 2 1.0
7 2 0 3 4 1.0
If you're concerned about the performance of using so many shifts, I wouldn't worry too much, here's a sample on 1 million rows:
In [401]: len(df)
Out[401]: 960000
In [402]: %%timeit
...: s = df.groupby('Group').Signal
...:
...: condition = ((s.shift(1).eq(1) | s.shift(2).eq(1) | s.shift(3).eq(1))
...: & df.Value1.lt(df.Value2))
...:
...: np.where(condition, 1, np.nan)
...:
...:
94.5 ms ± 524 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#Alexander identified a conflict in the rules, here is a version using a mask that fits that requirement:
s = (df.Signal.mask(df.Signal.eq(0)).groupby(df.Group)
.ffill(limit=3).mask(df.Signal.eq(1)).fillna(0))
Now you can simply use this column along with your other condition:
np.where((s.eq(1) & df.Value1.lt(df.Value2)).astype(int), 1, np.nan)
array([nan, nan, nan, 1., nan, nan, nan, 1.])
You can create an index that matches your criteria, and then use it to set the expected output to 1.
It is not clear how to treat the expected output when the rules conflict. For example, on row 6, the expected output would be 1 because it satisfied the signal criteria from row five and fits 'the subsequent three rows where value 1 < value 2'. However, it possibly conflicts with the rule that the first signal row is ignored.
idx = (df
.assign(
grp=df['Signal'].eq(1).cumsum(),
cond=df.eval('Value1 < Value2'))
.pipe(lambda df: df[df['grp'] > 0]) # Ignore data preceding first signal.
.groupby(['Group', 'grp'], as_index=False)
.apply(lambda df: df.iloc[1:4, :]) # Ignore current row, get rows 1-3.
.pipe(lambda df: df[df['cond']]) # Find rows where condition is met.
.index.get_level_values(1)
)
df['Expected_Output'] = np.nan
df.loc[idx, 'Expected_Output'] = 1
>>> df
Group Signal Value1 Value2 Expected_Output
0 1 0 3 1 NaN
1 1 1 4 2 NaN
2 1 0 7 4 NaN
3 1 0 8 9 1.0
4 1 0 5 3 NaN
5 2 1 3 6 NaN
6 2 1 1 2 NaN # <<< Intended difference vs. "expected"
7 2 0 3 4 1.0

Creating lambda function with conditions on one df to use in df.apply of another df

Consider df
Index A B C
0 20161001 0 24.5
1 20161001 3 26.5
2 20161001 6 21.5
3 20161001 9 29.5
4 20161001 12 20.5
5 20161002 0 30.5
6 20161002 3 22.5
7 20161002 6 25.5
...
Also consider df2
Index Threshold
0 25
1 27
2 29
3 30
4 25
5 30
..
I want to add a column "Number of Rows" to df2 which contains the number of rows in df where (C > Threshold) & (A >= 20161001) & (A <= 20161002) holds true. This is basically to imply that there are conditions on more than one column in df
Index Threshold Number of Rows
0 25 4
1 27 2
2 29 2
3 30 1
4 25 4
5 30 1
..
For Threshold=25 in df2, there are 4 rows in df where "C" value crosses 25.
I tried something like:
def foo(threshold,start,end):
return len(df[(df['C'] > threshold) & (df['A'] > start) & (df['A'] < end)])
df2['Number of rows'] = df.apply(lambda df2: foo(df2['Threshold'],start = 20161001, end = 20161002),axis=1)
But this is populating the Number of Rows column with 0. Why is this?
You could make use of Boolean Indexing and the sum() aggregate function
# Create the first dataframe (df)
df = pd.DataFrame([[20161001,0 ,24.5],
[20161001,3 ,26.5],
[20161001,6 ,21.5],
[20161001,9 ,29.5],
[20161001,12,20.5],
[20161002,0 ,30.5],
[20161002,3 ,22.5],
[20161002,6 ,25.5]],columns=['A','B','C'])
# Create the second dataframe (df2)
df2 = pd.DataFrame(data=[25,27,29,30,25,30],columns=['Threshold'])
start = 20161001
end = 20161002
df2['Number of Rows'] = df2['Threshold'].apply(lambda x : ((df.C > x) & (df.A >= start) & (df.A <= end)).sum())
print(df2['Number of Rows'])
Out[]:
0 4
1 2
2 2
3 1
4 4
5 1
Name: Number of Rows, dtype: int64

Categories

Resources