Imagine you have a dataframe df as follows:
Id Side Volume
2 a 40
2 b 30
1 a 20
2 b 10
You want the following output
Id Side sum
1 a 20
1 all 20
2 a 40
2 b 40
2 all 80
all a 60
all b 40
all all 100
Which would be a df.groupby(['Id','Side'].C.sum().reset_index() AND the sum values for all side and all id's (df.Volume.sum(), df[df.Side == 'a'].Volume.sum(), df[df.Side == 'b'].Volume.sum(), etc...)?
Is there a way to do this without calculating it outside and then merging both results?
Related
Let's consider this Dataframe:
$> df
a b
0 6 50
1 2 20
2 9 60
3 4 40
4 5 20
I want to compute column D based on:
The max value between:
integer 0
A slice of column B at that row's index
So I have created a column C (all zeroes) in my dataframe in order use DataFrame.max(axis=1). However, short of using apply or looping over the DataFrame, I don't know how to slice the input values. Expected result would be:
$> df
a b c d
0 6 50 0 60
1 2 20 0 60
2 9 60 0 60
3 4 40 0 40
4 5 20 0 20
So essentially, d's 3rd row is computed (pseudo-code) as max(df[3:,"b"], df[3:,"c"]), and similarly for each row.
Since the input columns (b, c) have already been computed, there has to be a way to slice the input as I calculate each row for D without having to loop, as this is slow.
Seems like this could work: Reverse "B", find cummax, then reverse it back and assign it to "d". Then use where on "d" to see if any value is less than 0:
df['d'] = df['b'][::-1].cummax()[::-1]
df['d'] = df['d'].where(df['d']>0, 0)
We can replace the last line with the one below using clip (thanks #Either), and drop the 2nd reversal (assuming indexes match) making it all a one liner:
df['d'] = df['b'][::-1].cummax().clip(lower=0)
Output:
a b d
0 6 50 60
1 2 20 60
2 9 60 60
3 4 40 40
4 5 20 20
I am trying to rank a large dataset using python. I do not want duplicates and rather than using the 'first' method, I would instead like it to look at another column and rank it based on that value.
It should only look at the second column if the rank in the first column has duplicates.
Name CountA CountB
Alpha 15 3
Beta 20 52
Delta 20 31
Gamma 45 43
I would like the ranking to end up
Name CountA CountB Rank
Alpha 15 3 4
Beta 20 52 2
Delta 20 31 3
Gamma 45 43 1
Currently, I am using df.rank(ascending=False, method='first')
Maybe use sort and pull out the index:
import pandas as pd
df = pd.DataFrame({'Name':['A','B','C','D'],'CountA':[15,20,20,45],'CountB':[3,52,31,43]})
df['rank'] = df.sort_values(['CountA','CountB'],ascending=False).index + 1
Name CountA CountB rank
0 A 15 3 4
1 B 20 52 2
2 C 20 31 3
3 D 45 43 1
You can take the counts of the values in CountA and then filter the DataFrame rows based on the count of CountA being greater than 1. Where the count is greater than 1, take CountB, otherwise CountA.
df = pd.DataFrame([[15,3],[20,52],[20,31],[45,43]],columns=['CountA','CountB'])
colAcount = df['CountA'].value_counts()
#then take the indices where colACount > 1 and use them in a where
df['final'] = df['CountA'].where(~df['CountA'].isin(colAcount[colAcount>1].index),df['CountB'])
df = df.sort_values(by='final', ascending=False).reset_index(drop=True)
# the rank is the index
CountA CountB final
0 20 52 52
1 45 43 45
2 20 31 31
3 15 3 15
See this for more details.
Given the following pandas dataframe:
Name speed
---------------
0 A 100
1 A 50
2 A 40
4 A 30
5 A 10
6 B 100
7 B 50
8 B 40
9 B 120
10 A 10
I want to select rows for each name when we find a 10< speed<100 where the next rows speed are also between 10 and 100, until we find a row with speed=<10 so we select them, if we find a row speed=>100 we don't select them and go on iterating the other rows. the result should be like this:
Name speed
---------------
1 A 50
2 A 40
4 A 30
You can use between() and shift() for this:
low = 10
high = 100
mask_next = df.speed.between(low, high, inclusive=False) & (df.speed.shift(-1).between(low, high, inclusive=False))
mask_before = df.speed.between(low, high, inclusive=False) & (df.speed.shift(1).between(low, high, inclusive=False))
df[mask_next | mask_before]
Output:
Name speed
1 A 50
2 A 40
4 A 30
7 B 50
8 B 40
If you need it for each name, you have to loop:
results = []
for name in df.Name.drop_duplicates().to_list():
mask_name = df.Name == name
results.append(df[mask_name & (mask_next | mask_before)])
pd.concat(results)
Output:
Name speed
1 A 50
2 A 40
4 A 30
7 B 50
8 B 40
The results are identical with your test data. However, the second approach checks if id 5 and 10 fulfill the condition. The first approach without looping will always drop id 10.
Let's say that I have the following dataframe:
name number
0 A 100
1 B 200
2 B 30
3 A 20
4 B 30
5 A 40
6 A 50
7 A 100
8 B 10
9 B 20
10 B 30
11 A 40
What I would like to do is to merge all the successive rows where name == 'B', between two rows with name == 'A' and get the corresponding sum. So, I would like my final output to look like that:
name number
0 A 100
1 B 230
2 A 20
3 B 30
4 A 40
5 A 50
6 A 100
7 B 60
8 A 40
We can use a little groupby trick here. Create a mask with of A's and then shift each subsequent group of B's into their own group. This answer assumes that your name Series contains just A's and B's.
c = df['name'].eq('A')
m1 = c.cumsum()
m = m1.where(c, m1 + m1.max())
df.groupby(m, sort=False, as_index=False).agg({'name': 'first', 'number': 'sum'})
name number
0 A 100
1 B 230
2 A 20
3 B 30
4 A 40
5 A 50
6 A 100
7 B 60
8 A 40
A clumsier attempt - but since I've done it might as well post.
This is just a basic for loop with a while:
for i in df.index:
if i in df.index and df.loc[i, 'name'] == 'B':
while df.loc[i+1, 'name'] == 'B':
df.loc[i, 'number'] += df.loc[i+1, 'number']
df = df.drop(i+1).reset_index(drop=True)
It's very straightforward (and hence inefficient I imagine): if B, if next row is also B, add next row to this row's number and delete next row.
consider the below pandas dataframe
df = pd.DataFrame({0:['a',1,2,3,'a',1,2,3],1:[10,20,30,40,50,60,70,80],2:[100,200,300,400,500,600,700,800]})
0 1 2
0 a 10 100
1 1 20 200
2 2 30 300
3 3 40 400
4 a 50 500
5 1 60 600
6 2 70 700
7 3 80 800
i want to reshape the dataframe such that my desired output should look like
1 2 3 4
a 10 100 50 500
1 20 200 60 600
2 30 300 70 700
3 40 400 80 800
basically, i have repetitive and finite set of values in df[0] but the corresponding values in other columns are unique at each repetition. I want so unstack the table in such as way that I can get the desired output. a numpy solution is also acceptable.
You can do something like this: group rows by the 0th column and then convert the groups into Series.
df.groupby(0)[1].apply(list).apply(pd.Series)
# 0 1
#0
#1 20 60
#2 30 70
#3 40 80
#a 10 50
Use groupby and then convert values to columns:
df.groupby(by=[0])[1].apply(lambda x: pd.Series(x.tolist())).unstack()
Out[37]:
0 1
0
1 20 60
2 30 70
3 40 80
a 10 50
Here's one solution, using a dictionary to store your repetitive values and the corresponding columns, and converting it back to a dataframe. Keep in mind that dicts are disordered, so if you want to keep the order of your repetitive values you would need to tweak this a bit.
df = pd.DataFrame({0:['a',1,2,3,'a',1,2,3],1:[10,20,30,40,50,60,70,80]})
unstacked = {}
for index, row in df.iterrows():
if row.iloc[0] not in unstacked:
unstacked[ row.iloc[0] ] = list(row[1::])
else:
unstacked[ row.iloc[0] ] += list(row[1::])
unstacked_df = pd.DataFrame.from_dict( unstacked, orient='index' )
print unstacked_df
0 1
a 10 50
1 20 60
2 30 70
3 40 80