I need to calculate a column based on other row. Basically I want my new_column to be the sum of "base_column" for all row with same id.
I currently do the following (but is not really efficient) what is the most efficient way to achieve that ?
def calculate(x):
filtered_df = df[["id"] == dataset.at[x.name, "id"]] # in fact my filter is more complex basically same id and date in the last 4 weeks
df.at[x.name, "new_column"] = filtered_df["base_column"].sum()
df.apply(calculate)
You can do a below
df['new_column']= df.groupby('id')['base_column'].transform('sum')
input
id base_column
0 1 2
1 1 4
2 2 5
3 3 6
4 5 7
5 7 4
6 7 5
7 7 3
output
id base_column new_column
0 1 2 6
1 1 4 6
2 2 5 5
3 3 6 6
4 5 7 7
5 7 4 12
6 7 5 12
7 7 3 12
Another way to do this is to use groupby and merge
import pandas as pd
df = pd.DataFrame({'id':[1,1,2],'base_column':[2,4,5]})
# compute sum by id
sum_base =df.groupby("id").agg({"base_column": 'sum'}).reset_index().rename(columns={'base_column':'new_column'})
# join the result to df
df = pd.merge(df,sum_base,how='left',on='id')
# id base_column new_column
#0 1 2 6
#1 1 4 6
#2 2 5 5
Related
I have a DataFrame with two columns A and B.
I want to create a new column named C to identify the continuous A with the same B value.
Here's an example
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,5,6,10,11,12,13,18], 'B':[1,1,2,2,3,3,3,3,4,4]})
I found a similar question, but that method only identifies the continuous A regardless of B.
df['C'] = df['A'].diff().ne(1).cumsum().sub(1)
I have tried to groupby B and apply the function like this:
df['C'] = df.groupby('B').apply(lambda x: x['A'].diff().ne(1).cumsum().sub(1))
However, it doesn't work: TypeError: incompatible index of inserted column with frame index.
The expected output is
A B C
1 1 0
2 1 0
3 2 1
5 2 2
6 3 3
10 3 4
11 3 4
12 3 4
13 4 5
18 4 6
Let's create a sequential counter using groupby, diff and cumsum then factorize to reencode the counter
df['C'] = df.groupby('B')['A'].diff().ne(1).cumsum().factorize()[0]
Result
A B C
0 1 1 0
1 2 1 0
2 3 2 1
3 5 2 2
4 6 3 3
5 10 3 4
6 11 3 4
7 12 3 4
8 13 4 5
9 18 4 6
Use DataFrameGroupBy.diff with compare not equal 1 and Series.cumsum, last subtract 1:
df['C'] = df.groupby('B')['A'].diff().ne(1).cumsum().sub(1)
print (df)
A B C
0 1 1 0
1 2 1 0
2 3 2 1
3 5 2 2
4 6 3 3
5 10 3 4
6 11 3 4
7 12 3 4
8 13 4 5
9 18 4 6
Overview
When creating a conditional count_cumsum column in Pandas I have created a temporary count column then deleted this after the desired column was created.
Code
df = pd.DataFrame({"Level":[1,2,3,4,5,6,7,8],
"Price":[2,3,4,5,6,7,1,10]})
df["Count"] = np.where((df.Price > df.Level),1,np.NaN)
df['count_cumsum'] = df.Count.groupby(df.Count.isna().cumsum()).cumsum()
del df["Count"]
Level Price count_cumsum
0 1 2 1.0
1 2 3 2.0
2 3 4 3.0
3 4 5 4.0
4 5 6 5.0
5 6 7 6.0
6 7 1 NaN
7 8 10 1.0
Question
How can I use a zero instead of NaN for the df["Count"] column to keep count_cumsum as an int column and is there a simpler way to yield this output.
Desired output
Level Price count_cumsum
0 1 2 1
1 2 3 2
2 3 4 3
3 4 5 4
4 5 6 5
5 6 7 6
6 7 1 0
7 8 10 1
To use zero instead of NaN, you can replace codes on np.nan with 0 and replace isna() by eq(0) in your code. This should be simple and you should be able to do it yourself based on the hint here. I will go straightly to the way to simplify the coding below:
You can simplify the processing logics as follows:
# Replace the `np.where` on the boolean condition and setting 0 or 1 according to condition by using `astype(int)` instead
m = (df.Price > df.Level).astype(int)
#Use the series m for grouping and cumsum
df['count_cumsum'] = m.groupby(m.eq(0).cumsum()).cumsum()
In this way, you can simplify the code by:
without defining temporary column df["Count"] on df and delete it afterwards
simplify the code np.where((df.Price > df.Level),1,0) to simply converting the boolean condition (df.Price > df.Level) to integer (will give 0 and 1 for False and True respectively).
Result:
print(df)
Level Price count_cumsum
0 1 2 1
1 2 3 2
2 3 4 3
3 4 5 4
4 5 6 5
5 6 7 6
6 7 1 0
7 8 10 1
You can avoid the NaN's altogether with the clean and readable solution below:
df = pd.DataFrame({"Level":[1,2,3,4,5,6,7,8],
"Price":[2,3,4,5,6,7,1,10]})
df["Count"] = np.where((df.Price > df.Level),1,0)
df['count_cumsum'] = df.Count.groupby((df.Count == 0).cumsum()).cumsum()
del df["Count"]
Level Price count_cumsum
0 1 2 1
1 2 3 2
2 3 4 3
3 4 5 4
4 5 6 5
5 6 7 6
6 7 1 0
7 8 10 1
This leaves everything as an int type as well, which seems to be what you're after.
I have a dataframe with a bunch of Q&A sessions. Each time the speaker changes, the dataframe has a new row. I'm trying to assign question characteristics to the answers so I want to create an ID for each question-answer group. In the example below, I want to increment the id each time a new question is asked (speakertype_id == 3 => questions; speakertype_id == 4 => answers). I currently loop through the dataframe like so:
Q_A = pd.DataFrame({'qna_id':[9]*10,
'qnacomponentid':[3,4,5,6,7,8,9,10,11,12],
'speakertype_id':[3,4,3,4,4,4,3,4,3,4]})
group = [0]*len(Q_A)
j = 1
for index,row in enumerate(Q_A.itertuples()):
if row[3] == 3:
j+=1
group[index] = j
Q_A['group'] = group
This gives me the desired output and is much faster than I expected, but this post makes me question whether I should ever iterate over a pandas dataframe. Any thoughts on a better method? Thanks.
**Edit: Expected Output:
qna_id qnacomponentid speakertype_id group
9 3 3 2
9 4 4 2
9 5 3 3
9 6 4 3
9 7 4 3
9 8 4 3
9 9 3 4
9 10 4 4
9 11 3 5
9 12 4 5
you can use eq and cumsum like:
Q_A['gr2'] = Q_A['speakertype_id'].eq(3).cumsum()
print(Q_A)
qna_id qnacomponentid speakertype_id group gr2
0 9 3 3 2 1
1 9 4 4 2 1
2 9 5 3 3 2
3 9 6 4 3 2
4 9 7 4 3 2
5 9 8 4 3 2
6 9 9 3 4 3
7 9 10 4 4 3
8 9 11 3 5 4
9 9 12 4 5 4
Note that not sure if you have any reason to start at 2, but you can add +1 after the cumsum if it is a requirement
i reproduced as per your output:
Q_A['cumsum'] = Q_A[Q_A.speakertype_id!=Q_A.speakertype_id.shift()].groupby('speakertype_id').cumcount()+2
Q_A['cumsum'] = Q_A['cumsum'].ffill().astype('int')
I have a dataframe consisting of two columns with id's and one column with numerical values. I want to groupby the first id column and keep all the rows corresponding to the smallest values in the second column, so that I keep multiple rows if needed.
This is my pandas dataframe
id1 id2 num1
1 1 9
1 1 4
1 2 4
1 2 3
1 3 7
2 6 9
2 6 1
2 6 5
2 9 3
2 9 7
3 2 8
3 4 2
3 4 7
3 4 9
3 4 10
What I want to have is:
id1 id2 num1
1 1 9
1 1 4
2 6 9
2 6 1
2 6 5
3 2 8
I have tried to keep the min value, find the idxmin() or remove duplicates but this ends up with only one row per id1 and id2.
firstS.groupby('id1')['id2'].transform(min)
Many thanks in advance!
You are close, only need compare id2 column with transform Series and filter by boolean indexing:
df = firstS[firstS['id2'] == firstS.groupby('id1')['id2'].transform(min)]
print (df)
id1 id2 num1
0 1 1 9
1 1 1 4
5 2 6 9
6 2 6 1
7 2 6 5
10 3 2 8
Simplest way:
df = df.merge(df.groupby("id1").id2.min().reset_index())
If I have a data frame looking like the following, and I want the max value of "f0max" from the file that has the same name.
f0max file maxtime
0 9 1 1
1 8 1 2
2 7 1 3
3 6 2 4
4 5 2 5
5 4 2 6
6 3 3 7
7 2 3 8
8 1 3 9
so the result would be
f0max file maxtime
0 9 1 1
3 6 2 4
6 3 3 7
so the result would be (in real data there is no same value for f0max and maxtime)
is this possible in pandas?
To return the entire row corresponding to the max f0max within each file
df.sort_values('f0max').groupby('file').tail(1)
Output:
f0max file maxtime
6 3 3 7
3 6 2 4
0 9 1 1
You can use Boolean indexing with GroupBy + transform. Note this will include duplicate maxima by group.
df = df[df['f0max'] == df.groupby('file')['f0max'].transform('max')]
Or you can sort and then drop duplicates by your grouper. If duplicate maxima exist by group, only one will be kept:
df = df.sort_values('f0max', ascending=False)\
.drop_duplicates('file')
Result:
print(df)
f0max file maxtime
0 9 1 1
3 6 2 4
6 3 3 7
Use groupby and merge
df1 = df.merge(df.groupby('file', as_index=False)['f0max'].max())
print (df1)
file f0max maxtime
0 1 9 1
1 2 6 4
2 3 3 7