I have a dataframe who looks like this:
Name rent sale
0 A 180 2
1 B 1 4
2 M 12 1
3 O 10 1
4 A 180 5
5 M 2 19
that i want to make condition that if i have a duplicate row and a duplicate value in column field => Example :
duplicate row A have duplicate value 180 in rent column
I keep only one (without making the sum)
Or make the sum => Example duplicate row A with different values 2 & 5 in Sale column and duplicate row M with different values in rent & sales columns
Expected output:
Name rent sale
0 A 180 7
1 B 1 4
2 M 14 20
3 O 10 1
I tried this code but it's not workin as i want
import pandas as pd
df=pd.DataFrame({'Name':['A','B','M','O','A','M'],
'rent':[180,1,12,10,180,2],
'sale':[2,4,1,1,5,19]})
df2 = df.drop_duplicates().groupby('Name',sort=False,as_index=False).agg(Name=('Name','first'),
rent=('rent', 'sum'),
sale=('sale','sum'))
print(df2)
I got this output
Name rent sale
0 A 360 7
1 B 1 4
2 M 14 20
3 O 10 1
Can try summing only the unique values per group:
def sum_unique(s):
return s.unique().sum()
df2 = df.groupby('Name', sort=False, as_index=False).agg(
Name=('Name', 'first'),
rent=('rent', sum_unique),
sale=('sale', sum_unique)
)
df2:
Name rent sale
0 A 180 7
1 B 1 4
2 M 14 20
3 O 10 1
You can first groupby by Name and rent, and then just by Name:
df2 = df.groupby(['Name', 'rent'], as_index=False).sum().groupby('Name', as_index=False).sum()
Related
I have two dataframes, just like below.
Dataframe1:
country
type
start_week
end_week
1
a
12
13
2
b
13
14
Dataframe2:
country
type
week
value
1
a
12
1000
1
a
13
900
1
a
14
800
2
b
12
1000
2
b
13
900
2
b
14
800
I want to add to the first dataframe column with the mean value from the second dataframe for key (country+type) and between start_week and end_week.
I want desired output to look like the below:
country
type
start_week
end_week
avg
1
a
12
13
950
2
b
13
14
850
here is one way :
combined = df1.merge(df2 , on =['country','type'])
combined = combined.loc[(combined.start_week <= combined.week) & (combined.week <= combined.end_week)]
output = combined.groupby(['country','type','start_week','end_week'])['value'].mean().reset_index()
output:
>>
country type start_week end_week value
0 1 a 12 13 950.0
1 2 b 13 14 850.0
You can use pd.melt and comparison of numpy arrays.
# melt df1
melted_df1 = df1.melt(id_vars=['country','type'],value_name='week')[['country','type','week']]
# for loop to compare two dataframe arrays
result = []
for i in df2.values:
for j in melted_df1.values:
if (j == i[:3]).all():
result.append(i)
break
# Computing mean of the result dataframe
result_df = pd.DataFrame(result,columns=df2.columns).groupby('type').mean().reset_index()['value']
# Assigning result_df to df1
df1['avg'] = result_df
country type start_week end_week avg
0 1 a 12 13 950.0
1 2 b 13 14 850.0
I have a sorted dataframe with an ID, and a value column, which looks like:
ID value
A 10
A 10
A 10
B 15
B 15
C 10
C 10
...
How can i create a new dataframe, that it counts the "new" distinct values in terms of the number of different IDS, so that it basically goes over my dataframe and looks like:
Number of ID Number of distinct values
1 1
2 2
3 2
In that case above we have 3 different IDs, but ID A and C have the same value.
So the first row in the new dataframe:
Numer of ID = 1, because we have 1 different ID so far
Number of distinct values= 1 , because we have one distinct value so far
Second row:
Number of ID=2, because we are going to row 4 in the old dataframe( we only are interessted in new IDS)
Number of disntinct values=2, because the value changed to 15 and didn't occur so far
I think you need processing new DataFrame by DataFrame.drop_duplicates with factorize and cumsum:
Replace duplicated values to NaN, forward filling them and then call pd.factorize:
df1 = df.drop_duplicates(['ID','value']).copy()
df1['Number of ID'] = range(1, len(df1)+1)
df1['Number of distinct values'] = pd.factorize(df1['value'].mask(df1['value'].duplicated()).ffill())[0] + 1
print (df1)
ID value Number of ID Number of distinct values
0 A 10 1 1
3 B 15 2 2
5 C 10 3 2
I change data for better testing:
print (df)
ID value
0 A 10
1 A 10
2 A 10
3 B 15
4 B 15
5 C 10
6 C 15
df1 = df.drop_duplicates(['ID','value']).copy()
df1['Number of ID'] = range(1, len(df1)+1)
df1['Number of distinct values'] = pd.factorize(df1['value'].mask(df1['value'].duplicated()).ffill())[0] + 1
print (df1)
ID value Number of ID Number of distinct values
0 A 10 1 1
3 B 15 2 2
5 C 10 3 2
6 C 15 4 2
Working wrong if multiple values value per ID:
df = pd.DataFrame({'Number of ID': range(1, len(df1)+1),
'Number of distinct values': np.cumsum(pd.factorize(df1['value'])[0])+1})
print (df)
Number of ID Number of distinct values
0 1 1
1 2 2
2 3 2
3 4 3
I have the below pandas dataframe:
Input:
A B C
Expense 2 3
Sales 5 6
Travel 8 9
My Expected Output is:
A B C
Expense 2 3
Sales 5 6
Travel 8 9
Total Exp 10 12
The last tow is basically total of row 1 and row 3. This is a very simplified example, i actually have to perform complex calculation on a huge dataframe.
Is there a way in python to perform such calculation?
You can select rows by positions with DataFrame.iloc and sum, then assign to new row:
df.loc[len(df.index)] = df.iloc[0] + df.iloc[2]
Or:
df.loc[len(df.index)] = df.iloc[[0,2]].sum()
print (df)
A B C
0 1 2 3
1 4 5 6
2 7 8 9
3 8 10 12
EDIT: First idea is create index by A column, so you can use loc with new value of A, but last step is convert index to column by reset_index:
df = df.set_index('A')
df.loc['Total Exp'] = df.iloc[[0,2]].sum()
df = df.reset_index()
print (df)
A B C
0 Expense 2 3
1 Sales 5 6
2 Travel 8 9
3 Total Exp 10 12
Similar is possible selecting by loc by labels - here Expense and Travel:
df = df.set_index('A')
df.loc['Total Exp'] = df.loc[['Expense', 'Travel']].sum()
df = df.reset_index()
print (df)
A B C
0 Expense 2 3
1 Sales 5 6
2 Travel 8 9
3 Total Exp 10 12
Or is possible filter out first column with 1: and add value back by Series.reindex:
df.loc[len(df.index)] = df.iloc[[0,2], 1:].sum().reindex(df.columns, fill_value='Total Exp')
print (df)
A B C
0 Expense 2 3
1 Sales 5 6
2 Travel 8 9
3 Total Exp 10 12
Or you can set value of A separately:
s = df.iloc[[0,2]].sum()
s.loc['A'] = 'Total Exp'
df.loc[len(df.index)] = s
print (df)
A B C
0 Expense 2 3
1 Sales 5 6
2 Travel 8 9
3 Total Exp 10 12
Suppose I have this DF:
s1 = pd.Series([1,1,2,2,2,3,3,3,4])
s2 = pd.Series([10,20,10,5,10,7,7,3,10])
s3 = pd.Series([0,0,0,0,1,1,0,2,0])
df = pd.DataFrame([s1,s2,s3]).transpose()
df.columns = ['id','qual','nm']
df
id qual nm
0 1 10 0
1 1 20 0
2 2 10 0
3 2 5 0
4 2 10 1
5 3 7 1
6 3 7 0
7 3 3 2
8 4 10 0
I want to get a new DF in which there are no duplicate ids, so there should be 4 rows with ids 1,2,3,4. The row that should be kept should be chosen based on the following criteria: take the one with smallest nm, if equal, take the one with largest qual, if still equal, just choose one.
I figure that my code should look something like:
df.groupby('id').apply(lambda x: ???)
And it should return:
id qual nm
0 1 20 0
1 2 10 0
2 3 7 0
3 4 10 0
But not sure what my function should take and return.
Or possibly there is an easier way?
Thanks!
Use boolean indexing with GroupBy.transform for minumum rows per groups, then for maximum values and last if still dupes remove them by DataFrame.drop_duplicates:
#get minimal nm
df1 = df[df['nm'] == df.groupby('id')['nm'].transform('min')]
#get maximal qual
df1 = df1[df1['qual'] == df1.groupby('id')['qual'].transform('max')]
#if still dupes get first id
df1 = df1.drop_duplicates('id')
print (df1)
id qual nm
1 1 20 0
2 2 10 0
6 3 7 0
8 4 10 0
Use -
grouper = df.groupby(['id'])
df.loc[(grouper['nm'].transform(min) == df['nm'] ) & (grouper['qual'].transform(max) == df['qual']),:].drop_duplicates(subset=['id'])
Output
id qual nm
1 1 20 0
2 2 10 0
6 3 7 0
8 4 10 0
I wanted to apply a custom operation on a column by grouping the values on another column. Group by column to get the count, then divide the another column value with this count for all the grouped records.
My Data Frame:
emp opp amount
0 a 1 10
1 b 1 10
2 c 2 30
3 b 2 30
4 d 2 30
My scenario:
For opp=1, two emp's worked(a,b). So the amount should be shared like
10/2 =5
For opp=2, two emp's worked(b,c,d). So the amount should be like
30/3 = 10
Final Output DataFrame:
emp opp amount
0 a 1 5
1 b 1 5
2 c 2 10
3 b 2 10
4 d 2 10
What is the best possible to do so
df['amount'] = df.groupby('opp')['amount'].transform(lambda g: g/g.size)
df
# emp opp amount
# 0 a 1 5
# 1 b 1 5
# 2 c 2 10
# 3 b 2 10
# 4 d 2 10
Or:
df['amount'] = df.groupby('opp')['amount'].apply(lambda g: g/g.size)
does similar thing.
You could try something like this:
df2 = df.groupby('opp').amount.count()
df.loc[:, 'calculated'] = df.apply( lambda row: \
row.amount / df2.ix[row.opp], axis=1)
df
Yields:
emp opp amount calculated
0 a 1 10 5
1 b 1 10 5
2 c 2 30 10
3 b 2 30 10
4 d 2 30 10