pandas : drop duplicates in the same time when grouping by

pandas : drop duplicates in the same time when grouping by - python

im doing a simple groupby on my data as shown in the code below. Is there a manner to do it directly without the drop_duplicates please, in the same line of code?
Thank you
df_brut['Revenue'] = df_brut.groupby(['cod', 'date', 'zone'])['Revenue'].transform('sum')
df_brut = df_brut.drop_duplicates()
df_brut.columns = ['cod','date', 'zone','SUM_']
My data
data1 = {'date': ['2021-06', '2021-06', '2021-07', '2021-07', '2021-07', '2021-07'], 'cod': ['12', '12', '14', '15', '15', '18'], 'zone': ['LA', 'LA', 'LA', 'PARIS', 'PARIS', 'PARIS'], 'Revenue': [10, 20, 30, 50, 40, 10]}
df_brut= pd.DataFrame(data1)
the grouped data expected is
data2 = {'date': ['2021-06', '2021-07', '2021-07', '2021-07'], 'cod': ['12', '14', '15','18'], 'zone': ['LA', 'LA', 'PARIS', 'PARIS'], 'SUM_': [30, 30, 90, 10]}
df_grouped= pd.DataFrame(data2)

You could do:
(df_brut.groupby(['cod', 'date', 'zone'], as_index=False)['Revenue']
.sum()
.rename({'Revenue': '_SUM'}, axis=1)
)

Related

Python multiple separate pivot tables based on another column to separate excel files

I am trying to produce multiple separate pivot tables for each distinct value in a different column in my df (like a different pivot table filtered by each). In the actual file there are several hundred R1's so was trying to find a way to loop over this somehow to produce them separately.
If possible is there a way to then send each pivot to a separate excel file
import pandas as pd
df=pd.DataFrame({'Employee':['1','2','3','4','5','6','7','8','9','10','11','12', '13', '14', '15', '16', '17', '18', '19', '20'],
'R1': ['mike', 'mike', 'mike', 'mike', 'mike', 'mike', 'mike', 'mike', 'stacey' , 'stacey', 'stacey', 'stacey', 'stacey', 'stacey', 'stacey', 'stacey', 'stacey', 'stacey', 'stacey', 'stacey'],
'R2':['bill', 'bill', 'bill', 'bill', 'bill', 'chris', 'chris', 'chris', 'jill', 'jill', 'jill', 'tom', 'tom', 'tom', 'tom', 'pete', 'pete', 'pete', 'pete', 'pete']})
df
So essentially 1 excel file for mike's world that has a count by employee by R2 and 1 excel for stacey's world that has a count by employee of R2 (but in the real data this would be done for the several hundred R1's)
thanks!
Mike excel
Stacey excel

While there may be prettier ways in dealing with the dataframes prior to writing to the sheets, this provided me the results you were looking for. It should scale with any number of 'R1''s as "unique()" provides a list of the unique names within R1. Then breaks it down for the variables you need and writes it to a sheet on the given filepath.
import pandas as pd
data_jobs2=pd.DataFrame({'Employee':['1','2','3','4','5','6','7','8','9','10','11','12', '13', '14', '15', '16', '17', '18', '19', '20'],
'L2Name': ['mike', 'mike', 'mike', 'mike', 'mike', 'mike', 'mike', 'mike', 'stacey' , 'stacey', 'stacey', 'stacey', 'stacey', 'stacey', 'stacey', 'stacey', 'stacey', 'stacey', 'stacey', 'stacey'],
'L3Name':['bill', 'bill', 'bill', 'bill', 'bill', 'chris', 'chris', 'chris', 'jill', 'jill', 'jill', 'tom', 'tom', 'tom', 'tom', 'pete', 'pete', 'pete', 'pete', 'pete']})
values = data_jobs2['L2Name'].unique()
filepath = 'Your\File\Path\Here\File_name.xlsx'
writer = pd.ExcelWriter(filepath, engine='openpyxl')
for i in values:
series = data_jobs2[data_jobs2['L2Name'] == i].groupby(['L2Name','L3Name'])['Employee'].count().to_frame().reset_index()
df_to_write = series.pivot(index = 'L2Name', columns='L3Name', values = 'Employee').reset_index().replace({i : 'Count of Employee'}).rename(columns={'L2Name':''}).set_index('')
df_to_write['Grand Total'] = df_to_write.sum(1)
df_to_write.to_excel(writer, sheet_name=i)
display(df_to_write)
display(series)
writer.save()
writer.close()

Index error while concatenating two dataframes in pandas

I get the following error
pandas.core.indexes.base.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
On code
dfp = pd.concat([df, tdf], axis=1)
I am trying to concatenate columns of tdf to the columns of df.
For these print statements
print(df.shape)
print(tdf.shape)
print(df.columns)
print(tdf.columns)
print(df.index)
print(tdf.index)
I get the following output:
(70000, 25)
(70000, 20)
Index(['300', '301', '302', '303', '304', '305', '306', '307', '308', '309',
'310', '311', '312', '313', '314', '315', '316', '317', '318', '319',
'320', '321', '322', '323', '324'],
dtype='object')
Index(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13',
'14', '15', '16', '17', '18', '19', '20'],
dtype='object')
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
...
9990, 9991, 9992, 9993, 9994, 9995, 9996, 9997, 9998, 9999],
dtype='int64', length=70000)
RangeIndex(start=0, stop=70000, step=1)
Any idea what is the issue? Why would indexing be a problem? Indexes are supposed to be the same since I concat columns, not rows. Column values seem to be perfectly different.
Thanks!

The problem is that df is not uniquely indexed. So you need to either reset the index
pd.concat([df.reset_index(),tdf], axis=1)
or drop it
pd.concat([df.reset_index(drop=True),tdf], axis=1)

pandas pivot aggregate percentage by index value counts

I have a dataframe as follows
df = pd.DataFrame({'date': ['2013-04-01','2013-04-01','2013-04-01','2013-04-02', '2013-04-02'],
'month': ['1','1','3','3','5'],
'pmonth': ['1', '1', '2', '5', '5'],
'duration': [30, 15, 20, 15, 30],
'user_id': ['10', '20', '30', '40', '50']})
I can calculate the percent of userid count by date,month and pmonth using
pd.crosstab(index=[df.date,df.month,df.pmonth],columns=df.duration,values=df.user_id,normalize ='index',aggfunc='count')
But i want to calculate the percent of userid in date,month combination only, is it possible using crosstab.

Convert Pandas Dataframe to_dict() with unique column values as keys

How can I convert a pandas dataframe to a dict using unique column values as the keys for the dictionary? In this case I want to use unique username's as the key.
Here is my progress so far based on information found on here and online.
My test dataframe:
import pandas
import pprint
df = pandas.DataFrame({
'username': ['Kevin', 'John', 'Kevin', 'John', 'Leslie', 'John'],
'sport': ['Soccer', 'Football', 'Racing', 'Tennis', 'Baseball', 'Bowling'],
'age': ['51','32','20','19','34','27'],
'team': ['Cowboyws', 'Packers', 'Sonics', 'Raiders', 'Wolves', 'Lakers']
})
I can create a dictionary by doing this:
dct = df.to_dict(orient='records')
pprint.pprint(dct, indent=4)
>>>>[{'age': '51', 'sport': 'Soccer', 'team': 'Cowboyws', 'username': 'Kevin'},
{'age': '32', 'sport': 'Football', 'team': 'Packers', 'username': 'John'},
{'age': '20', 'sport': 'Racing', 'team': 'Sonics', 'username': 'Kevin'},
{'age': '19', 'sport': 'Tennis', 'team': 'Raiders', 'username': 'John'},
{'age': '34', 'sport': 'Baseball', 'team': 'Wolves', 'username': 'Leslie'},
{'age': '27', 'sport': 'Bowling', 'team': 'Lakers', 'username': 'John'}]
I tried using the groupby and apply method which got me closer but it converts all the values to lists. I want them to remain as dictionaries so i can retain the each value's key:
result = df.groupby('username').apply(lambda x: x.values.tolist()).to_dict()
pprint.pprint(result, indent=4)
{ 'John': [ ['32', 'Football', 'Packers', 'John'],
['19', 'Tennis', 'Raiders', 'John'],
['27', 'Bowling', 'Lakers', 'John']],
'Kevin': [ ['51', 'Soccer', 'Cowboyws', 'Kevin'],
['20', 'Racing', 'Sonics', 'Kevin']],
'Leslie': [['34', 'Baseball', 'Wolves', 'Leslie']]}
This is the desired result I want:
{
'John': [{'age': '32', 'sport': 'Football', 'team': 'Packers', 'username': 'John'},
{'age': '19', 'sport': 'Tennis', 'team': 'Raiders', 'username': 'John'},
{'age': '27', 'sport': 'Bowling', 'team': 'Lakers', 'username': 'John'}],
'Kevin': [{'age': '51', 'sport': 'Soccer', 'team': 'Cowboyws', 'username': 'Kevin'},
{'age': '20', 'sport': 'Racing', 'team': 'Sonics', 'username': 'Kevin'}],
'Leslie': [{'age': '34', 'sport': 'Baseball', 'team': 'Wolves', 'username': 'Leslie'}]
}

Use groupby and apply. Inside the apply, call to_dict with the "records" orient (similar to what you've figured out already).
df.groupby('username').apply(lambda x: x.to_dict(orient='r')).to_dict()

I prefer using for loop here , also you may want to drop the username columns , since it is redundant
d = {x: y.drop('username',1).to_dict('r') for x , y in df.groupby('username')}
d
Out[212]:
{'John': [{'age': '32', 'sport': 'Football', 'team': 'Packers'},
{'age': '19', 'sport': 'Tennis', 'team': 'Raiders'},
{'age': '27', 'sport': 'Bowling', 'team': 'Lakers'}],
'Kevin': [{'age': '51', 'sport': 'Soccer', 'team': 'Cowboyws'},
{'age': '20', 'sport': 'Racing', 'team': 'Sonics'}],
'Leslie': [{'age': '34', 'sport': 'Baseball', 'team': 'Wolves'}]}

Add new item to list of dictionaries controlling based on other item values

I have the following list of dictionaries containing a household id (HID) for each year:
list_of_dicts = [{'HID':'1','year':'2017'},
{'HID':'1','year':'2018'},
{'HID':'2','year':'2017'},
{'HID':'2','year':'2018'},
{'HID':'3','year':'2017'},
{'HID':'3','year':'2018'}]
I want to read in each dict and add a new numeric id. I can do something similar to this using the following code:
i = 0
for line in list_of_dicts:
line['id'] = i
i += 1
However, this make the id item run continuously like this:
list_of_dicts = [{'HID':'1','year':'2017','id':1},
{'HID':'1','year':'2018','id':2},
{'HID':'2','year':'2017','id':3},
{'HID':'2','year':'2018','id':4},
{'HID':'3','year':'2017','id':5},
{'HID':'3','year':'2018','id':6}]
I want to allocate the same numeric id variable to each HID across each year, like so:
list_of_dicts = [{'HID':'1','year':'2017','id':1},
{'HID':'1','year':'2018','id':1},
{'HID':'2','year':'2017','id':2},
{'HID':'2','year':'2018','id':2},
{'HID':'3','year':'2017','id':3},
{'HID':'3','year':'2018','id':3}]
How do I control id allocation to be the same for each household, for each year?

You need to change it to not be based off i. Instead do this:
for line in list_of_dicts:
line['id'] = line['hid']
That way you create a new entry id based off the hid for that line

In Python3, you can use a list comprehension and dictionary unpacking:
list_of_dicts = [{'HID':'1','year':'2017'},
{'HID':'1','year':'2018'},
{'HID':'2','year':'2017'},
{'HID':'2','year':'2018'},
{'HID':'3','year':'2017'},
{'HID':'3','year':'2018'}]
new_data = [{**i, **{'id':i['HID']}} for i in list_of_dicts]
Output:
[{'HID': '1', 'year': '2017', 'id': '1'}, {'HID': '1', 'year': '2018', 'id': '1'}, {'HID': '2', 'year': '2017', 'id': '2'}, {'HID': '2', 'year': '2018', 'id': '2'}, {'HID': '3', 'year': '2017', 'id': '3'}, {'HID': '3', 'year': '2018', 'id': '3'}]

Is this what you are looking for?
list_of_dicts = [{'HID':'1','year':'2017'},
{'HID':'1','year':'2018'},
{'HID':'2','year':'2017'},
{'HID':'2','year':'2018'},
{'HID':'3','year':'2017'},
{'HID':'3','year':'2018'}]
for line in list_of_dicts:
line['id'] = line['HID']
print(list_of_dicts)
Outputs:
[{'HID': '1', 'year': '2017', 'id': '1'}, {'HID': '1', 'year': '2018', 'id': '1'}, {'HID': '2', 'year': '2017', 'id': '2'}, {'HID': '2', 'year': '2018', 'id': '2'}, {'HID': '3', 'year': '2017', 'id': '3'}, {'HID': '3', 'year': '2018', 'id': '3'}]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas : drop duplicates in the same time when grouping by - python

You could do: (df_brut.groupby(['cod', 'date', 'zone'], as_index=False)['Revenue'] .sum() .rename({'Revenue': '_SUM'}, axis=1) )

Related

Python multiple separate pivot tables based on another column to separate excel files

Index error while concatenating two dataframes in pandas

pandas pivot aggregate percentage by index value counts

Convert Pandas Dataframe to_dict() with unique column values as keys

Add new item to list of dictionaries controlling based on other item values

Categories

Resources