Related
I am trying to produce multiple separate pivot tables for each distinct value in a different column in my df (like a different pivot table filtered by each). In the actual file there are several hundred R1's so was trying to find a way to loop over this somehow to produce them separately.
If possible is there a way to then send each pivot to a separate excel file
import pandas as pd
df=pd.DataFrame({'Employee':['1','2','3','4','5','6','7','8','9','10','11','12', '13', '14', '15', '16', '17', '18', '19', '20'],
'R1': ['mike', 'mike', 'mike', 'mike', 'mike', 'mike', 'mike', 'mike', 'stacey' , 'stacey', 'stacey', 'stacey', 'stacey', 'stacey', 'stacey', 'stacey', 'stacey', 'stacey', 'stacey', 'stacey'],
'R2':['bill', 'bill', 'bill', 'bill', 'bill', 'chris', 'chris', 'chris', 'jill', 'jill', 'jill', 'tom', 'tom', 'tom', 'tom', 'pete', 'pete', 'pete', 'pete', 'pete']})
df
So essentially 1 excel file for mike's world that has a count by employee by R2 and 1 excel for stacey's world that has a count by employee of R2 (but in the real data this would be done for the several hundred R1's)
thanks!
Mike excel
Stacey excel
While there may be prettier ways in dealing with the dataframes prior to writing to the sheets, this provided me the results you were looking for. It should scale with any number of 'R1''s as "unique()" provides a list of the unique names within R1. Then breaks it down for the variables you need and writes it to a sheet on the given filepath.
import pandas as pd
data_jobs2=pd.DataFrame({'Employee':['1','2','3','4','5','6','7','8','9','10','11','12', '13', '14', '15', '16', '17', '18', '19', '20'],
'L2Name': ['mike', 'mike', 'mike', 'mike', 'mike', 'mike', 'mike', 'mike', 'stacey' , 'stacey', 'stacey', 'stacey', 'stacey', 'stacey', 'stacey', 'stacey', 'stacey', 'stacey', 'stacey', 'stacey'],
'L3Name':['bill', 'bill', 'bill', 'bill', 'bill', 'chris', 'chris', 'chris', 'jill', 'jill', 'jill', 'tom', 'tom', 'tom', 'tom', 'pete', 'pete', 'pete', 'pete', 'pete']})
values = data_jobs2['L2Name'].unique()
filepath = 'Your\File\Path\Here\File_name.xlsx'
writer = pd.ExcelWriter(filepath, engine='openpyxl')
for i in values:
series = data_jobs2[data_jobs2['L2Name'] == i].groupby(['L2Name','L3Name'])['Employee'].count().to_frame().reset_index()
df_to_write = series.pivot(index = 'L2Name', columns='L3Name', values = 'Employee').reset_index().replace({i : 'Count of Employee'}).rename(columns={'L2Name':''}).set_index('')
df_to_write['Grand Total'] = df_to_write.sum(1)
df_to_write.to_excel(writer, sheet_name=i)
display(df_to_write)
display(series)
writer.save()
writer.close()
I get the following error
pandas.core.indexes.base.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
On code
dfp = pd.concat([df, tdf], axis=1)
I am trying to concatenate columns of tdf to the columns of df.
For these print statements
print(df.shape)
print(tdf.shape)
print(df.columns)
print(tdf.columns)
print(df.index)
print(tdf.index)
I get the following output:
(70000, 25)
(70000, 20)
Index(['300', '301', '302', '303', '304', '305', '306', '307', '308', '309',
'310', '311', '312', '313', '314', '315', '316', '317', '318', '319',
'320', '321', '322', '323', '324'],
dtype='object')
Index(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13',
'14', '15', '16', '17', '18', '19', '20'],
dtype='object')
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
...
9990, 9991, 9992, 9993, 9994, 9995, 9996, 9997, 9998, 9999],
dtype='int64', length=70000)
RangeIndex(start=0, stop=70000, step=1)
Any idea what is the issue? Why would indexing be a problem? Indexes are supposed to be the same since I concat columns, not rows. Column values seem to be perfectly different.
Thanks!
The problem is that df is not uniquely indexed. So you need to either reset the index
pd.concat([df.reset_index(),tdf], axis=1)
or drop it
pd.concat([df.reset_index(drop=True),tdf], axis=1)
I have a dataframe as follows
df = pd.DataFrame({'date': ['2013-04-01','2013-04-01','2013-04-01','2013-04-02', '2013-04-02'],
'month': ['1','1','3','3','5'],
'pmonth': ['1', '1', '2', '5', '5'],
'duration': [30, 15, 20, 15, 30],
'user_id': ['10', '20', '30', '40', '50']})
I can calculate the percent of userid count by date,month and pmonth using
pd.crosstab(index=[df.date,df.month,df.pmonth],columns=df.duration,values=df.user_id,normalize ='index',aggfunc='count')
But i want to calculate the percent of userid in date,month combination only, is it possible using crosstab.
How can I convert a pandas dataframe to a dict using unique column values as the keys for the dictionary? In this case I want to use unique username's as the key.
Here is my progress so far based on information found on here and online.
My test dataframe:
import pandas
import pprint
df = pandas.DataFrame({
'username': ['Kevin', 'John', 'Kevin', 'John', 'Leslie', 'John'],
'sport': ['Soccer', 'Football', 'Racing', 'Tennis', 'Baseball', 'Bowling'],
'age': ['51','32','20','19','34','27'],
'team': ['Cowboyws', 'Packers', 'Sonics', 'Raiders', 'Wolves', 'Lakers']
})
I can create a dictionary by doing this:
dct = df.to_dict(orient='records')
pprint.pprint(dct, indent=4)
>>>>[{'age': '51', 'sport': 'Soccer', 'team': 'Cowboyws', 'username': 'Kevin'},
{'age': '32', 'sport': 'Football', 'team': 'Packers', 'username': 'John'},
{'age': '20', 'sport': 'Racing', 'team': 'Sonics', 'username': 'Kevin'},
{'age': '19', 'sport': 'Tennis', 'team': 'Raiders', 'username': 'John'},
{'age': '34', 'sport': 'Baseball', 'team': 'Wolves', 'username': 'Leslie'},
{'age': '27', 'sport': 'Bowling', 'team': 'Lakers', 'username': 'John'}]
I tried using the groupby and apply method which got me closer but it converts all the values to lists. I want them to remain as dictionaries so i can retain the each value's key:
result = df.groupby('username').apply(lambda x: x.values.tolist()).to_dict()
pprint.pprint(result, indent=4)
{ 'John': [ ['32', 'Football', 'Packers', 'John'],
['19', 'Tennis', 'Raiders', 'John'],
['27', 'Bowling', 'Lakers', 'John']],
'Kevin': [ ['51', 'Soccer', 'Cowboyws', 'Kevin'],
['20', 'Racing', 'Sonics', 'Kevin']],
'Leslie': [['34', 'Baseball', 'Wolves', 'Leslie']]}
This is the desired result I want:
{
'John': [{'age': '32', 'sport': 'Football', 'team': 'Packers', 'username': 'John'},
{'age': '19', 'sport': 'Tennis', 'team': 'Raiders', 'username': 'John'},
{'age': '27', 'sport': 'Bowling', 'team': 'Lakers', 'username': 'John'}],
'Kevin': [{'age': '51', 'sport': 'Soccer', 'team': 'Cowboyws', 'username': 'Kevin'},
{'age': '20', 'sport': 'Racing', 'team': 'Sonics', 'username': 'Kevin'}],
'Leslie': [{'age': '34', 'sport': 'Baseball', 'team': 'Wolves', 'username': 'Leslie'}]
}
Use groupby and apply. Inside the apply, call to_dict with the "records" orient (similar to what you've figured out already).
df.groupby('username').apply(lambda x: x.to_dict(orient='r')).to_dict()
I prefer using for loop here , also you may want to drop the username columns , since it is redundant
d = {x: y.drop('username',1).to_dict('r') for x , y in df.groupby('username')}
d
Out[212]:
{'John': [{'age': '32', 'sport': 'Football', 'team': 'Packers'},
{'age': '19', 'sport': 'Tennis', 'team': 'Raiders'},
{'age': '27', 'sport': 'Bowling', 'team': 'Lakers'}],
'Kevin': [{'age': '51', 'sport': 'Soccer', 'team': 'Cowboyws'},
{'age': '20', 'sport': 'Racing', 'team': 'Sonics'}],
'Leslie': [{'age': '34', 'sport': 'Baseball', 'team': 'Wolves'}]}
I have the following list of dictionaries containing a household id (HID) for each year:
list_of_dicts = [{'HID':'1','year':'2017'},
{'HID':'1','year':'2018'},
{'HID':'2','year':'2017'},
{'HID':'2','year':'2018'},
{'HID':'3','year':'2017'},
{'HID':'3','year':'2018'}]
I want to read in each dict and add a new numeric id. I can do something similar to this using the following code:
i = 0
for line in list_of_dicts:
line['id'] = i
i += 1
However, this make the id item run continuously like this:
list_of_dicts = [{'HID':'1','year':'2017','id':1},
{'HID':'1','year':'2018','id':2},
{'HID':'2','year':'2017','id':3},
{'HID':'2','year':'2018','id':4},
{'HID':'3','year':'2017','id':5},
{'HID':'3','year':'2018','id':6}]
I want to allocate the same numeric id variable to each HID across each year, like so:
list_of_dicts = [{'HID':'1','year':'2017','id':1},
{'HID':'1','year':'2018','id':1},
{'HID':'2','year':'2017','id':2},
{'HID':'2','year':'2018','id':2},
{'HID':'3','year':'2017','id':3},
{'HID':'3','year':'2018','id':3}]
How do I control id allocation to be the same for each household, for each year?
You need to change it to not be based off i. Instead do this:
for line in list_of_dicts:
line['id'] = line['hid']
That way you create a new entry id based off the hid for that line
In Python3, you can use a list comprehension and dictionary unpacking:
list_of_dicts = [{'HID':'1','year':'2017'},
{'HID':'1','year':'2018'},
{'HID':'2','year':'2017'},
{'HID':'2','year':'2018'},
{'HID':'3','year':'2017'},
{'HID':'3','year':'2018'}]
new_data = [{**i, **{'id':i['HID']}} for i in list_of_dicts]
Output:
[{'HID': '1', 'year': '2017', 'id': '1'}, {'HID': '1', 'year': '2018', 'id': '1'}, {'HID': '2', 'year': '2017', 'id': '2'}, {'HID': '2', 'year': '2018', 'id': '2'}, {'HID': '3', 'year': '2017', 'id': '3'}, {'HID': '3', 'year': '2018', 'id': '3'}]
Is this what you are looking for?
list_of_dicts = [{'HID':'1','year':'2017'},
{'HID':'1','year':'2018'},
{'HID':'2','year':'2017'},
{'HID':'2','year':'2018'},
{'HID':'3','year':'2017'},
{'HID':'3','year':'2018'}]
for line in list_of_dicts:
line['id'] = line['HID']
print(list_of_dicts)
Outputs:
[{'HID': '1', 'year': '2017', 'id': '1'}, {'HID': '1', 'year': '2018', 'id': '1'}, {'HID': '2', 'year': '2017', 'id': '2'}, {'HID': '2', 'year': '2018', 'id': '2'}, {'HID': '3', 'year': '2017', 'id': '3'}, {'HID': '3', 'year': '2018', 'id': '3'}]