*** Disclaimer: I am a total noob. I am trying to learn Pandas by solving a problem at work. This is a subset of my total problem but I am trying to solve the pieces before I tackle the project. I appreciate your patience! ***
I am trying to find out what percentage each Fund is of the States total.
Concept: We have funds(departments) that are based in states. The funds have different levels of compensation for different projects. I first need to total(group) the funds so I know the total compensation per fund.
I also need to total(group) the compensation by state so I can later figure out the fund % by state.
I have converted my data to sample code here:
import pandas as pd
#sample data
data = {'Fund':['1000','1000','2000','2000','3000','3000','4000','4000'],
'State':['AL','AL','FL','FL','AL','AL','NC','NC'],
'Compensation':[2000,2500,1500,1750,4000,3200,1450,3000]}
If the pic doesn't come over here is what I did:
print(employees)
employees.groupby('Fund').Compensation.sum()
employees.groupby('State').Compensation.sum()
I've spent a good portion of the day on my actual data trying to figure out how to get the:
Fund's compensation is __% of total compensation for State
or..
Fund_1000 is 38% of AL total compensation.
Thanks for your patience and your help!
John
Here is one solution. You can first do a groupby to get to the lowest level of aggregation, and then use groupby transform to divide these values by state totals.
agg = df.groupby(['Fund','State'],as_index=False)['Compensation'].sum()
agg['percentage'] = (agg['Compensation'] / agg.groupby('State')['Compensation'].transform(sum)) * 100
agg.to_dict()
{'Fund': {0: '1000', 1: '2000', 2: '3000', 3: '4000'},
'State': {0: 'AL', 1: 'FL', 2: 'AL', 3: 'NC'},
'Compensation': {0: 4500, 1: 3250, 2: 7200, 3: 4450},
'percentage': {0: 38.46153846153847,
1: 100.0,
2: 61.53846153846154,
3: 100.0}}
This should do the work:
df['total_state_compensataion'] = df.groupby('State')['Compensation'].transform(sum)
df['total_state_fund_compensataion'] = df.groupby(['State','Fund'])['Compensation'].transform(sum)
df['ratio']=df['total_state_fund_compensataion'].div(df['total_state_compensataion'])
>>>df.groupby(['State','Fund'])['ratio'].mean().to_dict()
out[1] {('AL', '1000'): 0.38461538461538464,
('AL', '3000'): 0.6153846153846154,
('FL', '2000'): 1.0,
('NC', '4000'): 1.0}
You can also calculate and merge data frames...
import pandas as pd
data = {
"Fund": ["1000", "1000", "2000", "2000", "3000", "3000", "4000", "4000"],
"State": ["AL", "AL", "FL", "FL", "AL", "AL", "NC", "NC"],
"Compensation": [2000, 2500, 1500, 1750, 4000, 3200, 1450, 3000],
}
# Create dataframe from dictionary provided
df = pd.DataFrame.from_dict(data)
# first group compensation by state and fund
df_fund = df.groupby(["Fund", "State"]).Compensation.sum().reset_index()
# Calculate Total by state in new df
df_total = df_fund.groupby("State").Compensation.sum().reset_index()
# Merge dataframes with total column
merged = df_fund.merge(df_total, how="outer", left_on="State", right_on="State")
#Add percentage col to merged dataframe.
merged["percentage"] = merged["Compensation_x"] / merged["Compensation_y"] * 100
Related
giving some context to this question.
I have a table of stock market prices of the form
df = {
'datetime': [ 'day1', 'day2', 'day3'],
'ticker1' : ['200.3', '199', '184.5'],
'ticker2' : ['56.3', '55.1', '57.2']
}
I would like to find the number of stocks making new highs and new lows for each day. What I done is looping through the rows and columns, and comparing each with the min and max and the past year values.
Is there a more efficient way of doing this?
And how can I append a new column with the New Highs & New Lows of each day.
Thanks!
for i in range(rows-253):
NH_NL = { 'NH': 0, 'NL': 0}
for ticker in df.columns:
if df[ticker][i + 253] > df[ticker][i: i+253].max():
NH_NL['NH'] += 1
if df[ticker][i + 253] < df[ticker][i: i+253].min():
NH_NL['NL'] += 1
I need to modify my code:
db_profit_platform=db[['Source','Device','Country','Profit']]
db_profit_final=db_profit_platform.groupby(['Source','Device','Country'])['Profit'].apply(sum).reset_index()
Now I need to add Bid and get average bid after group by (different aggregations for different columns):
to get: Source Device Country SumProfit Average Bid
How can I do it? (and maybe I will need more aggregations) Thanks
You can use agg function, here a minimal working example
import numpy as np
import pandas as pd
size = 10
db = pd.DataFrame({
'Source': np.random.randint(1, 3, size=size),
'Device': np.random.randint(1, 3, size=size),
'Country': np.random.randint(1, 3, size=size),
'Profit': np.random.randn(size),
'Bid': np.random.randn(size)
})
db.groupby(["Source", "Device", "Country"]).agg(
sum_profit=("Profit", "sum"),
avg_bid=("Bid", "mean")
)
See the official documentation https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html as well as this question
I'm trying to use the python dedupe library to perform a fuzzy duplicate check on my mock data, but i keep getting this error:
{'Vendor': {0: 'ABC', 1: 'ABC', 2: 'TIM'},
'Doc Date': {0: '5/12/2019', 1: '5/13/2019', 2: '4/15/2019'},
'Invoice Date': {0: '5/10/2019', 1: '5/10/2019', 2: '4/10/2019'},
'Invoice Ref Num': {0: 'ABCDE56.', 1: 'ABCDE56', 2: 'RTET5SDF'},
'Invoice Amount': {0: '56', 1: '56', 2: '100'}}
IndexError: Cannot choose from an empty sequence
Here's the code that i'm using:
import pandas as pd
import pandas_dedupe
df = pd.read_csv("duptest.csv") df.columns
df = pandas_dedupe.dedupe_dataframe(df,['Vendor','Invoice Ref Num','Invoice Amount'])
Any idea what i'm doing wrong? thanks.
pandas-dedupe create a sample of observations you need to label.
The default amount of observation is equal to 30% of your dataframe.
In your case you have too few examples in you dataframe to start active learning.
If you sample_size=1 as follows:
df = pandas_dedupe.dedupe_dataframe(df,['Vendor','Invoice Ref Num','Invoice Amount'], sample_size=1)
you will be able to dedupe you data :)
I have a list of a list of dictionaries. I managed to access each list-element within the outer list and convert the dictionary via pandas into a data-frame. I then save the DF and later concat it. That's a perfect result. But I need a loop to do that for big data.
Here is my MWE which works fine in principle.
import pandas as pd
mwe = [
[{"name": "Norway", "population": 5223256, "area": 323802.0, "gini": 25.8}],
[{"name": "Switzerland", "population": 8341600, "area": 41284.0, "gini": 33.7}],
[{"name": "Australia", "population": 24117360, "area": 7692024.0, "gini": 30.5}],
]
df0 = pd.DataFrame.from_dict(mwe[0])
df1 = pd.DataFrame.from_dict(mwe[1])
df2 = pd.DataFrame.from_dict(mwe[2])
frames = [df0, df1, df2]
result = pd.concat(frames)
It creates a nice table.
Here is what I tried to create a list of data frames:
for i in range(len(mwe)):
frame = pd.DataFrame()
frame = pd.DataFrame.from_dict(mwe[i])
frames = []
frames.append(frame)
Addendum: Thanks for all the answers. They are working on my MWE. Which made me notice that there are some strange entries in my dataset. No solution works for my dataset, since I have an inner-list element which contains two dictionaries (due to non unique data retrieval):
....
[{'name': 'United States Minor Outlying Islands', 'population': 300},
{'name': 'United States of America',
'population': 323947000,
'area': 9629091.0,
'gini': 48.0}],
...
How can I drop the entry for "United States Minor Outlying Islands"?
You could get each dict out of the containing list and just have a list of dict:
import pandas as pd
mwe = [[{'name': 'Norway', 'population': 5223256, 'area': 323802.0, 'gini': 25.8}],
[{'name': 'Switzerland',
'population': 8341600,
'area': 41284.0,
'gini': 33.7}],
[{'name': 'Australia',
'population': 24117360,
'area': 7692024.0,
'gini': 30.5}]]
# use x.pop() so that you aren't carrying around copies of the data
# for a "big data" application
df = pd.DataFrame([x.pop() for x in mwe])
df.head()
area gini name population
0 323802.0 25.8 Norway 5223256
1 41284.0 33.7 Switzerland 8341600
2 7692024.0 30.5 Australia 24117360
By bringing the list comprehension into the dataframe declaration, that list is temporary, and you don't have to worry about the cleanup. pop will also consume the dictionaries out of mwe, minimizing the amount of copies you are carrying around in memory
As a note, when doing this, mwe will then look like:
mwe
[[], [], []]
Because the contents of the sub-lists have been popped out
EDIT: New Question Content
If your data contains duplicates, or at least entries you don't want, and the undesired entries don't have matching columns to the rest of the dataset (which appears to be the case), it becomes a bit trickier to avoid copying data as above:
mwe.append([{'name': 'United States Minor Outlying Islands', 'population': 300}, {'name': 'United States of America', 'population': 323947000, 'area': 9629091.0, 'gini': 48.0}])
key_check = {}.fromkeys(["name", "population", "area", "gini"])
# the easy way but copies data
df = pd.DataFrame([item for item in data
for data in mwe
if item.keys()==key_check.keys()])
Since you'll still have the data hanging around in mwe. It might be better to use a generator
def get_filtered_data(mwe):
for data in mwe:
while data: # when data is empty, the while loop will end
item = data.pop() # still consumes data out of mwe
if item.keys() == key_check.keys():
yield item # will minimize data copying through lazy evaluation
df = pd.DataFrame([x for x in get_filtered_data(mwe)])
area gini name population
0 323802.0 25.8 Norway 5223256
1 41284.0 33.7 Switzerland 8341600
2 7692024.0 30.5 Australia 24117360
3 9629091.0 48.0 United States of America 323947000
Again, this is under the assumption that non-desired entries have invalid columns, which appears to be the case here, specifically. Otherwise, this will at least flatten out the data structure so you can filter it with pandas later
Create and empty DataFrame and loop over the list using df.append on each loop:
>>> import pandas as pd
mwe = [[{'name': 'Norway', 'population': 5223256, 'area': 323802.0, 'gini': 25.8}],
[{'name': 'Switzerland',
'population': 8341600,
'area': 41284.0,
'gini': 33.7}],
[{'name': 'Australia',
'population': 24117360,
'area': 7692024.0,
'gini': 30.5}]]
>>> df = pd.DataFrame()
>>> for country in mwe:
... df = df.append(country)
...
>>> df
area gini name population
0 323802.0 25.8 Norway 5223256
0 41284.0 33.7 Switzerland 8341600
0 7692024.0 30.5 Australia 24117360
Try this :
df = pd.DataFrame(columns = ['name', 'population', 'area', 'gini'])
for i in range(len(mwe)):
df.loc[i] = list(mwe[i][0].values())
Output :
name pop area gini
0 Norway 5223256 323802.0 25.8
1 Switzerland 8341600 41284.0 33.7
2 Australia 24117360 7692024.0 30.5
I am trying to organise some data into groups based on the contents of certain columns. The current code I have is dependent on loops and I would like to vectorise instead to improve performance. I know this is the way forwards with pandas and whilst I can vectorise some problems I am really struggling with this one.
What I need to do is group the data by the ClientNumber and link the genuine and incomplete rows so that for each Clientnumber all the genuine rows have a different process ID and the incomplete rows are given the same process ID as the nearest genuine row which has a StartDate that is greater than the StartDate for the incomplete row (Essentially, incomplete rows should be connected to a genuine if present and once a genuine row is found it should close that grouping and treat future rows as separate events). Then I must be able to set a process Start date for each row which is equal to the lowest Start date within that group of processID and mark the last row (the one with the greatest StartDate) with a ProcessCount in a separate column.
Apologies for my lack of descriptive ability here, hopefully the code (written in Python 3.6) I have so far will better explain my desired outcome. The code works but as you can see relies on nested loops which I don't like. I have tried researching around to find out how to vectorise this but I am struggling to get my head around the concept for this problem.
Any help you can provide me with in straightening out the loops in this code would be most appreciated and really help me better understand how to apply this to other tasks going forwards.
Data
df_dict = {'ClientNumber': {0: 1234, 1: 1234, 2: 1234, 3: 123, 4: 123, 5: 123, 6: 12, 7: 12, 8: 1}, 'Genuine_Incomplete': {0: 'Incomplete', 1: 'Genuine', 2: 'Genuine', 3: 'Incomplete', 4: 'Incomplete', 5: 'Genuine', 6: 'Incomplete', 7: 'Incomplete', 8: 'Genuine'}, 'StartDate': {0: Timestamp('2018-01-01 00:00:00'), 1: Timestamp('2018-01-05 00:00:00'), 2: Timestamp('2018-03-01 00:00:00'), 3: Timestamp('2018-01-01 00:00:00'), 4: Timestamp('2018-01-03 00:00:00'), 5: Timestamp('2018-01-10 00:00:00'), 6: Timestamp('2018-01-01 00:00:00'), 7: Timestamp('2018-06-02 00:00:00'), 8: Timestamp('2018-01-01 00:00:00')}}
df = pd.DataFrame(data=df_dict)
df["ID"] = df.index
df["Process_Start_Date"] = np.nan
df["ProcessCode"] = np.nan
df["ProcessCount"] = np.nan
grouped_df = df.groupby('ClientNumber')
for key, item in grouped_df:
newdf = grouped_df.get_group(key)
newdf.sort_values(by=["StartDate"], inplace=True)
c = 1
for i in newdf.iterrows():
i = i[0]
GI = df.loc[i, "Genuine_Incomplete"]
proc_code = "{}_{}".format(df.loc[i, "ClientNumber"],c)
df.loc[i, "ProcessCode"] = proc_code
if GI == "Genuine":
c += 1
grouped_df = df.groupby('ProcessCode')
for key, item in grouped_df:
newdf = grouped_df.get_group(key)
newdf.sort_values(by=["StartDate"], inplace=True)
df.loc[newdf.ID.iat[-1], "ProcessCount"] = 1
for i in newdf.iterrows():
i = i[0]
df.loc[i, "Process_Start_Date"] = df.loc[newdf.ID.iat[0], "StartDate"]
Note - You may have noticed my use of the df["ID"] which is just a copy of the index. I know this is not good practice but I couldn't work out how to set values from other columns using the index. Any suggestions for doing this are also very welcome.