Can Pandas Read Excel's Group Structure into a MultIndex? - python

I have an Excel file with some (mostly) nicely grouped rows. I built a fake example below.
Is there a way to get read_excel in Pandas to produce a multiindex preserving this structure?
For this example the MultiIndex would have four levels (Family, Individual, Child (optional), investment). If the subtotal values were lost that would be fine as they can easily be recreated in Pandas.

No, pandas can't read such a structure.
An alternative solution is to use pandas to read your data, but transform this into an easily accessible dictionary, rather than keeping your data in a dataframe with MultiIndex.
There are 2 sensible requirements to make your data more usable:
Make your investment fund names unique. This is trivial.
Convert your Excel grouping to an additional column which indicates the parent of the row.
In the below example, these 2 requirements are assumed.
Setup
from collections import defaultdict
from functools import reduce
import operator
import pandas as pd
df = pd.DataFrame({'name': ['Simpson Family', 'Marge Simpson', 'Maggies College Fund',
'MCF Investment 2', 'MS Investment 1', 'MS Investment 2', 'MS Investment 3',
'Homer Simpson', 'HS Investment 1', 'HS Investment 3', 'HS Investment 2',
'Griffin Family', 'Lois Griffin', 'LG Investment 2', 'LG Investment 3',
'Brian Giffin', 'BG Investment 3'],
'Value': [600, 450, 100, 100, 100, 200, 50, 150, 100, 50, 0, 200, 150, 100, 50, 50, 50],
'parent': ['Families', 'Simpson Family', 'Marge Simpson', 'Maggies College Fund',
'Marge Simpson', 'Marge Simpson', 'Marge Simpson', 'Simpson Family',
'Homer Simpson', 'Homer Simpson', 'Homer Simpson', 'Families',
'Griffin Family', 'Lois Griffin', 'Lois Griffin', 'Griffin Family',
'Brian Giffin']})
Value name parent
0 600 Simpson Family Families
1 450 Marge Simpson Simpson Family
2 100 Maggies College Fund Marge Simpson
3 100 MCF Investment 2 Maggies College Fund
4 100 MS Investment 1 Marge Simpson
5 200 MS Investment 2 Marge Simpson
6 50 MS Investment 3 Marge Simpson
7 150 Homer Simpson Simpson Family
8 100 HS Investment 1 Homer Simpson
9 50 HS Investment 3 Homer Simpson
10 0 HS Investment 2 Homer Simpson
11 200 Griffin Family Families
12 150 Lois Griffin Griffin Family
13 100 LG Investment 2 Lois Griffin
14 50 LG Investment 3 Lois Griffin
15 50 Brian Giffin Griffin Family
16 50 BG Investment 3 Brian Giffin
Step 1
Define a child -> parent dictionary and some utility functions:
child_parent_dict = df.set_index('name')['parent'].to_dict()
tree = lambda: defaultdict(tree)
d = tree()
def get_all_parents(child):
"""Get all parents from hierarchy structure"""
while child != 'Families':
child = child_parent_dict[child]
if child != 'Families':
yield child
def getFromDict(dataDict, mapList):
"""Iterate nested dictionary"""
return reduce(operator.getitem, mapList, dataDict)
def default_to_regular_dict(d):
"""Convert nested defaultdict to regular dict of dicts."""
if isinstance(d, defaultdict):
d = {k: default_to_regular_dict(v) for k, v in d.items()}
return d
Step 2
Apply this to your dataframe. Use it to create a nested dictionary structure which will be more efficient for repeated queries.
df['structure'] = df['name'].apply(lambda x: ['Families'] + list(get_all_parents(x))[::-1])
for idx, row in df.iterrows():
getFromDict(d, row['structure'])[row['name']]['Value'] = row['Value']
res = default_to_regular_dict(d)
Result
Dataframe
Value name parent \
0 600 Simpson Family Families
1 450 Marge Simpson Simpson Family
2 100 Maggies College Fund Marge Simpson
3 100 MCF Investment 2 Maggies College Fund
4 100 MS Investment 1 Marge Simpson
5 200 MS Investment 2 Marge Simpson
6 50 MS Investment 3 Marge Simpson
7 150 Homer Simpson Simpson Family
8 100 HS Investment 1 Homer Simpson
9 50 HS Investment 3 Homer Simpson
10 0 HS Investment 2 Homer Simpson
11 200 Griffin Family Families
12 150 Lois Griffin Griffin Family
13 100 LG Investment 2 Lois Griffin
14 50 LG Investment 3 Lois Griffin
15 50 Brian Giffin Griffin Family
16 50 BG Investment 3 Brian Giffin
structure
0 [Families]
1 [Families, Simpson Family]
2 [Families, Simpson Family, Marge Simpson]
3 [Families, Simpson Family, Marge Simpson, Magg...
4 [Families, Simpson Family, Marge Simpson]
5 [Families, Simpson Family, Marge Simpson]
6 [Families, Simpson Family, Marge Simpson]
7 [Families, Simpson Family]
8 [Families, Simpson Family, Homer Simpson]
9 [Families, Simpson Family, Homer Simpson]
10 [Families, Simpson Family, Homer Simpson]
11 [Families]
12 [Families, Griffin Family]
13 [Families, Griffin Family, Lois Griffin]
14 [Families, Griffin Family, Lois Griffin]
15 [Families, Griffin Family]
16 [Families, Griffin Family, Brian Giffin]
Dictionary
{'Families': {'Griffin Family': {'Brian Giffin': {'BG Investment 3': {'Value': 50},
'Value': 50},
'Lois Griffin': {'LG Investment 2': {'Value': 100}, 'LG Investment 3': {'Value': 50},
'Value': 150},
'Value': 200},
'Simpson Family': {'Homer Simpson': {'HS Investment 1': {'Value': 100}, 'HS Investment 2': {'Value': 0}, 'HS Investment 3': {'Value': 50},
'Value': 150},
'Marge Simpson': {'MS Investment 1': {'Value': 100}, 'MS Investment 2': {'Value': 200}, 'MS Investment 3': {'Value': 50},
'Maggies College Fund': {'MCF Investment 2': {'Value': 100},
'Value': 100},
'Value': 450},
'Value': 600}}}

I don't think it is possible to implement this using read_excel as-it.
What you can do is to add additional columns to your excel sheet based on the four hierarchy levels (Family, Individual, Child (optional), investment) and then use read_excel() with index_col[0,1,2,3] to generate the pandas dataframe.

See the index_col parameter of the read_excel function.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html
index_col : int, list of ints, default None
Column (0-indexed) to use as the row labels of the DataFrame. Pass None if there is no such column. If a list is passed, those columns will be combined into a MultiIndex. If a subset of data is selected with usecols, index_col is based on the subset.

Related

add a new column based on a group without grouping

I have this reproducible data set where i need to add a column based for the 'best usage' source.
df_in = pd.DataFrame({
'year': [ 5, 5, 5,
10, 10,
15, 15,
30, 30, 30 ],
'usage': ['farm', 'best', '',
'manual', 'best',
'best', 'city',
'random', 'best', 'farm' ],
'value': [0.825, 0.83, 0.85,
0.935, 0.96,
1.12, 1.305,
1.34, 1.34, 1.455],
'source': ['wood', 'metal', 'water',
'metal', 'water',
'wood', 'water',
'wood', 'metal', 'water' ]})
desired outcome:
print(df)
year usage value source best
0 5 farm 0.825 wood metal
1 5 best 0.830 metal metal
2 5 0.850 water metal
3 10 manual 0.935 metal water
4 10 best 0.960 water water
5 15 best 1.120 wood wood
6 15 city 1.305 water wood
7 30 random 1.340 wood metal
8 30 best 1.340 metal metal
9 30 farm 1.455 water metal
Is there a way to do that without grouping? currently, i'm using:
grouped = df_in.groupby('usage').get_group('best')
grouped = grouped.rename(columns={'source': 'best'})
df = df_in.merge(grouped[['year','best']],how='outer', on='year')
You could just query:
df_in.merge(df_in.query('usage=="best"')[['year','source']]
.drop_duplicates('year') # you might not need/want this line if `best` is unique per year (or doesn't need to be in the output)
.rename(columns={'source':'best'}),
on='year', how='left')
Output:
year usage value source best
0 5 farm 0.825 wood metal
1 5 best 0.830 metal metal
2 5 0.850 water metal
3 10 manual 0.935 metal water
4 10 best 0.960 water water
5 15 best 1.120 wood wood
6 15 city 1.305 water wood
7 30 random 1.340 wood metal
8 30 best 1.340 metal metal
9 30 farm 1.455 water metal
Here is a way using .loc and .map()
(df.assign(best = df_in['year']
.map(df_in.loc[df_in['usage'].eq('best'),['year','source']]
.set_index('year')
.squeeze())))
Output:
year usage value source best
0 5 farm 0.825 wood metal
1 5 best 0.830 metal metal
2 5 0.850 water metal
3 10 manual 0.935 metal water
4 10 best 0.960 water water
5 15 best 1.120 wood wood
6 15 city 1.305 water wood
7 30 random 1.340 wood metal
8 30 best 1.340 metal metal
9 30 farm 1.455 water metal

How can I create a dictionary to map individual missing values with aggregate function in Pandas?

I'm not even sure how to describe this, but I am looking for a method to sum the values of others and replace the NaN with a particular value.
My data consists of Organization, Population and Square Miles. I have population and square miles data for each county in the state, however some Organizations span across multiple counties. I merged the two data sets (organization info with pop/square miles data) but am obviously left with NaNs for the organizations that span across multiple counties.
If I create a dictionary like the following:
counties = {'Job a': ['county a', 'county b', 'county c'],
'Job b': ['county d', 'county e', 'county f'],
'Job c': ['county g', 'county h']} etc..
If I have table 1 like this:
county sq_mile population
0 County a 2500 15000
1 County b 750 400
2 County c 4000 3500
3 County d 4300 4500
4 County e 2000 1500
5 County f 1000 1500
6 County g 4300 3500
7 County h 4400 3200
8 County i 4000 3500
How can I take table 2 (which has just organizations) and fill missing data using the dictionary and associated column values?
organization sq_mile population
0 job a 7250 18900
1 job b 7300 7500
2 job c 8700 6700
3 county i 4000 3500
Try this:
counties_r = {i: k for k in counties.keys() for i in counties[k]}
df.groupby(df['County'].str.lower().map(counties_r).fillna(df['County']))\
.sum().reset_index()
Output:
County Sq. Mile Population
0 County i 4000 3500
1 Job a 7250 18900
2 Job b 7300 7500
3 Job c 8700 6700
Swap the keys and values in your dictionary the map that swapped dictionary to the county column making everything lowercase then use groupby with sum.

Adding Row values as Columns in Python [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 11 months ago.
I have the following table in pandas (notice how the item repeats for each warehouse)
id
Item
Warehouse
Price
Cost
1
Cake
US: California
30
20
1
Cake
US: Chicago
30
20
2
Meat
US: California
40
10
2
Meat
US: Chicago
40
10
And I need to add each warehouse as a separate column like this:
id
Item
Warehouse 1
Warehouse 2
Price
Cost
1
Cake
US: California
US: Chicago
30
20
2
Meat
US: California
US: Chicago
40
10
Data:
{'id': [1, 1, 2, 2],
'Item': ['Cake', 'Cake', 'Meat', 'Meat'],
'Warehouse': ['US: California',
'US: Chicago',
'US: California',
'US: Chicago'],
'Price': [30, 30, 40, 40],
'Cost': [20, 20, 10, 10]}
You could assign a number to each warehouse for each id using groupby + cumcount; then pivot:
out = (df.assign(col_idx=df.groupby('Item').cumcount().add(1))
.pivot(['id', 'Item', 'Price', 'Cost'], 'col_idx', 'Warehouse')
.add_prefix('Warehouse ').reset_index().rename_axis(columns=[None]))
or you could use groupby + agg(list); then construct a DataFrame with the Warehouse column and join:
out = df.groupby(['id', 'Item', 'Price', 'Cost']).agg(list).reset_index()
out = (out.drop(columns='Warehouse')
.join(pd.DataFrame(out['Warehouse'].tolist(), columns=['Warehouse 1', 'Warehouse 2'])))
Output:
id Item Price Cost Warehouse 1 Warehouse 2
0 1 Cake 30 20 US: California US: Chicago
1 2 Meat 40 10 US: California US: Chicago

Python pandas - look up value in different df using 2 columns' values, then calculate difference

I want to add a column to my df to show the difference betweeb the CurrentScore and the base scores corresponding to the same Date, Sector, and Classification. The base scores are in a separate dataframe called base_score_df with the Dates as its index. If the base_score_df is missing that day's base scores, I want the result to be null.
The main df:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': '2022-2-1 2022-2-1 2022-2-2 2022-2-2 2022-2-2 2022-2-3 2022-2-3 2022-2-3'.split(),
'Name': 'Walmart Google Walmart Microsoft Target Walmart Google Microsoft'.split(),
'Sector': 'Retail Tech Retail Tech Retail Retail Tech Tech'.split(),
'Classification': '3 4 3 5 5 4 4 4'.split(),
'CurrentScore': '200 197 202 188 186 193 202 201'.split()
})
print(df)
Date Name Sector Classification CurrentScore
0 2022-2-1 Walmart Retail 3 200
1 2022-2-1 Google Tech 4 197
2 2022-2-2 Walmart Retail 3 202
3 2022-2-2 Microsoft Tech 5 188
4 2022-2-2 Target Retail 5 186
5 2022-2-3 Walmart Retail 4 193
6 2022-2-3 Google Tech 4 202
7 2022-2-3 Microsoft Tech 4 201
The base_score_df:
base_score_df=pd.DataFrame({'Date': '2022-2-1 2022-2-3'.split(),
'Retail 3': '100 97'.split(),
'Retail 4': '102 100'.split(),
'Retail 5': '103 101'. split(),
'Tech 3': '105 107'.split(),
'Tech 4': '110 109'.split(),
'Tech 5': '112 113'.split()
})
base_score_df.set_index(['Date'], inplace=True)
print(base_score_df)
Retail 3 Retail 4 Retail 5 Tech 3 Tech 4 Tech 5
Date
2022-2-1 100 102 103 105 110 112
2022-2-3 97 100 101 107 109 113
My solution is to (1) concatenate Sector and Classification into a "Sector Classification" column, (2) use a for loop, itertuples, and apply() to look up the base scores row by row to put into a new "Base Score" column in the df, (3) calculate the difference in another column
Code for (2):
for row in df.iterruples(index=False,name='SP'):
def base_score_lookup(row):
scoredate=row['Date'],
header=row['Sector Classification']
return base_score_df.loc[scoredate,header]
base_score_df['Base Score']=df.apply(base_score_lookup,axis=1)
The problem is, if a date is missing in the base_score_df, the code doesn't run. I just want to use a null value in that case and move on to the next row. And I wonder the code can be written differently for faster speed. Thanks in advance.
Here's what you can do, explanation in the comments:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': '2022-2-1 2022-2-1 2022-2-2 2022-2-2 2022-2-2 2022-2-3 2022-2-3 2022-2-3'.split(),
'Name': 'Walmart Google Walmart Microsoft Target Walmart Google Microsoft'.split(),
'Sector': 'Retail Tech Retail Tech Retail Retail Tech Tech'.split(),
'Classification': '3 4 3 5 5 4 4 4'.split(),
'CurrentScore': '200 197 202 188 186 193 202 201'.split()
})
base_score_df=pd.DataFrame({'Date': '2022-2-1 2022-2-3'.split(),
'Retail 3': '100 97'.split(),
'Retail 4': '102 100'.split(),
'Retail 5': '103 101'. split(),
'Tech 3': '105 107'.split(),
'Tech 4': '110 109'.split(),
'Tech 5': '112 113'.split()
})
# ensure date column is in the same format
df['Date'] = pd.to_datetime(df.Date)
base_score_df['Date'] = pd.to_datetime(base_score_df.Date)
# melt the base score df into a long format
base_score_df = pd.melt(base_score_df,
id_vars=['Date'],
value_vars=[_ for _ in base_score_df.columns if _ != 'Date'])
base_score_df.columns = ['Date', 'category', 'BaseScore']
# split the category into Sector and Classification
base_score_df['Sector'], base_score_df['Classification'] = zip(*base_score_df.category.str.split(' '))
base_score_df.drop('category', axis=1, inplace=True)
# merge back with original dataframe
df = pd.merge(df,
base_score_df,
on=['Date', 'Sector', 'Classification'],
how='left')
# calculate score difference
df['ScoreDiff'] = df['CurrentScore'].astype(float) - df['BaseScore'].astype(float)
# output
df
Date Name Sector Classification CurrentScore BaseScore ScoreDiff
0 2022-02-01 Walmart Retail 3 200 100 100.0
1 2022-02-01 Google Tech 4 197 110 87.0
2 2022-02-02 Walmart Retail 3 202 NaN NaN
3 2022-02-02 Microsoft Tech 5 188 NaN NaN
4 2022-02-02 Target Retail 5 186 NaN NaN
5 2022-02-03 Walmart Retail 4 193 100 93.0
6 2022-02-03 Google Tech 4 202 109 93.0
7 2022-02-03 Microsoft Tech 4 201 109 92.0

How to compare fields from two CSV files with an arithmetic condition?

I have two csv files. The first file contains names of all countries with their capital cities,
CSV 1:
Capital Country Country Code
Budapest Hungary HUN
Rome Italy ITA
Dublin Ireland IRL
Paris France FRA
Berlin Germany DEU
.
.
.
CSV 2:
Second CSV file contains trip details of a bus,
Trip City Trip Country No. of pax
Budapest HUN 24
Paris FRA 36
Munich DEU 9
Florence ITA 5
Milan ITA 25
Rome ITA 2
Rome ITA 45
I would like to add a new column df["Touism visit"] with the values of no of pax, if the Trip City (from CSV 2) is a capital of a country (from CSV 1) and if the number of pax is more than 10.
Thank you.
Try this:
df2['tourism'] = 0
df2.loc[df2['Trip City'].isin(df1['Capital']) & (df2['No. of pax'] > 10), 'tourism'] = df2.loc[df2['Trip City'].isin(df1['Capital'])& (df2['No. of pax'] > 10), 'No. of pax']
I get :
Trip_City Trip_Country No._of_pax tourism
0 Budapest HUN 24 24
1 Paris FRA 36 36
2 Munich DEU 9 0
3 Florence ITA 5 0
4 Milan ITA 25 0
5 Rome ITA 2 0
6 Rome ITA 45 45
(I had to add _s to get pd.read_clipboard() to work properly)
this might also help,
import the dfs
df1 = pd.read_csv("CSV1.csv")
df2 = pd.read_csv("CSV2.csv")
make a dictionary out of the pandas Series
my_dict=dict(zip((df1["Country_Code"]),(df1["Capital"])))
define a function that test your conditions (note i used np.logical_and() to combine the conditions. A normal and
def isTourism(country_code,trip_city,No_of_pax):
if np.logical_and((my_dict[country_code]==trip_city),(No_of_pax >= 10)):
return "Yes"
else:
return "No
call function with map
df2["Tourism"] = list(map(isTourism,df2["Trip Country"],df2["Trip City"], df2["No. Of pax"]))
print(df2)
Trip City Trip Country No. Of pax Tourism
0 Budapest HUN 24 Yes
1 Paris FRA 36 Yes
2 Munich DEU 9 No
3 Florence ITA 5 No
4 Milan ITA 25 No
5 Rome ITA 2 No
6 Rome ITA 45 Yes
If you filter your second dataframe to only the values > 10, you could merge and sum as follows:
import pandas as pd
df1 = pd.DataFrame({'Capital': ['Budapest', 'Rome', 'Dublin', 'Paris',
'Berlin'],
'Country': ['Hungary', 'Italy', 'Ireland', 'France',
'Germany'],
'Country Code': ['HUN', 'ITA', 'IRL', 'FRA', 'DEU']
})
df2 = pd.DataFrame({'Trip City': ['Budapest', 'Paris', 'Munich', 'Florence',
'Milan', 'Rome', 'Rome'],
'Trip Country': ['HUN', 'FRA', 'DEU', 'ITA', 'ITA',
'ITA', 'ITA'],
'No. of pax': [24, 36, 9, 5, 25, 2, 45]
})
df2 = df2[df2['No. of pax'] > 10]
combined = df1.merge(df2,
left_on=['Capital', 'Country Code'],
right_on=['Trip City', 'Trip Country'],
how='left').groupby(['Capital', 'Country Code'],
sort=False,
as_index=False)['No. of pax'].sum()
print combined
This prints:
Capital Country Code No. of pax
0 Budapest HUN 24.0
1 Rome ITA 45.0
2 Dublin IRL NaN
3 Paris FRA 36.0
4 Berlin DEU NaN

Categories

Resources