Taking a cross product of Pandas Dataframes

Taking a cross product of Pandas Dataframes - python

I am attempting to take string data from one dataframe and substitute it with numerical values and create a cross product of the results like the following below.
I read the data into dataframes, example input coming below:
import pandas as pd
shoppers = pd.DataFrame({'name': ['bill', 'bob', 'james', 'jill', 'henry'],
'item': ['apple','apple','orange','grapes','orange']})
stores = pd.DataFrame({'apple' : [0.25, 0.20, 0.18],
'orange': [0.30, 0.4, 0.35],
'grapes': [1.0, 0.9, 1.1],
'store': ['kroger', 'publix', 'walmart']})
Here's the resulting shoppers dataframe:
item
name
bill apple
bob apple
james orange
jill grapes
henry orange
And here's the resulting stores dataframe:
apple orange grapes
store
kroger 0.25 0.30 1.0
publix 0.20 0.40 0.9
walmart 0.18 0.35 1.1
And the desired result is the price each person would pay at each store. For example:
I'm really struggling to find the right way to make such a transformation in Pandas efficiently. I could easily loop over shoppers and stores and build each row in a brute-force manner, but there must be a more efficient way to do this with the pandas API. Thanks for any suggestions.

Here's a solution, not cross, but dot:
pd.crosstab(shoppers.index, shoppers['item']).dot(stores.T)
Output:
kroger publix walmart
row_0
bill 0.25 0.2 0.18
bob 0.25 0.2 0.18
henry 0.30 0.4 0.35
james 0.30 0.4 0.35
jill 1.00 0.9 1.10

Related

Which regions have the lowest literacy rates?

I need to group/merge each city of the same name and calculate its overall percentage, to see which city amongst them has the lowest % literacy rate.
Code:
Python
import pandas as pd
df = pd.DataFrame({'Cities': ["Cape Town", "Cape Town", "Cape Town", "Tokyo", "Cape Town", "Tokyo", "Mumbai", "Belgium", "Belgium" ],
'LiteracyRate': [0.05, 0.35, 0.2, 0.11, 0.15, 0.2, 0.65, 0.35, 0.45]})
print(df)
For example:
Cities LiteracyRate
0 Cape Town 0.05
1 Cape Town 0.35
2 Cape Town 0.2
3 Tokyo 0.11
4 Cape Town 0.15
5 Tokyo 0.2
6 Mumbai 0.65
7 Belgium 0.35
8 Belgium 0.45
I'm expecting this:
Cities LiteracyRate %LiteracyRate
0 Cape Town 0.75 75
1 Tokyo 0.31 31
2 Mumbai 0.65 65
3 Belgium 0.8 80
So I tried this code below but it's not giving me desirable results, the countries with similar names are still not merged. And the percentages ain't right.
# Calculate the percentage
df["%LiteracyRate"] = (df["LiteracyRate"]/df["LiteracyRate"].sum())*100
# Show the DataFrame
print(df)

You can use groupby() in pandas, to join cities with the same names and sum() to calculate %
df = df.groupby('Cities').sum()
Than you can format results using
df['%LiteracyRate'] = (df['LiteracyRate']*100).round().astype(int)
df = df.reset_index()
To sort them by literacy rate you can
df = df.sort_values(by='%LiteracyRate')
df = df.reset_index()
Hope this helps!

How to concatenate two dataframes with duplicates some values?

I have two dataframes of unequal lengths. I want to combine them with a condition.
If two rows of df1 are identical then they must share the same value of df2.(without changing order )
import pandas as pd
d = {'country': ['France', 'France','Japan','China', 'China','Canada','Canada','India']}
df1 = pd.DataFrame(data=d)
I={'conc': [0.30, 0.25, 0.21, 0.37, 0.15]}
df2 = pd.DataFrame(data=I)
dfc=pd.concat([df1,df2], axis=1)
my output
country conc
0 France 0.30
1 France 0.25
2 Japan 0.21
3 China 0.37
4 China 0.15
5 Canada NaN
6 Canada NaN
7 India NaN
expected output
country conc
0 France 0.30
1 France 0.30
2 Japan 0.25
3 China 0.21
4 China 0.21
5 Canada 0.37
6 Canada 0.37
7 India 0.15

You need to create a link between the values and the countries first.
df2["country"] = df1["country"].unique()
Then you can use it to merge it with your original dataframe.
pd.merge(df1, df2, on="country")
But be aware that this only works as long as the number of the values is identical to the number of countries and the order for them is as expected.

I'd construct the dataframe directly, without intermediate dfs.
d = {'country': ['France', 'France','Japan','China', 'China','Canada','Canada','India']}
I = {'conc': [0.30, 0.25, 0.21, 0.37, 0.15]}
c = 'country'
dfc = pd.DataFrame(I, index=pd.Index(pd.unique(d[c]), name=c)).reindex(d[c]).reset_index()

filter dataframe and add the newly created columns to original df

Is there a simple way to perform calculations for each fruit in turn, adding the newly created column to original df?
df
concatted score fruit status date
apple_bana 0.500 apple high 2010-02-20
apple 0.600 apple low 2010-02-21
banana 0.530 pear low 2010-01-12
Expected output:
concatted score fruit status date first_diff
apple_bana 0.500 apple high 2010-02-20
apple 0.600 apple low 2010-02-21 0.1
banana 0.530 pear low 2010-01-12
I tried:
fruits = ['apple', 'banana', 'pair']
for fruit in fruits :
selected_rows = df[(df['fruit'] == fruit)]
selected_rows['first_diff']= df.score.diff().dropna()
df = df.append(selected_rows)

groupby(), and apply .diff() to score
df['first_diff']=df[['concatted', 'score', 'fruit', 'status', 'date']].groupby('fruit')['score'].diff().fillna('')
If in need of something general please try;
df['first_diff']=df[[x for x in df.columns]].groupby('fruit')['score'].diff().fillna('')
concatted score fruit status date first_diff
0 apple_bana 0.50 apple high 2010-02-20
1 apple 0.60 apple low 2010-02-21 0.1
2 banana 0.53 pear low 2010-01-12

Compare columns of two dataframes and filter dataframe based on the condition

The two dataframes are as shown
Name Score
John 0.27
Peter 0.34
David 0.89
Sarah 0.67
Tom 0.93
Name minScore
John 0.50
Peter 0.20
David 0.90
Sarah 0.50
Tom 0.90
I want to compare the column(Score) of first dataframe with column(minScore) of the second dataframe and get a filtered first dataframe
df = dataframe1['score']>dataframe2['minscore']
final ouput is as shown
Name Score
Peter 0.34
Sarah 0.67
Tom 0.93
Thanks in advance.

You need to join dataframes on field Name
df = dataframe1.merge(dataframe2, on='Name')
and filter result:
df[df.Score > df.minScore]

You can create a series indexed with Name and use map in constructing your Boolean condition. I also recommend you copy explicitly if you wish to guarantee you aren't left with a view.
min_map = df2.set_index('Name')['minScore']
df = df1.loc[df1['Score'] > df1['Name'].map(min_map)].copy()

Pandas Dataframe groupby statement output to 2 columns

I have a dictionary of values:
{'Spanish Omlette': -0.20000000000000284,
'Crumbed Chicken Salad': -1.2999999999999972,
'Chocolate Bomb': 0.0,
'Seed Nut Muesli': -3.8999999999999915,
'Fruit': -1.2999999999999972,
'Frikerdels Salad': -1.2000000000000028,
'Seed Nut Cheese Biscuits': 0.4000000000000057,
'Chorizo Pasta': -2.0,
'No carbs Ice Cream': 0.4000000000000057,
'Veg Stew': 0.4000000000000057,
'Bulgar spinach Salad': 0.10000000000000853,
'Mango Cheese': 0.10000000000000853,
'Crumbed Calamari chips': 0.10000000000000853,
'Slaw Salad': 0.20000000000000284,
'Mango': -1.2000000000000028,
'Rice & Fish': 0.20000000000000284,
'Almonds Cheese': -0.09999999999999432,
'Nectarine': -1.7000000000000028,
'Banana Cheese': 0.7000000000000028,
'Mediteranean Salad': 0.7000000000000028,
'Almonds': -4.099999999999994}
I am trying to get the aggregated sum of the values of each food item from the dictionary using Pandas:
fooddata = pd.DataFrame(list(foodWeight.items()), columns=['food','weight']).groupby('food')['weight'].agg(['sum']).sort_values(by='sum', ascending=0)
The above code gives the the correct output:
sum
food
Banana Cheese 0.7
Mediteranean Salad 0.7
Seed Nut Cheese Biscuits 0.4
Veg Stew 0.4
No carbs Ice Cream 0.4
Slaw Salad 0.2
Rice & Fish 0.2
Almonds Mango 0.1
Bulgar spinach Salad 0.1
Crumbed Calamari chips 0.1
Frikkadels Salad 0.1
Mango Cheese 0.1
Chocolate Bomb 0.0
Burrito Salad 0.0
Fried Eggs Cheese Avocado 0.0
Burger and Chips -0.1
Traditional Breakfast -0.1
Almonds Cheese -0.1
However, I need to get the output in 2 columns not one which Pandas is giving me above.
How do I get the output into a format that I can plot the data. I.E Label and Value as separate values

Set as_index=False while calling group by
fooddata = pd.DataFrame(list(foodWeight.items()), columns=['food','weight']).groupby('food',as_index=False).agg({"weight":"sum"}).sort_values(by='weight', ascending=0)

You can use parameter as_index=False in groupby and aggregate sum:
fooddata = pd.DataFrame(list(foodWeight.items()), columns=['food','weight'])
print (fooddata.groupby('food', as_index=False)['weight']
.sum()
.sort_values(by='weight', ascending=0))
food weight
2 Banana Cheese 0.7
12 Mediteranean Salad 0.7
20 Veg Stew 0.4
14 No carbs Ice Cream 0.4
16 Seed Nut Cheese Biscuits 0.4
18 Slaw Salad 0.2
15 Rice & Fish 0.2
3 Bulgar spinach Salad 0.1
6 Crumbed Calamari chips 0.1
11 Mango Cheese 0.1
4 Chocolate Bomb 0.0
1 Almonds Cheese -0.1
19 Spanish Omlette -0.2
10 Mango -1.2
8 Frikerdels Salad -1.2
9 Fruit -1.3
7 Crumbed Chicken Salad -1.3
13 Nectarine -1.7
5 Chorizo Pasta -2.0
17 Seed Nut Muesli -3.9
0 Almonds -4.1
Another solution is add reset_index:
print (fooddata.groupby('food')['weight']
.sum()
.sort_values(ascending=0)
.reset_index(name='sum'))
food sum
0 Banana Cheese 0.7
1 Mediteranean Salad 0.7
2 Veg Stew 0.4
3 Seed Nut Cheese Biscuits 0.4
4 No carbs Ice Cream 0.4
5 Slaw Salad 0.2
6 Rice & Fish 0.2
7 Crumbed Calamari chips 0.1
8 Mango Cheese 0.1
9 Bulgar spinach Salad 0.1
10 Chocolate Bomb 0.0
11 Almonds Cheese -0.1
12 Spanish Omlette -0.2
13 Mango -1.2
14 Frikerdels Salad -1.2
15 Crumbed Chicken Salad -1.3
16 Fruit -1.3
17 Nectarine -1.7
18 Chorizo Pasta -2.0
19 Seed Nut Muesli -3.9
20 Almonds -4.1
For plotting is better not reset index - then values of index create axis x - use plot:
fooddata.groupby('food')['weight'].sum().sort_values(ascending=0).plot()
Or if need plot barh:
fooddata.groupby('food')['weight'].sum().sort_values(ascending=0).plot.barh()

After the grouping you need to reset the index or use as_index=False when calling groupby. Paraphrasing this post, by default aggregation functions will not return the groups that you are aggregating over if they are named columns. Instead the grouped columns will be the indices of the returned object. Passing as_index=False or calling reset_index afterwards, will return the groups that you are aggregating over, if they are named columns.
See below my attempt to turn your results in a meaningful graph:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
df = fooddata.reset_index()
ax = df[['food','sum']].plot(kind='barh', title ="Total Sum per Food Item", figsize=(15, 10), legend=True, fontsize=12)
ax.set_xlabel("Sum per Food Item", fontsize=12)
ax.set_ylabel("Food Items", fontsize=12)
ax.set_yticklabels(df['food'])
plt.show()
This results in

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Taking a cross product of Pandas Dataframes - python

Here's a solution, not cross, but dot: pd.crosstab(shoppers.index, shoppers['item']).dot(stores.T) Output: kroger publix walmart row_0 bill 0.25 0.2 0.18 bob 0.25 0.2 0.18 henry 0.30 0.4 0.35 james 0.30 0.4 0.35 jill 1.00 0.9 1.10

Related

Which regions have the lowest literacy rates?

How to concatenate two dataframes with duplicates some values?

filter dataframe and add the newly created columns to original df

Compare columns of two dataframes and filter dataframe based on the condition

Pandas Dataframe groupby statement output to 2 columns

Categories

Resources