How to plot correlation between two columns - python

The task is the following:
Is there a correlation between the age of an athlete and his result at the Olympics in the entire dataset?
Each athlete has a name, age, medal (gold, silver, bronze or NA).
In my opinion, it is necessary to count the number of all athletes of the same age and calculate the percentage of them who have any kind of medal (data.Medal.notnull()). The graph should show all ages on the x-axis, and the percentage of those who has any medal on the y-axis. How to get this data and create the graphic with help of pandas and matprolib?
For instance, some data like in table:
Name Age Medal
Name1 20 Silver
Name2 21 NA
Name3 20 NA
Name4 22 Bronze
Name5 22 NA
Name6 21 NA
Name7 20 Gold
Name8 19 Silver
Name9 20 Gold
Name10 20 NA
Name11 21 Silver
The result should be (in the graphic):
19 - 100%
20 - 60%
21 - 33%
22 - 50%

First, turn df.Medal into 1s for a medal and 0s for NaN values using np.where.
import pandas as pd
import numpy as np
data = {'Name': {0: 'Name1', 1: 'Name2', 2: 'Name3', 3: 'Name4', 4: 'Name5',
5: 'Name6', 6: 'Name7', 7: 'Name8', 8: 'Name9', 9: 'Name10',
10: 'Name11'},
'Age': {0: 20, 1: 21, 2: 20, 3: 22, 4: 22, 5: 21, 6: 20, 7: 19, 8: 20,
9: 20, 10: 21},
'Medal': {0: 'Silver', 1: np.nan, 2: np.nan, 3: 'Bronze', 4: np.nan,
5: np.nan, 6: 'Gold', 7: 'Silver', 8: 'Gold', 9: np.nan,
10: 'Silver'}}
df = pd.DataFrame(data)
df.Medal = np.where(df.Medal.notna(),1,0)
print(df)
Name Age Medal
0 Name1 20 1
1 Name2 21 0
2 Name3 20 0
3 Name4 22 1
4 Name5 22 0
5 Name6 21 0
6 Name7 20 1
7 Name8 19 1
8 Name9 20 1
9 Name10 20 0
10 Name11 21 1
Now, you could plot the data maybe as follows:
import seaborn as sns
import matplotlib.ticker as mtick
sns.set_theme()
ax = sns.barplot(data=df, x=df.Age, y=df.Medal, errorbar=None)
# in versions prior to `seaborn 0.12` use
# `ax = sns.barplot(data=df, x=df.Age, y=df.Medal, ci=None)`
ax.yaxis.set_major_formatter(mtick.PercentFormatter(xmax=1.0))
# adding labels
ax.bar_label(ax.containers[0],
labels=[f'{round(v*100,2)}%' for v in ax.containers[0].datavalues])
Result:
Incidentally, if you would have wanted to calculate these percentages, one option could have been to use pd.crosstab:
percentages = pd.crosstab(df.Age,df.Medal, normalize='index')\
.rename(columns={1:'percentages'})['percentages']
print(percentages)
Age
19 1.000000
20 0.600000
21 0.333333
22 0.500000
Name: percentages, dtype: float64
So, with matplotlib, you could also do something like:
percentages = pd.crosstab(df.Age,df.Medal, normalize='index')\
.rename(columns={1:'percentages'})['percentages'].mul(100)
my_cmap = plt.get_cmap("viridis")
rescale = lambda y: (y - np.min(y)) / (np.max(y) - np.min(y))
fig, ax = plt.subplots()
ax.bar(x=percentages.index.astype(str),
height=percentages.to_numpy(),
color=my_cmap(rescale(percentages.to_numpy())))
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
ax.bar_label(ax.containers[0], fmt='%.1f%%')
plt.show()
Result:

Related

Reorder rows of Pandas dataframe using custom order over multiple columns

I want to reorder the rows of my dataframe based on a custom order over multiple columns.
Say, I have the following df:
import pandas as pd
df = pd.DataFrame.from_dict({'Name': {0: 'Tim', 1: 'Tim', 2: 'Tim', 3: 'Ari', 4: 'Ari', 5: 'Ari', 6: 'Dan', 7: 'Dan', 8: 'Dan'}, 'Subject': {0: 'Math', 1: 'Science', 2: 'History', 3: 'Math', 4: 'Science', 5: 'History', 6: 'Math', 7: 'Science', 8: 'History'}, 'Test1': {0: 10, 1: 46, 2: 54, 3: 10, 4: 83, 5: 39, 6: 10, 7: 58, 8: 10}, 'Test2': {0: 5, 1: 78, 2: 61, 3: 7, 4: 32, 5: 43, 6: 1, 7: 28, 8: 50}})
which looks like this
Name Subject Test1 Test2
Tim Math 10 5
Tim Science 46 78
Tim History 54 61
Ari Math 10 7
Ari Science 83 32
Ari History 39 43
Dan Math 10 1
Dan Science 58 28
Dan History 10 50
I want to sort it by Name first according to custom order ['Dan','Tim','Ari'] and then sort it by Subject according to custom order ['Science','History','Math'].
So my final df should look like
Name Subject Test1 Test2
Dan Science 58 28
Dan History 10 50
Dan Math 10 1
Tim Science 46 78
Tim History 54 61
Tim Math 10 5
Ari Science 83 32
Ari History 39 43
Ari Math 10 7
It seems like a simple thing, but I can't quite figure it out how to do it. The closest solution I could find was how to custom reorder rows according to a single column here. I want to be able to do this for multiple columns simultaneously.
You can represent Name and Subject as categorical variables:
names = ['Dan','Tim','Ari']
subjects = ['Science','History','Math']
df = df.astype({'Name': pd.CategoricalDtype(names, ordered=True),
'Subject': pd.CategoricalDtype(subjects, ordered=True)})
>>> df.sort_values(['Name', 'Subject'])
Name Subject Test1 Test2
7 Dan Science 58 28
8 Dan History 10 50
6 Dan Math 10 1
1 Tim Science 46 78
2 Tim History 54 61
0 Tim Math 10 5
4 Ari Science 83 32
5 Ari History 39 43
3 Ari Math 10 7
>>> df.sort_values(['Subject', 'Name'])
Name Subject Test1 Test2
7 Dan Science 58 28
1 Tim Science 46 78
4 Ari Science 83 32
8 Dan History 10 50
2 Tim History 54 61
5 Ari History 39 43
6 Dan Math 10 1
0 Tim Math 10 5
3 Ari Math 10 7
You can create 2 temporary columns for sorting and then drop them after you've sorted your df.
(
df.assign(key1=df.Name.map({'Dan':0, 'Tim':1, 'Ari':2}),
key2=df.Subject.map({'Science':0, 'History':1, 'Math':2}))
.sort_values(['key1', 'key2'])
.drop(['key1', 'key2'], axis=1)
)

Transposing values in df?

Imagine having the following df:
Document type Invoicenumber Invoicedate description quantity unit price line amount
Invoice 123 28-08-2020
0 NaN 17-09-2020 test 1,5 5 20
0 NaN 16-04-2020 test2 1,5 5 20
Invoice 456 02-03-2020
0 NaN NaN test3 21 3 64
0 0 NaN test3 21 3 64
0 0 NaN test3 21 3 64
The rows where there is a 0 are belonging to the row above and are line items of the same document.
My goal is to transpose the line items so that these are on the same line for each invoice as such:
I've tried to transpose them based on index but this did not work..
**Document type** **Invoicenumber Invoicedate** description#1 description#2 quantity quantity#2 unit price unit price #2 line amount line amount #2
Invoice 123 28-08-2020 test test2 1,5 1,5 5 5 20 20
and for the second row:
**Document type** **Invoicenumber Invoicedate** description#1 description#2 description #3 quantity quantity#2 quantity #3 unit price unit price #2 unit price #3 line amount line amount #2 line amount #3
Invoice 123 28-08-2020 test3 test3 test3 21 21 21 3 3 3 64 64 64
here is the dictionary code:
df = pd.DataFrame.from_dict({'Document Type': {0: 'IngramMicro.AccountsPayable.Invoice',
1: 0,
2: 0,
3: 'IngramMicro.AccountsPayable.Invoice',
4: 0,
5: 0,
6: 0},
'Factuurnummer': {0: '0.78861803',
1: 'NaN',
2: 'NaN',
3: '202130534',
4: 'NaN',
5: 'NaN',
6: 'NaN'},
'Factuurdatum': {0: '2021-05-03',
1: 'NaN',
2: 'NaN',
3: '2021-09-03',
4: 'NaN',
5: 'NaN',
6: 'NaN'},
'description': {0: 'NaN',
1: 'TM 300 incl onderstel 3058C003 84433210 4549292119381',
2: 'ESP 5Y 36 inch 7950A539 00000000 4960999794266',
3: 'NaN',
4: 'Basistarief A3 Office',
5: 'Toeslag 100 km enkele reis Leveren installeren Xerox VL C7020 05-03-2021',
6: 'Toeslag 100 km enkele reis Leveren installeren Xerox VL C7020 05-03-2021'},
'quantity': {0: 'NaN', 1: 1.0, 2: 1.0, 3: 'NaN', 4: 1.0, 5: 1.0, 6: 2.0},
'unit price': {0: 'NaN',
1: 1211.63,
2: 742.79,
3: 'NaN',
4: 260.0,
5: 30.0,
6: 30.0},
'line amount': {0: 'NaN',
1: 21.0,
2: 21.0,
3: 'NaN',
4: 260.0,
5: 30.0,
6: 30.0}})
I've tried the following:
df = pd.DataFrame(data=d1)
However failing to accomplish significant results.
Please help !
Here is what you can do. First we enumerate the groups and the line items within each group, and clean up 'Document Type':
import numpy as np
df['g'] = df['Document Type'].ne(0).cumsum()
df['l'] = df.groupby('g').cumcount()
df['Document Type'] = df['Document Type'].replace(0,np.nan).fillna(method = 'ffill')
df
we get
Document Type Factuurnummer Factuurdatum description quantity unit price line amount g l
-- ----------------------------------- --------------- -------------- ------------------------------------------------------------------------ ---------- ------------ ------------- --- ---
0 IngramMicro.AccountsPayable.Invoice 0.788618 2021-05-03 NaN nan nan nan 1 0
1 IngramMicro.AccountsPayable.Invoice nan NaN TM 300 incl onderstel 3058C003 84433210 4549292119381 1 1211.63 21 1 1
2 IngramMicro.AccountsPayable.Invoice nan NaN ESP 5Y 36 inch 7950A539 00000000 4960999794266 1 742.79 21 1 2
3 IngramMicro.AccountsPayable.Invoice 2.02131e+08 2021-09-03 NaN nan nan nan 2 0
4 IngramMicro.AccountsPayable.Invoice nan NaN Basistarief A3 Office 1 260 260 2 1
5 IngramMicro.AccountsPayable.Invoice nan NaN Toeslag 100 km enkele reis Leveren installeren Xerox VL C7020 05-03-2021 1 30 30 2 2
6 IngramMicro.AccountsPayable.Invoice nan NaN Toeslag 100 km enkele reis Leveren installeren Xerox VL C7020 05-03-2021 2 30 30 2 3
Now we can index on 'g' and 'l' and then move 'l' to columns via unstack. we drop columns that are all NaNs
df2 = df.set_index(['g','Document Type','l']).unstack(level = 2).replace('NaN',np.nan).dropna(axis='columns', how = 'all')
we rename column labels to be single-level:
df2.columns = [tup[0] + '_' + str(tup[1]) for tup in df2.columns.values]
df2.reset_index().drop(columns = 'g')
and we get something that looks like what you are after, I believe
Document Type Factuurnummer_0 Factuurdatum_0 description_1 description_2 description_3 quantity_1 quantity_2 quantity_3 unit price_1 unit price_2 unit price_3 line amount_1 line amount_2 line amount_3
-- ----------------------------------- ----------------- ---------------- ----------------------------------------------------- ------------------------------------------------------------------------ ------------------------------------------------------------------------ ------------ ------------ ------------ -------------- -------------- -------------- --------------- --------------- ---------------
0 IngramMicro.AccountsPayable.Invoice 0.788618 2021-05-03 TM 300 incl onderstel 3058C003 84433210 4549292119381 ESP 5Y 36 inch 7950A539 00000000 4960999794266 nan 1 1 nan 1211.63 742.79 nan 21 21 nan
1 IngramMicro.AccountsPayable.Invoice 2.02131e+08 2021-09-03 Basistarief A3 Office Toeslag 100 km enkele reis Leveren installeren Xerox VL C7020 05-03-2021 Toeslag 100 km enkele reis Leveren installeren Xerox VL C7020 05-03-2021 1 1 2 260 30 30 260 30 30

Pandas - Create column with difference in values

I have the below dataset. How can create a new column that shows the difference of money for each person, for each expiry?
The column is yellow is what I want. You can see that it is the difference in money for each expiry point for the person. I highlighted the other rows in colors so it is more clear.
Thanks a lot.
Example
[]
import pandas as pd
import numpy as np
example = pd.DataFrame( data = {'Day': ['2020-08-30', '2020-08-30','2020-08-30','2020-08-30',
'2020-08-29', '2020-08-29','2020-08-29','2020-08-29'],
'Name': ['John', 'Mike', 'John', 'Mike','John', 'Mike', 'John', 'Mike'],
'Money': [100, 950, 200, 1000, 50, 50, 250, 1200],
'Expiry': ['1Y', '1Y', '2Y','2Y','1Y','1Y','2Y','2Y']})
example_0830 = example[ example['Day']=='2020-08-30' ].reset_index()
example_0829 = example[ example['Day']=='2020-08-29' ].reset_index()
example_0830['key'] = example_0830['Name'] + example_0830['Expiry']
example_0829['key'] = example_0829['Name'] + example_0829['Expiry']
example_0829 = pd.DataFrame( example_0829, columns = ['key','Money'])
example_0830 = pd.merge(example_0830, example_0829, on = 'key')
example_0830['Difference'] = example_0830['Money_x'] - example_0830['Money_y']
example_0830 = example_0830.drop(columns=['key', 'Money_y','index'])
Result:
Day Name Money_x Expiry Difference
0 2020-08-30 John 100 1Y 50
1 2020-08-30 Mike 950 1Y 900
2 2020-08-30 John 200 2Y -50
3 2020-08-30 Mike 1000 2Y -200
If the difference is just derived from the previous date, you can just define a date variable in the beginning to find today(t) and previous day (t-1) to filter out original dataframe.
You can solve it with groupby.diff
Take the dataframe
df = pd.DataFrame({
'Day': [30, 30, 30, 30, 29, 29, 28, 28],
'Name': ['John', 'Mike', 'John', 'Mike', 'John', 'Mike', 'John', 'Mike'],
'Money': [100, 950, 200, 1000, 50, 50, 250, 1200],
'Expiry': [1, 1, 2, 2, 1, 1, 2, 2]
})
print(df)
Which looks like
Day Name Money Expiry
0 30 John 100 1
1 30 Mike 950 1
2 30 John 200 2
3 30 Mike 1000 2
4 29 John 50 1
5 29 Mike 50 1
6 28 John 250 2
7 28 Mike 1200 2
And the code
# make sure we have dates in the order we want
df.sort_values('Day', ascending=False)
# groubpy and get the difference from the next row in each group
# diff(1) calculates the difference from the previous row, so -1 will point to the next
df['Difference'] = df.groupby(['Name', 'Expiry']).Money.diff(-1)
Output
Day Name Money Expiry Difference
0 30 John 100 1 50.0
1 30 Mike 950 1 900.0
2 30 John 200 2 -50.0
3 30 Mike 1000 2 -200.0
4 29 John 50 1 NaN
5 29 Mike 50 1 NaN
6 28 John 250 2 NaN
7 28 Mike 1200 2 NaN

Adding missing dates to the dataframe

I have a dataframe that has a list of One Piece manga, which currently looks like this:
0 # Title Pages
Date
1997-07-19 1 Romance Dawn - The Dawn of the Adventure 53
1997-07-28 2 That Guy, "Straw Hat Luffy" 23
1997-08-04 3 Introducing "Pirate Hunter Zoro" 21
1997-08-11 4 Marine Captain "Axe-Hand Morgan" 19
1997-08-25 5 Pirate King and Master Swordsman 19
1997-09-01 6 The First Crew Member 23
1997-09-08 7 Friends 20
1997-09-13 8 Introducing Nami 19
Although every episode is to be issued weekly, sometimes they are delayed or on break, resulting in an irregular interval in the dates. What I would like to do is to add a missing date. For example, between 1997-08-11 and 1997-08-25, there should be 1997-08-18 (7 days from 1997-08-11) where the episode was not issued. Could you help me out with how to operate this code?
Thank you.
You sould use the shift builtin function.
df['day_between'] = df['Date'].shift(-1) - df['Date']
output of print(df[['Date', 'day_between']]) is then:
Date day_between
0 1997-07-19 9 days
1 1997-07-28 7 days
2 1997-08-04 7 days
3 1997-08-11 14 days
4 1997-08-25 7 days
5 1997-09-01 7 days
6 1997-09-08 5 days
7 1997-09-13 NaT
I used relativedelta and list comprehension to get a 14-day interval per row and .shift(1) to compare to another row with .np.where() with a 1 returning a row where we would want to insert a row before. Then, I looped through the dataframe and appended the relevant rows to another dataframe. Then, I used pd.concat to bring the two dataframes together, sorted by date, deleted the helper columns and reset the index.
There may be some gaps as others have mentioned like 22 days+ but this should get you in the right direction. Perhaps you could turn it into a function and run it multiple times, which is why I added .reset_index(drop=True) at the end. Obviously, you could just make this more advanced, but I hope this helps.
from dateutil.relativedelta import relativedelta
import pandas
from datetime import datetime
df = pd.DataFrame({'Date': {0: '1997-07-19',
1: '1997-07-28',
2: '1997-08-04',
3: '1997-08-11',
4: '1997-08-25',
5: '1997-09-01',
6: '1997-09-08',
7: '1997-09-13'},
'#': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8},
'Title': {0: 'Romance Dawn - The Dawn of the Adventure',
1: 'That Guy, "Straw Hat Luffy"',
2: 'Introducing "Pirate Hunter Zoro"',
3: 'Marine Captain "Axe-Hand Morgan"',
4: 'Pirate King and Master Swordsman',
5: 'The First Crew Member',
6: 'Friends',
7: 'Introducing Nami'},
'Pages': {0: 53, 1: 23, 2: 21, 3: 19, 4: 19, 5: 23, 6: 20, 7: 19}})
df['Date'] = pd.to_datetime(df['Date'])
df['Date2'] = [d - relativedelta(days=-14) for d in df['Date']]
df['Date3'] = np.where((df['Date'] >= df['Date2'].shift(1)), 1 , 0)
df1 = pd.DataFrame({})
n=0
for j in (df['Date3']):
n+=1
if j == 1:
new_row = pd.DataFrame({"Date": df['Date'][n-1] - relativedelta(days=7)}, index=[n])
df1=df1.append(new_row)
df = pd.concat([df, df1]).sort_values('Date').drop(['Date2', 'Date3'], axis=1).reset_index(drop=True)
df
Output:
Date # Title Pages
0 1997-07-19 1.0 Romance Dawn - The Dawn of the Adventure 53.0
1 1997-07-28 2.0 That Guy, "Straw Hat Luffy" 23.0
2 1997-08-04 3.0 Introducing "Pirate Hunter Zoro" 21.0
3 1997-08-11 4.0 Marine Captain "Axe-Hand Morgan" 19.0
4 1997-08-18 NaN NaN NaN
5 1997-08-25 5.0 Pirate King and Master Swordsman 19.0
6 1997-09-01 6.0 The First Crew Member 23.0
7 1997-09-08 7.0 Friends 20.0
8 1997-09-13 8.0 Introducing Nami 19.0

How can I create a stacked bar chart in matplotlib where the stacks vary from bar to bar?

So I have a pandas DataFrame that looks something like this:
year country total
0 2010 USA 10
1 2010 CHIN 12
2 2011 USA 8
3 2011 JAPN 12
4 2012 KORR 7
5 2012 USA 10
6 2013 CHIN 9
7 2013 USA 13
I'd like to create a stacked bar chart in matplotlib, where there is one bar for each year and stacks for the two countries in that year with height based on the total column. The color should be based on the country and be represented in the legend.
I can't seem to figure out how to make this happen. I think I could do it using for loops to go through each year and each country, then construct the bar with the color corresponding to values in a dictionary. However, this will create individual legend entries for each individual bar such that there are 8 total values in the legend. This is also a horribly inefficient way to graph in matplotlib as far as I can tell.
Can anyone give some pointers?
You need to transform your df first. It can be done via the below:
df = pd.DataFrame({'year': {0: 2010, 1: 2010, 2: 2011, 3: 2011, 4: 2012, 5: 2012, 6: 2013, 7: 2013},
'country': {0: 'USA', 1: 'CHIN', 2: 'USA', 3: 'JAPN', 4: 'KORR', 5: 'USA', 6: 'CHIN', 7: 'USA'},
'total': {0: 10, 1: 12, 2: 8, 3: 12, 4: 7, 5: 10, 6: 9, 7: 13}})
df2 = df.groupby(['year',"country"])['total'].sum().unstack("country")
print (df2)
#
country CHIN JAPN KORR USA
year
2010 12.0 NaN NaN 10.0
2011 NaN 12.0 NaN 8.0
2012 NaN NaN 7.0 10.0
2013 9.0 NaN NaN 13.0
#
ax = df2.plot(kind='bar', stacked=True)
plt.show()
Result:

Categories

Resources