Create grouped bar graph from dataframe

Create grouped bar graph from dataframe - python

Sample DataFrame
continent avg_count_country avg_age
Male 0 Asia 55 5
1 Africa 65 10
2 Europe 75 8
Female 0 Asia 50 7
1 Africa 60 12
2 Europe 70 0
Transgender 0 Asia 30 6
1 Africa 40 11
2 America 80 10
For the grouped bar graph:
X axis will have Male , Female , Transgender
Y axis will have total counts
3 bars in each Male , Female and Transgender
Male and Female will have 3 bars grouped - Asia , Africa , Europe
Transgender will have 3 bars grouped -- Asia , Africa , America
4 unique colors or legends [Asia , Africa ,Europe , America]
I can do it manually like plotting every bar
bars1 = //manually giving values
bars2 = //manually giving values
......bars3, bars4
plt.bar(r1, bars1, color='#7f6d5f', width=barWidth, edgecolor='white', label='var1')
and plotting each bar like this
But want to do it in more optimized way or dynamic way

You can reshape your dataframe and use pandas plot:
df_out = df.reset_index(level=1, drop=True)\
.set_index(['continent'], append=True)['avg_count_country']\
.unstack()
df_out.plot.bar()
Output:

Related

Subtract value of column based on another column

I have a big dataframe (the following is an example)
country
value
portugal
86
germany
20
belgium
21
Uk
81
portugal
77
UK
87
I want to subtract values by 60 whenever the country is portugal or UK, the dataframe should look like (Python)
country
value
portugal
26
germany
20
belgium
21
Uk
21
portugal
17
UK
27

IUUC, use isin on the lowercase country string to check if the values is in a reference list, then slice the dataframe with loc for in place modification:
df.loc[df['country'].str.lower().isin(['portugal', 'uk']), 'value'] -= 60
output:
country value
0 portugal 26
1 germany 20
2 belgium 21
3 Uk 21
4 portugal 17
5 UK 27

Use numpy.where:
In [1621]: import numpy as np
In [1622]: df['value'] = np.where(df['country'].str.lower().isin(['portugal', 'uk']), df['value'] - 60, df['value'])
In [1623]: df
Out[1623]:
country value
0 portugal 26
1 germany 20
2 belgium 21
3 Uk 21
4 portugal 17
5 UK 27

Cumsum with groupby

I have a dataframe containing:
State Country Date Cases
0 NaN Afghanistan 2020-01-22 0
271 NaN Afghanistan 2020-01-23 0
... ... ... ... ...
85093 NaN Zimbabwe 2020-11-30 9950
85364 NaN Zimbabwe 2020-12-01 10129
I'm trying to create a new column of cumulative cases but grouped by Country AND State.
State Country Date Cases Total Cases
231 California USA 2020-01-22 5 5
342 California USA 2020-01-23 10 15
233 Texas USA 2020-01-22 4 4
322 Texas USA 2020-01-23 12 16
I have been trying to follow Pandas groupby cumulative sum and have tried things such as:
df['Total'] = df.groupby(['State','Country'])['Cases'].cumsum()
Returns a series of -1's
df['Total'] = df.groupby(['State', 'Country']).sum() \
.groupby(level=0).cumsum().reset_index()
Returns the sum.
df['Total'] = df.groupby(['Country'])['Cases'].apply(lambda x: x.cumsum())
Doesnt separate sums by state.
df_f['Total'] = df_f.groupby(['Region','State'])['Cases'].apply(lambda x: x.cumsum())
This one works exept when 'State' is NaN, 'Total' is also NaN.

arrays = [['California', 'California', 'Texas', 'Texas'],
['USA', 'USA', 'USA', 'USA'],
['2020-01-22','2020-01-23','2020-01-22','2020-01-23'], [5,10,4,12]]
df = pd.DataFrame(list(zip(*arrays)), columns = ['State', 'Country', 'Date', 'Cases'])
df
State Country Date Cases
0 California USA 2020-01-22 5
1 California USA 2020-01-23 10
2 Texas USA 2020-01-22 4
3 Texas USA 2020-01-23 12
temp = df.set_index(['State', 'Country','Date'], drop=True).sort_index( )
df['Total Cases'] = temp.groupby(['State', 'Country']).cumsum().reset_index()['Cases']
df
State Country Date Cases Total Cases
0 California USA 2020-01-22 5 5
1 California USA 2020-01-23 10 15
2 Texas USA 2020-01-22 4 4
3 Texas USA 2020-01-23 12 16

Pandas dataframe Split One column data into 2 using some condition

I have one dataframe which is below-
0
____________________________________
0 Country| India
60 Delhi
62 Mumbai
68 Chennai
75 Country| Italy
78 Rome
80 Venice
85 Milan
88 Country| Australia
100 Sydney
103 Melbourne
107 Perth
I want to Split the data in 2 columns so that in one column there will be country and on other there will be city. I have no idea where to start with. I want like below-
0 1
____________________________________
0 Country| India Delhi
1 Country| India Mumbai
2 Country| India Chennai
3 Country| Italy Rome
4 Country| Italy Venice
5 Country| Italy Milan
6 Country| Australia Sydney
7 Country| Australia Melbourne
8 Country| Australia Perth
Any Idea how to do this?

Look for rows where | is present and pull into another column, and fill down on the newly created column :
(
df.rename(columns={"0": "city"})
# this looks for rows that contain '|' and puts them into a
# new column called Country. rows that do not match will be
# null in the new column.
.assign(Country=lambda x: x.loc[x.city.str.contains("\|"), "city"])
# fill down on the Country column, this also has the benefit
# of linking the Country with the City,
.ffill()
# here we get rid of duplicate Country entries in city and Country
# this ensures that only Country entries are in the Country column
# and cities are in the City column
.query("city != Country")
# here we reverse the column positions to match your expected output
.iloc[:, ::-1]
)
Country city
60 Country| India Delhi
62 Country| India Mumbai
68 Country| India Chennai
78 Country| Italy Rome
80 Country| Italy Venice
85 Country| Italy Milan
100 Country| Australia Sydney
103 Country| Australia Melbourne
107 Country| Australia Perth

Use DataFrame.insert with Series.where and Series.str.startswith for replace not matched values to missing values with ffill for forward filling missing values and then remove rows with same values in both by Series.ne for not equal in boolean indexing:
df.insert(0, 'country', df[0].where(df[0].str.startswith('Country')).ffill())
df = df[df['country'].ne(df[0])].reset_index(drop=True).rename(columns={0:'city'})
print (df)
country city
0 Country|India Delhi
1 Country|India Mumbai
2 Country|India Chennai
3 Country|Italy Rome
4 Country|Italy Venice
5 Country|Italy Milan
6 Country|Australia Sydney
7 Country|Australia Melbourne
8 Country|Australia Perth

Subtract columns from two data frames based on a common column

df1:
Asia 34
America 74
Australia 92
Africa 44
df2 :
Asia 24
Australia 90
Africa 30
I want the output of df1 - df2 to be
Asia 10
America 74
Australia 2
Africa 14
I am getting troubled by this, I am newbie into pandas. Please help out.

Use Series.sub with mapped second Series by Series.map:
df1['B'] = df1['B'].sub(df1['A'].map(df2.set_index('A')['B']), fill_value=0)
print (df1)
A B
0 Asia 10.0
1 America 74.0
2 Australia 2.0
3 Africa 14.0
If possible changed ordering of first column convert both first columns to index by DataFrame.set_index and subtract :
df2 = df1.set_index('A')['B'].sub(df2.set_index('A')['B'], fill_value=0).reset_index()
print (df2)
A B
0 Africa 14.0
1 America 74.0
2 Asia 10.0
3 Australia 2.0

Wide to long dataset using pandas

There are a lot of questions out there with similar titles but I'm unable to solve the issues that I'm having with my dataset.
Dataset:
ID Country Type Region Gender IA01_Raw IA01_Class1 IA01_Class2 IA02_Raw IA02_Class1 IA02_Class2 QA_Include QA_Comments
SC1 France A Europe Male 4 8 1 J 4 1 yes N/A
SC2 France A Europe Female 2 7 2 Q 6 4 yes N/A
SC3 France B Europe Male 3 7 2 K 8 2 yes N/A
SC4 France A Europe Male 4 8 2 A 2 1 yes N/A
SC5 France B Europe Male 1 7 1 F 1 3 yes N/A
ID6 France A Europe Male 2 8 1 R 3 7 yes N/A
ID7 France B Europe Male 2 8 1 Q 4 6 yes N/A
UC8 France B Europe Male 4 8 2 P 4 2 yes N/A
Required output:
ID Country Type Region Gender IA Raw Class1 Class2 QA_Include QA_Comments
SC1 France A Europe Male 01 K 8 1 yes N/A
SC1 France A Europe Male 01 L 8 1 yes N/A
SC1 France A Europe Male 01 P 8 1 yes N/A
SC1 France A Europe Male 02 Q 8 1 yes N/A
SC1 France A Europe Male 02 R 8 1 yes N/A
SC1 France A Europe Male 02 T 8 1 yes N/A
SC1 France A Europe Male 03 G 8 1 yes N/A
SC1 France A Europe Male 03 R 8 1 yes N/A
SC1 France A Europe Male 03 G 8 1 yes N/A
SC1 France A Europe Male 04 K 8 1 yes N/A
SC1 France A Europe Male 04 A 8 1 yes N/A
SC1 France A Europe Male 04 P 8 1 yes N/A
SC1 France A Europe Male 05 R 8 1 yes N/A
....
In the Dataset I've columns which are names as IA[X]_NAME where X = 1..9 and NAME = Raw, Class1 and Class2.
What I am trying to do is to just transpose these columns so that it looks like the table shown in Required output i.e. IA will show X value and just like this raw and classes will show their perspective values.
So in order to achieve it I sliced the columns as:
idVars = list(excel_df_final.columns[0:40]) + list(excel_df_final.columns[472:527]) #These contain columns like ID, Country, Type etc
valueVars = excel_df_final.columns[41:472].tolist() #All the IA_ columns
I don't know if this step was necessary but this gave me the perfect slices of columns but when I put it in melt it is not working properly. I have tried almost every method that is available in other questions.
pd.melt(excel_df_final, id_vars=idVars,value_vars=valueVars)
I've also tried this:
excel_df_final.set_index(idVars)[41:472].unstack()
but didn't work and here is Wide to long implementation which also didn't work:
pd.wide_to_long(excel_df_final, stubnames = ['IA', 'Raw', 'Class1', 'Class2'], i=idVars, j=valueVars)
The error I got for wide to long is:
ValueError: operands could not be broadcast together with shapes (95,)
(431,)
As my dataset has 526 columns in real, so that is why I've divided them into two lists one contains 95 column names which will be the i and the rest 431 are the one that I need to show in the row as shown in the sample data set.

This will get you started. The essence is using set_index, column conversion to MultiIndex, then stack. Better solutions exist, possibly, but I would do it this way because it is an easy step to your output.
# Set the index with columns that we don't want to "transpose"
df2 = df.set_index([
'ID', 'Country', 'Type', 'Region', 'Gender', 'QA_Include', 'QA_Comments'])
# Convert headers to MultiIndex -- this is so we can melt IA values
df2.columns = pd.MultiIndex.from_tuples(map(tuple, df2.columns.str.split('_')))
# Call stack to replicate data, then reset the index
out = df2.stack(level=0).reset_index().rename({'level_7': 'IA'}, axis=1)
out
ID Country Type Region Gender QA_Include QA_Comments IA Class1 Class2 Raw
0 SC1 France A Europe Male yes NaN IA01 8 1 4
1 SC1 France A Europe Male yes NaN IA02 4 1 J
2 SC2 France A Europe Female yes NaN IA01 7 2 2
3 SC2 France A Europe Female yes NaN IA02 6 4 Q
4 SC3 France B Europe Male yes NaN IA01 7 2 3
5 SC3 France B Europe Male yes NaN IA02 8 2 K
6 SC4 France A Europe Male yes NaN IA01 8 2 4
7 SC4 France A Europe Male yes NaN IA02 2 1 A
8 SC5 France B Europe Male yes NaN IA01 7 1 1
9 SC5 France B Europe Male yes NaN IA02 1 3 F
10 ID6 France A Europe Male yes NaN IA01 8 1 2
11 ID6 France A Europe Male yes NaN IA02 3 7 R
12 ID7 France B Europe Male yes NaN IA01 8 1 2
13 ID7 France B Europe Male yes NaN IA02 4 6 Q
14 UC8 France B Europe Male yes NaN IA01 8 2 4
15 UC8 France B Europe Male yes NaN IA02 4 2 P

u can use pd.lreshape
pd.lreshape(df.assign(IA01=['01']*len(df), IA02=['02']*len(df),IA09=['09']*len(df)),
{'IA': ['IA01', 'IA02','IA09'],
'Raw': ['IA01_Raw','IA02_Raw','IA09_Raw'],
'Class1': ['IA01_Class1','IA02_Class1','IA09_Class1'],
'Class2': ['IA01_Class2', 'IA02_Class2','IA09_Class2']
})
edit :
pd.lreshape(df.assign(IA01=['01']*len(df), IA02=['02']*len(df),IA09=['09']*len(df)),
{'IA': ['IA01', 'IA02','IA09'],
'Raw': ['IA01_Raw_baseline','IA02_Raw_midline','IA09_Raw_whatever'],
'Class1': ['IA01_Class1_baseline','IA02_Class1_midline','IA09_Class1_whatever'],
'Class2': ['IA01_Class2_baseline', 'IA02_Class2_midline','IA09_Class2_whatever']
})
edit: Just add column names of which ever columns you want from the input in Raw/Class1/Class2 column of the output to the list inside the dictionary
documentation for this is not available . use help(pd.lreshape) or refer here
Output:
Country Gender ID QA_Comments QA_Include Region Type IA Raw Class1 Class2
0 France Male SC1 NaN yes Europe A 01 4 8 1
1 France Female SC2 NaN yes Europe A 01 2 7 2
2 France Male SC3 NaN yes Europe B 01 3 7 2
3 France Male SC4 NaN yes Europe A 01 4 8 2
4 France Male SC5 NaN yes Europe B 01 1 7 1
5 France Male ID6 NaN yes Europe A 01 2 8 1
6 France Male ID7 NaN yes Europe B 01 2 8 1
7 France Male UC8 NaN yes Europe B 01 4 8 2
8 France Male SC1 NaN yes Europe A 02 J 4 1
9 France Female SC2 NaN yes Europe A 02 Q 6 4
10 France Male SC3 NaN yes Europe B 02 K 8 2
11 France Male SC4 NaN yes Europe A 02 A 2 1
12 France Male SC5 NaN yes Europe B 02 F 1 3
13 France Male ID6 NaN yes Europe A 02 R 3 7
14 France Male ID7 NaN yes Europe B 02 Q 4 6
15 France Male UC8 NaN yes Europe B 02 P 4 2
16 France Male SC1 NaN yes Europe A 09 W 6 3
17 France Female SC2 NaN yes Europe A 09 X 5 2
18 France Male SC3 NaN yes Europe B 09 Y 5 5
19 France Male SC4 NaN yes Europe A 09 P 5 2
20 France Male SC5 NaN yes Europe B 09 T 5 2
21 France Male ID6 NaN yes Europe A 09 I 5 2
22 France Male ID7 NaN yes Europe B 09 A 8 2
23 France Male UC8 NaN yes Europe B 09 K 7 5

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create grouped bar graph from dataframe - python

You can reshape your dataframe and use pandas plot: df_out = df.reset_index(level=1, drop=True)\ .set_index(['continent'], append=True)['avg_count_country']\ .unstack() df_out.plot.bar() Output:

Related

Subtract value of column based on another column

Cumsum with groupby

Pandas dataframe Split One column data into 2 using some condition

Subtract columns from two data frames based on a common column

Wide to long dataset using pandas

Categories

Resources