My dataset is called df:
year
french
flemish
2014
200
200
2015
170
210
2016
130
220
2017
120
225
2018
210
250
I want to create a histogram in seaborn with french and flemish on the x-axis and year as the hue.
I tried this, but it didn't work successfully:
sns.histplot(data=df, x="french", hue="year", multiple="dodge", shrink=.8)
The y-axis should show the height of the number of the columns of french and flemish.
You need to use a bar function, not a histogram function. Histogram functions take raw data and count them, but your data are already counted.
You need to melt the french and flemish columns into "long form." Then x will be the language, and y will be the counts.
sns.barplot(data=df.melt("year", var_name="language", value_name="count"),
x="language",
y="count",
hue="year")
plt.legend(loc=(1.05, 0))
The melted dataframe for reference:
>>> df.melt("year", var_name="language", value_name="count")
# year language count
# 0 2014 french 200
# 1 2015 french 170
# 2 2016 french 130
# 3 2017 french 120
# 4 2018 french 210
# 5 2014 flemish 200
# 6 2015 flemish 210
# 7 2016 flemish 220
# 8 2017 flemish 225
# 9 2018 flemish 250
Related
I want to replace NA values with the mean of other column with the same year.
Note: To replace NA values for Canada data, I want to use only the mean of Canada, not the mean from the whole dataset of course.
Here's a sample dataframe filled with random numbers. And some NA how i find them in my dataframe:
Country
Inhabitants
Year
Area
Cats
Dogs
Canada
38 000 000
2021
4
32
21
Canada
37 000 000
2020
4
NA
21
Canada
36 000 000
2019
3
32
21
Canada
NA
2018
2
32
21
Canada
34 000 000
2017
NA
32
21
Canada
35 000 000
2016
3
32
NA
Brazil
212 000 000
2021
5
32
21
Brazil
211 000 000
2020
4
NA
21
Brazil
210 000 000
2019
NA
32
21
Brazil
209 000 000
2018
4
32
21
Brazil
NA
2017
2
32
21
Brazil
207 000 000
2016
4
32
NA
What's the easiest way with pandas to replace those NA with the mean values of the other years? And is it possible to write a code for which it is possible to go through every NA and replace them (Inhabitants, Area, Cats, Dogs at once)?
Note Example is based on your additional data source from the comments
Replacing the NA-Values for multiple columns with mean() you can combine the following three methods:
fillna() (Iterating per column axis should be 0, which is default value of fillna())
groupby()
transform()
Create data frame from your example:
df = pd.read_excel('https://happiness-report.s3.amazonaws.com/2021/DataPanelWHR2021C2.xls')
Country name
year
Life Ladder
Log GDP per capita
Social support
Healthy life expectancy at birth
Freedom to make life choices
Generosity
Perceptions of corruption
Positive affect
Negative affect
Canada
2005
7.41805
10.6518
0.961552
71.3
0.957306
0.25623
0.502681
0.838544
0.233278
Canada
2007
7.48175
10.7392
nan
71.66
0.930341
0.249479
0.405608
0.871604
0.25681
Canada
2008
7.4856
10.7384
0.938707
71.84
0.926315
0.261585
0.369588
0.89022
0.202175
Canada
2009
7.48782
10.6972
0.942845
72.02
0.915058
0.246217
0.412622
0.867433
0.247633
Canada
2010
7.65035
10.7165
0.953765
72.2
0.933949
0.230451
0.41266
0.878868
0.233113
Call fillna() and iterate over all columns grouped by name of country:
df = df.fillna(df.groupby('Country name').transform('mean'))
Check your result for Canada:
df[df['Country name'] == 'Canada']
Country name
year
Life Ladder
Log GDP per capita
Social support
Healthy life expectancy at birth
Freedom to make life choices
Generosity
Perceptions of corruption
Positive affect
Negative affect
Canada
2005
7.41805
10.6518
0.961552
71.3
0.957306
0.25623
0.502681
0.838544
0.233278
Canada
2007
7.48175
10.7392
0.93547
71.66
0.930341
0.249479
0.405608
0.871604
0.25681
Canada
2008
7.4856
10.7384
0.938707
71.84
0.926315
0.261585
0.369588
0.89022
0.202175
Canada
2009
7.48782
10.6972
0.942845
72.02
0.915058
0.246217
0.412622
0.867433
0.247633
Canada
2010
7.65035
10.7165
0.953765
72.2
0.933949
0.230451
0.41266
0.878868
0.233113
This also works:
In [2]:
df = pd.read_excel('DataPanelWHR2021C2.xls')
In [3]:
# Check for number of null values in df
df.isnull().sum()
Out [3]:
Country name 0
year 0
Life Ladder 0
Log GDP per capita 36
Social support 13
Healthy life expectancy at birth 55
Freedom to make life choices 32
Generosity 89
Perceptions of corruption 110
Positive affect 22
Negative affect 16
dtype: int64
SOLUTION
In [4]:
# Adds mean of column to any NULL values
df.fillna(df.mean(), inplace=True)
In [5]:
# 2nd check for number of null values
df.isnull().sum()
Out [5]: No more NULL values
Country name 0
year 0
Life Ladder 0
Log GDP per capita 0
Social support 0
Healthy life expectancy at birth 0
Freedom to make life choices 0
Generosity 0
Perceptions of corruption 0
Positive affect 0
Negative affect 0
dtype: int64
Background
I have five years of NO2 measurement data, in csv files-one file for every location and year. I have loaded all the files into pandas dataframes in the same format:
Date Hour Location NO2_Level
0 01/01/2016 00 Street 18
1 01/01/2016 01 Street 39
2 01/01/2016 02 Street 129
3 01/01/2016 03 Street 76
4 01/01/2016 04 Street 40
Goal
For each dataframe count the number of times NO2_Level is greater than 150 and output this.
So I wrote a loop that's creates all the dataframes from the right directories and cleans them appropriately .
Problem
Whatever I've tried produces results I know on inspection are incorrect, e.g :
-the count value for every location on a given year is the same (possible but unlikely)
-for a year when I know there should be any positive number for the count, every location returns 0
What I've tried
I have tried a lot of approaches to getting this value for each dataframe, such as making the column a series:
NO2_Level = pd.Series(df['NO2_Level'])
count = (NO2_Level > 150).sum()'''
Using pd.count():
count = df[df['NO2_Level'] >= 150].count()
These two approaches have gotten closest to what I want to output
Example to test on
data = {'Date': ['01/01/2016','01/02/2016',' 01/03/2016', '01/04/2016', '01/05/2016'], 'Hour': ['00', '01', '02', '03', '04'], 'Location': ['Street','Street','Street','Street','Street',], 'NO2_Level': [18, 39, 129, 76, 40]}
df = pd.DataFrame(data=d)
NO2_Level = pd.Series(df['NO2_Level'])
count = (NO2_Level > 150).sum()
count
Expected Outputs
So from this I'm trying to get it to output a single line for each dataframe that was made in the format Location, year, count (of condition):
Kirkstall Road,2013,47
Haslewood Close,2013,97
...
Jack Lane Hunslet,2015,158
So the above example would produce
Street, 2016, 1
Actual
Every year produces the same result for each location, for some years (2014) the count doesn't seem to work at all when on inspection there should be:
Kirkstall Road,2013,47
Haslewood Close,2013,47
Tilbury Terrace,2013,47
Corn Exchange,2013,47
Temple Newsam,2014,0
Queen Street Morley,2014,0
Corn Exchange,2014,0
Tilbury Terrace,2014,0
Haslewood Close,2015,43
Tilbury Terrace,2015,43
Corn Exchange,2015,43
Jack Lane Hunslet,2015,43
Norman Rows,2015,43
Hopefully this helps.
import pandas as pd
ddict = {
'Date':['2016-01-01','2016-01-01','2016-01-01','2016-01-01','2016-01-01','2016-01-02',],
'Hour':['00','01','02','03','04','02'],
'Location':['Street','Street','Street','Street','Street','Street',],
'N02_Level':[19,39,129,76,40, 151],
}
df = pd.DataFrame(ddict)
# Convert dates to datetime
df['Date'] = pd.to_datetime(df['Date'])
# Make a Year column
df['Year'] = df['Date'].apply(lambda x: x.strftime('%Y'))
# Group by lcoation and year, count by M02_Level > 150
df1 = df[df['N02_Level'] > 150].groupby(['Location','Year']).size().reset_index(name='Count')
# Interate the results
for i in range(len(df1)):
loc = df1['Location'][i]
yr = df1['Year'][i]
cnt = df1['Count'][i]
print(f'{loc},{yr},{cnt}')
### To not use f-strings
for i in range(len(df1)):
print('{loc},{yr},{cnt}'.format(loc=df1['Location'][i], yr=df1['Year'][i], cnt=df1['Count'][i]))
Sample data:
Date Hour Location N02_Level
0 2016-01-01 00 Street 19
1 2016-01-01 01 Street 39
2 2016-01-01 02 Street 129
3 2016-01-01 03 Street 76
4 2016-01-01 04 Street 40
5 2016-01-02 02 Street 151
Output:
Street,2016,1
here is a solution with a sample generated (randomly):
def random_dates(start, end, n):
start_u = start.value // 10 ** 9
end_u = end.value // 10 ** 9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
location = ['street', 'avenue', 'road', 'town', 'campaign']
df = pd.DataFrame({'Date' : random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-12-31'), 20),
'Location' : np.random.choice(location, 20),
'NOE_level' : np.random.randint(low=130, high= 200, size=20)})
#Keep only year for Date
df['Date'] = df['Date'].dt.strftime("%Y")
print(df)
df = df.groupby(['Location', 'Date'])['NOE_level'].apply(lambda x: (x>150).sum()).reset_index(name='count')
print(df)
Example df generated:
Date Location NOE_level
0 2018 town 191
1 2017 campaign 187
2 2017 town 137
3 2016 avenue 148
4 2017 campaign 195
5 2018 town 181
6 2018 road 187
7 2018 town 184
8 2016 town 155
9 2016 street 183
10 2018 road 136
11 2017 road 171
12 2018 street 165
13 2015 avenue 193
14 2016 campaign 170
15 2016 street 132
16 2016 campaign 165
17 2015 road 161
18 2018 road 161
19 2015 road 140
output:
Location Date count
0 avenue 2015 1
1 avenue 2016 0
2 campaign 2016 2
3 campaign 2017 2
4 road 2015 1
5 road 2017 1
6 road 2018 2
7 street 2016 1
8 street 2018 1
9 town 2016 1
10 town 2017 0
11 town 2018 3
This is my dataset.
Country Type Disaster Count
0 CHINA P REP Industrial Accident 415
1 CHINA P REP Transport Accident 231
2 CHINA P REP Flood 175
3 INDIA Transport Accident 425
4 INDIA Flood 206
5 INDIA Storm 121
6 UNITED STATES Storm 348
7 UNITED STATES Transport Accident 159
8 UNITED STATES Flood 92
9 PHILIPPINES Storm 249
10 PHILIPPINES Transport Accident 84
11 PHILIPPINES Flood 71
12 INDONESIA Transport Accident 136
13 INDONESIA Flood 110
14 INDONESIA Seismic Activity 77
I would like to make a triple bar chart and the label is based on the column 'Type'. I would also like to group the bar based on the column 'Country'.
I have tried using (with df as the DataFrame object of the pandas library),
df.groupby('Country').plot.bar()
but the result came out as multiple bar charts representing each group in the 'Country' column.
The expected output is similar to this:
What are the codes that I need to run in order to achieve this graph?
There are two ways -
df.set_index('Country').pivot(columns='Type').plot.bar()
df.set_index(['Country','Type']).plot.bar()
I'm trying to sum data from multiple columns in my dataframe by pivoting the table and using aggfunc. My dataframe gives emission data for various regions. I don't want to sum some rows so I make a selection of the rows that I want to sum. The output however is two rows for each column:
one is named True and gives the sum of the rows that I defined (this is the column that I want)
the other is named False and gives the sum of the remainder of the rows that I did not define (this one I would like to drop/omit)
The data is numeric regional data for multiple years so what I want to do is add data from some regions in order to get data for larger regions. The years are listed in columns.
The data looks something like this:
inp = [{'Scenario':'Baseline', 'Region':'CHINA', 'Variable':'Methane', 'Unit':'MtCO2eq', '1990':5,'1995':10,'2000':15},
{'Scenario':'Baseline', 'Region':'INDIA', 'Variable':'Methane', 'Unit':'MtCO2eq', '1990':6,'1995':11,'2000':16},
{'Scenario':'Baseline', 'Region':'INDONESIA', 'Variable':'Methane', 'Unit':'MtCO2eq', '1990':7,'1995':12,'2000':17},
{'Scenario':'Baseline', 'Region':'KOREA', 'Variable':'Methane', 'Unit':'MtCO2eq', '1990':8,'1995':13,'2000':18},
{'Scenario':'Baseline', 'Region':'JAPAN', 'Variable':'Methane', 'Unit':'MtCO2eq', '1990':9,'1995':14,'2000':19},
{'Scenario':'Baseline', 'Region':'THAILAND', 'Variable':'Methane', 'Unit':'MtCO2eq', '1990':10,'1995':15,'2000':20},
{'Scenario':'Baseline', 'Region':'RUSSIA', 'Variable':'Methane', 'Unit':'MtCO2eq', '1990':11,'1995':16,'2000':21}]
dt = pd.DataFrame(inp)
dt
1990 1995 2000 Region Scenario Unit Variable
0 5 10 15 CHINA Baseline MtCO2eq Methane
1 6 11 16 INDIA Baseline MtCO2eq Methane
2 7 12 17 INDONESIA Baseline MtCO2eq Methane
3 8 13 18 KOREA Baseline MtCO2eq Methane
4 9 14 19 JAPAN Baseline MtCO2eq Methane
5 10 15 20 THAILAND Baseline MtCO2eq Methane
6 11 16 21 RUSSIA Baseline MtCO2eq Methane
I run this piece of code:
dt_test = dt.pivot_table(dt,index=['Scenario','Variable','Unit'],
columns=[(df['Region'] == 'CHINA')|
(df['Region'] == 'INDIA')|
(df['Region'] == 'INDONESIA')
|(df['Region'] == 'KOREA')],
aggfunc=np.sum)
And get this as output:
1990 1995 2000
Region False True False True False True
Scenario Variable Unit
Baseline Methane MtCO2eq 46 10 76 15 106 20
If someone could help me out with either a way to drop this False column for all the years or another nifty way to get the totals that I want that would be amazing.
Use xs:
print (dt_test.xs(True, axis=1, level=1))
1990 1995 2000
Scenario Variable Unit
Baseline Methane MtCO2eq 26 46 66
But better is filter first by isin and boolean indexing:
df = df[df['Region'].isin(['CHINA','INDIA','INDONESIA','KOREA'])]
print (df)
1990 1995 2000 Region Scenario Unit Variable
0 5 10 15 CHINA Baseline MtCO2eq Methane
1 6 11 16 INDIA Baseline MtCO2eq Methane
2 7 12 17 INDONESIA Baseline MtCO2eq Methane
3 8 13 18 KOREA Baseline MtCO2eq Methane
And then aggregate sum per groups:
dt_test = df.groupby(['Scenario','Variable','Unit']).sum()
print (dt_test)
1990 1995 2000
Scenario Variable Unit
Baseline Methane MtCO2eq 26 46 66
I want a grouped bar chart, but the default plot doesn't have the groupings the way I'd like, and I'm struggling to get them rearranged properly.
The dataframe looks like this:
user year cat1 cat2 cat3 cat4 cat5
0 Brad 2014 309 186 119 702 73
1 Brad 2015 280 177 100 625 75
2 Brad 2016 306 148 127 671 74
3 Brian 2014 298 182 131 702 73
4 Brian 2015 295 125 117 607 76
5 Brian 2016 298 137 97 596 75
6 Chris 2014 309 171 111 654 72
7 Chris 2015 251 146 105 559 76
8 Chris 2016 231 130 105 526 75
etc
Elsewhere, the code produces two variables, user1 and user2. I want to produce a bar chart that compares the numbers for those two users over time in cat1, cat2, and cat3. So for example if user1 and user2 were Brian and Chris, I would want a chart that looks something like this:
On an aesthetic note: I'd prefer the year labels be vertical text or a font size that fits on a single line, but it's really the dataframe pivot that's confusing me at the moment.
Select the subset of users you want to plot against. Use pivot_table later to transform the DF to the required format to be plotted by transposing and unstacking it.
import matplotlib.pyplot as plt
def select_user_plot(user_1, user_2, cats, frame, idx, col):
frame = frame[(frame[idx[0]] == user_1)|(frame[idx[0]] == user_2)]
frame_pivot = frame.pivot_table(index=idx, columns=col, values=cats).T.unstack()
frame_pivot.plot.bar(legend=True, cmap=plt.get_cmap('RdYlGn'), figsize=(8,8), rot=0)
Finally,
Choose the users and categories to be included in the bar plot.
user_1 = 'Brian'
user_2 = 'Chris'
cats = ['cat1', 'cat2', 'cat3']
select_user_plot(user_1, user_2, cats, frame=df, idx=['user'], col=['year'])
Note: This gives close to the plot that the OP had posted.(Year appearing as Legends instead of the tick labels)