I want to create a bar chart with a focus on two cities. My data set is similar to this.
city rate Bedrooms
Houston 132.768382 0
Dallas 151.981043 1
Dallas 112.897727 3
Houston 132.332665 1
Houston 232.611185 2
Dallas 93.530662 4
I've broken them up into a dataframe of just Dallas and Houston. Like
dal.groupby('bedrooms')['rate'].mean().plot(kind='bar')
&
hou.groupby('bedrooms')['rate'].mean().plot(kind='bar')
How would I go about making a bar chart that lists average rate of listings based on bedroom type. Something similar to this image below that I found here Python matplotlib multiple bars. With the labels being the cities.
I'd appreciate any help!
Seaborn is your friend in this case, first create a grouped dataframe with the average rate per City and bedrooms and the plot it with seaborn
import seaborn as sns
dal_group = dal.groupby(['city' , 'Bedrooms']).agg({'rate': 'mean'}).reset_index()
sns.barplot(data=dal_group, x='Bedrooms', y='rate', hue='city')
with the data above, it will produce this plot:
Here's a basic way to do it in matplotlib:
import numpy as np
import matplotlib.pyplot as plt
data_dallas = dal.groupby('bedrooms')['rate'].mean()
data_houston = hou.groupby('bedrooms')['rate'].mean()
fig, ax = plt.subplots()
x = np.arange(5) # if the max. number of bedrooms is 4
width = 0.35 # width of one bar
dal_bars = ax.bar(x, data_dallas, width)
hou_bars = ax.bar(x + width, data_houston, width)
ax.set_xticks(x + width / 2)
ax.set_xticklabels(x)
ax.legend((dal_bars[0], hou_bars[0]), ('Dallas', 'Houston'))
plt.show()
There is an easy solution using one line of pandas (as long you rearrange the data first) only or using plotly
Data
import pandas as pd
df = pd.DataFrame({'city': {0: 'Houston',
1: 'Dallas',
2: 'Dallas',
3: 'Houston',
4: 'Houston',
5: 'Dallas'},
'rate': {0: 132.768382,
1: 151.981043,
2: 112.897727,
3: 132.332665,
4: 232.611185,
5: 93.530662},
'Bedrooms': {0: 0, 1: 1, 2: 3, 3: 1, 4: 2, 5: 4}})
# groupby
df = df.groupby(["city", "Bedrooms"])["rate"].mean().reset_index()
Pandas - Matplotlib
With pivot_table we can rearrange our data
pv = pd.pivot_table(df,
index="Bedrooms",
columns="city",
values="rate")
city Dallas Houston
Bedrooms
0 NaN 132.768382
1 151.981043 132.332665
2 NaN 232.611185
3 112.897727 NaN
4 93.530662 NaN
And then plot in one line only.
pv.plot(kind="bar");
Using Plotly
import plotly.express as px
px.bar(df, x="Bedrooms", y="rate", color="city",barmode='group')
You can read more here: https://pythonspot.com/matplotlib-bar-chart/
import numpy as np
import matplotlib.pyplot as plt
# data to plot
n_groups = # of data points for each
mean_rates_houston = [average rates of bedrooms for Houston]
mean_rates_dallas = [average rates of bedrooms for Dalls]
# create plot
fig, ax = plt.subplots()
index = np.arange(n_groups)
bar_width = 0.35
opacity = 0.8
rects1 = plt.bar(index, mean_rates_dallas, bar_width,
alpha=opacity,
color='b',
label='Dallas')
rects2 = plt.bar(index + bar_width, mean_rates_houston, bar_width,
alpha=opacity,
color='g',
label='Houston')
plt.xlabel('City')
plt.ylabel('Rates')
plt.title('Bedroom Rates per City')
# whatever the number of bedrooms in your dataset might be: change plt.xticks
plt.xticks(index + bar_width, ('0', '1', '2', '3'))
plt.legend()
plt.tight_layout()
plt.show()
Related
New here so putting hyperlinks. My dataframe looks like this.
HR ICULOS SepsisLabel PatientID
100.3 1 0 1
117.0 2 0 1
103.9 3 0 1
104.7 4 0 1
102.0 5 0 1
88.1 6 0 1
Access the whole file here. What I wanted is to add a marker on the HR graph based on SepsisLabel (See the file). E.g., at ICULOS = 249, Sepsis Label changed from 0 to 1. I wanted to show that at this point on graph, sepsis label changed. I was able to calculate the position using this code:
mark = dummy.loc[dummy['SepsisLabel'] == 1, 'ICULOS'].iloc[0]
print("The ICULOS where SepsisLabel changes from 0 to 1 is:", mark)
Output: The ICULOS where SepsisLabel changes from 0 to 1 is: 249
I Plotted the graph using the code:
plt.figure(figsize=(15,6))
ax = plt.gca()
ax.set_title("Patient ID = 1")
ax.set_xlabel('ICULOS')
ax.set_ylabel('HR Readings')
sns.lineplot(ax=ax,
x="ICULOS",
y="HR",
data=dummy,
marker = '^',
markersize=5,
markeredgewidth=1,
markeredgecolor='black',
markevery=mark)
plt.show()
This is what I got: Graph. The marker was supposed to be on position 249 only. But it is also on position 0. Why is it happening? Can someone help me out?
Thanks.
Working with markevery can be tricky in this case, as it strongly depends on there being exactly one entry for each patient and each ICULOS.
Here is an alternative approach, using an explicit scatter plot to draw the marker:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
df = pd.DataFrame({'HR': np.random.randn(200).cumsum() + 60,
'ICULOS': np.tile(np.arange(1, 101), 2),
'SepsisLabel': np.random.binomial(2, 0.05, 200),
'PatientID': np.repeat([1, 2], 100)})
for patient_id in [1, 2]:
dummy = df[df['PatientID'] == patient_id]
fig, ax = plt.subplots(figsize=(15, 6))
ax.set_title(f"Patient ID = {patient_id}")
ax.set_xlabel('ICULOS')
ax.set_ylabel('HR Readings')
sns.lineplot(ax=ax,
x="ICULOS",
y="HR",
data=dummy)
x = dummy[dummy['SepsisLabel'] == 1]["ICULOS"].values[0]
y = dummy[dummy['SepsisLabel'] == 1]["HR"].values[0]
ax.scatter(x=x,
y=y,
marker='^',
s=5,
linewidth=1,
edgecolor='black')
ax.text(x, y, str(x) + '\n', ha='center', va='center', color='red')
plt.show()
For your new question, here is an example how to convert the 'ICULOS' column to pandas dates. The example uses date 20210101 to correspond with ICULOS == 1. You probably have a different starting date for each patient.
df_fb = pd.DataFrame()
df_fb['Y'] = df['HR']
df_fb['DS'] = pd.to_datetime('20210101') + pd.to_timedelta(df['ICULOS'] - 1, unit='D')
I would like to scatter plot a number of stations from a txt file to a map, using cartopy:
def ReadData(FileName,typee,delimee):
return np.genfromtxt(FileName, dtype=typee, delimiter=delimee, encoding='latin-1')
MyTypes = ("|U11","float","float","float","|U1","|U2","|U1","|U29")
MyDelimiters = [11,9,10,7,1,2,1,29] # station ID, lat, lon (-180 to 180), elevation (m), blank, Country code, blank, Name
RawData = ReadData(stations.txt,MyTypes,MyDelimiters)
stations.txt:
01001099999 70.9330 -8.6670 9.0 NO JAN MAYEN(NOR-NAVY) 0100109
01001599999 61.3830 5.8670 327.0 NO BRINGELAND 0100159
01003099999 77.0000 15.5000 12.0 NO HORNSUND 0100309
01008099999 78.2460 15.4660 26.8 SV LONGYEAR 0100809
01010099999 69.2930 16.1440 13.1 NO ANDOYA 0101009
2nd column represents the latitudes, 3rd column the longitudes, 4th column the elevation.
StationListID = np.array(RawData['f0'])
StationListLat = np.array(RawData['f1'])
StationListLon = np.array(RawData['f2'])
StationListElev = np.array(RawData['f3'])
I use:
import matplotlib.pyplot as plt
import cartopy.crs as crs
plt.scatter(x=StationListLon, y=StationListLat,
color="dodgerblue",
s=1,
alpha=0.5,
transform=crs.PlateCarree())
If the elevation < 0, I would like to have black dots, for > 5 green, for > 10 red and for > 15 blue dots. Where do I set the if conditions or group the lines?
Modules and data:
import numpy as np
import pandas as pd
import io
RawData = pd.read_csv(io.StringIO("""
x f1 f2 f3 stations
01001099999 70.9330 -8.6670 9.0 NOJANMAYEN(NOR-NAVY)
01001599999 61.3830 5.8670 327.0 NOBRINGELAND
01003099999 77.0000 15.5000 12.0 NOHORNSUND
01008099999 78.2460 15.4660 26.8 SVLONGYEAR
01010099999 69.2930 16.1440 13.1 NOANDOYA
"""), sep="\s", engine="python")
StationListLat = np.array(RawData['f1'])
StationListLon = np.array(RawData['f2'])
StationListElev = np.array(RawData['f3'])
You could first make labels that signify the colors using pd.cut.
color_labels = ['black', 'yellow', 'green', 'red', 'blue']
cut_bins = [-500, 0, 5, 10, 15, 500]
RawData['colors'] = pd.cut(RawData['f3'], bins=cut_bins, labels=color_labels)
Then you could use these labels to display the colors of the dots. Note that you do not have a color for values inbetween 0 and 5, I just gave it the color yellow.
As you see, I left the crs part out, if I am not mistaken it is not directly relevant for this problem.
import matplotlib.pyplot as plt
plt.scatter(x=StationListLon, y=StationListLat,
color=RawData['colors'],
s=20,
alpha=0.5)
I want to add color legend like this:
Green - Tier 1
Gold - Tier 2
Silver - Tier 3
Chocolate - Tier 4.
The values 'Tier 1', 'Tier 2', 'Tier 3', 'Tier 4' are based on another column called RFM['Tier'].
plt.style.use('classic')
#Scatter plot monetary and recency
Color = ['green','gold','silver','chocolate']
RFM['Color']= RFM['Cluster'].map(lambda p: Color[p])
ax = RFM.plot(
kind='scatter',
x='Monetary', y='Recency',
figsize=(10,8),
c= RFM['Color']
)
ax.set_title('Monetary and Recency Distribution',color='darkslategray')
Seaborn's scatterplot would automatically create a legend. The hue= tells which column to use for coloring, the palette= parameter tells the colors to use (either a list or a colormap). For non-numeric data, hue_order= can fix a certain ordering.
Here is an example with toy data:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
plt.style.use('classic')
N = 100
RFM = pd.DataFrame({'Tier': np.random.randint(1, 5, N),
'Monetary': np.random.uniform(1, 100, N),
'Recency': np.random.uniform(1, 500, N)})
color_palette = ['green', 'gold', 'silver', 'chocolate']
plt.figure(figsize=(10, 8))
ax = sns.scatterplot(data=RFM, x='Monetary', y='Recency', hue='Tier', palette=color_palette)
ax.set_title('Monetary and Recency Distribution', color='darkslategray')
plt.show()
PS: To change the legend labels, ax.legend can be called explicitly, e.g.:
ax.legend([f'Tier {i}' for i in range(1, 5)])
I have two issues with a code (see below).
1) I have two bars and I would like each bar two be coloured with a different colormap (GnBu for 'male' and RdPu for 'female'). I tried creating a list 'colors' including the two colormaps, but it didn't work very well.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(([[3.31, 3.19, 10.17, 7.69, 8.00, 20.83, 16.25, 30.00, 10.00, 38.00], [3.34, 5.13, 5.21, 4.56, 8.94, 7.75, 3.82, 1.75, 0, 0]]), columns=['18-24','25-29','30-34','35-39','40-44','45-49','50-54','55-59','60-64','65-70'], index=['male','female'])
color1 = plt.cm.GnBu(np.linspace(0, 1, 10))
color2 = plt.cm.RdPu(np.linspace(0, 1, 10))
colors=(color1,color2)
df.plot(kind='bar', stacked=True, width = 0.99, figsize=(10, 42), color=colors, rot=0)
plt.ylabel('history (years)')
2) If I have a second, third... dataset e.g.
df = pd.DataFrame(np.abs([[3.71,3.29,2.59,3.06,2.57,2.70,5.50,2.25,2.00,2.50], [2.59,2.50,2.38,2.19,3.20,3.00,2.25,7.00,0,0]]), columns=['18-24','25-29','30-34','35-39','40-44','45-49','50-54','55-59','60-64','65-70'], index=['male','female'])
and I want to plot it within the same figure, but with some space to the first two bars..., would be grateful for a hint. Bars should always be paired ('male' and 'female').
Stacked Bar Color
Try the below code :-
N = 3
top_sector = (914,300, 200) ## No of Investment in top sectors for C1/C2/C3
second_sector = (770,200,100) ## No of Investment in 2nd highest sector for C1/C2/C3
third_sector = (282,100,50) ## No of Investment in 3rd highest sector for C1/C2/C3
# fourth_sector = (0,0,0)`enter code here`
# fifth_sector = (0,0,0)`enter code here`
ind = np.arange(N)`enter code here`
width = 0.35
p1 = plt.bar(ind,top_sector,width,color=['blue'])
p2 = plt.bar(ind,second_sector,width,color=['red'])
p3 = plt.bar(ind,third_sector,width,color=['green'])`enter code here`
# p4 = plt.bar(ind,top_sector,width,color=['yellow'])`enter code here`
# p5 = plt.bar(ind,top_sector,width,color=['purple'])
plt.xlabel('Top 3 Countries for Investment')
plt.ylabel('No of Investment (Count)')
plt.title('No of Investment in top 3 sectors of top 3 countries')
plt.xticks(ind,('USA' , 'GBR', 'IND'))
plt.yticks(np.arange(0,1000,200))`enter code here`
plt.legend((p1[0],p2[0],p3[0],p4[0],p5[0]),('Others',' Health', 'Entertainment', 'Cleantech / Semiconductors','News,Searching/Msg'))
plt.show()
I am using Python 3.5. Also, I am a beginner (3 weeks experience) Python attempter and somehow I haven't given up in trying to analyze my data.
Data Description: My data is in a csv file (fev.csv). I've included it here if you want to see the full extent of it full data set. It has 5 columns:
age (years)
fev (liters)
ht (inches)
sex (female=0, male=1)
smoke (non-smoker=1, smoker=1)
Task: I am trying to write a program to generate a bar graph of average FEVs with error bars indicating standard deviation. I'm trying to get 2 side by side bars (smokers/non-smokers) at 4 different age categories (11-12, 13-14, 15-16, 17 or older).
Code so far (please excuse all my #notes, it helps me know what I'm trying to do):
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('fev.csv')
nonsmokers = data[data.smoke==0]
smokers = data[data.smoke==1]
nonsmokers1 = nonsmokers[(nonsmokers.age==11) | (nonsmokers.age==12)]
nonsmokers2 = nonsmokers[(nonsmokers.age==13) | (nonsmokers.age==14)]
nonsmokers3 = nonsmokers[(nonsmokers.age==15) | (nonsmokers.age==16)]
nonsmokers4 = nonsmokers[(nonsmokers.age>=17)]
smokers1 = smokers[(smokers.age==11) | (smokers.age==12)]
smokers2 = smokers[(smokers.age==13) | (smokers.age==14)]
smokers3 = smokers[(smokers.age==15) | (smokers.age==16)]
smokers4 = smokers[(smokers.age>=17)]
nonsmMean = [nonsmokers1.fev.mean(), nonsmokers2.fev.mean(), nonsmokers3.fev.mean(), nonsmokers4.fev.mean()]
nonsmSd = [nonsmokers1.fev.std(), nonsmokers2.fev.std(), nonsmokers3.fev.std(), nonsmokers4.fev.std()]
smMean = [smokers1.fev.mean(), smokers2.fev.mean(), smokers3.fev.mean(), smokers4.fev.mean()]
smSd = [smokers1.fev.std(), smokers2.fev.std(), smokers3.fev.std(), smokers4.fev.std()]
# data to be plotted
nonsmoker = np.array(nonsmMean)
sdNonsmoker = np.array(nonsmSd)
smoker = np.array(smMean)
sdSmoker = np.array(smSd)
# parameters
bar_width = 0.35
x = np.arange(len(nonsmoker))
# plotting bars
plt.bar(x, nonsmoker, bar_width, yerr=sdNonsmoker, ecolor='k', color='b', label='Nonsmokers')
plt.bar(x+bar_width, smoker, bar_width, yerr=sdSmoker, ecolor='k', color='m', label='Smokers')
# formatting and labeling the axes and title
plt.xlabel('Age')
plt.ylabel('FEV')
plt.title('Mean FEV by Age and Smoking Status')
plt.xticks(x+0.35, ['11 to 12', '13 to 14', '15 to 16', '17+'])
# adding the legend
plt.legend()
plt.axis([-0.5,4.2,0,7])
plt.savefig('FEVgraph.png', dpi=300)
# and we are done!
plt.show()
Is there a more efficient way of doing this?
Thanks!
Possible solution is the following:
# pip install pandas
# pip install matplotlib
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# read csv file and create pandas dataframe
df = pd.read_csv('https://raw.githubusercontent.com/benkeser/halplus/master/inst/extdata/fev.csv')
# assign age bins to data
bins = [df['age'].min()-1, 10, 12, 14, 16, df['age'].max()]
bins_labels = ['<11', '11 to 12', '13 to 14', '15 to 16', '17+']
df['age_bins'] = pd.cut(df['age'], bins, labels = bins_labels)
# aggregate data
result = df.groupby(['smoke', 'age_bins'], as_index=False).agg({'fev':['mean','std']})
result.columns = ['_'.join(col).strip('_') for col in result.columns.values]
result = result.round(1)
# prepare data for plot
nonsmokers = result[result['smoke'] == 0]
smokers = result[result['smoke'] == 1]
x = np.arange(len(bins_labels))
width = 0.35
# set plot fugure size
plt.rcParams["figure.figsize"] = [8,6]
fig, ax = plt.subplots()
rects1 = ax.bar(x - width/2, nonsmokers['fev_mean'], width, yerr=nonsmokers['fev_std'], color='b', label='Nonsmokers')
rects2 = ax.bar(x + width/2, smokers['fev_mean'], width, yerr=smokers['fev_std'], color='m', label='Smokers')
ax.set_xlabel('Age')
ax.set_ylabel('FEV')
ax.set_title('Mean FEV by Age and Smoking Status')
ax.set_xticks(x, bins_labels)
ax.legend(loc=2)
fig.tight_layout()
plt.savefig('FEVgraph.png', dpi=300)
plt.show()
Returns