Plotting mean value of barchart in same figure with pandas - python

I have a dataframe of different cereals and want to plot their calories as a barplot. Now I also want to plot the mean value of the calorie values as a lineplot into the same figure as the barplot. I had the idea to put the mean value into a 1x1 dataframe by its own but I got the error
"None of [Index(['mean'], dtype='object')] are in the [columns]"
But I'm not determined to that approach.
I was unsuccessful in finding any solution for myself. Is there any?
My code inculding the calculation of the mean value but without showing it in the figure:
import pandas as pd
df = pd.read_csv("cereal.csv")
mn = df["calories"].mean()
df.plot.bar(x="name", y="calories")

If I understand well you have a few bars and you would like a single horizontal line for the mean? You can try:
import pandas as pd
df = pd.read_csv("cereal.csv")
mn = df["calories"].mean()
ax = df.plot.bar(x="name", y="calories")
ax.axhline(mn, ls=':')

Related

Using Pandas to plot Frequency of X against Y, where X is a 0 or 1 [Python,CSV,Pandas]

I have a CSV file which has been generated and altered to the current form;
Quick snapshot of sample Data
I want to be able to plot a graph that will have the X along the X axis as normal, and the Y axis to be a frequency 'True' values i.e (1's) So that I can visualise the relationship between time and frequency of the event occurring.
Thus far I have attempted a melt and using value_counts but they seem to give absolute not relative to the X value. I understand the data will likely need sorting additionally before plotting but I'm not sure the best way to go about this.
Many thanks for any help.
You can either plot a histogram which probably would work. or you can do a groupby 'x' to find aggregate of the sum of 'y' with code s shown below:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = np.array([[1988,1988,1988,1989,1990,1990,1991,1991], [0,1,1,0,1,1,0,0]])
df = pd.DataFrame(data.T, columns = ['x', 'y'])
df=df.groupby(['x']).sum()
print(df)

plot dataframe based on column index/position in python

I'm trying to plot a dataframe based on the column index (position).
It's easy to use column name, and it shows correct plot, but since there's duplicated column names, I have to use column index.
import matplotlib.pyplot as plt
import pandas as pd
# gca stands for 'get current axis'
ax = plt.gca()
#class_report.plot(kind='line',x='description',y= "f1-score",ax=ax) #no error but shows duplicate lines
class_report.plot(kind='line',x='description',y= class_report.iloc[:,[3]],ax=ax) #error
class_report.plot(kind='line',x='description',y= class_report.iloc[:,[7]], color='red', ax=ax)#error
plt.show()
and it shows this error :
ValueError: Boolean array expected for the condition, not object
after using np.array(class_report.iloc[:,[3]]), new error appeared:
KeyError: "None of [Index([ (0.6884596334819217,), (0.16236162361623618,), (0.6314769975786926,),\n (0.625,), (0.7875912408759124,), (0.4711779448621553,),\n (0.593069306930693,), (0.18989898989898987,), (0.5726240286909743,),\n (0.12307692307692307,), (0.03592814371257485,), (0.5991130820399113,),\n (0.4436968029750066,), (0.5754453990621118,), (0.5679548536332456,)],\n dtype='object')] are in the [columns]"
Here's data
Since you have two columns with an identical name, you can't use the notion of
my_dataframe.plot(y = some_column_name)
Instead, use the plotly plot function, as in:
class_report = pd.DataFrame(zip(range(10), np.random.rand(10), np.random.rand(10)),
columns=["description", "f1_score", "f1_score"])
plt.plot(class_report.description, class_report.iloc[:,1])
plt.plot(class_report.description, class_report.iloc[:,2], color = "red")
plt.show()
I'm using random data in this example, with two columns named 'f1_score'.
The output is:
You can rename the columns using
class_report.columns = ['description','f1-score','f1-score-2',...]
plt.plot(class_report['description'], class_report['f1-score'])
plt.plot(class_report['description'], class_report['f1-score-2'], color='red)
plt.show()

How can I visualise categorical feature vs date column

In my dataset I have a categorical column named 'Type'contain(eg.,INVOICE,IPC,IP) and 'Date' column contain dates(eg,2014-02-01).
how can I plot these two.
On x axis I want date
On y axis a line for (eg.INVOCE) showing its trend
enter image description here
Not very sure what you mean by plot and show trend, one ways is to count like #QuangHoang suggested, and plot with a heatmap, something like below. If it is something different, please expand on your question.
import pandas as pd
import numpy as np
import seaborn as sns
dates = pd.date_range(start='1/1/2018', periods=5, freq='3M')[np.random.randint(0,5,20)]
type = np.random.choice(['INVOICE','IPC','IP'],20)
df = pd.DataFrame({'dates':dates ,'type':type})
tab = pd.crosstab(df['type'],df['dates'].dt.strftime('%d-%m-%Y'))
n = np.unique(tab.values)
cmap = sns.color_palette("BuGn_r",len(n))
sns.heatmap(tab,cmap=cmap)

Avoid plotting missing values in Seaborn

Problem: I have timeseries data of several days and I use sns.FacetGrid function of Seaborn python library to plot this data in facet form. In several cases, I found that mentioned seaborn function plots consecutive missing values (nan values) between two readings with a continuous line. While as matplotlib shows missing values as a gap, which makes sense. A demo example is as
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# create timeseries data for 3 days such that day two contains NaN values
time_duration1 = pd.date_range('1/1/2018', periods=24,freq='H')
data1 = np.random.randn(len(time_duration1))
ds1 = pd.Series(data=data1,index=time_duration1)
time_duration2 = pd.date_range('1/2/2018',periods=24,freq='H')
data2 = [float('nan')]*len(time_duration2)
ds2 = pd.Series(data=data2,index=time_duration2)
time_duration3 = pd.date_range('1/3/2018', periods=24,freq='H')
data3 = np.random.randn(len(time_duration3))
ds3 = pd.Series(data=data3,index=time_duration3)
# combine all three days series and then convert series into pandas dataframe
DS = pd.concat([ds1,ds2,ds3])
DF = DS.to_frame()
DF.plot()
It results into following plot
Above Matplotlib plot shows missing values with a gap.
Now let us prepare same data for seaborn function as
DF['col'] = np.ones(DF.shape[0])# dummy column but required for facets
DF['timestamp'] = DF.index
DF.columns = ['data_val','col','timestamp']
g = sns.FacetGrid(DF,col='col',col_wrap=1,size=2.5)
g.map_dataframe(plt.plot,'timestamp','data_val')
See, how seaborn plot shows missing data with a line. How should I force seaborn to not plot nan values with such a line?
Note: This is a dummy example, and I need facet grid in any case to plot my data.
FacetGrid by default removes nan from the data. The reason is that some functions inside seaborn would not work properly with nans (especially some of the statistical function, I'd say).
In order to keep the nan values in the data, use the dropna=False argument to FacetGrid:
g = sns.FacetGrid(DF,... , dropna=False)

Iterating over columns with for loops in pandas dataframe

I am trying to take a dataframe read in from CSV file, and generate scatter plots for each column within the dataframe. For example, I have read in the following with df=pandas.readcsv()
Sample AMP ADP ATP
1A 239847 239084 987374
1B 245098 241210 988950
2A 238759 200554 921032
2B 230029 215408 899804
I would like to generate a scatter plot using sample as the x values, and the areas for each of the columns.
I am using the following code with bokeh.plotting to plot each column manually
import pandas
from bokeh.plotting import figure, show
df = pandas.read_csv("data.csv")
p = figure(x_axis_label='Sample', y_axis_label='Peak Area', x_range=sorted(set(df['Sample'])))
p.scatter(df['Sample'], df['AMP'])
show(p)
This generates scatter plots successfully, but I would like to create a loop to generate a scatter plot for each column. In my full dataset, I have over 500 columns I would like to plot.
I have followed references for using df.iteritems and df.itertuples for iterating through dataframes, but I'm not sure how to get the output I want.
I have tried the following:
for index, row in df.iteritems():
p = figure()
p.scatter(df['Sample'], df[row])
show(p)
I hit an error right away:
raise KeyError('%s not in index' % objarr[mask] KeyError: "['1A' '1B'
'2A' '2B'] not in index
Any guidance? Thanks in advance.
iteritems iterates over columns, not rows. But your real problem is when you are trying to df[row] instead of df[index]. I'd switch wording to columns and do this:
for colname, col in df.iteritems():
p = figure()
p.scatter(df['Sample'], df[colname])
show(p)

Categories

Resources