Iterating over Dataframe columns to plot Histogram

Iterating over Dataframe columns to plot Histogram - python

%matplotlib inline
for column in df.columns:
if df[column].dtype =="int64":
df[column].hist(title=column)
else:
df[column].plot(kind="bar", title=column)
AttributeError: 'Rectangle' object has no property 'title'
I would like to print Histogram whether the dtype is int and a barplot whether the dtype is object but the code isn't working.

Try to slice your columns at start and you need subplot/subplots to plot multiple graphs
import seaborn as sns
numeric_columns = df.select_dtypes(include=['int64','float64']).columns
n_rows = 2
n_cols= 2
for i, column in enumerate(df.columns,1):
plt.subplot(n_rows,n_cols,i)
if column in numeric_columns:
df[column].plot(kind="hist", title=column)
else:
sns.countplot(df[column])
And as mentioned by Code Different you want get of the title arguement if you are using an older pandas version

Related

How to use two columns in x-axis

I'm using the below code to get Segment and Year in x-axis and Final_Sales in y-axis but it is throwing me an error.
CODE
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline
order = pd.read_excel("Sample.xls", sheet_name = "Orders")
order["Year"] = pd.DatetimeIndex(order["Order Date"]).year
result = order.groupby(["Year", "Segment"]).agg(Final_Sales=("Sales", sum)).reset_index()
bar = plt.bar(x = result["Segment","Year"], height = result["Final_Sales"])
ERROR
Can someone help me to correct my code to see the output as below.
Required Output

Try to add another pair of brackets - result[["Segment","Year"]],
What you tried to do is to retrieve column named - "Segment","Year",
But actually what are you trying to do is to retrieve a list of columns - ["Segment","Year"].

There are several problems with your code:
When using several columns to index a dataframe you want to pass a list of columns to [] (see the docs) as follows :
result[["Segment","Year"]]
From the figure you provide it looks like you want to use year as hue. matplotlib.barplot doesn't have a hue argument, you would have to build it manually as described here. Instead you can use seaborn library that you are already importing anyway (see https://seaborn.pydata.org/generated/seaborn.barplot.html):
sns.barplot(x = 'Segment', y = 'Final_Sales', hue = 'Year', data = result)

plot dataframe based on column index/position in python

I'm trying to plot a dataframe based on the column index (position).
It's easy to use column name, and it shows correct plot, but since there's duplicated column names, I have to use column index.
import matplotlib.pyplot as plt
import pandas as pd
# gca stands for 'get current axis'
ax = plt.gca()
#class_report.plot(kind='line',x='description',y= "f1-score",ax=ax) #no error but shows duplicate lines
class_report.plot(kind='line',x='description',y= class_report.iloc[:,[3]],ax=ax) #error
class_report.plot(kind='line',x='description',y= class_report.iloc[:,[7]], color='red', ax=ax)#error
plt.show()
and it shows this error :
ValueError: Boolean array expected for the condition, not object
after using np.array(class_report.iloc[:,[3]]), new error appeared:
KeyError: "None of [Index([ (0.6884596334819217,), (0.16236162361623618,), (0.6314769975786926,),\n (0.625,), (0.7875912408759124,), (0.4711779448621553,),\n (0.593069306930693,), (0.18989898989898987,), (0.5726240286909743,),\n (0.12307692307692307,), (0.03592814371257485,), (0.5991130820399113,),\n (0.4436968029750066,), (0.5754453990621118,), (0.5679548536332456,)],\n dtype='object')] are in the [columns]"
Here's data

Since you have two columns with an identical name, you can't use the notion of
my_dataframe.plot(y = some_column_name)
Instead, use the plotly plot function, as in:
class_report = pd.DataFrame(zip(range(10), np.random.rand(10), np.random.rand(10)),
columns=["description", "f1_score", "f1_score"])
plt.plot(class_report.description, class_report.iloc[:,1])
plt.plot(class_report.description, class_report.iloc[:,2], color = "red")
plt.show()
I'm using random data in this example, with two columns named 'f1_score'.
The output is:

You can rename the columns using
class_report.columns = ['description','f1-score','f1-score-2',...]
plt.plot(class_report['description'], class_report['f1-score'])
plt.plot(class_report['description'], class_report['f1-score-2'], color='red)
plt.show()

Avoid plotting missing values in Seaborn

Problem: I have timeseries data of several days and I use sns.FacetGrid function of Seaborn python library to plot this data in facet form. In several cases, I found that mentioned seaborn function plots consecutive missing values (nan values) between two readings with a continuous line. While as matplotlib shows missing values as a gap, which makes sense. A demo example is as
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# create timeseries data for 3 days such that day two contains NaN values
time_duration1 = pd.date_range('1/1/2018', periods=24,freq='H')
data1 = np.random.randn(len(time_duration1))
ds1 = pd.Series(data=data1,index=time_duration1)
time_duration2 = pd.date_range('1/2/2018',periods=24,freq='H')
data2 = [float('nan')]*len(time_duration2)
ds2 = pd.Series(data=data2,index=time_duration2)
time_duration3 = pd.date_range('1/3/2018', periods=24,freq='H')
data3 = np.random.randn(len(time_duration3))
ds3 = pd.Series(data=data3,index=time_duration3)
# combine all three days series and then convert series into pandas dataframe
DS = pd.concat([ds1,ds2,ds3])
DF = DS.to_frame()
DF.plot()
It results into following plot
Above Matplotlib plot shows missing values with a gap.
Now let us prepare same data for seaborn function as
DF['col'] = np.ones(DF.shape[0])# dummy column but required for facets
DF['timestamp'] = DF.index
DF.columns = ['data_val','col','timestamp']
g = sns.FacetGrid(DF,col='col',col_wrap=1,size=2.5)
g.map_dataframe(plt.plot,'timestamp','data_val')
See, how seaborn plot shows missing data with a line. How should I force seaborn to not plot nan values with such a line?
Note: This is a dummy example, and I need facet grid in any case to plot my data.

FacetGrid by default removes nan from the data. The reason is that some functions inside seaborn would not work properly with nans (especially some of the statistical function, I'd say).
In order to keep the nan values in the data, use the dropna=False argument to FacetGrid:
g = sns.FacetGrid(DF,... , dropna=False)

Matplotlib Boxplot and pandas dataframe data type

So I set up this empty dataframe DF and load data into the dataframe according to some conditions. As such, some its elements would then be empty (nan). I noticed that if I don't specify the datatype as float when I create the empty dataframe, DF.boxplot() will give me an 'Index out of range' error.
As I understand it, pandas' DF.boxplot() uses matplotlib's plt.boxplot() function, so naturally I tried using plt.boxplot(DF.iloc[:,0]) to plot the boxplot of the first column. I noticed a reversed behavior: When dtype of DF is float, it will not work: it will just show me an empty plot. See the code below where DF.boxplot() wont work, but plt.boxplot(DF.iloc[:,0]) will plot a boxplot (when i add dtype='float' when first creating the dataframe, plt.boxplot(DF.iloc[:,0]) will give me an empty plot):
import numpy as np
import pandas as pd
DF=pd.DataFrame(index=range(10),columns=range(4))
for i in range(10):
for j in range(4):
if i==j:
continue
DF.iloc[i,j]=i
I am wondering does this has to do with how plt.boxplot() handles nan for different data types? If so, why did setting the dataframe's data type as 'object' didn't work for DF.boxplot(), if pandas is just using matplotlib's boxplot function?

I think we can agree that neither df.boxplot() nor plt.boxplot can handle dataframes of type "object". Instead they need to be of a numeric datatype.
If the data is numeric, df.boxplot() will work as expected, even with nan values, because they are removed before plotting.
import pandas as pd
import matplotlib.pyplot as plt
df=pd.DataFrame(index=range(10),columns=range(4), dtype=float)
for i in range(10):
for j in range(4):
if i!=j:
df.iloc[i,j]=i
df.boxplot()
plt.show()
Using plt.boxplot you would need to remove the nans manually, e.g. using df.dropna().
import pandas as pd
import matplotlib.pyplot as plt
df=pd.DataFrame(index=range(10),columns=range(4), dtype=float)
for i in range(10):
for j in range(4):
if i!=j:
df.iloc[i,j]=i
data = [df[i].dropna() for i in range(4)]
plt.boxplot(data)
plt.show()
To summarize:

Matplotlib's fill_between doesnt work with plot_date, any alternatives?

I want to create a plot just like this:
The code:
P.fill_between(DF.start.index, DF.lwr, DF.upr, facecolor='blue', alpha=.2)
P.plot(DF.start.index, DF.Rt, '.')
but with dates in the x axis, like this (without bands):
the code:
P.plot_date(DF.start, DF.Rt, '.')
the problem is that fill_between fails when x values are date_time objects.
Does anyone know of a workaround? DF is a pandas DataFrame.

It would help if you show how df is defined. What does df.info() report? This will show us the dtypes of the columns.
There are many ways that dates can be represented: as strings, ints, floats, datetime.datetime, NumPy datetime64s, Pandas Timestamps, or Pandas DatetimeIndex. The correct way to plot it depends on what you have.
Here is an example showing your code works if df.index is a DatetimeIndex:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
index = pd.date_range(start='2000-1-1', end='2015-1-1', freq='M')
N = len(index)
poisson = (stats.poisson.rvs(1000, size=(N,3))/100.0)
poisson.sort(axis=1)
df = pd.DataFrame(poisson, columns=['lwr', 'Rt', 'upr'], index=index)
plt.fill_between(df.index, df.lwr, df.upr, facecolor='blue', alpha=.2)
plt.plot(df.index, df.Rt, '.')
plt.show()
If the index has string representations of dates, then (with Matplotlib version 1.4.2) you would get a TypeError:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
index = pd.date_range(start='2000-1-1', end='2015-1-1', freq='M')
N = len(index)
poisson = (stats.poisson.rvs(1000, size=(N,3))/100.0)
poisson.sort(axis=1)
df = pd.DataFrame(poisson, columns=['lwr', 'Rt', 'upr'])
index = [item.strftime('%Y-%m-%d') for item in index]
plt.fill_between(index, df.lwr, df.upr, facecolor='blue', alpha=.2)
plt.plot(index, df.Rt, '.')
plt.show()
yields
File "/home/unutbu/.virtualenvs/dev/local/lib/python2.7/site-packages/numpy/ma/core.py", line 2237, in masked_invalid
condition = ~(np.isfinite(a))
TypeError: Not implemented for this type
In this case, the fix is to convert the strings to Timestamps:
index = pd.to_datetime(index)

Regarding the error reported by chilliq:
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs
could not be safely coerced to any supported types according to the casting
rule ''safe''
This can be produced if the DataFrame columns have "object" dtype when using fill_between. Changing the example column types and then trying to plot, as follows, results in the error above:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
index = pd.date_range(start='2000-1-1', end='2015-1-1', freq='M')
N = len(index)
poisson = (stats.poisson.rvs(1000, size=(N,3))/100.0)
poisson.sort(axis=1)
df = pd.DataFrame(poisson, columns=['lwr', 'Rt', 'upr'], index=index)
dfo = df.astype(object)
plt.fill_between(df0.index, df0.lwr, df0.upr, facecolor='blue', alpha=.2)
plt.show()
From dfo.info() we see that the column types are "object":
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 180 entries, 2000-01-31 to 2014-12-31
Freq: M
Data columns (total 3 columns):
lwr 180 non-null object
Rt 180 non-null object
upr 180 non-null object
dtypes: object(3)
memory usage: 5.6+ KB
Ensuring that the DataFrame has numerical columns will solve the problem. To do this we can use pandas.to_numeric to convert, as follows:
dfn = dfo.apply(pd.to_numeric, errors='ignore')
plt.fill_between(dfn.index, dfn.lwr, dfn.upr, facecolor='blue', alpha=.2)
plt.show()

I got similar error while using fill_between:
ufunc 'bitwise_and' not supported
However, in my case the cause of error was rather stupid. I was passing color parameter but without explicit argument name which caused it to be #4 parameter called where. So simply making sure keyword parameters has key solved the issue:
ax.fill_between(xdata, highs, lows, color=color, alpha=0.2)

I think none of the answers addresses the original question, they all change it a little bit.
If you want to plot timdeltas you can use this workaround
ax = df.Rt.plot()
x = ax.get_lines()[0].get_xdata().astype(float)
ax.fill_between(x, df.lwr, df.upr, color="b", alpha=0.2)
plt.show()
This work sin your case. In general, the only caveat is that you always need to plot the index using pandas and then get the coordinates from the artist. I am sure that by looking at pandas code, one can actually find how they plot the timedeltas. Then one can apply that to the code, and the first plot is not needed anymore.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iterating over Dataframe columns to plot Histogram - python

Related

How to use two columns in x-axis

plot dataframe based on column index/position in python

Avoid plotting missing values in Seaborn

Matplotlib Boxplot and pandas dataframe data type

Matplotlib's fill_between doesnt work with plot_date, any alternatives?

Categories

Resources