Adjusting data in a dictionary and then plotting it - python

I have
x = collections.Counter(df.f.values.tolist())
if 'nan' in x:
del x['nan']
plt.bar(range(len(x)), x.values(), align='center')
plt.xticks(range(len(x)), list(x.keys()))
plt.show()
My question is, how can I remove the nan's from the dictionary that is created, and how can I change the order of the bar plot to go from 1-5? The first 3 nan's are empty spots in the data (intentional since its from a poll), and the last one is the title of the column. I tried manually changing the range part of plt.bar to be 1-5 but it does not seem to work.

You can use .value_counts on a pandas.Series to simply get how many times each value occurs. This makes it simple to then make a barplot.
By default, value_counts will ignore the NaN values, so that takes care of that, and by using .sort_index() we can guarantee the values are plotted in order. It seems we need to use .to_frame() so that it only plots one color for the column (it chooses one color per row for a Series).
Sample Data
import pandas as pd
import numpy as np
# Get your plot settings
import seaborn as sns
sns.set()
np.random.seed(123)
df = pd.DataFrame({'f': np.random.randint(1,6,100)})
df = df.append(pd.DataFrame({'f': np.repeat(np.NaN,1000)}))
Code
df.f.value_counts().to_frame().sort_index().plot(kind='bar', legend=False)

Related

Order seaborn countplot by Month

this should be very simple but I'm trying to order a seaborn countplot by Month.
The default is in reverse order (latest months first), so I would like to either simply reverse the order or specify the order - ideally I'd like to understand how to do both.
This is the code I have:
sns.countplot(data = cycling ,x = cycling['Date'].dt.strftime('%Y-%m') ) plt.xticks(rotation=45) plt.show()
I tried adding order = cycling['Date'].dt.strftime('%Y-%m') but it just splits the bars further based on how many entries I had for that month. So it goes from this: Barplot image 1: wrong order
To this: Barplot image 2: wrong order + sliced too much
Any help would be great, thanks!
By default, the order of appearance in the 'Date' column is used. If your dataframe is strictly from newest to oldest, you could just invert the dataframe. If there isn't a strict order, you can sort the dataframe.
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
cycling = pd.DataFrame({'Date': np.random.choice(pd.date_range('20210801', '20230123', freq='D'), 500)})
ax = sns.countplot(x=cycling.sort_values('Date')['Date'].dt.strftime('%Y-%m'))
ax.tick_params(axis='x', rotation=45)
ax.set_xlabel('')
plt.tight_layout()
plt.show()
You can use order = list(set(order)) to remove duplicates from your list.
the use of set will remove duplicates from the list and the use of list will convert the type back to list
you can also reverse the auto-generated order list: by using order.reverse()

How to align bars with tick labels in plt or pandas histogram (when plotting multiple columns)

I have started using python for lots of data problems at work and the datasets are always slightly different. I'm trying to explore more efficient ways of plotting data using the inbuilt pandas function rather than individually writing out the code for each column and editing the formatting to get a nice result.
Background: I'm using Jupyter notebook and looking at histograms where the values are all unique integers.
Problem: I want the xtick labels to align with the centers of the histogram bars when plotting multiple columns of data with the one function e.g. df.hist() to get histograms of all columns at once.
Does anyone know if this is possible?
Or is it recommended to do each graph on its own vs. using the inbuilt function applied to all columns?
I can modify them individually following this post: Matplotlib xticks not lining up with histogram
which gives me what I would like but only for one graph and with some manual processing of the values.
Desired outcome example for one graph:
Basic example of data I have:
# Import libraries
import pandas as pd
import numpy as np
# create list of datapoints
data = [[170,30,210],
[170,50,200],
[180,50,210],
[165,35,180],
[170,30,190],
[170,70,190],
[170,50,190]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['height', 'width','weight'])
# print dataframe.
df
Code that displays the graphs in the problem statement
df.hist(figsize=(5,5))
plt.show()
Code that displays the graph for weight how I would like it to be for all
df.hist(column='weight',bins=[175,185,195,205,215])
plt.xticks([180,190,200,210])
plt.yticks([0,1,2,3,4,5])
plt.xlim([170, 220])
plt.show()
Any tips or help would be much appreciated!
Thanks
I hope this helps.You take the column and count the frequency of each label (value counts) then you specify sort_index in order to get the order by the label not by the frecuency, then you plot the bar plot.
data = [[170,30,210],
[170,50,200],
[180,50,210],
[165,35,180],
[170,30,190],
[170,70,190],
[170,50,190]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['height', 'width','weight'])
df.weight.value_counts().sort_index().plot(kind = 'bar')
plt.show()

Subplot counts of multiple categorical variables on a single bar chart

I'm trying to create a single barplot from multiple dataframe columns each of which is a categorical variable (all based on the same levels). I want it to show a count of the levels occurring in each column.
The below code achieves what I want, but on 4 different bar plots. I'd like it all to be on one plot, so the bars are side by side (labels/legend would be rad). I'm trying to a get clean, simple solution using matplotlib but so far I can't figure it out. Help?
Thanks!
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.DataFrame({"A":['cow','pig','horse','goat','cow'], "B":['cow','pig','horse','cow','goat'], "C":['pig','horse','goat','pig','cow'], "D":['cow','pig','horse','horse','goat'], "E":['pig','horse','goat','cow','goat']})
levels = np.sort(df['A'].unique())
df.A.value_counts()[levels].plot(kind='bar')
df.B.value_counts()[levels].plot(kind='bar')
df.C.value_counts()[levels].plot(kind='bar')
df.D.value_counts()[levels].plot(kind='bar')
You should apply pd.series.value_counts and plot a bar graph, stacked or unstacked.
If you need each column on its own;
df.apply(pd.Series.value_counts).plot(kind='bar')
if you need them stacked;
df.apply(pd.Series.value_counts).plot(kind='bar', stacked=True)

How do I get all seaborn plots into an output png?

I want to plot all columns in my dataframe against one column in the same df: totCost. The following code works fine:
for i in range(0, len(df.columns), 5):
g=sns.pairplot(data=df,
x_vars=df.columns[i:i+5],
y_vars=['totCost'])
g.set(xticklabels=[])
g.savefig('output.png')
Problem is output.png only contains the last 3 graphs (there are 18 total). Same happens if I de-dent that line. How do I write all 18 as a single graphic?
So, the problem with using pairplot like you do, is that in every iteration of the loop, a new figure is created and assigned to g.
If you take your last line of code g.savefig('output.png'), outside of the loop, only the last version of g is saved to disk, and this is the one with only the last three subplots in it.
If you put that line into you loop, all figures get saved to disk, but under the same name, and the last one is of course again the figure with three subplots in it.
A way around this is to create a figure, and assign all subplots to it, as they come, and then save that figure to disk:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
# generate random data, with 18 columns
dic = {str(a): np.random.randint(0,10,10) for a in range(18)}
df = pd.DataFrame(dic)
# rename first column of dataframe
df.rename(columns={'0':'totCost'}, inplace=True)
#instantiate figure
fig = plt.figure()
# loop through all columns, create subplots in 5 by 5 grid along the way,
# and add them to the figure
for i in range(len(df.columns)):
ax = fig.add_subplot(5,5,i+1)
ax.scatter(df['totCost'], df[df.columns[i]])
ax.set_xticklabels([])
plt.tight_layout()
fig.savefig('figurename.png')

How to show some selected rows with FacetGrid

I have a dataframe and with a column called "my_row". It has many values. I only want to see some of the data on FacetGrid that belong to specific values of "my_row" on the row. I tried to make a subset of my dataframe and visualize that, but still somehow seaborn "knows" that my original dataframe had more values in "my_row" column and shows empty plots for the rows that I dont want.
So using the following code still gives me a figure with 2 rows of data that I want and many empty plots after that.
X = df[(df['my_row']=='1') | (df['my_row']=='2')].copy()
g = sns.FacetGrid(X, row='my_row', col='column')
How can I tell python to just plot that 2 rows?
I get plots like this with many empty plots:
I cannot reproduce this. The code from the question seems to work fine. Here we have a dataframe with four different values in the my_row column. Then filtering out two of them creates a FacetGrid with only two rows.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
df = pd.DataFrame({"my_row" : np.random.choice(list("1234"), size=40),
"column" : np.random.choice(list("AB"), size=40),
"x" : np.random.rand(40),
"y" : np.random.rand(40)})
X = df[(df['my_row']=='1') | (df['my_row']=='2')].copy()
g = sns.FacetGrid(X, row='my_row', col='column')
g.map(plt.scatter, "x", "y")
plt.show()
For anyone encountering this problem-- the issue is that my_row is a categorical type. To solve, change this to a str.
i.e.
X = df[(df['my_row']=='1') | (df['my_row']=='2')].copy()
X['my_row']=X['my_row'].astype(str)
g = sns.FacetGrid(X, row='my_row', col='column')
This should now work! :)
I got inspired by this link:
Plot lower triangle in a seaborn Pairgrid
and changed my code to this:
g = sns.FacetGrid(df, row='my_row', col='column')
for i in list(range(2,48)):
for j in list(range(0,12)):
g.axes[i,j].set_visible(False)
So I had to iterate over each plot individually at make it invisible. But I think there should be an easier way to do this. And in the end I still don't understand how FacetGrid knows anything about the size of my original dataframe df when I use X and its input.
This is an answer that works, but I think there must be better solutions. One problem with my answer is that when I save the figure, I get a big white space in the saved plot (corresponding to the axes that I set their visibility to False) that I do not see in jupyter notebooks when I am running the code. If FacetGrid just plots the dataframe that I am giving it as the input (in this case X), there would have been no problem anymore. There should be a way to do that.

Categories

Resources