Create a checkerboard plot with unbalanced rows and colums - python

I have a dataset similar to this format X = [[1,4,5], [34,70,1,5], [43,89,4,11], [22,76,4]] where the length of element lists are not equal.
I want to create a checkerboard plot of 4 rows and 4 columns and the colorbar of each unit box corresponds to the value of the number. In this dataset some small boxes will be missing (eg. 4th column firs row).
How would I plot this in python using matplotlib?
Thanks

You can use seaborn library or matplotlib to generate heatmap. Firstly, convert it to pandas dataframe to handle missing values.
import pandas as pd
df = pd.DataFrame([[1,4,5],[34,70,1,5], [43,89,4,11],[22,76,4]])
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
sns.heatmap(df)
plt.show()
Result looks something like this.

Related

Pandas - plot every single column of a Dataframe in a small multiple chart

I have a dataset with 35 features. I want to plot every feature in a small multiple chart type, as this:
Now, I am able to plot features one by one with the following code:
# libraries and data
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# Make a data frame
df=pd.read_csv("mydataset.csv")
df.plot(y="feature1")
How can I achieve the result shown in the previous image ? I need something similar to:
# read the data in a Dataframe object
# for each feature (column):
# plot the feature in the final small multiple chart
# render the full final chart
Thank you !
To complete the comment provided by Chris, I would suggest:
Transform your dataframe so all the columns you want to plot are contained in one column, using pd.melt. Let's call the list of columns that you want to plot columns_to_plot
df_for_plot = pd.melt(df, value_vars = columns_to_plot)
Use seaborn FacetGrid to plot it into separated plots:
g = sns.FacetGrid(df_for_plot, col="variable")
g = g.map(plt.plot, "value")

Subplot counts of multiple categorical variables on a single bar chart

I'm trying to create a single barplot from multiple dataframe columns each of which is a categorical variable (all based on the same levels). I want it to show a count of the levels occurring in each column.
The below code achieves what I want, but on 4 different bar plots. I'd like it all to be on one plot, so the bars are side by side (labels/legend would be rad). I'm trying to a get clean, simple solution using matplotlib but so far I can't figure it out. Help?
Thanks!
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.DataFrame({"A":['cow','pig','horse','goat','cow'], "B":['cow','pig','horse','cow','goat'], "C":['pig','horse','goat','pig','cow'], "D":['cow','pig','horse','horse','goat'], "E":['pig','horse','goat','cow','goat']})
levels = np.sort(df['A'].unique())
df.A.value_counts()[levels].plot(kind='bar')
df.B.value_counts()[levels].plot(kind='bar')
df.C.value_counts()[levels].plot(kind='bar')
df.D.value_counts()[levels].plot(kind='bar')
You should apply pd.series.value_counts and plot a bar graph, stacked or unstacked.
If you need each column on its own;
df.apply(pd.Series.value_counts).plot(kind='bar')
if you need them stacked;
df.apply(pd.Series.value_counts).plot(kind='bar', stacked=True)

How to plot only one half of a scatter matrix using pandas

I am using pandas scatter_matrix (couldn't get PairgGrid in seaborn to work) to plot all combinations of a set of columns in a pandas frame. Each column as 1000 data points and there are nine columns.
I am using the following code:
pandas.plotting.scatter_matrix(df, alpha=0.2, figsize=(8,8))
I get the figure shown below:
This is nice., However, you'll notice that across the main diagonal I have a mirror image. Is it possible to plot only the lower portion as in the following fake plot I made using paint:
This is probably not the cleanest way to do it, but it works:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
axes = pd.plotting.scatter_matrix(iris, alpha=0.2, figsize=(8,8))
for i in range(np.shape(axes)[0]):
for j in range(np.shape(axes)[1]):
if i < j:
axes[i,j].set_visible(False)

How to create a scatter plot for each dataframe column

I am trying to write some code in order to create an animation of scatter plot data through tine. In order to do this I have a dataset with multiple columns where each column represents a numbered timestep.
I would like the code to cycle through each timestep column for the y axis and use a constant x axis, so that a separate scatter plot is generated for each timestep. I tried to do this by coding a for loop that specifies an incrementing column number for the y axis.
My current code generates three out of seven scatter plots in my sample data but returns the following error:
IndexError: index 9 is out of bounds for axis 0 with size 9
I have tried other similar solutions on stack overflow but that didn't correct my problem.
The data is here if anyone wants to use what I am using: https://www.dropbox.com/s/7vwa0lud44td2ak/test_splot_anim_noTS.csv?dl=0data file
Any help or advice would be much appreciated.
import numpy as np
import pandas a pd
import matplotlib as mpl
import matplotlib.pyplot as plt
data=pd.read_csv("test_splot_anim_noTS.csv")
for n in range (6, 13):
data.plot(kind='scatter', x='metres', y=n)
plt.ylim(-4,4)
plt.savefig('n.jpeg')
data=pd.read_csv("test_splot_anim_noTS.csv")
for column in data.columns[1:]:
data.plot(kind='scatter', x='metres',y=column)
plt.ylim(-4,4)
plt.savefig('{}.jpeg'.format(column))
I may have done it!
panda.DataFrame.plot, single line plot
data=pd.read_csv("test_splot_anim_noTS.csv")
data.set_index('metres', drop=True, inplace=True)
data.plot()
With matplotlib, single plot with all columns:
import matplotlib.pyplot as plt
plt.plot(data)
plt.show()
Separate scatter plots, files saved:
for col in data.columns:
plt.scatter(data.index, data[col])
plt.ylim(-4, 4)
plt.savefig(f'{col}.jpeg')
plt.show()
With Seaborn:
for col in data.columns:
sns.scatterplot(data.index, data[col])
plt.ylim(-4,4)
plt.savefig(f'{col}.jpeg')
plt.show()

Visualizing data for the purpose of data cleaning

I have a column of a data frame with millions of rows (almost 8 million). I want to investigate this column in order to do some data cleaning. The data contained is trip_distance of NYC's yellow taxis.
I tried a simple plotting with sns.distplot() but it doesn't give me a clear plot.
I did try to use a range too: sns.distplot(df['trip_distance']<200, kde=False, bins=10, norm_hist=False), but I got this which again does not look helpful:
Is there a way to understand this column through visualization?
You can try this:->
import pandas as pd
import matplotlib.pyplot as plt
s=pd.read_csv("name.csv",usecols=['col_name'],squeeze=True)
s.plot.bar() #for bar graph
s.plot.hist() #for histogram

Categories

Resources