Visualizing data for the purpose of data cleaning - python

I have a column of a data frame with millions of rows (almost 8 million). I want to investigate this column in order to do some data cleaning. The data contained is trip_distance of NYC's yellow taxis.
I tried a simple plotting with sns.distplot() but it doesn't give me a clear plot.
I did try to use a range too: sns.distplot(df['trip_distance']<200, kde=False, bins=10, norm_hist=False), but I got this which again does not look helpful:
Is there a way to understand this column through visualization?

You can try this:->
import pandas as pd
import matplotlib.pyplot as plt
s=pd.read_csv("name.csv",usecols=['col_name'],squeeze=True)
s.plot.bar() #for bar graph
s.plot.hist() #for histogram

Related

Pandas - plot every single column of a Dataframe in a small multiple chart

I have a dataset with 35 features. I want to plot every feature in a small multiple chart type, as this:
Now, I am able to plot features one by one with the following code:
# libraries and data
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# Make a data frame
df=pd.read_csv("mydataset.csv")
df.plot(y="feature1")
How can I achieve the result shown in the previous image ? I need something similar to:
# read the data in a Dataframe object
# for each feature (column):
# plot the feature in the final small multiple chart
# render the full final chart
Thank you !
To complete the comment provided by Chris, I would suggest:
Transform your dataframe so all the columns you want to plot are contained in one column, using pd.melt. Let's call the list of columns that you want to plot columns_to_plot
df_for_plot = pd.melt(df, value_vars = columns_to_plot)
Use seaborn FacetGrid to plot it into separated plots:
g = sns.FacetGrid(df_for_plot, col="variable")
g = g.map(plt.plot, "value")

Plot the counted values in each column of a data frame in a separate plot

I'm very new to Python and am trying to plot all the columns in my data frame in separate plots.
The data frame has 45 columns which are all called, V1_category V2_category V3_category V4_category V5_category V6_category V7_category etc. till V45_category.
Each entry has one of the four values: neutral, pleasant, unpleasant, painful. I need to somehow count how often these 4 values occur in each of the 45 columns and then plot these as 45 individual histograms (possibly in one figure?). I want the plots to be nicely formatted so I guess matplotlib would be the most useful?
Any help or suggestions would be much appreciated! :)
I guess what you need is a barplot. There are many options for visualization these categories, see more at the vignette for seaborn.
Below I try to make a data.frame that looks like yours:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
categories = ['neutral','pleasant','unpleasant','painful']
df = pd.DataFrame(np.random.choice(categories,(100,45)),
columns=["V"+str(i)+"_category" for i in np.arange(1,46)])
To use it in seaborn, your data.frame needs to be long, so we can pivot it like:
df.melt()
variable value
0 V1_category neutral
1 V1_category unpleasant
2 V1_category pleasant
3 V1_category unpleasant
4 V1_category unpleasant
And we pass this directly into seaborn:
sns.catplot('value',data=df.melt(),col="variable",
kind="count",col_wrap=5,height=5, aspect=2)

How to align bars with tick labels in plt or pandas histogram (when plotting multiple columns)

I have started using python for lots of data problems at work and the datasets are always slightly different. I'm trying to explore more efficient ways of plotting data using the inbuilt pandas function rather than individually writing out the code for each column and editing the formatting to get a nice result.
Background: I'm using Jupyter notebook and looking at histograms where the values are all unique integers.
Problem: I want the xtick labels to align with the centers of the histogram bars when plotting multiple columns of data with the one function e.g. df.hist() to get histograms of all columns at once.
Does anyone know if this is possible?
Or is it recommended to do each graph on its own vs. using the inbuilt function applied to all columns?
I can modify them individually following this post: Matplotlib xticks not lining up with histogram
which gives me what I would like but only for one graph and with some manual processing of the values.
Desired outcome example for one graph:
Basic example of data I have:
# Import libraries
import pandas as pd
import numpy as np
# create list of datapoints
data = [[170,30,210],
[170,50,200],
[180,50,210],
[165,35,180],
[170,30,190],
[170,70,190],
[170,50,190]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['height', 'width','weight'])
# print dataframe.
df
Code that displays the graphs in the problem statement
df.hist(figsize=(5,5))
plt.show()
Code that displays the graph for weight how I would like it to be for all
df.hist(column='weight',bins=[175,185,195,205,215])
plt.xticks([180,190,200,210])
plt.yticks([0,1,2,3,4,5])
plt.xlim([170, 220])
plt.show()
Any tips or help would be much appreciated!
Thanks
I hope this helps.You take the column and count the frequency of each label (value counts) then you specify sort_index in order to get the order by the label not by the frecuency, then you plot the bar plot.
data = [[170,30,210],
[170,50,200],
[180,50,210],
[165,35,180],
[170,30,190],
[170,70,190],
[170,50,190]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['height', 'width','weight'])
df.weight.value_counts().sort_index().plot(kind = 'bar')
plt.show()

Subplot counts of multiple categorical variables on a single bar chart

I'm trying to create a single barplot from multiple dataframe columns each of which is a categorical variable (all based on the same levels). I want it to show a count of the levels occurring in each column.
The below code achieves what I want, but on 4 different bar plots. I'd like it all to be on one plot, so the bars are side by side (labels/legend would be rad). I'm trying to a get clean, simple solution using matplotlib but so far I can't figure it out. Help?
Thanks!
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.DataFrame({"A":['cow','pig','horse','goat','cow'], "B":['cow','pig','horse','cow','goat'], "C":['pig','horse','goat','pig','cow'], "D":['cow','pig','horse','horse','goat'], "E":['pig','horse','goat','cow','goat']})
levels = np.sort(df['A'].unique())
df.A.value_counts()[levels].plot(kind='bar')
df.B.value_counts()[levels].plot(kind='bar')
df.C.value_counts()[levels].plot(kind='bar')
df.D.value_counts()[levels].plot(kind='bar')
You should apply pd.series.value_counts and plot a bar graph, stacked or unstacked.
If you need each column on its own;
df.apply(pd.Series.value_counts).plot(kind='bar')
if you need them stacked;
df.apply(pd.Series.value_counts).plot(kind='bar', stacked=True)

Create a checkerboard plot with unbalanced rows and colums

I have a dataset similar to this format X = [[1,4,5], [34,70,1,5], [43,89,4,11], [22,76,4]] where the length of element lists are not equal.
I want to create a checkerboard plot of 4 rows and 4 columns and the colorbar of each unit box corresponds to the value of the number. In this dataset some small boxes will be missing (eg. 4th column firs row).
How would I plot this in python using matplotlib?
Thanks
You can use seaborn library or matplotlib to generate heatmap. Firstly, convert it to pandas dataframe to handle missing values.
import pandas as pd
df = pd.DataFrame([[1,4,5],[34,70,1,5], [43,89,4,11],[22,76,4]])
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
sns.heatmap(df)
plt.show()
Result looks something like this.

Categories

Resources