Seasborn Distplot goes unresponsive - python

I am trying to plot a simple Distplot using pandas and seaborn to understand the density of the datasets.
Input
#Car,45
#photo,4
#movie,6
#life,1
#Horse,14
#Pets,20
#run,67
#picture,89
The dataset has above 10K rows, no headers and I am trying to use col[1]
code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('keyword.csv', delimiter=',', header=None, usecols=[1])
#print df
sns.distplot(df)
plt.show()
No error as I can print the input column but the distplot is taking ages to compute and freezes my screen. Any suggestion to speed the process.
Edit1: As Suggested in the Comment Below I try to change from pandas.read_csv to np.loadtxt and now I get an error.
Code:
import numpy as np
from numpy import log as log
import matplotlib.pyplot as plt
import seaborn as sns
import pandas
df = np.loadtxt('keyword.csv', delimiter=',', usecols=(1), unpack=True)
sns.kdeplot(df)
sns.distplot(df)
plt.show()
Error:
Traceback (most recent call last):
File "0_distplot_csv.py", line 7, in <module>
df = np.loadtxt('keyword.csv', delimiter=',', usecols=(1), unpack=True)
File "/usr/local/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 726, in loadtxt
usecols = list(usecols)
TypeError: 'int' object is not iterable
Edit 2: I did try the mentioned suggestions from the comment section
sns.distplot(df[1])
This does the same as mentioned initially. The screen is frozen for ages.
sns.distplot(df[1].values)
I see a strange behavior in this case.
When the input is
Car,45
photo,4
movie,6
life,1
Horse,14
Pets,20
run,67
picture,89
It does plot but when the input is below
#Car,45
#photo,4
#movie,6
#life,1
#Horse,14
#Pets,20
#run,67
#picture,89
It is again the same freezing entire screen and would do nothing.
I did try to put comments=None thinking it might be reading them as comments. But looks like comments isn't used in pandas.
Thank you

After several trials and a lot of online search, I could finally get what I was looking for. The code allows to load data with column number when we do not have headers. This also reads the rows with # comments.
code:
import numpy as np
import matplotlib.pyplot as plt
from pylab import*
import math
from matplotlib.ticker import LogLocator
from scipy.stats.kde import gaussian_kde
import seaborn as sns
data = np.genfromtxt('keyword.csv', delimiter=',', comments=None)
d0=data[:,1]
#Plot a simple histogram with binsize determined automatically
sns.kdeplot(np.array(d0), color='b', bw=0.5, marker='o', label='keyword')
plt.legend(loc='upper right')
plt.xlabel('Freq(x)')
plt.ylabel('pdf(x)')
#plt.gca().set_xscale("log")
#plt.gca().set_yscale("log")
plt.show()

Related

Plot Correlation Table imported from excel with Python

So I am trying to plot correlation Matrix (already calculated) in python. the table is like below:
And I would like it to look like this:
I am using the Following code in python:
import seaborn as sn
import matplotlib.pyplot as plt
import pandas as pd
data =pd.read_excel('/Desktop/wetchimp_global/corr/correlation_matrix.xlsx')
df = pd.DataFrame(data)
print (df)
corrMatrix = data.corr()
print (corrMatrix)
sn.heatmap(corrMatrix, annot=True)
plt.show()
Note that, the matrix is ready and I don't want to calculate the correlation again! but I failed to do that. Any suggestions?
You are recalculating the correlation with the following line:
corrMatrix = data.corr()
You then go on to utilize this recalculated variable in the heatmap here:
sn.heatmap(corrMatrix, annot=True)
plt.show()
To resolve this, instead of passing in the corrMatrix value which is the recalculated value, pass the pure excel data data or df (as df is just a copy of data). Thus, all the code you should need is:
import seaborn as sn
import matplotlib.pyplot as plt
import pandas as pd
data =pd.read_excel('/Desktop/wetchimp_global/corr/correlation_matrix.xlsx')
sn.heatmap(data, annot=True)
plt.show()
Note that this assumes, however, that your data IS ready for the heatmap as you suggest. As we online do not have access to your data we cannot confirm that.
I have deleted to frist column (names) and add them later so the code is as below:
import seaborn as sn
import matplotlib.pyplot as plt
import pandas as pd
data =pd.read_excel('/Users/yousefalbuhaisi/Desktop/wetchimp_global/corr/correlation_matrix.xlsx')
fig, ax = plt.subplots(dpi=150)
y_axis_labels = ['CLC','GIEMS','GLWD','LPX_BERN','LPJ_WSL','LPJ_WHyME','SDGVM','DLEM','ORCHIDEE','CLM4ME']
sn.heatmap(data,yticklabels=y_axis_labels, annot=True)
plt.show()
and the results are:

CSV file matplotlib.pyplot graphing error

I am using pandas to import a csv to my notebook, and I changed any blank data column to a blank space. When I use plt.plot to make a graph of the data it turns out with a bunch of black lines on the x and y axis. Below is my code and graph:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
apo2data = pd.read_csv('/Users/lilyloyer/Desktop/Apo2excel.csv')
apo2data.isnull()
data = apo2data.fillna(" ")
teff=data['Teff (K)']
grav=data['logg_seis']
plt.plot(teff, grav, 'ro')

Make histogram from CSV file with python

I have written this code to perform a histogram from a .csv file however I do not get the histogram but as you see in the image
how can I fix it?
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('test.csv', header=None)
plt.hist(data)
plt.show()
The head lines in the .csv file are:
-95.725
-78.477
-77.976
-77.01
-73.161
-72.505
-71.794
-71.036
-70.653
-70.476
-69.32
-68.787
-68.234
-67.968
-67.742
-67.611
-67.577
-66.69
-66.381
-66.172
-66.072
-65.773
-64.969
-64.897
-64.603
I'm not sure if this will work, but try adding the keyword parameters bins='auto', density=True and histtype='step' to the plt.hist function.
For example:
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('test.csv', header=None)
plt.hist(data, bins='auto', density=True, histtype='step')
plt.show()
What they each do is:
bins='auto': Lets numpy automatically decide on the best bin edges;
density=True: Sets the area within the histogram to equal 1.0;
histtype='bar': Gives the bar style look for the histogram.
This and more can all be found in the matplotlib API.

How do I make one line in this graph a different color from the rest?

I have a graph, and I would like to make one of my lines different color
Tried using the matplotlib recommendation which just made me print two graphs
import numpy as np
import pandas as pd
import seaborn as sns
data = pd.read_csv("C:\\Users\\Nathan\\Downloads\\markouts_changed_maskedNEW.csv");
data.columns = ["Yeet","Yeet1","Yeet 2","Yeet 3","Yeet 4","Yeet 7","Exchange 5","Yeet Average","Intelligent Yeet"];
mpg = data[data.columns]
mpg.plot(color='green', linewidth=2.5)

How to create box plot from pandas object using matplotlib?

I read data into pandas object and then I want to create a box plot using matplotlib (not pandas.boxplot()). This is just for learning purposes. This is my code, in which myData['MyColumn'] fails.
import matplotlib.pyplot as plt
import pandas as pd
myData = pd.read_csv('data/myData.csv')
plt.boxplot(myData['MyColumn'])
plt.show()
Your code works fine with fake data. Check the type of the data you're trying to plot.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
myData = pd.DataFrame(np.random.rand(10, 2), columns=['MyColumn', 'blah'])
plt.boxplot(myData['MyColumn'])
plt.show()

Categories

Resources