How do I find covariance and correlation?

How do I find covariance and correlation? - python

I have 2 data sets saved in the csv file. Column names "avg" and "hu". I want to find the covariance and correlation values of these two data sets. I tried it with some simple codes. But every time I got an error. What am I doing wrong ?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data=pd.read_csv("80hucov.csv")
avg=data["avg"]
hu=data["hu"]
data = np.array(["avg, hu"])
covMatrix = np.cov(data,bias=True)
print (covMatrix)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data=pd.read_csv("80hucov.csv")
data = {'A': ["avg"],
'B': ["hu"],}
df = pd.DataFrame(data,columns=['A','B'])
covMatrix = pd.DataFrame.cov(df)
sn.heatmap(covMatrix, annot=True, fmt='g')
plt.show()

It seems you may need to redefine your definition of the array.
Currently you have:
data = np.array(["avg, hu"])
You can do:
data_array = data[['avg', 'hu']].to_numpy()
I recommend using different names for different objets within your code. In your example you use "data" for both your dataframe and your array.

Related

Plot Correlation Table imported from excel with Python

So I am trying to plot correlation Matrix (already calculated) in python. the table is like below:
And I would like it to look like this:
I am using the Following code in python:
import seaborn as sn
import matplotlib.pyplot as plt
import pandas as pd
data =pd.read_excel('/Desktop/wetchimp_global/corr/correlation_matrix.xlsx')
df = pd.DataFrame(data)
print (df)
corrMatrix = data.corr()
print (corrMatrix)
sn.heatmap(corrMatrix, annot=True)
plt.show()
Note that, the matrix is ready and I don't want to calculate the correlation again! but I failed to do that. Any suggestions?

You are recalculating the correlation with the following line:
corrMatrix = data.corr()
You then go on to utilize this recalculated variable in the heatmap here:
sn.heatmap(corrMatrix, annot=True)
plt.show()
To resolve this, instead of passing in the corrMatrix value which is the recalculated value, pass the pure excel data data or df (as df is just a copy of data). Thus, all the code you should need is:
import seaborn as sn
import matplotlib.pyplot as plt
import pandas as pd
data =pd.read_excel('/Desktop/wetchimp_global/corr/correlation_matrix.xlsx')
sn.heatmap(data, annot=True)
plt.show()
Note that this assumes, however, that your data IS ready for the heatmap as you suggest. As we online do not have access to your data we cannot confirm that.

I have deleted to frist column (names) and add them later so the code is as below:
import seaborn as sn
import matplotlib.pyplot as plt
import pandas as pd
data =pd.read_excel('/Users/yousefalbuhaisi/Desktop/wetchimp_global/corr/correlation_matrix.xlsx')
fig, ax = plt.subplots(dpi=150)
y_axis_labels = ['CLC','GIEMS','GLWD','LPX_BERN','LPJ_WSL','LPJ_WHyME','SDGVM','DLEM','ORCHIDEE','CLM4ME']
sn.heatmap(data,yticklabels=y_axis_labels, annot=True)
plt.show()
and the results are:

Grouped Histogram in Python

Is there a simple way of creating histograms for a continuous variable (mpg) that is filtered by a categorical variable (cyl=4,8)? So essentially I need two histograms for mpg grouped by cyl, one for cyl=4 and one for cyl=8.
Here is an example from a different dataset:

import numpy as np
import pandas as pd
import seaborn as sns
data = pd.DataFrame()
data[4] = np.random.normal(0,10,300)
data[8] = np.random.normal(20,11,300)
sns.distplot(data[4], color="skyblue")
sns.distplot(data[8], color="orange")
I just used my random sample.
I am just being a little lazy here, but all you need to do is a seaborn package.
There are much more options you can handle, so please read it more here [https://python-graph-gallery.com/]

Taking data from specific columns in a dataset

I need to take data from only 3 columns in my dataset, how do I do this? I am trying to make a correlation graph. This is my code:
import matplotlib.pyplot as plt
import pandas as pd
crimedata = pd.read_csv('MasterFileCSV.csv')
crime_df = pd.DataFrame(crimedata)
plt.matshow(crime_df.corr())
plt.show

Python: Iris Data Set, include the species

I have the below code and it returns me the min and max values of the chosen column, however, I would also like to include the species that this value relates to. I have also included the column names in the csv file.
import pandas as pd
import numpy as np
import scipy as sp
import matplotlib as mpl
import seaborn as sns
df = pd.read_csv("iris head.csv")
print(min(df['Sepal Length']))
print(max(df['Sepal Length']))

IIUC is as follows:
df.groupby(['class'])['Sepal Length'].agg(['max','min'])

Histogram for multiple dataframes with different sizes in Pandas?

I am trying to generate a multiple legend histogram(example). The problem is that the length(size) of the DataFrame is different. The following code would have worked if the size (30 and 10 in this example) were the same. Is there a way to still generate the histogram that I can compare multiple data series?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
orig = pd.DataFrame(np.random.random(30))
short = pd.DataFrame(np.random.random(10))
combine = pd.DataFrame({'orig' : orig, 'short' : short})
plt.figure()
h = combine.plot(kind='hist', logy=True)
f = h.get_figure()
f.savefig('figures/combined.png')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I find covariance and correlation? - python

Related

Plot Correlation Table imported from excel with Python

Grouped Histogram in Python

Taking data from specific columns in a dataset

Python: Iris Data Set, include the species

Histogram for multiple dataframes with different sizes in Pandas?

Categories

Resources