I have the following pandas Dataframe. alfa_value and beta_value are random, ndcg shall be the parameter deciding the color.
The question is: how do I do a heatmap of the pandas Dataframe?
You can use the code below to generate a heatmap. You have to adjust the bins to group your data (analyze the mean, the std, ...)
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
rng = np.random.default_rng(2022)
df = pd.DataFrame({'alfa_value': rng.integers(1000, 10000, 1000),
'beta_value': rng.random(1000),
'ndcg': rng.random(1000)})
out = df.pivot_table('ndcg', pd.cut(df['alfa_value'], bins=10),
pd.cut(df['beta_value'], bins=10), aggfunc='mean')
sns.heatmap(out)
plt.tight_layout()
plt.show()
In general, Seaborn's heatmap function is a nice way to color pandas' DataFrames based on their values. Good examples and descriptions can be found here.
Since you seem to want to color the row based on a different column, you are probably looking for something more like these answers.
Related
So I am trying to plot correlation Matrix (already calculated) in python. the table is like below:
And I would like it to look like this:
I am using the Following code in python:
import seaborn as sn
import matplotlib.pyplot as plt
import pandas as pd
data =pd.read_excel('/Desktop/wetchimp_global/corr/correlation_matrix.xlsx')
df = pd.DataFrame(data)
print (df)
corrMatrix = data.corr()
print (corrMatrix)
sn.heatmap(corrMatrix, annot=True)
plt.show()
Note that, the matrix is ready and I don't want to calculate the correlation again! but I failed to do that. Any suggestions?
You are recalculating the correlation with the following line:
corrMatrix = data.corr()
You then go on to utilize this recalculated variable in the heatmap here:
sn.heatmap(corrMatrix, annot=True)
plt.show()
To resolve this, instead of passing in the corrMatrix value which is the recalculated value, pass the pure excel data data or df (as df is just a copy of data). Thus, all the code you should need is:
import seaborn as sn
import matplotlib.pyplot as plt
import pandas as pd
data =pd.read_excel('/Desktop/wetchimp_global/corr/correlation_matrix.xlsx')
sn.heatmap(data, annot=True)
plt.show()
Note that this assumes, however, that your data IS ready for the heatmap as you suggest. As we online do not have access to your data we cannot confirm that.
I have deleted to frist column (names) and add them later so the code is as below:
import seaborn as sn
import matplotlib.pyplot as plt
import pandas as pd
data =pd.read_excel('/Users/yousefalbuhaisi/Desktop/wetchimp_global/corr/correlation_matrix.xlsx')
fig, ax = plt.subplots(dpi=150)
y_axis_labels = ['CLC','GIEMS','GLWD','LPX_BERN','LPJ_WSL','LPJ_WHyME','SDGVM','DLEM','ORCHIDEE','CLM4ME']
sn.heatmap(data,yticklabels=y_axis_labels, annot=True)
plt.show()
and the results are:
Is there a simple way of creating histograms for a continuous variable (mpg) that is filtered by a categorical variable (cyl=4,8)? So essentially I need two histograms for mpg grouped by cyl, one for cyl=4 and one for cyl=8.
Here is an example from a different dataset:
import numpy as np
import pandas as pd
import seaborn as sns
data = pd.DataFrame()
data[4] = np.random.normal(0,10,300)
data[8] = np.random.normal(20,11,300)
sns.distplot(data[4], color="skyblue")
sns.distplot(data[8], color="orange")
I just used my random sample.
I am just being a little lazy here, but all you need to do is a seaborn package.
There are much more options you can handle, so please read it more here [https://python-graph-gallery.com/]
I've created a Brownian motion and then I have taken the last values of 1000 entries repeated 10000 times. I was able to plot the histogram using the following code as follows:
import seaborn as sns
import matplotlib.pyplot as plt
\\BM represents list of values generated by the Brownian motion
fig, (ax1,ax2) = plt.subplots(2)
ax1.hist(BM[:,-1],12)
I've been able to draw the KDE as follows, however i unable to merge the two diagrams together. Can someone please help me?
sns.kdeplot(data=BM[:,-1])
Try with sns.kdeplot(BM['col1']) where 'col1' is the name of the column you want to plot.
I'll give you a reproducible example that works for me.
import seaborn as sns
import pandas as pd
import numpy as np
BM = pd.DataFrame(np.array([-0.00871515, -0.0001227 , -0.01449098, 0.01808527, 0.00074193, 0.01145541]
, columns=['col1'])
BM.head(2)
col1
0 -0.008715
1 -0.000123
sns.kdeplot(BM['col1'])
Edit based on your additional question:
To have the histogram and a kde plot use this one:
sns.distplot(BM['col1'])
I want to graph my data in natural distribution way
Not sure how I do that
I tried using plt.hist but it failed, I only got one column!!
here is my code
import pymssql
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
conn = pymssql.connect(server='MyServer', database='MyDB')
df = pd.read_sql('EXEC [Stat_EDFlow] [2018-03-01], [2019-02-28]', conn, index=False)
conn.close()
plt.hist(df['MyColumn'])
plt.show()
The reason for this is the way bins are calculated.
You have some outliers in your data, which is causing the plot to "zoom out" in an effort to show all of them.
One way you can resolve this issue is to remove the outliers (say, everything past the 95th percentile) and specify the number of bins:
df.loc[df['MyColumn'] < df['MyColumn'].quantile(0.95), 'MyColumn']).plot.hist(bins=25)
If this doesn't work, decrease the threshold from 0.95.
Another way is to specify the bins directly:
df['MyColumn'].plot.hist(bins=np.linspace(0, 100, 25))
I think you're looking for the ,bins= keyword.
You can either provide an integer of the number of bins you want, or something like np.arange(min,max,dist).
https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html
EDIT:
To have a line plot, you can use something like:
import matplotlib.pyplot as plt
import numpy as np
synthetic=np.random.normal(size=100)
fig=plt.figure(figsize=(5,5))
y,binEdges=np.histogram(synthetic,bins=20) #we want 20 bins
bincenters = 0.5*(binEdges[1:]+binEdges[:-1])
plt.plot(bincenters,y,c='k')
I'm trying to present datatable collected from firewall logs in a histogram so that i would have one bar for each date in the file, and the number of occurences in a certain column stacked in the bar.
I looked into several examples here but they all seemed to be based on the fact that i would know what values there are in the particular column - and what i'm trying to achieve here is the way to present histogram without needing to know all possible fields.
In the example i have used protocol as the column:
#!/usr/bin/python
import pandas as pd
import numpy as np
import glob
import matplotlib.pyplot as plt
csvs = glob.glob("*log-export.csv")
dfs = [pd.read_csv(csv, sep="\xff", engine="python") for csv in csvs]
df_merged = pd.concat(dfs).fillna("")
data = df_merged[['date', 'proto']]
np_data = np.array(data)
plt.hist(np_data, stacked=True)
plt.show()
But this shows following diagram:
histogram
and i would like to accomplish something like this:
stacked
Any suggestions how to achieve this?
Setup
I had to make up data because you didn't provide any.
df = pd.DataFrame(dict(
Date=pd.date_range(end=pd.to_datetime('now'), periods=100, freq='H'),
Proto=np.random.choice('UDP TCP ICMP'.split(), 100, p=(.3, .5, .2))
))
Solution
Use pd.crosstab then plot
pd.crosstab(df.Date.dt.date, df.Proto).plot.bar(stacked=True)