I'm trying to create a heatmap and I am following the following question:
Making heatmap from pandas DataFrame
My dataframe looks like the following picture:
I tried the following code:
years = ["1860","1870", "1880","1890","1900","1910","1920","1930","1940","1950","1960","1970","1980","1990","2000"]
kantons = ["AG","AI","AR","BE","BL","BS","FR","GE","GL","GR","JU","LU","NE","NW","OW","SG","SH","SO","SZ","TG","TI","UR","VD","VS","ZG","ZH"]
df = pd(abs(dfYears), index=years, columns=kantons)
which gives the exception that:
"AG" can not be used as float
So I thought if I need to drop the index column which is not possible.
Any suggestions?
When replicating similar data, you can do:
import pandas as pd
import numpy as np
years = ["1860","1870", "1880","1890","1900","1910","1920","1930","1940","1950","1960","1970","1980","1990","2000"]
kantons = ["AG","AI","AR","BE","BL","BS","FR","GE","GL","GR","JU","LU","NE","NW","OW","SG","SH","SO","SZ","TG","TI","UR","VD","VS","ZG","ZH"]
df = pd.DataFrame(np.random.randint(low=10000, high=200000, size=(15, 26)), index=years, columns=kantons)
df.style.background_gradient(cmap='Reds')
Pandas has some Builtin Styles for the most common visualization needs. .background_gradient function is a simple way for highlighting cells based on their values. cmap parameter determines the color map based on the matplotlib colormaps.
Related
I want to graph 3 plots horizontally side by side
Three graphs are generated using the code below:
df.groupby(df.col1, pd.cut[0,1,2]).col2.mean().plot.bar()
df1.groupby(df.col1, pd.cut[0,1,2]).col2.mean().plot.bar()
df2.groupby(df.col1, pd.cut[0,1,2]).col2.mean().plot.bar()
I'm not sure where to set axes in this case. Any help would be appreciated.
You may simply use pandas' barh function.
df.groupby(pd.cut(df.col1, [0,1,2]).col2.mean().plot.barh()
This is an example, using this approach to create a dataframe with random samples:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df.groupby(pd.cut(df.A, [0,10,20,30,40,50,60,70,80,90,100])).A.mean().plot.barh()
This snippet outputs the following plot:
This question already has answers here:
Use index in pandas to plot data
(6 answers)
Closed 1 year ago.
I'm learning Python, specifically Pandas and Matplotlib at the moment. I have a dataset of Premier League Hattrick scorers and have been using pandas to do some basic analysis. I then want to produce a bar chart based on this data extract. I have been able to create a bar chart, but the X axis shows 'nan' instead of the player names.
My code to extract the data...
import matplotlib.pyplot as plt
import pandas as pd
top10 = df.groupby(['Player'])[['Goals']].count().sort_values(['Goals'],ascending=False).head(10)
This produces the following, which I know is a Pandas DataFrame as if I print the type of 'top10' i get the following:
<class 'pandas.core.frame.DataFrame'>
This produces the following if printed out...
I tried to create a chart direct from this dataFrame, but was given an error message 'KeyError: Player'
So, I made an new dataframe and plotted this, which was kind of successful, but it displayed 'nan' on the X access?
top10df = pd.DataFrame(top10,columns=['Player','Goals'])
top10df.plot(x ='Player', y='Goals', kind='bar')
plt.show()
I did manually create a dataframe and it worked, so unsure where to go, tried googling and searching stackoverflow with no success. Any ideas please??
You could plot directly using the groupby results in the following way:
top10.plot(kind='bar', title='Your Title', ylabel='Goals',
xlabel='Player', figsize=(6, 5))
A dummy example, since you did not supply your data (next time it's best to do so):
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'category': list('XYZXY'),
'sex': list('mfmff'),
'ThisColumnIsNotUsed': range(5,10)})
x = df.groupby('sex').count()
x.plot(kind='bar', ylabel='category', xlabel='sex')
we get:
I'm using the below code to get Segment and Year in x-axis and Final_Sales in y-axis but it is throwing me an error.
CODE
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline
order = pd.read_excel("Sample.xls", sheet_name = "Orders")
order["Year"] = pd.DatetimeIndex(order["Order Date"]).year
result = order.groupby(["Year", "Segment"]).agg(Final_Sales=("Sales", sum)).reset_index()
bar = plt.bar(x = result["Segment","Year"], height = result["Final_Sales"])
ERROR
Can someone help me to correct my code to see the output as below.
Required Output
Try to add another pair of brackets - result[["Segment","Year"]],
What you tried to do is to retrieve column named - "Segment","Year",
But actually what are you trying to do is to retrieve a list of columns - ["Segment","Year"].
There are several problems with your code:
When using several columns to index a dataframe you want to pass a list of columns to [] (see the docs) as follows :
result[["Segment","Year"]]
From the figure you provide it looks like you want to use year as hue. matplotlib.barplot doesn't have a hue argument, you would have to build it manually as described here. Instead you can use seaborn library that you are already importing anyway (see https://seaborn.pydata.org/generated/seaborn.barplot.html):
sns.barplot(x = 'Segment', y = 'Final_Sales', hue = 'Year', data = result)
I'm attempting to put a MongoDB database that I've imported with PyMongo into a pandas dataframe and then plot it by time with a "date" column of type datetime64 with matplotlib. However, I'm getting randomly connected dates. Does anyone know how I might fix this problem?
The date column seems to be unsorted. To reproduce consider e.g.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.rand(15,5), columns=list("ABCDE"))
a = np.arange("2018-05-05", "2018-05-20", dtype="datetime64[D]")
np.random.shuffle(a)
df["date"] = a
plt.plot("date", "C", data=df)
plt.show()
If we sort the dataframe by the date column now,
df.sort_values(by="date", inplace=True)
the result looks much nicer.
A tangential remark here: I would recommend deciding for one style, either
plt.plot("date", "C", data=df)
or
plt.plot(df["date"], df["C"])
and not mix the two by supplying the x argument as Series and the y as string.
I'm trying to present datatable collected from firewall logs in a histogram so that i would have one bar for each date in the file, and the number of occurences in a certain column stacked in the bar.
I looked into several examples here but they all seemed to be based on the fact that i would know what values there are in the particular column - and what i'm trying to achieve here is the way to present histogram without needing to know all possible fields.
In the example i have used protocol as the column:
#!/usr/bin/python
import pandas as pd
import numpy as np
import glob
import matplotlib.pyplot as plt
csvs = glob.glob("*log-export.csv")
dfs = [pd.read_csv(csv, sep="\xff", engine="python") for csv in csvs]
df_merged = pd.concat(dfs).fillna("")
data = df_merged[['date', 'proto']]
np_data = np.array(data)
plt.hist(np_data, stacked=True)
plt.show()
But this shows following diagram:
histogram
and i would like to accomplish something like this:
stacked
Any suggestions how to achieve this?
Setup
I had to make up data because you didn't provide any.
df = pd.DataFrame(dict(
Date=pd.date_range(end=pd.to_datetime('now'), periods=100, freq='H'),
Proto=np.random.choice('UDP TCP ICMP'.split(), 100, p=(.3, .5, .2))
))
Solution
Use pd.crosstab then plot
pd.crosstab(df.Date.dt.date, df.Proto).plot.bar(stacked=True)