Pyplot Stacked histogram - amount of occurences in column - python

I'm trying to present datatable collected from firewall logs in a histogram so that i would have one bar for each date in the file, and the number of occurences in a certain column stacked in the bar.
I looked into several examples here but they all seemed to be based on the fact that i would know what values there are in the particular column - and what i'm trying to achieve here is the way to present histogram without needing to know all possible fields.
In the example i have used protocol as the column:
#!/usr/bin/python
import pandas as pd
import numpy as np
import glob
import matplotlib.pyplot as plt
csvs = glob.glob("*log-export.csv")
dfs = [pd.read_csv(csv, sep="\xff", engine="python") for csv in csvs]
df_merged = pd.concat(dfs).fillna("")
data = df_merged[['date', 'proto']]
np_data = np.array(data)
plt.hist(np_data, stacked=True)
plt.show()
But this shows following diagram:
histogram
and i would like to accomplish something like this:
stacked
Any suggestions how to achieve this?

Setup
I had to make up data because you didn't provide any.
df = pd.DataFrame(dict(
Date=pd.date_range(end=pd.to_datetime('now'), periods=100, freq='H'),
Proto=np.random.choice('UDP TCP ICMP'.split(), 100, p=(.3, .5, .2))
))
Solution
Use pd.crosstab then plot
pd.crosstab(df.Date.dt.date, df.Proto).plot.bar(stacked=True)

Related

Plot heatmap from pandas Dataframe

I have the following pandas Dataframe. alfa_value and beta_value are random, ndcg shall be the parameter deciding the color.
The question is: how do I do a heatmap of the pandas Dataframe?
You can use the code below to generate a heatmap. You have to adjust the bins to group your data (analyze the mean, the std, ...)
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
rng = np.random.default_rng(2022)
df = pd.DataFrame({'alfa_value': rng.integers(1000, 10000, 1000),
'beta_value': rng.random(1000),
'ndcg': rng.random(1000)})
out = df.pivot_table('ndcg', pd.cut(df['alfa_value'], bins=10),
pd.cut(df['beta_value'], bins=10), aggfunc='mean')
sns.heatmap(out)
plt.tight_layout()
plt.show()
In general, Seaborn's heatmap function is a nice way to color pandas' DataFrames based on their values. Good examples and descriptions can be found here.
Since you seem to want to color the row based on a different column, you are probably looking for something more like these answers.

Grouped Histogram in Python

Is there a simple way of creating histograms for a continuous variable (mpg) that is filtered by a categorical variable (cyl=4,8)? So essentially I need two histograms for mpg grouped by cyl, one for cyl=4 and one for cyl=8.
Here is an example from a different dataset:
import numpy as np
import pandas as pd
import seaborn as sns
data = pd.DataFrame()
data[4] = np.random.normal(0,10,300)
data[8] = np.random.normal(20,11,300)
sns.distplot(data[4], color="skyblue")
sns.distplot(data[8], color="orange")
I just used my random sample.
I am just being a little lazy here, but all you need to do is a seaborn package.
There are much more options you can handle, so please read it more here [https://python-graph-gallery.com/]

How to plot multiple data one after another in the same graph using Python Pandas DataFrame

I was trying to visualize a facebook stock dataset, where the data for 2014 to 2018 is stored. The dataset looks like this: dataset screenshot
My goal is to visualize the closing column, but by year. That is, year 2014, then 2015 and so on, but they should be in one figure, and one after another. Something like this: expected graph image
But whatever I try, all the graph parts start from index 0, instead of continuing from the end of the previous one. Here's what I got: the graph I generated
Please help me to solve this problem. Thanks!
The most straightforward way is simply to create separate dataframes with empty
values for the non-needed dates.
Here I use an example dataset.
import pandas as pd
import numpy as np
df = pd.DataFrame(
np.random.randint(0, 100, size=100),
index=pd.date_range(start="2020-01-01", periods=100, freq="D"),
)
Then you can create and select the data to plot
df1 = df.copy()
df2 = df.copy()
df1[df.index > pd.to_datetime('2020-02-01')] = np.NaN
df2[df.index < pd.to_datetime('2020-02-01')] = np.NaN
And then simply plot these on the same axis.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 1, figsize=(18, 8))
ax.plot(df1)
ax.plot(df2)
The result

How to plot time series graph in jupyter?

I have tried to plot the data in order to achieve something like this:
But I could not and I just achieved this graph with plotly:
Here is the small sample of my data
Does anyone know how to achieve that graph?
Thanks in advance
You'll find a lot of good stuff on timeseries on plotly.ly/python. Still, I'd like to share some practical details that I find very useful:
organize your data in a pandas dataframe
set up a basic plotly structure using fig=go.Figure(go.Scatter())
Make your desired additions to that structure using fig.add_traces(go.Scatter())
Plot:
Code:
import plotly.graph_objects as go
import pandas as pd
import numpy as np
# random data or other data sources
np.random.seed(123)
observations = 200
timestep = np.arange(0, observations/10, 0.1)
dates = pd.date_range('1/1/2020', periods=observations)
val1 = np.sin(timestep)
val2=val1+np.random.uniform(low=-1, high=1, size=observations)#.tolist()
# organize data in a pandas dataframe
df= pd.DataFrame({'Timestep':timestep, 'Date':dates,
'Value_1':val1,
'Value_2':val2})
# Main plotly figure structure
fig = go.Figure([go.Scatter(x=df['Date'], y=df['Value_2'],
marker_color='black',
opacity=0.6,
name='Value 1')])
# One of many possible additions
fig.add_traces([go.Scatter(x=df['Date'], y=df['Value_1'],
marker_color='blue',
name='Value 2')])
# plot figure
fig.show()

Make line chart with multiple series and error bars

I'm hoping to create a line graph which shows the changes to flowering and fruiting times (phenophases) from year to year. For each phenophase I'd like to plot the average Day of Year and, if possible, show the min and max for each year as an error bar. I've filtered down all the data I need in a few data frames, grouped it all in a sensible way, but I can't figure out how to get it all to plot. Here's a screen grab of where I'm at: Imgur
All the examples I've found adding error bars have been based on formulas or other equal amounts over/under, but in my case the max/min will be different so I'm not sure how to integrate that. Possible just create a list of each column's data and feed that to plot? I'm playing with that now but not getting far.
Also, if anyone has general suggestions as to better ways to present this data I'm all ears. I've looked into Gantt plots but didn't get far with them, as this seems a bit more straight-forward just using matplotlib. I'm happy to put some demo data or the rest of my notebook up if anyone thinks that would help.
Edit: Here's some sample data and the code from my notebook: Gist
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
%matplotlib inline
pd.set_option('display.max_columns', 40)
tick_spacing = 1
dfClean = df[['Site_Cluster', 'Species', 'Phenophase_Name',
'Phenophase_Status', 'Observation_Year', 'Day_of_Year']]
dfClean = dfClean[dfClean.Phenophase_Status == 1]
PhenoNames = ['Open flowers', 'Ripe fruits']
dfLakes = dfClean[(dfClean.Phenophase_Name.isin(PhenoNames))
& (dfClean.Site_Cluster == 'Lakes')
& (dfClean.Species == 'lapponica')]
dfLakesGrouped = dfLakes.groupby(['Observation_Year', 'Phenophase_Name'])
dfLakesReady = dfLakesGrouped.Day_of_Year.agg([np.min, np.mean, np.max]).round(0)
dfLakesReady = dfLakesReady.unstack()
print(dfLakesReady['mean'].plot())
Here's another answer:
from pandas import DataFrame, date_range, Timedelta
import numpy as np
from matplotlib import pyplot as plt
rng = date_range(start='2015-01-01', periods=5, freq='24H')
df = DataFrame({'y':np.random.normal(size=len(rng))}, index=rng)
y1 = df['y']
y2 = (y1*3)
sd1 = (y1*2)
sd2 = (y1*2)
fig,(ax1,ax2) = plt.subplots(2,1,sharex=True)
_ = y1.plot(yerr=sd1, ax=ax1)
_ = y2.plot(yerr=sd2, ax=ax2)
Output:

Categories

Resources