Month,Cluster,Count
7,Linux,42
7,Linux,56
7,Pct,6
7,Pct(C),11
7,Memory,28
10,Latency,73
10,Linux,47
10,Pct,21
10,Pct(C),18
10,Swap,41
10,Protection ,509
I need to compare Month here 7,10 and plot cluster for each month. How to visualize this data in python. I need to differentiate cluster count for both months
You only have one observation for Latency, Memory and swap
So you cant plot a line for the change in these, but you could combine a scatter plot with a lineplot like so
import pandas as pd
import io
strdata = '''
Month,Cluster,Count
7,Linux,42
7,Linux,56
7,Pct,6
7,Pct(C),11
7,Memory,28
10,Latency,73
10,Linux,47
10,Pct,21
10,Pct(C),18
10,Swap,41
'''
df = pd.read_csv(io.StringIO(strdata),sep=",")
df.drop_duplicates(subset=['Month','Cluster']).set_index("Month").groupby("Cluster")["Count"].plot(legend=True, marker=".")
df.set_index("Month").groupby("Cluster")["Count"].plot(legend=True, style=".")
Related
Is there a simple way of creating histograms for a continuous variable (mpg) that is filtered by a categorical variable (cyl=4,8)? So essentially I need two histograms for mpg grouped by cyl, one for cyl=4 and one for cyl=8.
Here is an example from a different dataset:
import numpy as np
import pandas as pd
import seaborn as sns
data = pd.DataFrame()
data[4] = np.random.normal(0,10,300)
data[8] = np.random.normal(20,11,300)
sns.distplot(data[4], color="skyblue")
sns.distplot(data[8], color="orange")
I just used my random sample.
I am just being a little lazy here, but all you need to do is a seaborn package.
There are much more options you can handle, so please read it more here [https://python-graph-gallery.com/]
I am new to Python and am currently working on a set of real estate data from redfinn.
Currently my data looks like this:
There are many different neighborhoods in the dataset. I would like
to:
get the average homes_sold per month(date field was cut out of the
screenshot) per neighborhood
graph the above using only the neighborhoods I wish to use (about
4).
Any help is greatly appreciated.
As I understood, you have different values of sold per month houses and you want to take an average of it. If so, try this code (provide your data instead):
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
data = pd.DataFrame({'neighborhood':['n1','n1','n2','n3','n3','n4','n5'],'homes_sold per month':[5,7,2,6,4,1,5],'something_else':[5,3,3,5,5,5,5]})
neighborhoods_to_plot = ['n1','n2','n4','n5'] #provide here a list you want to plot
plot = pd.DataFrame()
for n in neighborhoods_to_plot:
plot.at[n,'homes_sold per month'] = data.loc[data['neighborhood']==n]['homes_sold per month'].mean()
plot.index.name = 'neighborhood'
plt.figure(figsize=(4,3),dpi=300,tight_layout=True)
sns.barplot(x=plot.index,y=plot['homes_sold per month'],data=plot)
plt.savefig('graph.png', bbox_inches='tight')
Plot
Okay so I am going to assume that you are using Pandas and Matplotlib in order to handle this data. Then in order to get the average number of homes sold for month you just need to do:
import pandas as pd
mean_number_of_homes_sold = data[['neighborhood','homes_sold']].groupby['neighborhood'].agg('mean')
In order to get the information plotted with only the neighborhoods you want you will need something like this
import pandas as pd
import matplotlib.pyplot as plt
#fill this list with strings representing the names of the data you need plotted
neighborhoods_to_plot = ['Albany Park', 'Tinley Park']
data_to_graph = data[data.neighborhood.isin(neighborhoods_to_plot)]
fig, ax = plt.subplots()
data_to_graph.plot(kind='scatter', x='avg_sale_to_list', y ='inventory_mom')
ax.set(title='Relationship between time to sale from listing and inventory momentum for selected neighborhoods')
fig.savefig('neighborhood.png', transparent=False, dpi=300, bbox_inches="tight")
You can obviously change which data is graphed or the type of graph but this should give you a decent starting point.
I have tried to plot the data in order to achieve something like this:
But I could not and I just achieved this graph with plotly:
Here is the small sample of my data
Does anyone know how to achieve that graph?
Thanks in advance
You'll find a lot of good stuff on timeseries on plotly.ly/python. Still, I'd like to share some practical details that I find very useful:
organize your data in a pandas dataframe
set up a basic plotly structure using fig=go.Figure(go.Scatter())
Make your desired additions to that structure using fig.add_traces(go.Scatter())
Plot:
Code:
import plotly.graph_objects as go
import pandas as pd
import numpy as np
# random data or other data sources
np.random.seed(123)
observations = 200
timestep = np.arange(0, observations/10, 0.1)
dates = pd.date_range('1/1/2020', periods=observations)
val1 = np.sin(timestep)
val2=val1+np.random.uniform(low=-1, high=1, size=observations)#.tolist()
# organize data in a pandas dataframe
df= pd.DataFrame({'Timestep':timestep, 'Date':dates,
'Value_1':val1,
'Value_2':val2})
# Main plotly figure structure
fig = go.Figure([go.Scatter(x=df['Date'], y=df['Value_2'],
marker_color='black',
opacity=0.6,
name='Value 1')])
# One of many possible additions
fig.add_traces([go.Scatter(x=df['Date'], y=df['Value_1'],
marker_color='blue',
name='Value 2')])
# plot figure
fig.show()
I wrote a python script to read in a distance matrix that was provided via a CSV text file. This distance matrix shows the difference between different animal species, and I'm trying to sort them in different ways(diet, family, genus, etc.) using data from another CSV file that just has one row of ordering information. Code used is here:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as mp
dietCols = pd.read_csv("label_diet.txt", header=None)
df = pd.read_csv("distance_matrix.txt", header=None)
ax = sns.heatmap(df)
fig = ax.get_figure()
fig.savefig("fig1.png")
mp.clf()
dfDiet = pd.read_csv("distance_matrix.txt", header=None, names=dietCols)
ax2 = sns.heatmap(dfDiet, linewidths=0)
fig2 = ax2.get_figure()
fig2.savefig("fig2.png")
mp.clf()
When plotting the distance matrix, the original graph looks like this:
However, when the additional naming information is read from the text file, the graph produced only has one column and looks like this:
You can see the matrix data is being used as row labeling, and I'm not sure why that would be. Some of the rows provided have no values so they're listed as "NaN", so I'm not sure if that would be causing a problem. Is there any easy way to order this distance matrix using an exterior file? Any help would be appreciated!
I'm trying to present datatable collected from firewall logs in a histogram so that i would have one bar for each date in the file, and the number of occurences in a certain column stacked in the bar.
I looked into several examples here but they all seemed to be based on the fact that i would know what values there are in the particular column - and what i'm trying to achieve here is the way to present histogram without needing to know all possible fields.
In the example i have used protocol as the column:
#!/usr/bin/python
import pandas as pd
import numpy as np
import glob
import matplotlib.pyplot as plt
csvs = glob.glob("*log-export.csv")
dfs = [pd.read_csv(csv, sep="\xff", engine="python") for csv in csvs]
df_merged = pd.concat(dfs).fillna("")
data = df_merged[['date', 'proto']]
np_data = np.array(data)
plt.hist(np_data, stacked=True)
plt.show()
But this shows following diagram:
histogram
and i would like to accomplish something like this:
stacked
Any suggestions how to achieve this?
Setup
I had to make up data because you didn't provide any.
df = pd.DataFrame(dict(
Date=pd.date_range(end=pd.to_datetime('now'), periods=100, freq='H'),
Proto=np.random.choice('UDP TCP ICMP'.split(), 100, p=(.3, .5, .2))
))
Solution
Use pd.crosstab then plot
pd.crosstab(df.Date.dt.date, df.Proto).plot.bar(stacked=True)