Multiple datasets and fit line

Multiple datasets and fit line - python

I have different datasets:
Df1
X Y
1 1
2 5
3 14
4 36
5 90
Df2
X Y
1 1
2 5
3 21
4 38
5 67
Df3
X Y
1 1
2 5
3 10
4 50
5 78
I would like to determine a line which fits this data and plot all data in one chart (like a regression).
On the x axis I have the time; on the y axis I have the frequency of an event that occurs.
Any help on the approach on how to determine the line and plot the results keeping the different legends (would be ok with seaborn or matplotlib) would be helpful.
What I have done so far is plotting the three lines as follows:
plot_df = pd.DataFrame(list(zip(dataset_list, x_lists, y_lists)),
columns =['Dataset', 'X', 'Y']).set_index('Dataset', inplace=False)
plot_df= plot_df.apply(pd.Series.explode).reset_index() # this step should transpose the resulting df and explode the values
# plot
fig, ax = plt.subplots(figsize=(10,8))
for name, group in plot_df.groupby('Dataset'):
group.plot(x = "X", y= "Y", ax=ax, label=name)
Please note that the three lists at the beginning contain information on the three different df.

I recommend using linregress from scipy.stats as this gives very readable code. Just need to add in the logic to your loop:
from scipy.stats import linregress
for name, group in plot_df.groupby('Dataset'):
group.plot(x = "X", y= "Y", ax=ax, label=name)
#fit a line to the data
fit = linregress(group.X, group.Y)
ax.plot(group.X, group.X * fit.slope + fit.intercept, label=f'{name} fit')

Related

Plotting dataframe where headers are 24h timestamps

I have a CSV which looks something like:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
<>
Red 2151.0 1966.0 1889.0 2148.0 2112.0 2351.0 1813.0 2008.0 1841.0 1901.0 2373.0 2643.0 2322.0 1901.0 1849.0 2132.0 1877.0 1963.0 1861.0 1973.0 2468.0 3434.0 3159.0 3413.0
Blue 2122.0 2059.0 2274.0 2647.0 2136.0 2271.0 2107.0 2192.0 2403.0 2148.0 2008.0 2111.0 2061.0 2196.0 2165.0 2354.0 1931.0 2195.0 1985.0 2025.0 2463.0 2943.0 3302.0 3424.0
I need to plot this data as a scatter plot, where Red/Blue is on the X-axis, and time on the Y-axis. Here 1->24 is timestamps per hour. I am confused as usually we have a timestamp per row, but in my situation, I have to plot each row, for each timestamp.
I then have multiple such CSVs, and will need to plot all of them to one graph.
So my question is, what's the best way to plot all values of Red/Blue for each given timestamp?
When I try to do:
x = sorted_df.index
y = list(sorted_df.columns)
plt.scatter(x, y)
plt.show()
I get a ValueError: x and y must be the same size - and this is my main source of confusion because x and y will never be the same. In the above example, x will always be 2 and y will always be 24!
Any help is much appreciated!

Maybe you could transpose (.T, short for .transpose) your dataframe, iterate its (now) columns and add the scatter plot of each column ("Red", "Blue", etc.) in the same plot:
# Assuming that your index is ["Red", "Blue", ...]
# Assuming that your columns are [1, 2, ...]
sorted_df = sorted_df.T
# Now your index is [1, 2, ...] and columns are ["Red", "Blue", ...]
for column in sorted_df.columns:
# x: [1, 2, ...] (always the same)
# y: Values for each column (first "Red", then "Blue", and so on)
plt.scatter(sorted_df.index, sorted_df[column])
# Display plot
plt.show()
This is the result I get:

Generating an histogram with Matplotlib using a dataframe for x and y

I'm generating a simple line chart with Matplotlib, here is my code:
fig = plt.figure(facecolor='#131722',dpi=155, figsize=(8, 4))
ax1 = plt.subplot2grid((1,2), (0,0), facecolor='#131722')
for x in OrderedList:
rate_buy = []
total_buy = []
for y in x['data']['bids']:
rate_buy.append(y[0])
total_buy.append(y[1])
rBuys = pd.DataFrame({'buy': rate_buy})
tBuys = pd.DataFrame({'total': total_buy})
ax1.plot(rBuys.buy, tBuys.total, color='#0400ff', linewidth=0.5, alpha=1)
ax1.fill_between(rBuys.buy, 0, tBuys.total, facecolor='#0400ff', alpha=1)
Which gives me the following output:
And here is the data i used in the dataframe:
buy
0 9611
1 9610
2 9609
3 9608
4 9607
5 9606
6 9605
7 9604
8 9603
9 9602
10 9601
11 9600
12 9599
total
0 3.033661
1 3.295753
2 3.599813
3 22.305765
4 22.987476
5 30.975145
6 39.492845
7 42.828580
8 46.677708
9 49.533740
10 50.925840
11 61.396243
12 61.921523
I want to get the same output of the image, but with an histogram chart or whatever it's similar to that, where the height of the column on the y axis is retrieved from the total dataframe and the x axis position is retrieved from the buy dataframe. So the first element will have position x=9611 and y=3.033661
Is it possible to do that with Matplotlib? I tried to use hist, but it doesn't allow me to set both the x and the y axis

Pandas uses matplotlib as well, and the API is very easy once you have the dataframe.
Here is an example.
d = {
'buy':[
9611,
9610,
9609,
9608,
9607,
9606,
9605,
9604,
9603,
9602,
9601,
9600,
9599
],
'total':[
3.033661,
3.295753,
3.599813,
22.305765,
22.987476,
30.975145,
39.492845,
42.828580,
46.677708,
49.533740,
50.925840,
61.396243,
61.921523
]
}
df = pd.DataFrame(d)
df = df.sort_values(by=['buy']) #remember to sort your x values!
df.plot(kind='bar', x='buy', y='total', width=1)
plt.show()

scatter plot with multiple X features and single Y in Python

Data in form:
x1 x2
data= 2104, 3
1600, 3
2400, 3
1416, 2
3000, 4
1985, 4
y= 399900
329900
369000
232000
539900
299900
I want to plot scatter plot which have got 2 X feature {x1 and x2} and single Y,
but when I try
y=data.loc[:'y']
px=data.loc[:,['x1','x2']]
plt.scatter(px,y)
I get:
'ValueError: x and y must be the same size'.
So I tried this:
data=pd.read_csv('ex1data2.txt',names=['x1','x2','y'])
px=data.loc[:,['x1','x2']]
x1=px['x1']
x2=px['x2']
y=data.loc[:'y']
plt.scatter(x1,x2,y)
This time I got blank graph with full blue color painted inside.
I will be great full if i get some guide

You can only plot with one x and several y's. You could plot the different x's in a twiny axis:
fig, ax = plt.subplots()
ay = ax.twiny()
ax.scatter(df['x1'], df['y'])
ay.scatter(df['x2'], df['y'], color='r')
plt.show()
Output:

You can check the pandas functions for plotting dataframe content, it's very powerful.
But if you want to use matplotlib you can check the documentation (https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.scatter.html), and it's said that X and Y must be array-like. You are instead passing a list.
So the working code it's like this:
data = pd.read_csv("test.txt", header=None)
data
0 1 2
0 2104 3 399900
1 1600 3 329900
2 2400 3 369000
3 1416 2 232000
4 3000 4 539900
5 1985 4 299900
data.columns = ["x1", "x2", "y"]
data
x1 x2 y
0 2104 3 399900
1 1600 3 329900
2 2400 3 369000
3 1416 2 232000
4 3000 4 539900
5 1985 4 299900
# If you call scatter many times and then plt.show() a single image is created
plt.scatter(data["x1"], data["y"])
plt.scatter(data["x2"], data["y"])
plt.show()
Note that if you want to have data in an array format you can do data["x1"].values and it will return an ndarray.

You could use seaborn with a melted dataframe. seaborn.scatterplot has a hue argument, which allows to include multiple data series.
import seaborn as sns
ax = sns.scatterplot(x='value', hue='series', y='y',
data=data.melt(value_vars=['x1', 'x2'],
id_vars='y',
var_name='series'))
However, if your x values are that different, you might want to use twin axes, as in #Quang Hoang's answer.

Filtering outliers within each category of categorical data in pandas

I'm new to pandas/seaborn/etc and attempting to graph a subset of my data in a different style (using seaborn), using something like the example here https://seaborn.pydata.org/generated/seaborn.stripplot.html :
>>> ax = sns.stripplot(x="day", y="total_bill", hue="smoker",
... data=tips, jitter=True,
... palette="Set2", dodge=True)
My goal is to plot only the outliers within each x/hue dimension, i.e. for the example shown I'd be using 8 different percentile cutoffs for the 8 different columns of points displayed.
I have a dataframe like:
Cat RPS latency_ns
0 X 100 909423.0
1 X 100 14747385.0
2 X 1000 14425058.0
3 Y 100 7107907.0
4 Y 1000 21466101.0
... ... ... ...
And I want to filter this data, leaving only the upper 99.9th percentile outliers.
I've found I can do:
df.groupby([dim1_label, dim2_label]).quantile(0.999)
To get something like:
latency_ns
Cat RPS
X 10 RPS 6.463337e+07
100 RPS 4.400980e+07
1000 RPS 6.075070e+07
Y 100 RPS 3.958944e+07
Z 10 RPS 5.621427e+07
100 RPS 4.436208e+07
1000 RPS 6.658783e+07
But I'm not sure where to go from here with a merge/filter operation.

Here is a small example I created to guide you. I hope it is helpful.
Code
import numpy as np
import pandas as pd
import seaborn as sns
#create a sample data frame
n = 1000
prng = np.random.RandomState(123)
x = prng.uniform(low=1, high=5, size=(n,)).astype('int')
#print(x[:10])
#[3 2 1 3 3 2 4 3 2 2]
y = prng.normal(size=(n,))
#print(y[:10])
#[ 1.32327371 -0.00315484 -0.43065984 -0.14641577 1.16017595 -0.64151234
#-0.3002324 -0.63226078 -0.20431653 0.2136956 ]
z = prng.binomial(n=1,p=2/3,size=(n,))
#print(z[:10])
#[1 0 1 1 1 1 0 1 1 1]
#analagously to the smoking example, my df x maps day,
#y maps to total bill, and z maps to is smoker (or not)
df = pd.DataFrame(data={'x':x,'y':y,'z':z})
#df.head()
df_filtered = pd.DataFrame()
#df.groupby.quantile([0.9]) returns a scalar, unless you want to plot only a single point, use this
#if you want to plot values that are within the lower and upper bounds, then some
#conditional filtering is required, see the conditional filtering I wrote below
for i,j in df.groupby([x, z]):
b = j.quantile([0,0.9]) #use [0.999,1] in your case
lb = b['y'].iloc[0]
ub = b['y'].iloc[1]
df_temp = j[(j['y']>=lb)&(j['y']<=ub)]
df_filtered = pd.concat([df_filtered,df_temp])
#print(df_filtered.count())
#x 897
#y 897
#z 897
#dtype: int64
Output
import matplotlib.pyplot as plt
ax = sns.stripplot(x='x', y='y', hue='z',
data=df_filtered, jitter=True,
palette="Set2", dodge=True)
plt.show()

How to plot a FacetGrid scatter plot with multiple data frames?

I have 2 dataframes, 1 has training data and the other has labels. There are 6 features/columns in the training data and 1 column in the labels data frame. I want 6 plots in my facet grid - all of them to be a scatter plot. So feature 1 vs label, feature 2 vs label, feature 3 vs label, feature 4 vs label.
Can someone show me how to do this?
for instance, using these sample data frames
In [15]: training
Out[15]:
feature1 feature2 feature3 feature4 feature5 feature6
0 2 3 4 5 2 5
1 5 4 2 5 6 2
In [16]: labels
Out[16]:
label
0 34
1 2
This should make 6 separate scatter plots, each with 2 data points.

Seaborn has a nice FacetGrid function.You can merge your two dataframes wrap the seaborn facetgrid around a normal matplotlib.pyplot.scatter()
import pandas as pd
import random
import matplotlib.pyplot as plt
import seaborn as sns
#make a test dataframe
features = {}
for i in range(7):
features['feature%s'%i] = [random.random() for j in range(10)]
f = pd.DataFrame(features)
labels = pd.DataFrame({'label':[random.random() for j in range(10)]})
#unstack it so feature labels are now in a single column
unstacked = pd.DataFrame(f.unstack()).reset_index()
unstacked.columns = ['feature', 'feature_index', 'feature_value']
#merge them together to get the label value for each feature value
plot_data = pd.merge(unstacked, labels, left_on = 'feature_index', right_index = True)
#wrap a seaborn facetgrid
kws = dict(s=50, linewidth=.5, edgecolor="w")
g = sns.FacetGrid(plot_data, col="feature")
g = (g.map(plt.scatter, "feature_value", "label", **kws))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multiple datasets and fit line - python

Related

Plotting dataframe where headers are 24h timestamps

Generating an histogram with Matplotlib using a dataframe for x and y

scatter plot with multiple X features and single Y in Python

Filtering outliers within each category of categorical data in pandas

How to plot a FacetGrid scatter plot with multiple data frames?

Categories

Resources