Related
I have a data set with 8 columns and several rows. The columns contain measurements for different variable (6 in total) under 2 different conditions, each consisting of 4 columns that contain repeated measurements for a particular condition.
Using Searborn, I would like to generate a bar chart displaying the mean and standard deviation of every 4 columns, grouped by index key (i.e. measured variable). The dataframe structure is as follows:
np.random.seed(10)
df = pd.DataFrame({
'S1_1':np.random.randn(6),
'S1_2':np.random.randn(6),
'S1_3':np.random.randn(6),
'S1_4':np.random.randn(6),
'S2_1':np.random.randn(6),
'S2_2':np.random.randn(6),
'S2_3':np.random.randn(6),
'S2_4':np.random.randn(6),
},index= ['var1','var2','var3','var4','var5','var6'])
How do I pass to seaborn that I would like only 2 bars, 1 for the first 4 columns and 1 for the second. With each bar displaying the mean (and standard deviation or some other measure of dispersion) across 4 columns.
I was thinking of using multi-indexing, adding a second column level to group the columns into 2 condition,
df.columns = pd.MultiIndex.from_arrays([['Condition 1'] * 4 + ['Condition 2'] * 4,df.columns])
but I can't figure out what I should pass to Seaborn to generate the plot I want.
If anyone could point me in the right direction, that would be a great help!
Update Based on Comment
Plotting is all about reshaping the dataframe for the plot API
# still create the groups
l = df.columns
n = 4
groups = [l[i:i+n] for i in range(0, len(l), n)]
num_gps = len(groups)
# stack each group and add an id column
data_list = list()
for group in groups:
id_ = group[0][1]
data = df[group].copy().T
data['id_'] = id_
data_list.append(data)
df2 = pd.concat(data_list, axis=0).reset_index()
df2.rename({'index': 'sample'}, axis=1, inplace=True)
# melt df2 into a long form
dfm = df2.melt(id_vars=['sample', 'id_'])
# plot
p = sns.catplot(kind='bar', data=dfm, x='variable', y='value', hue='id_', ci='sd', aspect=3)
df2.head()
sample YAL001C YAL002W YAL004W YAL005C YAL007C YAL008W YAL011W YAL012W YAL013W YAL014C id_
0 S2_1 -13.062716 -8.084685 2.360795 -0.740357 3.086768 -0.117259 -5.678183 2.527573 -17.326287 -1.319402 2
1 S2_2 -5.431474 -12.676807 0.070569 -4.214761 -4.318011 -4.489010 -10.268632 0.691448 -24.189106 -2.343884 2
2 S2_3 -9.365509 -12.281169 0.497772 -3.228236 0.212941 -2.287206 -10.250004 1.111842 -27.811564 -4.329987 2
3 S2_4 -7.582111 -15.587219 -1.286167 -4.531494 -3.090265 -4.718281 -8.933496 2.079757 -21.580854 -2.834441 2
4 S3_1 -12.618254 -20.010779 -2.530541 -3.203072 -2.436503 -2.922565 -15.972632 3.551605 -35.618485 -4.925495 3
dfm.head()
sample id_ variable value
0 S2_1 2 YAL001C -13.062716
1 S2_2 2 YAL001C -5.431474
2 S2_3 2 YAL001C -9.365509
3 S2_4 2 YAL001C -7.582111
4 S3_1 3 YAL001C -12.618254
Plot Result
kind='box'
A box plot might be a better to convey the distribution
p = sns.catplot(kind='box', data=dfm, y='variable', x='value', hue='id_', height=12)
Original Answer
Use a list comprehension to chunk the columns into groups of 4
This uses the original, more comprehensive data that was posted. It can be found in revision 4
Create a figure with subplots and zip each group to an ax from axes
Use each group to select data from df and transpose the data with .T.
Using sns.barplot the default estimator is mean, so the length of the bar is the mean, and set ci='sd' so the confidence interval is the standard deviation.
sns.barplot(data=data, ci='sd', ax=ax) can easily be replaced with sns.boxplot(data=data, ax=ax)
import seaborn as sns
# using the first comma separated data that was posted, create groups of 4
l = df.columns
n = 4 # chunk size for groups
groups = [l[i:i+n] for i in range(0, len(l), n)]
num_gps = len(groups)
# plot
fig, axes = plt.subplots(num_gps, 1, figsize=(12, 6*num_gps))
for ax, group in zip(axes, groups):
data = df[group].T
sns.barplot(data=data, ci='sd', ax=ax)
ax.set_title(f'{group.to_list()}')
fig.tight_layout()
fig.savefig('test.png')
Example of data
The bar is the mean of each column, and the line is the standard deviation
YAL001C YAL002W YAL004W YAL005C YAL007C YAL008W YAL011W YAL012W YAL013W YAL014C
S8_1 -1.731388 -17.215712 -3.518643 -2.358103 0.418170 -1.529747 -12.630343 2.435674 -27.471971 -4.021264
S8_2 -1.325524 -24.056632 -0.984390 -2.119338 -1.770665 -1.447103 -10.618954 2.156420 -30.362998 -4.735058
S8_3 -2.024020 -29.094027 -6.146880 -2.101090 -0.732322 -2.773949 -12.642857 -0.009749 -28.486835 -4.783863
S8_4 2.541671 -13.599049 -2.688125 -2.329332 -0.694555 -2.820627 -8.498677 3.321018 -31.741916 -2.104281
Plot Result
I have a dataset with 7 columns - level,Time_30,Time_60,Time_90,Time_120,Time_150 and Time_180
My main goal is to do a time-series anomaly detection using cell count in a 30-minute interval.
I want to do the following data preparation steps:
(I) melt/reshape the df into the appropriate time-series format (from wide to long)- consolidate the columns time_30, time_60 ,....., time_180 into one column time with 6 levels (30,60,.....,180)
(II) since the result from (I) comes out as 30,60,.....180, I want to set the time column as the appropriate time or date format for time-series (something like this '%H:%M:%S')
(III) use a for-loop to plot the time-series plot for each level - A, B,...., F) for comparison purposes.
(IV) Anomaly detection
# generate/import dataset
import pandas as pd
df = pd.DataFrame({'level':[A,B,C,D,E,F],
'Time_30':[1993.05,1999.45, 2001.11, 2007.39, 2219.77],
'Time_60':[2123.15,2299.59, 2339.19, 2443.37, 2553.15],
'Time_90':[2323.56,2495.99,2499.13, 2548.71, 2656.0],
'Time_120':[2355.52,2491.19,2519.92,2611.81, 2753.11],
'Time_150':[2425.31,2599.51, 2539.9, 2713.77, 2893.58],
'Time_180':[2443.35,2609.92, 2632.49, 2774.03, 2901.25]} )
Desired outcome
# first series
level, time, count
A, 30, 1993.05
B, 60, 2123.15
C, 90, 2323.56
D, 120, 2355.52
E, 150, 2425.31
F, 180, 2443.35
# 2nd series
level,time,count
A,30,1999.45
B,60,2299.59
C,90,2495.99
D,120,2491.19
E,150,2599.51
F,180,2609.92
.
.
.
.
# up until the last series
See below for my attempt
# (I)
df1 = pd.melt(df,id_vars = ['level'],var_name = 'time',value_name = 'count') #
# (II)
df1['time'] = pd.to_datetime(df1['time'],format= '%H:%M:%S' ).dt.time
OR
df1['time'] = pd.to_timedelta(df1['time'], unit='m')
# (III)
plt.figure(figsize=(10,5))
plt.plot(df1)
for timex in range(30,180):
plt.axvline(datetime(timex,1,1), color='k', linestyle='--', alpha=0.3)
# Perform STL Decomp
stl = STL(df1)
result = stl.fit()
seasonal, trend, resid = result.seasonal, result.trend, result.resid
plt.figure(figsize=(8,6))
plt.subplot(4,1,1)
plt.plot(df1)
plt.title('Original Series', fontsize=16)
plt.subplot(4,1,2)
plt.plot(trend)
plt.title('Trend', fontsize=16)
plt.subplot(4,1,3)
plt.plot(seasonal)
plt.title('Seasonal', fontsize=16)
plt.subplot(4,1,4)
plt.plot(resid)
plt.title('Residual', fontsize=16)
plt.tight_layout()
estimated = trend + seasonal
plt.figure(figsize=(12,4))
plt.plot(df1)
plt.plot(estimated)
plt.figure(figsize=(10,4))
plt.plot(resid)
# Anomaly detection
resid_mu = resid.mean()
resid_dev = resid.std()
lower = resid_mu - 3*resid_dev
upper = resid_mu + 3*resid_dev
anomalies = df1[(resid < lower) | (resid > upper)] # returns the datapoints with the anomalies
anomalies
plt.plot(df1)
for timex in range(30,180):
plt.axvline(datetime(timex,1,1), color='k', linestyle='--', alpha=0.6)
plt.scatter(anomalies.index, anomalies.count, color='r', marker='D')
Please note: if you can only attempt I and/or II that would be much appreciated.
I made a few small edits to your sample dataframe based on my comment above:
import pandas as pd
df = pd.DataFrame({'level':['A','B','C','D','E'],
'Time_30':[1993.05,1999.45, 2001.11, 2007.39, 2219.77],
'Time_60':[2123.15,2299.59, 2339.19, 2443.37, 2553.15],
'Time_90':[2323.56,2495.99,2499.13, 2548.71, 2656.0],
'Time_120':[2355.52,2491.19,2519.92,2611.81, 2753.11],
'Time_150':[2425.31,2599.51, 2539.9, 2713.77, 2893.58],
'Time_180':[2443.35,2609.92, 2632.49, 2774.03, 2901.25]} )
First, manipulate the Time_* column names to be integer values:
timecols = [int(c.replace("Time_","")) for c in df.columns if c != 'level']
df.columns = ['level'] + timecols
After that you can pd.melt() like you were thinking, yielding a datarame with all those "series" you mentioned above concatenated together:
df1 = df.melt(id_vars=['level'], value_vars=timecols, var_name='time', value_name='count').sort_values(['level','time']).reset_index(drop=True)
print(df1.head(10))
level time count
0 A 30 1993.05
1 A 60 2123.15
2 A 90 2323.56
3 A 120 2355.52
4 A 150 2425.31
5 A 180 2443.35
6 B 30 1999.45
7 B 60 2299.59
8 B 90 2495.99
9 B 120 2491.19
If you want to loop over the levels, select them with:
for level in df1['level'].unique():
tmp = df1[df1['level']==level]
or
for level in df1['level'].unique():
tmp = df1[df1['level']==level].copy()
...if you intend to modify/add data to the tmp dataframe.
As for making timestamps, you could do:
df1['time'] = pd.to_timedelta(df1['time'], unit='min')
...like you were attempting, but it depends on how you're using it. If you just want strings that look like "00:30:00", etc, you can try something like:
df1['time'] = pd.to_timedelta(df1['time'], unit='min').apply(lambda x:str(x)[-8:])
Anyway, hope that gets you on track for what you need.
I have two DataFrames. Both have X and Y coordinates. But DF1 is much denser than DF2. I want to downsample DF1 according to the X Y coordinates in DF2. Specifically, for each X/Y pairs in DF2, I select DF1 data between X +/-delta and Y +/-delta, and calculate the average value of Z. New_DF1 will have the same X Y coordinate as DF2, but with the average value of Z by downsampling.
Here are some examples and a function I made for this purpose. My problem was that it is too slow for a large dataset. It is highly appreciated if anyone has a better idea to vectorize the operation instead of crude looping.
Create data examples:
DF1 = pd.DataFrame({'X':[0.6,0.7,0.9,1.1,1.3,1.8,2.1,2.8,2.9,3.0,3.3,3.5],"Y":[0.6,0.7,0.9,1.1,1.3,1.8,2.1,2.8,2.9,3.0,3.3,3.5],'Z':[1,2,3,4,5,6,7,8,9,10,11,12]})
DF2 = pd.DataFrame({'X':[1,2,3],'Y':[1,2,3],'Z':[10,20,30]})
Function:
def DF1_match_DF2_target(half_range, DF2, DF1):
### half_range, scalar, define the area of dbf target
### dbf data
### raw pwg pixel map
DF2_X =DF2.loc[:,["X"]]
DF2_Y =DF2.loc[:,['Y']]
results = list()
for i in DF2.index:
#Select target XY from DF2
x= DF2_X.at[i,'X']
y= DF2_Y.at[i,'Y']
#Select X,Y range for DF1
upper_lmt_X = x+half_range
lower_lmt_X = x-half_range
upper_lmt_Y = y+half_range
lower_lmt_Y = y-half_range
#Select data from DF1 according to X,Y range, calculate average Z
subset_X = DF1.loc[(DF1['X']>lower_lmt_X) & (DF1['X']<upper_lmt_X)]
subset_XY = subset_X.loc[(subset_X['Y']>lower_lmt_Y) & (subset_X['Y']<upper_lmt_Y)]
result = subset_XY.mean(axis=0,skipna=True)
result[0] = x #set X,Y in new_DF1 the same as the X,Y in DF2
result[1] = y #set X,Y in new_DF1 the same as the X,Y in DF2
results.append(result)
results = pd.DataFrame(results)
return results
Test and Result:
new_DF1 = DF1_match_DF2_target(0.5,DF2,DF1)
new_DF1
Test and Result
How about using the 'pandas:cut()' function to aggregate using the boundary values?
half_range = 0.5
# create bins
x_bins = [0] + list(df2.x)
y_bins = [0] + list(df2.y)
tmp = [half_range]*(len(df2)+1)
x_bins = [a + b for a, b in zip(x_bins, tmp)]
y_bins = [a + b for a, b in zip(y_bins, tmp)]
key = pd.cut(df1.x, bins=x_bins, right=False, precision=1)
df3 = df1.groupby(key).mean().reset_index()
df2.z = df3.z
df2
x y z
0 1 1 3.0
1 2 2 6.5
2 3 3 9.5
Below is Youtuber Sentdex's machine learning code, and I couldn't understand some parts.
import numpy as np
from sklearn.cluster import MeanShift, KMeans
from sklearn import preprocessing, model_selection
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_excel('titanic.xls')
original_df = pd.DataFrame.copy(df)
df.drop(['body', 'name'], 1, inplace=True)
df.fillna(0, inplace=True)
def handle_non_numerical_data(df):
columns = df.columns.values
for column in columns:
text_digit_vals = {}
def convert_to_int(val):
return text_digit_vals[val]
if df[column].dtype != np.int64 and df[column].dtype != np.float64:
column_contents = df[column].values.tolist()
unique_elements = set(column_contents)
x = 0
for unique in unique_elements:
if unique not in text_digit_vals:
# creating dict that contains new
# id per unique string
text_digit_vals[unique] = x
x += 1
df[column] = list(map(convert_to_int, df[column]))
return df
df = handle_non_numerical_data(df)
df.drop(['ticket', 'home.dest'], 1, inplace=True)
X = np.array(df.drop(['survived'], 1).astype(float))
X = preprocessing.scale(X)
y = np.array(df['survived'])
clf = MeanShift()
clf.fit(X)
labels= clf.labels_ ###Can't understand###
cluster_centers = clf.cluster_centers_
original_df['cluster_group'] = np.nan
for i in range(len(X)):
original_df['cluster_group'].iloc[i] = labels[i]
n_clusters_ = len(np.unique(labels))
survival_rates = {}
for i in range(n_clusters_):
temp_df = original_df[(original_df['cluster_group'] == float(i))]
# print(temp_df.head())
survival_cluster = temp_df[(temp_df['survived'] == 1)]
survival_rate = len(survival_cluster) / len(temp_df)
# print(i,survival_rate)
survival_rates[i] = survival_rate
print(survival_rates)
Supposedly in "labels = clf.labels_", labels are [0 : 5] (when I ran program and I got those numbers). But here's the question. Where do those numbers come from? and why 0,1,2? why not bigger number?
scikitlearn's documentation on Meanshift provides an explanation of the labels_ attribute that you seem confused about. Taken directly from the documentation
labels_ :
Labels of each point.
If you're more confused about what these labels represent, a brief explanation would be that the number refers to what bin that specific point was clustered into. So all the points with a value of 0 would all belong to the same cluster, and all the points with a value of 1 would all belong to the same cluster, and so on. What the value of these labels are doesn't really matter, they're just here to be able to identify which cluster the data point belongs to.
I'd recommend reading more about clustering if you're still confused about why you would want to label the data.
I would like to do a regression with a rolling window, but I got only one parameter back after the regression:
rolling_beta = sm.OLS(X2, X1, window_type='rolling', window=30).fit()
rolling_beta.params
The result:
X1 5.715089
dtype: float64
What could be the problem?
Thanks in advance, Roland
I think the problem is that the parameters window_type='rolling' and window=30 simply do not do anything. First I'll show you why, and at the end I'll provide a setup I've got lying around for linear regressions on rolling windows.
1. The problem with your function:
Since you haven't provided some sample data, here's a function that returns a dataframe of a desired size with some random numbers:
# Function to build synthetic data
import numpy as np
import pandas as pd
import statsmodels.api as sm
from collections import OrderedDict
def sample(rSeed, periodLength, colNames):
np.random.seed(rSeed)
date = pd.to_datetime("1st of Dec, 1999")
cols = OrderedDict()
for col in colNames:
cols[col] = np.random.normal(loc=0.0, scale=1.0, size=periodLength)
dates = date+pd.to_timedelta(np.arange(periodLength), 'D')
df = pd.DataFrame(cols, index = dates)
return(df)
Output:
X1 X2
2018-12-01 -1.085631 -1.294085
2018-12-02 0.997345 -1.038788
2018-12-03 0.282978 1.743712
2018-12-04 -1.506295 -0.798063
2018-12-05 -0.578600 0.029683
.
.
.
2019-01-17 0.412912 -1.363472
2019-01-18 0.978736 0.379401
2019-01-19 2.238143 -0.379176
Now, try:
rolling_beta = sm.OLS(df['X2'], df['X1'], window_type='rolling', window=30).fit()
rolling_beta.params
Output:
X1 -0.075784
dtype: float64
And this at least represents the structure of your output too, meaning that you're expecting an estimate for each of your sample windows, but instead you get a single estimate. So I looked around for some other examples using the same function online and in the statsmodels docs, but I was unable to find specific examples that actually worked. What I did find were a few discussions talking about how this functionality was deprecated a while ago. So then I tested the same function with some bogus input for the parameters:
rolling_beta = sm.OLS(df['X2'], df['X1'], window_type='amazing', window=3000000).fit()
rolling_beta.params
Output:
X1 -0.075784
dtype: float64
And as you can see, the estimates are the same, and no error messages are returned for the bogus input. So I suggest that you take a look at the function below. This is something I've put together to perform rolling regression estimates.
2. A function for regressions on rolling windows of a pandas dataframe
df = sample(rSeed = 123, colNames = ['X1', 'X2', 'X3'], periodLength = 50)
def RegressionRoll(df, subset, dependent, independent, const, win, parameters):
"""
RegressionRoll takes a dataframe, makes a subset of the data if you like,
and runs a series of regressions with a specified window length, and
returns a dataframe with BETA or R^2 for each window split of the data.
Parameters:
===========
df: pandas dataframe
subset: integer - has to be smaller than the size of the df
dependent: string that specifies name of denpendent variable
inependent: LIST of strings that specifies name of indenpendent variables
const: boolean - whether or not to include a constant term
win: integer - window length of each model
parameters: string that specifies which model parameters to return:
BETA or R^2
Example:
========
RegressionRoll(df=df, subset = 50, dependent = 'X1', independent = ['X2'],
const = True, parameters = 'beta', win = 30)
"""
# Data subset
if subset != 0:
df = df.tail(subset)
else:
df = df
# Loopinfo
end = df.shape[0]
win = win
rng = np.arange(start = win, stop = end, step = 1)
# Subset and store dataframes
frames = {}
n = 1
for i in rng:
df_temp = df.iloc[:i].tail(win)
newname = 'df' + str(n)
frames.update({newname: df_temp})
n += 1
# Analysis on subsets
df_results = pd.DataFrame()
for frame in frames:
#print(frames[frame])
# Rolling data frames
dfr = frames[frame]
y = dependent
x = independent
if const == True:
x = sm.add_constant(dfr[x])
model = sm.OLS(dfr[y], x).fit()
else:
model = sm.OLS(dfr[y], dfr[x]).fit()
if parameters == 'beta':
theParams = model.params[0:]
coefs = theParams.to_frame()
df_temp = pd.DataFrame(coefs.T)
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
if parameters == 'R2':
theParams = model.rsquared
df_temp = pd.DataFrame([theParams])
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
df_temp.columns = [', '.join(independent)]
df_results = pd.concat([df_results, df_temp], axis = 0)
return(df_results)
df_rolling = RegressionRoll(df=df, subset = 50, dependent = 'X1', independent = ['X2'], const = True, parameters = 'beta',
win = 30)
Output: A dataframe with beta estimates for OLS of X2 on X1 for each 30 period window of the data.
const X2
Date
2018-12-30 0.044042 0.032680
2018-12-31 0.074839 -0.023294
2019-01-01 -0.063200 0.077215
.
.
.
2019-01-16 -0.075938 -0.215108
2019-01-17 -0.143226 -0.215524
2019-01-18 -0.129202 -0.170304