groupby operation in pandas.DataFrame without outliers - python

For a pandas.Series, I know how to remove outliers. With something like this:
x = pd.Series(np.random.normal(size=1000))
iqr = x.quantile(.75) - x.quantile(.25)
y = x[x.between(x.quantile(.25) - 1.5*iqr, x.quantile(.75) + 1.5*iqr)]
I would like to do thins over the different Series/columns of a DataFrame
import string
import random
df = pd.DataFrame([])
df['A'] = pd.Series(np.random.normal(size=1000))
df['B'] = pd.Series(np.random.normal(size=1000, loc=-5, scale=1))
df['C'] = pd.Series(np.random.normal(size=1000, loc=10, scale=2))
df['index'] = pd.Series([random.choice(string.ascii_uppercase) for i in range(1000)])
df.set_index('index')
I usually do stuff like
df = df.groupby('index').mean()
However, in this case, it would also average the outliers, which I would like to ignore from averaging.
Notice that the random data makes than the outliers are in different positions in each column. So an outlier should be ignored only in that column/Series
The result should be a DataFrame, with 26 lines (one for each letter of index), and 3 columns, with the values averaged without outliers
I can loop over the columns of df and do the first block of code. But is there a nicer way?
Suggestion are welcome. Any approach is accepted

Use the following code.
def mean_without_outlier(x): # x: series
iqr = x.quantile(.75) - x.quantile(.25)
y = x[x.between(x.quantile(.25) - 1.5*iqr, x.quantile(.75) + 1.5*iqr)]
return y.mean()
df.groupby("index")[['A', 'B', 'C']].agg(mean_without_outlier)

Related

changing frequency in a pandas SeriesGroupBy

I'm struggling to find a simple way to change a frequency of a pd.Series that is grouped on some level of a pd.MultiIndex (so it's a pd.core.groupby.generic.SeriesGroupBy).
Here's a simple example how this could be done with a standard pd.Series:
import pandas as pd
dates = pd.date_range('1990','2000',freq='M')
values = range(len(dates))
vec = pd.Series(values,index=dates)
vec.asfreq('D')
Here's what I'd hope would work for a grouped series but it doesn't:
idx = [x//12 for x in values]
midx =pd.MultiIndex.from_arrays([idx,dates], names=['level0','level1'])
vec = pd.Series(values,index=midx)
vec.groupby('level0').asfreq('D')
I know that I could change the index to level1, change the frequency and group it again but I wonder if there is a better way to do it.
Edit
Here's an approach I mentioned above (very crude, i know):
vec_new = vec.groupby('level0').obj\
.reset_index()\
.set_index('level1')\
.asfreq('D',method='ffill')\
.reset_index()
midx = pd.MultiIndex.from_arrays([vec_new['level0'],vec_new['level1']], names=['level0','level1'])
vec_new.index = midx
vec_new.drop(columns = ['level1'],inplace=True)
vec_new.groupby('level0')

Normalizing a huge python dataframe

I have a huge csv file (~2GB) that I have imported using Dask. Now I want to normalize this dataframe. The dataframe contains about 70k columns. I have written this python function to calculate this:
def normalize(df):
result = df.copy()
for col in tqdm(df.columns):
if col!=str('name') #basically not to normalize columns with name of "name"
max_value = df[col].max()
min_value = df[col].min()
result[col] = (df[col] - min_value) / (max_value - min_value)
return result
It works okay but takes a lot of time. I put this on execution and its showing it will take appoximately 88 hours to complete. I tried switching to sklearn's minmaxscaler() but it doesn't show any progress of normalization and I am afraid that it will also take quite a lot of time. Is there any other way to normalize all the columns (and ignore a few like I did in that if condition).
You don't need to loop through this. When the other columns than name are numerical values then you can just do something along the following:
num_cols = [col for col in df.columns if col != "name"]
df.loc[:, num_cols] = (df[num_cols] - df[num_cols].min()) / (df[num_cols].max() - df[num_cols].min())
Here is a minimal code sample:
import pandas as pd
df = pd.DataFrame({"name": ["a"]*4, "a": [2,3,4,6], "b": [9,5,2,34]})
num_cols = [col for col in df.columns if col != "name"]
df.loc[:, num_cols] = (df[num_cols] - df[num_cols].min()) / (df[num_cols].max() - df[num_cols].min())
print(df)
I am afraid that it will also take quite a lot of time
Then considering that you just need numerical operations I suggest using numpy for actual number crunching and pandas only for extraction of columns to process, simple example:
import numpy as np
import pandas as pd
df = pd.DataFrame({'name':['A','B','C'],'x1':[1,2,3],'x2':[4,8,6],'x3':[10,15,30]})
num_arr = df[['x1','x2','x3']].to_numpy()
mins = np.min(num_arr,0)
maxs = np.max(num_arr,0)
result_arr = (num_arr - mins) / (maxs - mins)
result_df = pd.concat([df[['name']],pd.DataFrame(result_arr,columns=['x1','x2','x3'])],axis=1)
print(result_df)
output
name x1 x2 x3
0 A 0.0 0.0 0.00
1 B 0.5 1.0 0.25
2 C 1.0 0.5 1.00
Disclaimer: this solutions assumes that df has indices like 0,1,2,...
If you would need further speed increase consider using parallelization, which might be used in this case as values in each columns are computed independently from other columns.

pandas groupby column with rolling mean, limited between datetimes, without iterating over each row

I have data in a dataframe as follows:
ROWS = 1000
df = pandas.DataFrame()
df['DaT'] = pandas.date_range('2000-1-1', periods=ROWS, freq='H')
df['cat'] = numpy.random.choice(['a','b','c'],size=ROWS)
df['val'] = numpy.random.randint(2,size=ROWS)
df['r10'] = df.groupby(['cat'])['val'].apply(lambda x: x.rolling(10).mean() )
I need to calculate a column that, is grouped by category 'cat', and is a rolling (10periods) mean of the value 'val' column, but the rolling mean for a given row cannot include values from the day it occurs on.
The desired result ('wanted') can be generated as follows:
df['wanted'] = numpy.nan
for idx, row in df.iterrows():
Rdate = row['DaT'].normalize()
Rcat = row['cat']
try: df.loc[idx,'wanted'] = df[(df['DaT'] < Rdate) & (df['cat'] == Rcat) ]['val'].rolling(10).mean().iloc[-1]
except: df.loc[idx,'wanted'] = numpy.nan
The above is an awful solution, but gets the result. It is very slow for 100000+rows that need to go through. Is there are more elegant solution?
I have tried using combinations of shift and even quantize to get a more efficient solution, but no success yet

What's fastest way to perform calculation based on values of other columns in a dataframe?

I have df, and I have to apply this formula:
to every row, then add the new series (as a new column).
Right now my code is :
new_col = deque()
for i in range(len(df)):
if i < n:
new_col.append(0)
else:
x = np.log10(np.sum(ATR[i-n:i])/(max(high[i-n:i])-min(low[i-n:i])))
y = np.log10(n)
new_col.append(100 * x/y)
df['new_col'] = pd.DataFrame({"new_col" : new_col})
ATR, high, low are obtained from columns of my existing df. But this method is very slow. Is there a faster way to perform the task? Thanks.
Without sample data, I can't test the following, but it should work:
tmp_df = df.rolling(n).agg({'High':'max', 'Low':'min', 'ATR':'sum'})
df['new_col'] = (100*np.log10(tmp_df['ATR'])) / (tmp_df['High'] - tmp_df['Low']) / np.log10(n)
df['new_col'] = df['new_col'].shift().fillna(0)

Iterate and change value based on function in Python pandas

please help. Seems easy, just can't figure it out.
DataFrame (df) contains numbers. For each column:
* compute the mean and std
* compute a new value for each value in each row in each column
* change that value with the new value
Method 1
import numpy as np
import pandas as pd
n = 1
while n<len(df.column.values.tolist()):
col = df.values[:,n]
mean = sum(col)/len(col)
std = np.std(col, axis = 0)
for x in df[df.columns.values[n]]:
y = (float(x) - float(mean)) / float(std)
df.set_value(x, df.columns.values[n], y)
n = n+1
Method 2
labels = df.columns.values.tolist()
df2 = df.ix[:,0]
n = 1
while n<len(df.column.values.tolist()):
col = df.values[:,n]
mean = sum(col)/len(col)
std = np.std(col, axis = 0)
ls = []
for x in df[df.columns.values[n]]:
y = (float(x) - float(mean)) / float(std)
ls.append(y)
df2 = pd.DataFrame({labels[n]:str(ls)})
df1 = pd.concat([df1, df2], axis=1, ignore_index=True)
n = n+1
Error: ValueError: If using all scalar values, you must pass an index
Also tried the .apply method but the new DataFrame doesn't change the values.
print(df.to_json()):
{"col1":{"subj1":4161.97,"subj2":5794.73,"subj3":4740.44,"subj4":4702.84,"subj5":3818.94},"col2":{"subj1":13974.62,"subj2":19635.32,"subj3":17087.721851,"subj4":13770.461021,"subj5":11546.157578},"col3":{"subj1":270.7,"subj2":322.607708,"subj3":293.422314,"subj4":208.644585,"subj5":210.619961},"col4":{"subj1":5400.16,"subj2":7714.080365,"subj3":6023.026011,"subj4":5880.187272,"subj5":4880.056292}}
You are standard normalizing each column by removing the mean and scaling to unit variance. You can use scikit-learn's standardScaler for this:
from sklearn import preprocessing
scaler= preprocessing.StandardScaler()
new_df = pd.DataFrame(scaler.fit_transform(df.T), columns=df.columns, index=df.index)
Here is the documentation for the same
It looks like you're trying to do operations on DataFrame columns and values as though DataFrames were simple lists or arrays, rather than in the vectorized / column-at-a-time way more usual for NumPy and Pandas work.
A simple, first-pass improvement might be:
# import your data
import json
df = pd.DataFrame(json.loads(json_text))
# loop over only numeric columns
for col in df.select_dtypes([np.number]):
# compute column mean and std
col_mean = df[col].mean()
col_std = df[col].std()
# adjust column to normalized values
df[col] = df[col].apply(lambda x: (x - col_mean) / col_std)
That is vectorized by column. It retains some explicit looping, but is straightforward and relatively beginner-friendly.
If you're comfortable with Pandas, it can done more compactly:
numeric_cols = list(df.select_dtypes([np.number]))
df[numeric_cols] = df[numeric_cols].apply(lambda col: (col - col.mean()) / col.std(), axis=0)
In your revised DataFrame, there are no string columns. But the earlier DataFrame had string columns, causing problems when they were computed upon, so let's be careful. This is a generic way to select numeric columns. If it's too much, you can simplify at the cost of generality by listing them explicitly:
numeric_cols = ['col1', 'col2', 'col3', 'col4']

Categories

Resources