Normalizing a huge python dataframe - python

I have a huge csv file (~2GB) that I have imported using Dask. Now I want to normalize this dataframe. The dataframe contains about 70k columns. I have written this python function to calculate this:
def normalize(df):
result = df.copy()
for col in tqdm(df.columns):
if col!=str('name') #basically not to normalize columns with name of "name"
max_value = df[col].max()
min_value = df[col].min()
result[col] = (df[col] - min_value) / (max_value - min_value)
return result
It works okay but takes a lot of time. I put this on execution and its showing it will take appoximately 88 hours to complete. I tried switching to sklearn's minmaxscaler() but it doesn't show any progress of normalization and I am afraid that it will also take quite a lot of time. Is there any other way to normalize all the columns (and ignore a few like I did in that if condition).

You don't need to loop through this. When the other columns than name are numerical values then you can just do something along the following:
num_cols = [col for col in df.columns if col != "name"]
df.loc[:, num_cols] = (df[num_cols] - df[num_cols].min()) / (df[num_cols].max() - df[num_cols].min())
Here is a minimal code sample:
import pandas as pd
df = pd.DataFrame({"name": ["a"]*4, "a": [2,3,4,6], "b": [9,5,2,34]})
num_cols = [col for col in df.columns if col != "name"]
df.loc[:, num_cols] = (df[num_cols] - df[num_cols].min()) / (df[num_cols].max() - df[num_cols].min())
print(df)

I am afraid that it will also take quite a lot of time
Then considering that you just need numerical operations I suggest using numpy for actual number crunching and pandas only for extraction of columns to process, simple example:
import numpy as np
import pandas as pd
df = pd.DataFrame({'name':['A','B','C'],'x1':[1,2,3],'x2':[4,8,6],'x3':[10,15,30]})
num_arr = df[['x1','x2','x3']].to_numpy()
mins = np.min(num_arr,0)
maxs = np.max(num_arr,0)
result_arr = (num_arr - mins) / (maxs - mins)
result_df = pd.concat([df[['name']],pd.DataFrame(result_arr,columns=['x1','x2','x3'])],axis=1)
print(result_df)
output
name x1 x2 x3
0 A 0.0 0.0 0.00
1 B 0.5 1.0 0.25
2 C 1.0 0.5 1.00
Disclaimer: this solutions assumes that df has indices like 0,1,2,...
If you would need further speed increase consider using parallelization, which might be used in this case as values in each columns are computed independently from other columns.

Related

groupby operation in pandas.DataFrame without outliers

For a pandas.Series, I know how to remove outliers. With something like this:
x = pd.Series(np.random.normal(size=1000))
iqr = x.quantile(.75) - x.quantile(.25)
y = x[x.between(x.quantile(.25) - 1.5*iqr, x.quantile(.75) + 1.5*iqr)]
I would like to do thins over the different Series/columns of a DataFrame
import string
import random
df = pd.DataFrame([])
df['A'] = pd.Series(np.random.normal(size=1000))
df['B'] = pd.Series(np.random.normal(size=1000, loc=-5, scale=1))
df['C'] = pd.Series(np.random.normal(size=1000, loc=10, scale=2))
df['index'] = pd.Series([random.choice(string.ascii_uppercase) for i in range(1000)])
df.set_index('index')
I usually do stuff like
df = df.groupby('index').mean()
However, in this case, it would also average the outliers, which I would like to ignore from averaging.
Notice that the random data makes than the outliers are in different positions in each column. So an outlier should be ignored only in that column/Series
The result should be a DataFrame, with 26 lines (one for each letter of index), and 3 columns, with the values averaged without outliers
I can loop over the columns of df and do the first block of code. But is there a nicer way?
Suggestion are welcome. Any approach is accepted
Use the following code.
def mean_without_outlier(x): # x: series
iqr = x.quantile(.75) - x.quantile(.25)
y = x[x.between(x.quantile(.25) - 1.5*iqr, x.quantile(.75) + 1.5*iqr)]
return y.mean()
df.groupby("index")[['A', 'B', 'C']].agg(mean_without_outlier)

Mean of every 15 rows of a dataframe in python

I have a dataframe of (1500x11). I have to select each of the 15 rows and take mean of every 11 columns separately. So my final dataframe should be of dimension 100x11. How to do this in Python.
The following should work:
dfnew=df[:0]
for i in range(100):
df2=df.iloc[i*15:i*15+15, :]
x=pd.Series(dict(df2.mean()))
dfnew=dfnew.append(x, ignore_index=True)
print(dfnew)
Don't know much about pandas, hence I've coded my next solution in pure numpy. Without any python loops hence very efficient. And converted result back to pandas DataFrame:
Try next code online!
import pandas as pd, numpy as np
df = pd.DataFrame([[i + j for j in range(11)] for i in range(1500)])
a = df.values
a = a.reshape((a.shape[0] // 15, 15, a.shape[1]))
a = np.mean(a, axis = 1)
df = pd.DataFrame(a)
print(df)
You can use pandas.DataFrame.
Use a for loop to compute the means and create a counter which should be reseted at every 15 entries.
columns = [col1, col2, ..., col12]
for columns, values in df.items():
# compute mean
# at every 15 entries save it
Also, using pd.DataFrame() you can create the new dataframe.
I'd recommend you to read the documentation.
https://pandas.pydata.org/pandas-docs/stable/reference/frame.html

What's fastest way to perform calculation based on values of other columns in a dataframe?

I have df, and I have to apply this formula:
to every row, then add the new series (as a new column).
Right now my code is :
new_col = deque()
for i in range(len(df)):
if i < n:
new_col.append(0)
else:
x = np.log10(np.sum(ATR[i-n:i])/(max(high[i-n:i])-min(low[i-n:i])))
y = np.log10(n)
new_col.append(100 * x/y)
df['new_col'] = pd.DataFrame({"new_col" : new_col})
ATR, high, low are obtained from columns of my existing df. But this method is very slow. Is there a faster way to perform the task? Thanks.
Without sample data, I can't test the following, but it should work:
tmp_df = df.rolling(n).agg({'High':'max', 'Low':'min', 'ATR':'sum'})
df['new_col'] = (100*np.log10(tmp_df['ATR'])) / (tmp_df['High'] - tmp_df['Low']) / np.log10(n)
df['new_col'] = df['new_col'].shift().fillna(0)

How to randomly delete 10% attributes values from df in pandas

I have a example dataset. It has 2000 rows and 15 columns. Last columns will be need as decision class in classification.
I need to delete randomly 10% of attributes values. So 10% values from columns 0-13 should be NA.
I wrote a for loop. It randomizes a colNumber (0-13) and rowNumber (0-2000) and it replaces a value to NA. But I think (and I see this) it's not a faster solution. I tried to find something else in pandas, not core python, but couldn't find anything.
Maybe someone have better idea? More pandas solution? Or maybe something completely different?
You can make use of pandas' sample method.
Imports and set up data
import numpy as np
import pandas as pd
n = 100
data = {
'a': np.random.random(size=n),
'b': np.random.choice(list(string.ascii_lowercase), size=n),
'c': np.random.random(size=n),
}
df = pd.DataFrame(data)
Solution
for col in df.columns:
df.loc[df.sample(frac=0.1).index, col] = np.nan
Solution without for loop:
def delete_10(col):
col.loc[col.sample(frac=0.1).index] = np.nan
return col
df.apply(delete_10, axis=0)
Check
Check to see proportion of NaN values:
df.isnull().sum() / len(df)
Output:
a 0.1
b 0.1
c 0.1
dtype: float64
Maybe this work, create a random array and see if is less than 0.1:
mask = np.random.random(df.iloc[:, :13].shape)<0.1
mask[13:] = False
df[mask] = np.nan

Iterate and change value based on function in Python pandas

please help. Seems easy, just can't figure it out.
DataFrame (df) contains numbers. For each column:
* compute the mean and std
* compute a new value for each value in each row in each column
* change that value with the new value
Method 1
import numpy as np
import pandas as pd
n = 1
while n<len(df.column.values.tolist()):
col = df.values[:,n]
mean = sum(col)/len(col)
std = np.std(col, axis = 0)
for x in df[df.columns.values[n]]:
y = (float(x) - float(mean)) / float(std)
df.set_value(x, df.columns.values[n], y)
n = n+1
Method 2
labels = df.columns.values.tolist()
df2 = df.ix[:,0]
n = 1
while n<len(df.column.values.tolist()):
col = df.values[:,n]
mean = sum(col)/len(col)
std = np.std(col, axis = 0)
ls = []
for x in df[df.columns.values[n]]:
y = (float(x) - float(mean)) / float(std)
ls.append(y)
df2 = pd.DataFrame({labels[n]:str(ls)})
df1 = pd.concat([df1, df2], axis=1, ignore_index=True)
n = n+1
Error: ValueError: If using all scalar values, you must pass an index
Also tried the .apply method but the new DataFrame doesn't change the values.
print(df.to_json()):
{"col1":{"subj1":4161.97,"subj2":5794.73,"subj3":4740.44,"subj4":4702.84,"subj5":3818.94},"col2":{"subj1":13974.62,"subj2":19635.32,"subj3":17087.721851,"subj4":13770.461021,"subj5":11546.157578},"col3":{"subj1":270.7,"subj2":322.607708,"subj3":293.422314,"subj4":208.644585,"subj5":210.619961},"col4":{"subj1":5400.16,"subj2":7714.080365,"subj3":6023.026011,"subj4":5880.187272,"subj5":4880.056292}}
You are standard normalizing each column by removing the mean and scaling to unit variance. You can use scikit-learn's standardScaler for this:
from sklearn import preprocessing
scaler= preprocessing.StandardScaler()
new_df = pd.DataFrame(scaler.fit_transform(df.T), columns=df.columns, index=df.index)
Here is the documentation for the same
It looks like you're trying to do operations on DataFrame columns and values as though DataFrames were simple lists or arrays, rather than in the vectorized / column-at-a-time way more usual for NumPy and Pandas work.
A simple, first-pass improvement might be:
# import your data
import json
df = pd.DataFrame(json.loads(json_text))
# loop over only numeric columns
for col in df.select_dtypes([np.number]):
# compute column mean and std
col_mean = df[col].mean()
col_std = df[col].std()
# adjust column to normalized values
df[col] = df[col].apply(lambda x: (x - col_mean) / col_std)
That is vectorized by column. It retains some explicit looping, but is straightforward and relatively beginner-friendly.
If you're comfortable with Pandas, it can done more compactly:
numeric_cols = list(df.select_dtypes([np.number]))
df[numeric_cols] = df[numeric_cols].apply(lambda col: (col - col.mean()) / col.std(), axis=0)
In your revised DataFrame, there are no string columns. But the earlier DataFrame had string columns, causing problems when they were computed upon, so let's be careful. This is a generic way to select numeric columns. If it's too much, you can simplify at the cost of generality by listing them explicitly:
numeric_cols = ['col1', 'col2', 'col3', 'col4']

Categories

Resources