please help. Seems easy, just can't figure it out.
DataFrame (df) contains numbers. For each column:
* compute the mean and std
* compute a new value for each value in each row in each column
* change that value with the new value
Method 1
import numpy as np
import pandas as pd
n = 1
while n<len(df.column.values.tolist()):
col = df.values[:,n]
mean = sum(col)/len(col)
std = np.std(col, axis = 0)
for x in df[df.columns.values[n]]:
y = (float(x) - float(mean)) / float(std)
df.set_value(x, df.columns.values[n], y)
n = n+1
Method 2
labels = df.columns.values.tolist()
df2 = df.ix[:,0]
n = 1
while n<len(df.column.values.tolist()):
col = df.values[:,n]
mean = sum(col)/len(col)
std = np.std(col, axis = 0)
ls = []
for x in df[df.columns.values[n]]:
y = (float(x) - float(mean)) / float(std)
ls.append(y)
df2 = pd.DataFrame({labels[n]:str(ls)})
df1 = pd.concat([df1, df2], axis=1, ignore_index=True)
n = n+1
Error: ValueError: If using all scalar values, you must pass an index
Also tried the .apply method but the new DataFrame doesn't change the values.
print(df.to_json()):
{"col1":{"subj1":4161.97,"subj2":5794.73,"subj3":4740.44,"subj4":4702.84,"subj5":3818.94},"col2":{"subj1":13974.62,"subj2":19635.32,"subj3":17087.721851,"subj4":13770.461021,"subj5":11546.157578},"col3":{"subj1":270.7,"subj2":322.607708,"subj3":293.422314,"subj4":208.644585,"subj5":210.619961},"col4":{"subj1":5400.16,"subj2":7714.080365,"subj3":6023.026011,"subj4":5880.187272,"subj5":4880.056292}}
You are standard normalizing each column by removing the mean and scaling to unit variance. You can use scikit-learn's standardScaler for this:
from sklearn import preprocessing
scaler= preprocessing.StandardScaler()
new_df = pd.DataFrame(scaler.fit_transform(df.T), columns=df.columns, index=df.index)
Here is the documentation for the same
It looks like you're trying to do operations on DataFrame columns and values as though DataFrames were simple lists or arrays, rather than in the vectorized / column-at-a-time way more usual for NumPy and Pandas work.
A simple, first-pass improvement might be:
# import your data
import json
df = pd.DataFrame(json.loads(json_text))
# loop over only numeric columns
for col in df.select_dtypes([np.number]):
# compute column mean and std
col_mean = df[col].mean()
col_std = df[col].std()
# adjust column to normalized values
df[col] = df[col].apply(lambda x: (x - col_mean) / col_std)
That is vectorized by column. It retains some explicit looping, but is straightforward and relatively beginner-friendly.
If you're comfortable with Pandas, it can done more compactly:
numeric_cols = list(df.select_dtypes([np.number]))
df[numeric_cols] = df[numeric_cols].apply(lambda col: (col - col.mean()) / col.std(), axis=0)
In your revised DataFrame, there are no string columns. But the earlier DataFrame had string columns, causing problems when they were computed upon, so let's be careful. This is a generic way to select numeric columns. If it's too much, you can simplify at the cost of generality by listing them explicitly:
numeric_cols = ['col1', 'col2', 'col3', 'col4']
Related
I have about 88 columns in a pandas dataframe. I'm trying to apply a formula that calculates a single value for each column. How do I switch out the name of each column and then build a new single-row dataframe from the equation?
Below is the equation (linear mixed model) which results in a single value for each column.
B1 = (((gdf.groupby(['Benthic_Mo'])['SHAPE_Area'].sum())/Area_sum) *
(gdf.groupby(['Benthic_Mo'])['W8_629044'].mean())).sum()
Below is a sample of the names of the columns
['OBJECTID', 'Benthic_Mo', 'SHAPE_Leng', 'SHAPE_Area', 'geometry', 'tmp', 'Species','W8_629044', 'W8_642938', 'W8_656877', 'W8_670861', 'W8_684891', 'W8_698965', 'W8_713086', 'W8_72726',...]
The columns with W8_## need to be switched out in the formula, but about 80 of them are there. The output I need is a new dataframe with a single row. I also would like to calculate the variance or Standard deviation from the data calculated with the formal.
thank you!
You can loop through the dataframe columns. I think the below code should work.
collist = list(orignal_dataframe.columns)
emptylist = []
emptydict = {}
for i in collist[7:]:
B1 = (((gdf.groupby(['Benthic_Mo'])['SHAPE_Area'].sum())/Area_sum) * (gdf.groupby(['Benthic_Mo'])[i].mean())).sum()
emptydict[i] = B1
emptylist.append(emptydict)
resdf = pd.DataFrame(emptylist)
to create new df with the results in each new col (one row), you can use similar as below:
W8_cols = [col for col in df.columns if 'W8_' in col]
df_out = pd.DataFrame()
for col in W8_cols:
B1 = (((gdf.groupby(['Benthic_Mo'])['SHAPE_Area'].sum()) / Area_sum) *
(gdf.groupby(['Benthic_Mo'])[col].mean())).sum()
t_data = [{col: B1}]
df_temp = pd.DataFrame(t_data)
data = [df_out, df_temp]
df_out = pd.concat(data, axis=1)
I have a huge csv file (~2GB) that I have imported using Dask. Now I want to normalize this dataframe. The dataframe contains about 70k columns. I have written this python function to calculate this:
def normalize(df):
result = df.copy()
for col in tqdm(df.columns):
if col!=str('name') #basically not to normalize columns with name of "name"
max_value = df[col].max()
min_value = df[col].min()
result[col] = (df[col] - min_value) / (max_value - min_value)
return result
It works okay but takes a lot of time. I put this on execution and its showing it will take appoximately 88 hours to complete. I tried switching to sklearn's minmaxscaler() but it doesn't show any progress of normalization and I am afraid that it will also take quite a lot of time. Is there any other way to normalize all the columns (and ignore a few like I did in that if condition).
You don't need to loop through this. When the other columns than name are numerical values then you can just do something along the following:
num_cols = [col for col in df.columns if col != "name"]
df.loc[:, num_cols] = (df[num_cols] - df[num_cols].min()) / (df[num_cols].max() - df[num_cols].min())
Here is a minimal code sample:
import pandas as pd
df = pd.DataFrame({"name": ["a"]*4, "a": [2,3,4,6], "b": [9,5,2,34]})
num_cols = [col for col in df.columns if col != "name"]
df.loc[:, num_cols] = (df[num_cols] - df[num_cols].min()) / (df[num_cols].max() - df[num_cols].min())
print(df)
I am afraid that it will also take quite a lot of time
Then considering that you just need numerical operations I suggest using numpy for actual number crunching and pandas only for extraction of columns to process, simple example:
import numpy as np
import pandas as pd
df = pd.DataFrame({'name':['A','B','C'],'x1':[1,2,3],'x2':[4,8,6],'x3':[10,15,30]})
num_arr = df[['x1','x2','x3']].to_numpy()
mins = np.min(num_arr,0)
maxs = np.max(num_arr,0)
result_arr = (num_arr - mins) / (maxs - mins)
result_df = pd.concat([df[['name']],pd.DataFrame(result_arr,columns=['x1','x2','x3'])],axis=1)
print(result_df)
output
name x1 x2 x3
0 A 0.0 0.0 0.00
1 B 0.5 1.0 0.25
2 C 1.0 0.5 1.00
Disclaimer: this solutions assumes that df has indices like 0,1,2,...
If you would need further speed increase consider using parallelization, which might be used in this case as values in each columns are computed independently from other columns.
For a pandas.Series, I know how to remove outliers. With something like this:
x = pd.Series(np.random.normal(size=1000))
iqr = x.quantile(.75) - x.quantile(.25)
y = x[x.between(x.quantile(.25) - 1.5*iqr, x.quantile(.75) + 1.5*iqr)]
I would like to do thins over the different Series/columns of a DataFrame
import string
import random
df = pd.DataFrame([])
df['A'] = pd.Series(np.random.normal(size=1000))
df['B'] = pd.Series(np.random.normal(size=1000, loc=-5, scale=1))
df['C'] = pd.Series(np.random.normal(size=1000, loc=10, scale=2))
df['index'] = pd.Series([random.choice(string.ascii_uppercase) for i in range(1000)])
df.set_index('index')
I usually do stuff like
df = df.groupby('index').mean()
However, in this case, it would also average the outliers, which I would like to ignore from averaging.
Notice that the random data makes than the outliers are in different positions in each column. So an outlier should be ignored only in that column/Series
The result should be a DataFrame, with 26 lines (one for each letter of index), and 3 columns, with the values averaged without outliers
I can loop over the columns of df and do the first block of code. But is there a nicer way?
Suggestion are welcome. Any approach is accepted
Use the following code.
def mean_without_outlier(x): # x: series
iqr = x.quantile(.75) - x.quantile(.25)
y = x[x.between(x.quantile(.25) - 1.5*iqr, x.quantile(.75) + 1.5*iqr)]
return y.mean()
df.groupby("index")[['A', 'B', 'C']].agg(mean_without_outlier)
I have df, and I have to apply this formula:
to every row, then add the new series (as a new column).
Right now my code is :
new_col = deque()
for i in range(len(df)):
if i < n:
new_col.append(0)
else:
x = np.log10(np.sum(ATR[i-n:i])/(max(high[i-n:i])-min(low[i-n:i])))
y = np.log10(n)
new_col.append(100 * x/y)
df['new_col'] = pd.DataFrame({"new_col" : new_col})
ATR, high, low are obtained from columns of my existing df. But this method is very slow. Is there a faster way to perform the task? Thanks.
Without sample data, I can't test the following, but it should work:
tmp_df = df.rolling(n).agg({'High':'max', 'Low':'min', 'ATR':'sum'})
df['new_col'] = (100*np.log10(tmp_df['ATR'])) / (tmp_df['High'] - tmp_df['Low']) / np.log10(n)
df['new_col'] = df['new_col'].shift().fillna(0)
I am new to Python. I am trying to do the following loop and wonder if I am doing it the correct way or if there is a better (faster) way to do it. Briefly, i want to compute a series of conditional mean of a variable y. The conditions are created with regards to the x variables. For example, there are y x1 x2 x3 x4 in the df. the first set of conditions would be x1>x2 and x1x2, x1
import pandas as pd
import numpy as np
import itertools
dates = pd.date_range('20130101', periods=100)
df = pd.DataFrame(np.random.randn(100,10), index=dates,
columns=list('ABCDEFGHIJ') )
df['y']=np.random.randn(100,1)
cols = list(df)
cols.insert(0, cols.pop(cols.index('y')))
df = df.loc[:, cols]
xlist = np.asarray(list(df.iloc[:,1:]))
xlist = pd.DataFrame(vlist, columns=['x'])
xcombo = pd.DataFrame(np.asarray(list(itertools.combinations(xlist['x'], 3))), columns=['x1','x2','x3'])
xcombo['stat'] = ""
for i, row in xcombo.iterrows():
x1=(xcombo['x1'][i])
x2=(xcombo['x2'][i])
x3=(xcombo['x3'][i])
# the following two lines (intends to) select subset of df meeting the condition x1>x2 and x1<x3
dfx = df[df[x1]>df[x2]]
dfx = dfx[dfx[x1]<dfx[x3]] # df[df[x1]>df[x2] and df[x1]<df[x3]] doesn't work
xcombo['stat'][i] = dfx['y'].mean() # store the mean value of y in the corresponding row
You can use itertuples() method of pandas dataframe. It is much faster than iteritems() or iterrows().