I have a column in the dataframe on which I apply many functions. For example,
df[col_name] = df[col_name].apply(lambda x: fun1(x))
df[col_name] = df[col_name].apply(lambda x: fun2(x))
df[col_name] = df[col_name].apply(lambda x: fun3(x))
I have 10 functions that I apply to this column for preprocessing and cleaning. Is there a way where I can refactor this code or make the block of code small?
How about
def fun(x):
for f in (fun1, fun2, fun3):
x = f(x)
return x
df[col_name] = df[col_name].apply(fun)
Related
I have a function which call another one.
The objective is, by calling function get_substr to extract a substring based on a position of the nth occurence of a character
def find_nth(string, char, n):
start = string.find(char)
while start >= 0 and n > 1:
start = string.find(char, start+len(char))
n -= 1
return start
def get_substr(string,char,n):
if n == 1:
return string[0:find_nth(string,char,n)]
else:
return string[find_nth(string,char,n-1)+len(char):find_nth(string,char,n)]
The function works.
Now I want to apply it on a dataframe by doing this.
df_g['F'] = df_g.apply(lambda x: get_substr(x['EQ'],'-',1))
I get on error:
KeyError: 'EQ'
I don't understand it as df_g['EQ'] exists.
Can you help me?
Thanks
You forgot about axis=1, without that function is applied to each column rather than each row. Consider simple example
import pandas as pd
df = pd.DataFrame({'A':[1,2],'B':[3,4]})
df['Z'] = df.apply(lambda x:x['A']*100,axis=1)
print(df)
output
A B Z
0 1 3 100
1 2 4 200
As side note if you are working with value from single column you might use pandas.Series.apply rather than pandas.DataFrame.apply, in above example it would mean
df['Z'] = df['A'].apply(lambda x:x*100)
in place of
df['Z'] = df.apply(lambda x:x['A']*100,axis=1)
I have a dataframe with 3 millions of rows (df1) and another with 10k rows (df2). What is the fastest method of filtering df1 for each row in df2?
Here is exactly what I need to do in the loop:
for i in list(range(len(df2))): #For each row
x = df1[(df1['column1'].isin([df2['info1'][i]])) \
& (df1['column2'].isin([df2['info2'][i]])) \
& (df1['column3'].isin([df2['info3'][i]]))]
# ..... More code using x variable every time ......
This code is not fast enough to be viable.
Note that I used .isin function but inside it thereĀ“s always only 1 item. I found out that using .isin() , df1['column1'].isin([df2['info1'][i]] , was faster then using df1['column1'] == df2['info1'][i] .
import pandas as pd
import numpy as np
def make_filter(x, y, match_dict, uinque=False):
filter = None
for x_key in x.columns:
if x_key in match_dict:
y_key = match_dict[x_key]
y_col = y[y_key]
if uinque:
y_col = y_col.unique()
col_filter = x[x_key].isin(y[y_key])
if filter is None:
filter = col_filter
else:
filter = filter & col_filter
return filter
def main():
n_rows = 100
x = np.random.randint(4, size=(n_rows, 2))
x = pd.DataFrame(x, columns=["col1", "col2"])
y = np.random.randint(2, 4, size=(n_rows, 2))
y = pd.DataFrame(y, columns=["info1", "info2"])
match_dict = {"col1":"info1", "col2": "info2"}
z = make_filter(x, y, match_dict, uinque=True)
print(x[z])
main()
I have this txt:
1989MaiteyCarlos
2015mamasypadres
And I have a code to separate word and generate different columns to DataFrame.
The code is:
txt1=pd.read_table(r'C:\Users\TOSHIBA\Desktop\prueba.txt',header=None)
txt1['anno'] = txt1[0].apply(lambda x: x[:4])
txt1['chica'] = txt1[0].apply(lambda x: x[4:9])
txt1['chico'] = txt1[0].apply(lambda x: x[10:])
I need to do a general function to solved the problem. I tried it with this code:
def read_txt (df,columnas,rangos):
for i,j in zip(columnas,rangos):
for k in j:
df[i] = df[0].apply(lambda x: x[k])
return df
But the result was fail.
How can I do this function?
I solved the problem.
the function that I used is:
def read_txt (df,columnas,rangos):
for i,j in zip(columnas,rangos):
df[i] = df[0].apply(lambda x: x[j[0]:j[1]])
return df
data = read_txt (txt1,['anno','chica','chico'],[[0,4],[4,9],[10,16]])
How to calculate 99% and 1% percentile as cap and floor for each column, the if value >= 99% percentile then redefine the value as the value of 99% percentile; similarly if value <= 1% percentile then redefine value as the value of 1% percentile
np.random.seed(2)
df = pd.DataFrame({'value1': np.random.randn(100), 'value2': np.random.randn(100)})
df['lrnval'] = np.where(np.random.random(df.shape[0])>=0.7, 'learning', 'validation')
if we have hundreds columns, can we use apply function instead of do loop?
Based on Abdou's answer, the following might save you some time:
for col in df.columns:
percentiles = df[col].quantile([0.01, 0.99]).values
df[col][df[col] <= percentiles[0]] = percentiles[0]
df[col][df[col] >= percentiles[1]] = percentiles[1]
or use numpy.clip:
import numpy as np
for col in df.columns:
percentiles = df[col].quantile([0.01, 0.99]).values
df[col] = np.clip(df[col], percentiles[0], percentiles[1])
You can first define a helper function that takes in as arguments a series and a value and changes that value according to the conditions mentioned above:
def scale_val(s, val):
percentiles = s.quantile([0.01,0.99]).values
if val <= percentiles[0]:
return percentiles[0]
elif val >= percentiles[1]:
return percentiles[1]
else:
return val
Then you can use pd.DataFrame.apply and pd.Series.apply:
df.apply(lambda s: s.apply(lambda v: scale_val(s,v)))
Please note that this may be a somewhat slow solution if you are dealing with a large amount of data, but I would suggest you give a shot and see if it will solve your problem within a reasonable time.
Edit:
If you only want to get the percentiles for rows of df where the column lrnval is equal to "learning", you can modify the function to calculate the percentiles for only rows where that condition is true:
def scale_val2(s, val):
percentiles = s[df.lrnval.eq('learning')].quantile([0.01,0.99]).values
if val <= percentiles[0]:
return percentiles[0]
elif val >= percentiles[1]:
return percentiles[1]
else:
return val
Since there is a column that contains strings, I assume that you won't be doing any calculations on it. So, I would change the code as follows:
df.filter(regex='[^lrnval]').apply(lambda s: s.apply(lambda v: scale_val2(s,v)))
I hope this proves useful.
I can't figure out how to apply a simple function to every row of a column in a Panda data frame.
Example:
def delLastThree(x):
x = x.strip()
x = x[:-3]
return x
arr = ['test123','test234','test453']
arrDF = pandas.DataFrame(arr)
arrDF.columns = ['colOne']
arrDF['colOne'].apply(delLastThree)
print arrDF
I would expect the code below to return 'test' for every row. Instead it prints the original values.
How do I apply the delLastThree function to every row in the DF?
You are creating a pd.Series when selecting using single brackets with df['colOne'].
Either use .apply(func, axis=1) on a DataFrame, ie either when selecting with [['colOne']], or without selecting any columns. However, if you use .apply(axis=1), the result is a pd.Series, so you need to modify the function to .str for .string methods.
With the pd.Series resulting from selecting with ['colOne'], you can use either just .apply() or .map().
def delLastThree_series(x):
x = x.strip()
x = x[:-3]
return x
def delLastThree_df(x):
x = x.str.strip()
x = x.str[:-3]
return x
arr = ['test123','test234','test453']
arrDF = pd.DataFrame(arr)
arrDF.columns = ['colOne']
Now use either
arrDF.apply(delLastThree_df, axis=1)
arrDF[['colOne']].apply(delLastThree_df, axis=1)
or
arrDF['colOne'].apply(delLastThree_series)
arrDF['colOne'].map(delLastThree_series, axis=1)
to get:
colOne
0 test
1 test
2 test
You could of course also just:
arrDF['colOne'].str.strip().str[:-3]
use map() function for series (single column):
In [15]: arrDF['colOne'].map(delLastThree)
Out[15]:
0 test
1 test
2 test
Name: colOne, dtype: object
or if you want to change it:
In [16]: arrDF['colOne'] = arrDF['colOne'].map(delLastThree)
In [17]: arrDF
Out[17]:
colOne
0 test
1 test
2 test
but as #Stefan said this will be much faster and more efficient and more "Pandonic":
arrDF['colOne'] = arrDF['colOne'].str.strip().str[:-3]
or if you want to strip all trailing spaces and numbers:
arrDF['colOne'] = arrDF['colOne'].str.replace(r'[\s\d]+$', '')
test:
In [21]: arrDF['colOne'].str.replace(r'[\s\d]+$', '')
Out[21]:
0 test
1 test
2 test
Name: colOne, dtype: object