Apply custom function to entire dataframe - python

I have a function which call another one.
The objective is, by calling function get_substr to extract a substring based on a position of the nth occurence of a character
def find_nth(string, char, n):
start = string.find(char)
while start >= 0 and n > 1:
start = string.find(char, start+len(char))
n -= 1
return start
def get_substr(string,char,n):
if n == 1:
return string[0:find_nth(string,char,n)]
else:
return string[find_nth(string,char,n-1)+len(char):find_nth(string,char,n)]
The function works.
Now I want to apply it on a dataframe by doing this.
df_g['F'] = df_g.apply(lambda x: get_substr(x['EQ'],'-',1))
I get on error:
KeyError: 'EQ'
I don't understand it as df_g['EQ'] exists.
Can you help me?
Thanks

You forgot about axis=1, without that function is applied to each column rather than each row. Consider simple example
import pandas as pd
df = pd.DataFrame({'A':[1,2],'B':[3,4]})
df['Z'] = df.apply(lambda x:x['A']*100,axis=1)
print(df)
output
A B Z
0 1 3 100
1 2 4 200
As side note if you are working with value from single column you might use pandas.Series.apply rather than pandas.DataFrame.apply, in above example it would mean
df['Z'] = df['A'].apply(lambda x:x*100)
in place of
df['Z'] = df.apply(lambda x:x['A']*100,axis=1)

Related

Ignoring an invalid filter among multiple filters on a DataFrame

Problem Statement:
I have a DataFrame that has to be filtered with multiple conditions.
Each condition is optional, which means if an invalid value is entered by the user for a certain condition, the condition can be skipped completely, defaulting to the original DataFrame (without that specific condition)in return.
While I can implement this quite easily with multiple if-conditions, modifying the DataFrame in a sequential way, I am looking for something that is more elegant and scalable (with increasing input parameters) and preferably using inbuilt pandas functionality
Reproducible Example
Dummy dataframe -
df = pd.DataFrame({'One':['a','a','a','b'],
'Two':['x','y','y','y'],
'Three':['l','m','m','l']})
print(df)
One Two Three
0 a x l
1 a y m
2 a y m
3 b y l
Let's say that invalid values are the values that don't belong to the respective column. So, for column 'One' all other values are invalid except 'a' and 'b'. If the user input's 'a' then I should be able to filter the DataFrame df[df['One']=='a'], however, if the user inputs any invalid value, no such filter should be applied, and the original dataframe df is returned.
My attempt (with multiple parameters):
def valid_filtering(df, inp):
if inp[0] in df['One'].values:
df = df[df['One']==inp[0]]
if inp[1] in df['Two'].values:
df = df[df['Two']==inp[1]]
if inp[2] in df['Three'].values:
df = df[df['Three']==inp[2]]
return df
With all valid inputs -
inp = ['a','y','m'] #<- all filters valid so df is filtered before returning
print(valid_filtering(df, inp))
One Two Three
1 a y m
2 a y m
With few invalid inputs -
inp = ['a','NA','NA'] #<- only first filter is valid, so other 2 filters are ignored
print(valid_filtering(df, inp))
One Two Three
0 a x l
1 a y m
2 a y m
P.S. Additional question - is there a way to get DataFrame indexing to behave as -
df[df['One']=='valid'] -> returns filtered df
df[df['One']=='invalid'] -> returns original df
Because this would help me rewrite my filtering -
df[(df['One']=='valid') & (df['Two']=='invalid') & (df['Three']=='valid')] -> Filtered by col One and Three
EDIT: Solution -
An updated solution inspired by the code and logic provided by #corralien and #Ben.T
df.loc[(df.eq(inp)|~df.eq(inp).any(0)).all(1)]
Here is one way creating a Boolean dataframe depending on each value of inp in each column. Then use any along the rows to get columns with at least one True, and all along the columns once selected the columns that have at least one True.
def valid_filtering(df, inp):
# check where inp values are same than in df
m = (df==pd.DataFrame(data=[inp] , index=df.index, columns=df.columns))
# select the columns with at least one True
cols = m.columns[m.any()]
# select the rows that all True amongst wanted columns
rows = m[cols].all(axis=1)
# return df with selected rows
return df.loc[rows]
Note that if you don't have the same number of filter than columns in your original df, then you could do with a dictionary, it works too as in the example below the column Three will be ignored as all False.
d = {'One': 'a', 'Two': 'y'}
m = (df==pd.DataFrame(d, index=df.index).reindex(columns=df.columns))
The key is if a column return all False (~b.any, invalid filter) then return True to accept all values of this columns:
mask = df.eq(inp).apply(lambda b: np.where(~b.any(), True, b))
out = df.loc[mask.all(axis="columns")]
Case 1: inp = ['a','y','m'] (with all valid inputs)
>>> out
One Two Three
1 a y m
2 a y m
Case 2: inp = ['a','NA','NA'] (with few invalid inputs)
>>> out
One Two Three
0 a x l
1 a y m
2 a y m
Case 3: inp = ['NA','NA','NA'] (with no invalid inputs)
>>> out
One Two Three
0 a x l
1 a y m
2 a y m
3 b y l
Case 4: inp = ['b','x','m'] (with all valid inputs but not results)
>>> out
Empty DataFrame
Columns: [One, Two, Three]
Index: []
Of course, you can increase input parameters:
df["Four"] = ['i','j','k','k']
inp = ['a','NA','m','k']
>>> out
One Two Three Four
2 a y m k
Another way with list comprehension:
def valid_filtering(df, inp):
series = [df[column] == inp[i]
for i, column in enumerate(df.columns) if len(df[df[column] == inp[i]].values) > 0]
for s in series: df = df[s]
return df
Output of print(valid_filtering(df, ['a','NA','NA'])):
One Two Three
0 a x l
1 a y m
2 a y m
Related: applying lambda row on multiple columns pandas

Replacing all entries in one column by according elements in the rows using a function

I have a pandas dataframe which looks like this:
A B
x 5.9027.5276
y 656.344872.0
z 78.954.23
What I want to reach is to replace the string entries in column B by floats of the first four numbers of the entries of column B as decimal numbers at the second position.
Therefore, I wrote the following code:
for entry in df['B']:
entry = re.search(r'((\d\.?){1,4})', entry).group().replace(".","")
df['B'] = entry[:1] + '.' + entry[1:]
df['B'] = df['B'].astype(float)
It almost does what I want but it replaces all the entries in B with the float value of the first row. Instead, I would like to replace the entries with the according float value of each row.
How could I do this?
Thanks a lot!
You can use the relevant pandas string functions:
df['B'] = df['B'].str.extract('((\d\.?){1,4})')[0].str.replace(r'\.', '')
df['B'] = df['B'].str[:1] + '.' + df['B'].str[1:]
df['B'] = df['B'].astype(float)
print(df)
A B
0 x 5.902
1 y 6.563
2 z 7.895
You might encase your operation in function and then use .apply i.e.:
import re
import pandas as pd
df = pd.DataFrame({'A':['x','y','z'],'B':['5.9027.5276','656.344872.0','78.954.23']})
def func(entry):
entry = re.search(r'((\d\.?){1,4})', entry).group().replace(".","")
return entry[:1] + '.' + entry[1:]
df['B'] = df['B'].apply(func)
df['B'] = df['B'].astype(float)
print(df)
output:
A B
0 x 5.902
1 y 6.563
2 z 7.895

CASE statement in Python based on Regex

So I have a data frame like this:
FileName
01011RT0TU7
11041NT4TU8
51391RST0U2
01011645RT0TU9
11311455TX0TU8
51041545ST3TU9
What I want is another column in the DataFrame like this:
FileName |RdwyId
01011RT0TU7 |01011000
11041NT4TU8 |11041000
51391RST0U2 |51391000
01011645RT0TU9|01011645
11311455TX0TU8|11311455
51041545ST3TU9|51041545
Essentially, if the first 5 characters are digits then concat with "000", if the first 8 characters are digits then simply move them to the RdwyId column
I am noob so I have been playing with this:
Test 1:
rdwyre1=re.compile(r'\d\d\d\d\d')
rdwyre2=re.compile(r'\d\d\d\d\d\d\d\d')
rdwy1=rdwyre1.findall(str(thous["FileName"]))
rdwy2=rdwyre2.findall(str(thous["FileName"]))
thous["RdwyId"]=re.sub(r'\d\d\d\d\d', str(thous["FileName"].loc[:4])+"000",thous["FileName"])
Test 2:
thous["RdwyId"]=np.select(
[
re.search(r'\d\d\d\d\d',thous["FileName"])!="None",
rdwyre2.findall(str(thous["FileName"]))!="None"
],
[
rdwyre1.findall(str(thous["FileName"]))+"000",
rdwyre2.findall(str(thous["FileName"])),
],
default="Unknown"
)
Test 3:
thous=thous.assign(RdwyID=lambda x: str(rdwyre1.search(x).group())+"000" if bool(rdwyre1.search(x))==True else str(rdwyre2.search(x).group()))
None of the above have worked. Could anyone help me figure out where I am going wrong? and how to fix it?
You can use numpy select, which replicates CASE WHEN for multiple conditions, and Pandas' str.isnumeric method:
cond1 = df.FileName.str[:8].str.isnumeric() # first condition
choice1 = df.FileName.str[:8] # result if first condition is met
cond2 = df.FileName.str[:5].str.isnumeric() # second condition
choice2 = df.FileName.str[:5] + "000" # result if second condition is met
condlist = [cond1, cond2]
choicelist = [choice1, choice2]
df.loc[:, "RdwyId"] = np.select(condlist, choicelist)
df
FileName RdwyId
0 01011RT0TU7 01011000
1 11041NT4TU8 11041000
2 51391RST0U2 51391000
3 01011645RT0TU9 01011645
4 11311455TX0TU8 11311455
5 51041545ST3TU9 51041545
def filt(list1):
for i in list1:
if i[:8].isdigit():
print(i[:8])
else:
print(i[:5]+"000")
# output
01011000
11041000
51391000
01011645
11311455
51041545
I mean, if your case is very specific, you can tweak it and apply it to your dataframe.
To a dataframe.
def filt(i):
if i[:8].isdigit():
return i[:8]
else:
return i[:5]+"000"
d = pd.DataFrame({"names": list_1})
d["filtered"] = d.names.apply(lambda x: filt(x)) #.apply(filt) also works im used to lambdas
#output
names filtered
0 01011RT0TU7 01011000
1 11041NT4TU8 11041000
2 51391RST0U2 51391000
3 01011645RT0TU9 01011645
4 11311455TX0TU8 11311455
5 51041545ST3TU9 51041545
Using regex:
c1 = re.compile(r'\d{5}')
c2 = re.compile(r'\d{8}')
rdwyId = []
for f in thous['FileName']:
m = re.match(c2, f)
if m:
rdwyId.append(m[0])
continue
m = re.match(c1, f)
if m:
rdwyId.append(m[0] + "000")
thous['RdwyId'] = rdwyId
Edit: replaced re.search with re.match as it's more efficient, since we are only looking for matches at the beginning of the string.
Let us try findall with ljust
df['new'] = df.FileName.str.findall(r"(\d+)[A-z]").str[0].str.ljust(8,'0')
Out[226]:
0 01011000
1 11041000
2 51391000
3 01011645
4 11311455
5 51041545
Name: FileName, dtype: object

How to apply my function to the first row of a dataframe?

def calcScore(p):
if p[0] > p[1]:
x = 3
y = 0
elif p[0] == p[1]:
x = 1
y = 1
else:
x = 0
y = 3
return x,y
How would I apply this function to the first row of my dataframe?
I know how to apply it to the whole dataframe but can't seem to apply it to the first row only? Below is what I did with the whole dataframe. I am new to python so please forgive silly or stupid mistakes. Thank you. :)
result =(prem[['FTHG','FTAG']].apply(calcScore, axis = 1))
print(result)
apply is for applying a function to all rows or columns. If you just want one you can just do:
result = calcScore(perm.iloc[0, ['FHG', 'FtAG']])

Pandas Apply Syntax

I can't figure out how to apply a simple function to every row of a column in a Panda data frame.
Example:
def delLastThree(x):
x = x.strip()
x = x[:-3]
return x
arr = ['test123','test234','test453']
arrDF = pandas.DataFrame(arr)
arrDF.columns = ['colOne']
arrDF['colOne'].apply(delLastThree)
print arrDF
I would expect the code below to return 'test' for every row. Instead it prints the original values.
How do I apply the delLastThree function to every row in the DF?
You are creating a pd.Series when selecting using single brackets with df['colOne'].
Either use .apply(func, axis=1) on a DataFrame, ie either when selecting with [['colOne']], or without selecting any columns. However, if you use .apply(axis=1), the result is a pd.Series, so you need to modify the function to .str for .string methods.
With the pd.Series resulting from selecting with ['colOne'], you can use either just .apply() or .map().
def delLastThree_series(x):
x = x.strip()
x = x[:-3]
return x
def delLastThree_df(x):
x = x.str.strip()
x = x.str[:-3]
return x
arr = ['test123','test234','test453']
arrDF = pd.DataFrame(arr)
arrDF.columns = ['colOne']
Now use either
arrDF.apply(delLastThree_df, axis=1)
arrDF[['colOne']].apply(delLastThree_df, axis=1)
or
arrDF['colOne'].apply(delLastThree_series)
arrDF['colOne'].map(delLastThree_series, axis=1)
to get:
colOne
0 test
1 test
2 test
You could of course also just:
arrDF['colOne'].str.strip().str[:-3]
use map() function for series (single column):
In [15]: arrDF['colOne'].map(delLastThree)
Out[15]:
0 test
1 test
2 test
Name: colOne, dtype: object
or if you want to change it:
In [16]: arrDF['colOne'] = arrDF['colOne'].map(delLastThree)
In [17]: arrDF
Out[17]:
colOne
0 test
1 test
2 test
but as #Stefan said this will be much faster and more efficient and more "Pandonic":
arrDF['colOne'] = arrDF['colOne'].str.strip().str[:-3]
or if you want to strip all trailing spaces and numbers:
arrDF['colOne'] = arrDF['colOne'].str.replace(r'[\s\d]+$', '')
test:
In [21]: arrDF['colOne'].str.replace(r'[\s\d]+$', '')
Out[21]:
0 test
1 test
2 test
Name: colOne, dtype: object

Categories

Resources