I am mapping specific keywords with text data using applymap in Python. Let's say I want to check how often the keyword "hello" matches with the text data over all rows. Applymap gives me the desired matrix outcome, however only a "True" or "False" instead of the number of appearances.
I tried to connect count() with my applymap function, but I could not make it work.
The minimal working example is as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({'text': ['hello hello', 'yes no hello', 'good morning']})
keys = ['hello']
keyword = pd.DataFrame({0:keys})
res = []
for a in df['text']:
res.append(keyword.applymap(lambda x: x in a))
map = pd.concat(res, axis=1).T
map.index = np.arange(len(map))
#Output
map
0
0 True
1 True
2 False
#Desired Output with 'hello' appearing twice in the first row, once in the second and zero in the third of df.
0
0 2
1 1
2 0
I am looking for a way to keep my applymap function to obtain the matrix form, but replace the True (1) and False (0) with the number of appearances, such as the desired output shows above.
Instead of testing for an item in the list:
res.append(keyword.applymap(lambda x: x in a)) # x == a
You should use:
res.append(keyword.applymap(lambda x: str.count(a, x))) # counting occurrence of "a"
Related
I have a function which call another one.
The objective is, by calling function get_substr to extract a substring based on a position of the nth occurence of a character
def find_nth(string, char, n):
start = string.find(char)
while start >= 0 and n > 1:
start = string.find(char, start+len(char))
n -= 1
return start
def get_substr(string,char,n):
if n == 1:
return string[0:find_nth(string,char,n)]
else:
return string[find_nth(string,char,n-1)+len(char):find_nth(string,char,n)]
The function works.
Now I want to apply it on a dataframe by doing this.
df_g['F'] = df_g.apply(lambda x: get_substr(x['EQ'],'-',1))
I get on error:
KeyError: 'EQ'
I don't understand it as df_g['EQ'] exists.
Can you help me?
Thanks
You forgot about axis=1, without that function is applied to each column rather than each row. Consider simple example
import pandas as pd
df = pd.DataFrame({'A':[1,2],'B':[3,4]})
df['Z'] = df.apply(lambda x:x['A']*100,axis=1)
print(df)
output
A B Z
0 1 3 100
1 2 4 200
As side note if you are working with value from single column you might use pandas.Series.apply rather than pandas.DataFrame.apply, in above example it would mean
df['Z'] = df['A'].apply(lambda x:x*100)
in place of
df['Z'] = df.apply(lambda x:x['A']*100,axis=1)
Problem Statement:
I have a DataFrame that has to be filtered with multiple conditions.
Each condition is optional, which means if an invalid value is entered by the user for a certain condition, the condition can be skipped completely, defaulting to the original DataFrame (without that specific condition)in return.
While I can implement this quite easily with multiple if-conditions, modifying the DataFrame in a sequential way, I am looking for something that is more elegant and scalable (with increasing input parameters) and preferably using inbuilt pandas functionality
Reproducible Example
Dummy dataframe -
df = pd.DataFrame({'One':['a','a','a','b'],
'Two':['x','y','y','y'],
'Three':['l','m','m','l']})
print(df)
One Two Three
0 a x l
1 a y m
2 a y m
3 b y l
Let's say that invalid values are the values that don't belong to the respective column. So, for column 'One' all other values are invalid except 'a' and 'b'. If the user input's 'a' then I should be able to filter the DataFrame df[df['One']=='a'], however, if the user inputs any invalid value, no such filter should be applied, and the original dataframe df is returned.
My attempt (with multiple parameters):
def valid_filtering(df, inp):
if inp[0] in df['One'].values:
df = df[df['One']==inp[0]]
if inp[1] in df['Two'].values:
df = df[df['Two']==inp[1]]
if inp[2] in df['Three'].values:
df = df[df['Three']==inp[2]]
return df
With all valid inputs -
inp = ['a','y','m'] #<- all filters valid so df is filtered before returning
print(valid_filtering(df, inp))
One Two Three
1 a y m
2 a y m
With few invalid inputs -
inp = ['a','NA','NA'] #<- only first filter is valid, so other 2 filters are ignored
print(valid_filtering(df, inp))
One Two Three
0 a x l
1 a y m
2 a y m
P.S. Additional question - is there a way to get DataFrame indexing to behave as -
df[df['One']=='valid'] -> returns filtered df
df[df['One']=='invalid'] -> returns original df
Because this would help me rewrite my filtering -
df[(df['One']=='valid') & (df['Two']=='invalid') & (df['Three']=='valid')] -> Filtered by col One and Three
EDIT: Solution -
An updated solution inspired by the code and logic provided by #corralien and #Ben.T
df.loc[(df.eq(inp)|~df.eq(inp).any(0)).all(1)]
Here is one way creating a Boolean dataframe depending on each value of inp in each column. Then use any along the rows to get columns with at least one True, and all along the columns once selected the columns that have at least one True.
def valid_filtering(df, inp):
# check where inp values are same than in df
m = (df==pd.DataFrame(data=[inp] , index=df.index, columns=df.columns))
# select the columns with at least one True
cols = m.columns[m.any()]
# select the rows that all True amongst wanted columns
rows = m[cols].all(axis=1)
# return df with selected rows
return df.loc[rows]
Note that if you don't have the same number of filter than columns in your original df, then you could do with a dictionary, it works too as in the example below the column Three will be ignored as all False.
d = {'One': 'a', 'Two': 'y'}
m = (df==pd.DataFrame(d, index=df.index).reindex(columns=df.columns))
The key is if a column return all False (~b.any, invalid filter) then return True to accept all values of this columns:
mask = df.eq(inp).apply(lambda b: np.where(~b.any(), True, b))
out = df.loc[mask.all(axis="columns")]
Case 1: inp = ['a','y','m'] (with all valid inputs)
>>> out
One Two Three
1 a y m
2 a y m
Case 2: inp = ['a','NA','NA'] (with few invalid inputs)
>>> out
One Two Three
0 a x l
1 a y m
2 a y m
Case 3: inp = ['NA','NA','NA'] (with no invalid inputs)
>>> out
One Two Three
0 a x l
1 a y m
2 a y m
3 b y l
Case 4: inp = ['b','x','m'] (with all valid inputs but not results)
>>> out
Empty DataFrame
Columns: [One, Two, Three]
Index: []
Of course, you can increase input parameters:
df["Four"] = ['i','j','k','k']
inp = ['a','NA','m','k']
>>> out
One Two Three Four
2 a y m k
Another way with list comprehension:
def valid_filtering(df, inp):
series = [df[column] == inp[i]
for i, column in enumerate(df.columns) if len(df[df[column] == inp[i]].values) > 0]
for s in series: df = df[s]
return df
Output of print(valid_filtering(df, ['a','NA','NA'])):
One Two Three
0 a x l
1 a y m
2 a y m
Related: applying lambda row on multiple columns pandas
I have a dataframe like below
df
A B C
0 1 TRANSIT_1
TRANSIT_3
0 TRANSIT_5
And I want to change it to below:
Resulting DF
A B C D
0 1 TRANSIT_1
TRANSIT_3
0 TRANSIT_5
So i tried to use str.contains and once I receive the series with True or False, i put it in eval function to somehow get me the table I want.
Code I tried:
series_index = pd.DataFrame()
series_index = df.columns.str.contains("^TRANSIT_", case=True, regex=True)
print(type(series_index))
series_index.index[series_index].tolist()
I thought to use eval function to write it to separate column,like
df = eval(df[result]=the index) # I dont know, But eval function does evaluation and puts it in a separate column
I couldn't find a simple one-liner, but this works:
idx = list(df1[df1.where(df1.applymap(lambda x: 'TRA' in x if isinstance(x, str) else False)).notnull()].stack().index)
a, b = [], []
for sublist in idx:
a.append(sublist[0])
b.append(sublist[1])
df1['ans'] = df1.lookup(a,b)
Output
A B C ans
0 0 1 TRANSIT_1 TRANSIT_1
1 TRANSIT_3 None None TRANSIT_3
2 0 TRANSIT_5 None TRANSIT_5
I can't figure out how to apply a simple function to every row of a column in a Panda data frame.
Example:
def delLastThree(x):
x = x.strip()
x = x[:-3]
return x
arr = ['test123','test234','test453']
arrDF = pandas.DataFrame(arr)
arrDF.columns = ['colOne']
arrDF['colOne'].apply(delLastThree)
print arrDF
I would expect the code below to return 'test' for every row. Instead it prints the original values.
How do I apply the delLastThree function to every row in the DF?
You are creating a pd.Series when selecting using single brackets with df['colOne'].
Either use .apply(func, axis=1) on a DataFrame, ie either when selecting with [['colOne']], or without selecting any columns. However, if you use .apply(axis=1), the result is a pd.Series, so you need to modify the function to .str for .string methods.
With the pd.Series resulting from selecting with ['colOne'], you can use either just .apply() or .map().
def delLastThree_series(x):
x = x.strip()
x = x[:-3]
return x
def delLastThree_df(x):
x = x.str.strip()
x = x.str[:-3]
return x
arr = ['test123','test234','test453']
arrDF = pd.DataFrame(arr)
arrDF.columns = ['colOne']
Now use either
arrDF.apply(delLastThree_df, axis=1)
arrDF[['colOne']].apply(delLastThree_df, axis=1)
or
arrDF['colOne'].apply(delLastThree_series)
arrDF['colOne'].map(delLastThree_series, axis=1)
to get:
colOne
0 test
1 test
2 test
You could of course also just:
arrDF['colOne'].str.strip().str[:-3]
use map() function for series (single column):
In [15]: arrDF['colOne'].map(delLastThree)
Out[15]:
0 test
1 test
2 test
Name: colOne, dtype: object
or if you want to change it:
In [16]: arrDF['colOne'] = arrDF['colOne'].map(delLastThree)
In [17]: arrDF
Out[17]:
colOne
0 test
1 test
2 test
but as #Stefan said this will be much faster and more efficient and more "Pandonic":
arrDF['colOne'] = arrDF['colOne'].str.strip().str[:-3]
or if you want to strip all trailing spaces and numbers:
arrDF['colOne'] = arrDF['colOne'].str.replace(r'[\s\d]+$', '')
test:
In [21]: arrDF['colOne'].str.replace(r'[\s\d]+$', '')
Out[21]:
0 test
1 test
2 test
Name: colOne, dtype: object
I have a dataframe with a column containing comma separated strings. What I want to do is separate them by comma, count them and append the counted number to a new data frame. If the column contains a list with only one element, I want to differentiate wheather it is a string or an integer. If it is an integer, I want to append the value 0 in that row to the new df.
My code looks as follows:
def decide(dataframe):
df=pd.DataFrame()
for liste in DataFrameX['Column']:
x=liste.split(',')
if len(x) > 1:
df.append(pd.Series([len(x)]), ignore_index=True)
else:
#check if element in list is int
for i in x:
try:
int(i)
print i
x = []
df.append(pd.Series([int(len(x))]), ignore_index=True)
except:
print i
x = [1]
df.append(pd.Series([len(x)]), ignore_index=True)
return df
The Input data look like this:
C1
0 a,b,c
1 0
2 a
3 ab,x,j
If I now run the function with my original dataframe as input, it returns an empty dataframe. Through the print statement in the try/except statements I could see that everything works. The problem is appending the resulting values to the new dataframe. What do I have to change in my code? If possible, please do not give an entire different solution, but tell me what I am doing wrong in my code so I can learn.
******************UPDATE************************************
I edited the code so that it can be called as lambda function. It looks like this now:
def decide(x):
For liste in DataFrameX['Column']:
x=liste.split(',')
if len(x) > 1:
x = len(x)
print x
else:
#check if element in list is int
for i in x:
try:
int(i)
x = []
x = len(x)
print x
except:
x = [1]
x = len(x)
print x
And I call it like this:
df['Count']=df['C1'].apply(lambda x: decide(x))
It prints the right values, but the new column only contains None.
Any ideas why?
This is a good start, it could be simplified, but I think it works as expected.
#I have a dataframe with a column containing comma separated strings.
df = pd.DataFrame({'data': ['apple, peach', 'banana, peach, peach, cherry','peach','0']})
# What I want to do is separate them by comma, count them and append the counted number to a new data frame.
df['data'] = df['data'].str.split(',')
df['count'] = df['data'].apply(lambda row: len(row))
# If the column contains a list with only one element
df['first'] = df['data'].apply(lambda row: row[0])
# I want to differentiate wheather it is a string or an integer
df['first'] = pd.to_numeric(df['first'], errors='coerce')
# if the element in x is an integer, len(x) should be set to zero
df.loc[pd.notnull(df['first']), 'count'] = 0
# Dropping temp column
df.drop('first', 1, inplace=True)
df
data count
0 [apple, peach] 2
1 [banana, peach, peach, cherry] 4
2 [peach] 1
3 [0] 0