Pandas dataframe column value case insensitive replace where <condition> - python

Is there a case insensitive version for pandas.DataFrame.replace? https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.replace.html
I need to replace string values in a column subject to a case-insensitive condition of the form "where label == a or label == b or label == c".

The issue with some of the other answers is that they don't work with all Dataframes, only with Series, or Dataframes that can be implicitly converted to a Series. I understand this is because the .str construct exists in the Series class, but not in the Dataframe class.
To work with Dataframes, you can make your regular expression case insensitive with the (?i) extension. I don't believe this is available in all flavors of RegEx but it works with Pandas.
d = {'a':['test', 'Test', 'cat'], 'b':['CAT', 'dog', 'Cat']}
df = pd.DataFrame(data=d)
a b
0 test CAT
1 Test dog
2 cat Cat
Then use replace as you normally would but with the (?i) extension:
df.replace('(?i)cat', 'MONKEY', regex=True)
a b
0 test MONKEY
1 Test dog
2 MONKEY MONKEY

I think need convert to lower and then replace by condition with isin:
d = {'a':['test', 'Test', 'cat', 'CAT', 'dog', 'Cat']}
df = pd.DataFrame(data=d)
m = df['a'].str.lower().isin(['cat','test'])
df.loc[m, 'a'] = 'baby'
print (df)
a
0 baby
1 baby
2 baby
3 baby
4 dog
5 baby
Another solution:
df['b'] = df['a'].str.replace('test', 'baby', flags=re.I)
print (df)
a b
0 test baby
1 Test baby
2 cat cat
3 CAT CAT
4 dog dog
5 Cat Cat

Related

String manipulation within a column (pandas): split, replace, join

I'd like to create a new column based on the following conditions:
if the row contains dogs/dog/chien/chiens, then add -00
if the row contains cats/cat/chat/chats, then add 00-
A sample of data is as follows:
Animal
22 dogs
1 dog
1 cat
3 dogs
32 chats
and so far.
I'd like as output a column with only numbers (numerical):
Animal New column
22 dogs 22-00
1 dog 1-00
1 cat 00-1
3 dogs 3-00
32 chats 00-32
I think I should use an if condition to check the words, then .split and .join . It's about string manipulation but I'm having trouble breaking down this problem.
PRES = set(("cats", "cat", "chat", "chats"))
POSTS = set(("dogs", "dog", "chien", "chiens"))
def fun(words):
# words will come as e.g. "22 dogs"
num, ani = words.split()
if ani in PRES:
return "00-" + num
elif ani in POSTS:
return num + "-00"
else:
# you might want to handle this..
return "unexpected"
df["New Column"] = df["Animal"].apply(fun)
where df is your dataframe. For a fast lookup, we turn the condition lists into sets. Then we apply a function to values of the Animal column of df and act accordingly.
You could do this, first extract the number, then use np.where to conditionally add characters to the string:
df['New Col'] = df['Animal'].str.extract(r'([0-9]*)')
df['New Col'] = np.where(df['Animal'].str.contains('dogs|dog|chiens|chien'), df['New Col']+'-00', df['New Col'])
df['New Col'] = np.where(df['Animal'].str.contains('cats|cat|chat|chats'), '00-'+df['New Col'], df['New Col'])
print(df)
Animal New Col
0 22 dogs 22-00
1 1 dog 1-00
2 1 cat 00-1
3 3 dogs 3-00
4 32 chats 00-32
Since your data is well-formatted, you can use a basic substitution and apply it to the row:
import pandas as pd
import re
def replacer(s):
return re.sub(r" (chiens?|dogs?)", "-00",
re.sub(r"(\d+) ch?ats?", r"00-\1", s))
df = pd.DataFrame({"Animal": ["22 dogs", "1 dog", "1 cat", "3 dogs", "32 chats"]})
df["New Column"] = df["Animal"].apply(replacer)
Output:
Animal New Column
0 22 dogs 22-00
1 1 dog 1-00
2 1 cat 00-1
3 3 dogs 3-00
4 32 chats 00-32
Using re:
import re
list1 = ['dogs', 'dog', 'chien', 'chiens']
list2 = ['cats', 'cat', 'chat', 'chats']
df['New_col'] = [(re.search(r'(\w+)', val).group(1).strip()+"-00") if re.search(r'([a-zA-Z]+)', val).group(1).strip() in list1 else ("00-" + re.search(r'(\w+)', val).group(1).strip()) if re.search(r'([a-zA-Z]+)', val).group(1).strip() in list2 else val for val in list(df['Animal'])]
print(df)
Output:
Animal New_col
0 22 dogs 22-00
1 1 dog 1-00
2 1 cat 00-1
3 3 dogs 3-00
4 32 chats 00-32
Create tuple of search words
dog = ('dogs', 'dog', 'chien', 'chiens')
cat = ('cats', 'cat', 'chat', 'chats')
Create conditions for each tuple created with corresponding replacements and apply the conditions to the column, using numpy select :
num = df.Animal.str.split().str[0] #the numbers
#conditions
cond1 = df.Animal.str.endswith(dog)
cond2 = df.Animal.str.endswith(cat)
condlist = [cond1,cond2]
#what should be returned for each successful condition
choicelist = [num+"-00","00-"+num]
df['New Column'] = np.select(condlist,choicelist)
df
Animal New Column
0 22 dogs 22-00
1 1 dog 1-00
2 1 cat 00-1
3 3 dogs 3-00
4 32 chats 00-32

Padding spaces to strings in a series with variable lenght

I am trying to pad "_" on both side of string in a dataframe series.
Here is the dataframe.
A
cat
dog
rat
So i used this
A.str.pad(5, side='both', fillchar="_")
Output
A
_cat_
_dog_
_rat_
but now I got a series with variable length of string.
A
cat
dog
rat
crocodile
moose
expected output
A
_cat_
_dog_
_rat_
_crocodile_
_moose_
One way I can do is iterate through the entire dataframe, but I need a pandas way to do that. I am using pandas and python 3.
Basic pandas operations will give you what you want
'_' + df['A'].astype(str) + '_'
Output:
0 _cat_
1 _dog_
2 _rat_
3 _crocodile_
4 _moose_
Here is my version, using apply and modern (Python >= 3.6) string formatting
import pandas as pd
s = pd.Series(['A', 'cat', 'dog', 'rat', 'crocodile', 'moose'])
print(s)
s = s.apply(lambda x: f'_{x}_',)
print(s)
Output:
0 A
1 cat
2 dog
3 rat
4 crocodile
5 moose
dtype: object
0 _A_
1 _cat_
2 _dog_
3 _rat_
4 _crocodile_
5 _moose_
dtype: object
You can also use apply:
s = pd.Series(['A', 'cat', 'dog', 'rat', 'crocodile', 'moose'])
s.apply(lambda x: "_" + x + "_")

Filter dataframe rows containing a set of string in python

I have a dataframe df like -
A B
12 A cat
24 The dog
54 An elephant
I have to filter rows based on values on column B containing a list of string. I can do that for a string "cat" as follows:
df[df["B"].str.contains("cat", case=False, na=False)]
This will return me
A B
12 A cat
But now I want to filter it for a list of string i.e. ['cat', 'dog',.....].
A B
12 A cat
24 The dog
I can do that using a for loop but am searching for a pandas way of doing this. I am using python3 and pandas and have searched a lot of solutions on stack overflow since past 2 days
Use join with | for regex OR with \b for word boundary:
L = ['cat', 'dog']
pat = r'(\b{}\b)'.format('|'.join(L))
df[df["B"].str.contains(pat, case=False, na=False)]

Filter all rows that do not contain letters (alpha) in ´pandas´

I am trying to filter a pandas dataframe using regular expressions.
I want to delete those rows that do not contain any letters. For example:
Col A.
50000
$927848
dog
cat 583
rabbit 444
My desired results is:
Col A.
dog
cat 583
rabbit 444
I have been trying to solve this problem unsuccessful with regex and pandas filter options. See blow. I am specifically running into problems when I try to merge two conditions for the filter. How can I achieve this?
Option 1:
df['Col A.'] = ~df['Col A.'].filter(regex='\d+')
Option 2
df['Col A.'] = df['Col A.'].filter(regex=\w+)
Option 3
from string import digits, letters
df['Col A.'] = (df['Col A.'].filter(regex='|'.join(letters)))
OR
df['Col A.'] = ~(df['Col A.'].filter(regex='|'.join(digits)))
OR
df['Col A.'] = df[~(df['Col A.'].filter(regex='|'.join(digits))) & (df['Col A.'].filter(regex='|'.join(letters)))]
I think you'd need str.contains to filter values which contain letters by the means of boolean indexing:
df = df[df['Col A.'].str.contains('[A-Za-z]')]
print (df)
Col A.
2 dog
3 cat 583
4 rabbit 444
If there are some NaNs values you can pass a parameter:
df = df[df['Col A.'].str.contains('[A-Za-z]', na=False)]
print (df)
Col A.
3 dog
4 cat 583
5 rabbit 444
Have you tried:
df['Col A.'].filter(regex=r'\D') # Keeps only if there's a non-digit character
or:
df['Col A.'].filter(regex=r'[A-Za-z]') # Keeps only if there's a letter (alpha)
or:
df['Col A.'].filter(regex=r'[^\W\d_]') # More info in the link below...
Explanation: https://stackoverflow.com/a/2039476/8933502
df['Col A.'].str.contains(r'^\d+$', na=True) # if string with only digits or if int/float then will result in NaN converted to True
eg: [50000, '$927848', 'dog', 'cat 583', 'rabbit 444', '3 e 3', 'e 3', '33', '3 e']
will give :
[True,False,False,False,False,False,False, True,False]
You can use ^.*[a-zA-Z].*$
https://regex101.com/r/b84ji1/1
Details
^: Start of the line
.*: Match any character
[a-zA-Z]: Match letters
$: End of the line

How to test if a string contains one of the substrings in a list, in pandas?

Is there any function that would be the equivalent of a combination of df.isin() and df[col].str.contains()?
For example, say I have the series
s = pd.Series(['cat','hat','dog','fog','pet']), and I want to find all places where s contains any of ['og', 'at'], I would want to get everything but 'pet'.
I have a solution, but it's rather inelegant:
searchfor = ['og', 'at']
found = [s.str.contains(x) for x in searchfor]
result = pd.DataFrame[found]
result.any()
Is there a better way to do this?
One option is just to use the regex | character to try to match each of the substrings in the words in your Series s (still using str.contains).
You can construct the regex by joining the words in searchfor with |:
>>> searchfor = ['og', 'at']
>>> s[s.str.contains('|'.join(searchfor))]
0 cat
1 hat
2 dog
3 fog
dtype: object
As #AndyHayden noted in the comments below, take care if your substrings have special characters such as $ and ^ which you want to match literally. These characters have specific meanings in the context of regular expressions and will affect the matching.
You can make your list of substrings safer by escaping non-alphanumeric characters with re.escape:
>>> import re
>>> matches = ['$money', 'x^y']
>>> safe_matches = [re.escape(m) for m in matches]
>>> safe_matches
['\\$money', 'x\\^y']
The strings with in this new list will match each character literally when used with str.contains.
You can use str.contains alone with a regex pattern using OR (|):
s[s.str.contains('og|at')]
Or you could add the series to a dataframe then use str.contains:
df = pd.DataFrame(s)
df[s.str.contains('og|at')]
Output:
0 cat
1 hat
2 dog
3 fog
Here is a one line lambda that also works:
df["TrueFalse"] = df['col1'].apply(lambda x: 1 if any(i in x for i in searchfor) else 0)
Input:
searchfor = ['og', 'at']
df = pd.DataFrame([('cat', 1000.0), ('hat', 2000000.0), ('dog', 1000.0), ('fog', 330000.0),('pet', 330000.0)], columns=['col1', 'col2'])
col1 col2
0 cat 1000.0
1 hat 2000000.0
2 dog 1000.0
3 fog 330000.0
4 pet 330000.0
Apply Lambda:
df["TrueFalse"] = df['col1'].apply(lambda x: 1 if any(i in x for i in searchfor) else 0)
Output:
col1 col2 TrueFalse
0 cat 1000.0 1
1 hat 2000000.0 1
2 dog 1000.0 1
3 fog 330000.0 1
4 pet 330000.0 0
Had the same issue. Without making it too complex, you can add | in between each entry, like fieldname.str.contains("cat|dog") works

Categories

Resources