I am trying to pad "_" on both side of string in a dataframe series.
Here is the dataframe.
A
cat
dog
rat
So i used this
A.str.pad(5, side='both', fillchar="_")
Output
A
_cat_
_dog_
_rat_
but now I got a series with variable length of string.
A
cat
dog
rat
crocodile
moose
expected output
A
_cat_
_dog_
_rat_
_crocodile_
_moose_
One way I can do is iterate through the entire dataframe, but I need a pandas way to do that. I am using pandas and python 3.
Basic pandas operations will give you what you want
'_' + df['A'].astype(str) + '_'
Output:
0 _cat_
1 _dog_
2 _rat_
3 _crocodile_
4 _moose_
Here is my version, using apply and modern (Python >= 3.6) string formatting
import pandas as pd
s = pd.Series(['A', 'cat', 'dog', 'rat', 'crocodile', 'moose'])
print(s)
s = s.apply(lambda x: f'_{x}_',)
print(s)
Output:
0 A
1 cat
2 dog
3 rat
4 crocodile
5 moose
dtype: object
0 _A_
1 _cat_
2 _dog_
3 _rat_
4 _crocodile_
5 _moose_
dtype: object
You can also use apply:
s = pd.Series(['A', 'cat', 'dog', 'rat', 'crocodile', 'moose'])
s.apply(lambda x: "_" + x + "_")
Related
I'd like to create a new column based on the following conditions:
if the row contains dogs/dog/chien/chiens, then add -00
if the row contains cats/cat/chat/chats, then add 00-
A sample of data is as follows:
Animal
22 dogs
1 dog
1 cat
3 dogs
32 chats
and so far.
I'd like as output a column with only numbers (numerical):
Animal New column
22 dogs 22-00
1 dog 1-00
1 cat 00-1
3 dogs 3-00
32 chats 00-32
I think I should use an if condition to check the words, then .split and .join . It's about string manipulation but I'm having trouble breaking down this problem.
PRES = set(("cats", "cat", "chat", "chats"))
POSTS = set(("dogs", "dog", "chien", "chiens"))
def fun(words):
# words will come as e.g. "22 dogs"
num, ani = words.split()
if ani in PRES:
return "00-" + num
elif ani in POSTS:
return num + "-00"
else:
# you might want to handle this..
return "unexpected"
df["New Column"] = df["Animal"].apply(fun)
where df is your dataframe. For a fast lookup, we turn the condition lists into sets. Then we apply a function to values of the Animal column of df and act accordingly.
You could do this, first extract the number, then use np.where to conditionally add characters to the string:
df['New Col'] = df['Animal'].str.extract(r'([0-9]*)')
df['New Col'] = np.where(df['Animal'].str.contains('dogs|dog|chiens|chien'), df['New Col']+'-00', df['New Col'])
df['New Col'] = np.where(df['Animal'].str.contains('cats|cat|chat|chats'), '00-'+df['New Col'], df['New Col'])
print(df)
Animal New Col
0 22 dogs 22-00
1 1 dog 1-00
2 1 cat 00-1
3 3 dogs 3-00
4 32 chats 00-32
Since your data is well-formatted, you can use a basic substitution and apply it to the row:
import pandas as pd
import re
def replacer(s):
return re.sub(r" (chiens?|dogs?)", "-00",
re.sub(r"(\d+) ch?ats?", r"00-\1", s))
df = pd.DataFrame({"Animal": ["22 dogs", "1 dog", "1 cat", "3 dogs", "32 chats"]})
df["New Column"] = df["Animal"].apply(replacer)
Output:
Animal New Column
0 22 dogs 22-00
1 1 dog 1-00
2 1 cat 00-1
3 3 dogs 3-00
4 32 chats 00-32
Using re:
import re
list1 = ['dogs', 'dog', 'chien', 'chiens']
list2 = ['cats', 'cat', 'chat', 'chats']
df['New_col'] = [(re.search(r'(\w+)', val).group(1).strip()+"-00") if re.search(r'([a-zA-Z]+)', val).group(1).strip() in list1 else ("00-" + re.search(r'(\w+)', val).group(1).strip()) if re.search(r'([a-zA-Z]+)', val).group(1).strip() in list2 else val for val in list(df['Animal'])]
print(df)
Output:
Animal New_col
0 22 dogs 22-00
1 1 dog 1-00
2 1 cat 00-1
3 3 dogs 3-00
4 32 chats 00-32
Create tuple of search words
dog = ('dogs', 'dog', 'chien', 'chiens')
cat = ('cats', 'cat', 'chat', 'chats')
Create conditions for each tuple created with corresponding replacements and apply the conditions to the column, using numpy select :
num = df.Animal.str.split().str[0] #the numbers
#conditions
cond1 = df.Animal.str.endswith(dog)
cond2 = df.Animal.str.endswith(cat)
condlist = [cond1,cond2]
#what should be returned for each successful condition
choicelist = [num+"-00","00-"+num]
df['New Column'] = np.select(condlist,choicelist)
df
Animal New Column
0 22 dogs 22-00
1 1 dog 1-00
2 1 cat 00-1
3 3 dogs 3-00
4 32 chats 00-32
I have a dataframe df like -
A B
12 A cat
24 The dog
54 An elephant
I have to filter rows based on values on column B containing a list of string. I can do that for a string "cat" as follows:
df[df["B"].str.contains("cat", case=False, na=False)]
This will return me
A B
12 A cat
But now I want to filter it for a list of string i.e. ['cat', 'dog',.....].
A B
12 A cat
24 The dog
I can do that using a for loop but am searching for a pandas way of doing this. I am using python3 and pandas and have searched a lot of solutions on stack overflow since past 2 days
Use join with | for regex OR with \b for word boundary:
L = ['cat', 'dog']
pat = r'(\b{}\b)'.format('|'.join(L))
df[df["B"].str.contains(pat, case=False, na=False)]
Is there a case insensitive version for pandas.DataFrame.replace? https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.replace.html
I need to replace string values in a column subject to a case-insensitive condition of the form "where label == a or label == b or label == c".
The issue with some of the other answers is that they don't work with all Dataframes, only with Series, or Dataframes that can be implicitly converted to a Series. I understand this is because the .str construct exists in the Series class, but not in the Dataframe class.
To work with Dataframes, you can make your regular expression case insensitive with the (?i) extension. I don't believe this is available in all flavors of RegEx but it works with Pandas.
d = {'a':['test', 'Test', 'cat'], 'b':['CAT', 'dog', 'Cat']}
df = pd.DataFrame(data=d)
a b
0 test CAT
1 Test dog
2 cat Cat
Then use replace as you normally would but with the (?i) extension:
df.replace('(?i)cat', 'MONKEY', regex=True)
a b
0 test MONKEY
1 Test dog
2 MONKEY MONKEY
I think need convert to lower and then replace by condition with isin:
d = {'a':['test', 'Test', 'cat', 'CAT', 'dog', 'Cat']}
df = pd.DataFrame(data=d)
m = df['a'].str.lower().isin(['cat','test'])
df.loc[m, 'a'] = 'baby'
print (df)
a
0 baby
1 baby
2 baby
3 baby
4 dog
5 baby
Another solution:
df['b'] = df['a'].str.replace('test', 'baby', flags=re.I)
print (df)
a b
0 test baby
1 Test baby
2 cat cat
3 CAT CAT
4 dog dog
5 Cat Cat
my_list=["one","is"]
df
Out[6]:
Name Story
0 Kumar Kumar is one of the great player in his team
1 Ravi Ravi is a good poet
2 Ram Ram drives well
if anyone of the items in my_list is present in the "Story" column I need to get the no of occurrence for all the items.
my_desired_output
new_df
word count
one 1
is 2
I achieved extracting the row which are having anyone of the items in my_list using
mask=df1["Story"].str.contains('|'.join(my_list),na=False) but now I am trying get the counts of each word in my_list
You can use str.split with stack for Series of words first:
a = df['Story'].str.split(expand=True).stack()
print (a)
0 0 Kumar
1 is
2 one
3 of
4 the
5 great
6 player
7 in
8 his
9 team
1 0 Ravi
1 is
2 a
3 good
4 poet
2 0 Ram
1 drives
2 well
dtype: object
Then filter by boolean indexing with isin, get value_counts and for DataFrame add rename_axis and reset_index:
df = a[a.isin(my_list)].value_counts().rename_axis('word').reset_index(name='count')
print (df)
word count
0 is 2
1 one 1
Another solution with creating list of all words by str.split, then fllaten by from_iterable, use Counter and last create DataFrame by constructor:
from collections import Counter
from itertools import chain
my_list=["one","is"]
a = list(chain.from_iterable(df['Story'].str.split().values.tolist()))
print (a)
['Kumar', 'is', 'one', 'of', 'the', 'great', 'player',
'in', 'his', 'team', 'Ravi', 'is', 'a', 'good', 'poet', 'Ram', 'drives', 'well']
b = Counter([x for x in a if x in my_list])
print (b)
Counter({'is': 2, 'one': 1})
df = pd.DataFrame({'word':list(b.keys()),'count':list(b.values())}, columns=['word','count'])
print (df)
word count
0 one 1
1 is 2
Is there any function that would be the equivalent of a combination of df.isin() and df[col].str.contains()?
For example, say I have the series
s = pd.Series(['cat','hat','dog','fog','pet']), and I want to find all places where s contains any of ['og', 'at'], I would want to get everything but 'pet'.
I have a solution, but it's rather inelegant:
searchfor = ['og', 'at']
found = [s.str.contains(x) for x in searchfor]
result = pd.DataFrame[found]
result.any()
Is there a better way to do this?
One option is just to use the regex | character to try to match each of the substrings in the words in your Series s (still using str.contains).
You can construct the regex by joining the words in searchfor with |:
>>> searchfor = ['og', 'at']
>>> s[s.str.contains('|'.join(searchfor))]
0 cat
1 hat
2 dog
3 fog
dtype: object
As #AndyHayden noted in the comments below, take care if your substrings have special characters such as $ and ^ which you want to match literally. These characters have specific meanings in the context of regular expressions and will affect the matching.
You can make your list of substrings safer by escaping non-alphanumeric characters with re.escape:
>>> import re
>>> matches = ['$money', 'x^y']
>>> safe_matches = [re.escape(m) for m in matches]
>>> safe_matches
['\\$money', 'x\\^y']
The strings with in this new list will match each character literally when used with str.contains.
You can use str.contains alone with a regex pattern using OR (|):
s[s.str.contains('og|at')]
Or you could add the series to a dataframe then use str.contains:
df = pd.DataFrame(s)
df[s.str.contains('og|at')]
Output:
0 cat
1 hat
2 dog
3 fog
Here is a one line lambda that also works:
df["TrueFalse"] = df['col1'].apply(lambda x: 1 if any(i in x for i in searchfor) else 0)
Input:
searchfor = ['og', 'at']
df = pd.DataFrame([('cat', 1000.0), ('hat', 2000000.0), ('dog', 1000.0), ('fog', 330000.0),('pet', 330000.0)], columns=['col1', 'col2'])
col1 col2
0 cat 1000.0
1 hat 2000000.0
2 dog 1000.0
3 fog 330000.0
4 pet 330000.0
Apply Lambda:
df["TrueFalse"] = df['col1'].apply(lambda x: 1 if any(i in x for i in searchfor) else 0)
Output:
col1 col2 TrueFalse
0 cat 1000.0 1
1 hat 2000000.0 1
2 dog 1000.0 1
3 fog 330000.0 1
4 pet 330000.0 0
Had the same issue. Without making it too complex, you can add | in between each entry, like fieldname.str.contains("cat|dog") works