Check if a substring is in a string python - python

I have two dataframe, I need to check contain substring from first df in each string in second df and get a list of words that are included in the second df
First df(word):
word
apples
dog
cat
cheese
Second df(sentence):
sentence
apples grow on a tree
...
I love cheese
I tried this one:
tru=[]
for i in word['word']:
if i in sentence['sentence'].values:
tru.append(i)
And this one:
tru=[]
for i in word['word']:
if sentence['sentence'].str.contains(i):
tru.append(i)
I expect to get a list like ['apples',..., 'cheese']

One possible way is to use Series.str.extractall:
import re
import pandas as pd
df_word = pd.Series(["apples", "dog", "cat", "cheese"])
df_sentence = pd.Series(["apples grow on a tree", "i love cheese"])
pattern = f"({'|'.join(df_word.apply(re.escape))})"
matches = df_sentence.str.extractall(pattern)
matches
Output:
0
match
0 0 apples
1 0 cheese
You can then convert the results to a list:
matches[0].unique().tolist()
Output:
['apples', 'cheese']

Related

How to test if a string contains all of the substrings in a list: pandas? [duplicate]

I have a df (Pandas Dataframe) with three rows:
some_col_name
"apple is delicious"
"banana is delicious"
"apple and banana both are delicious"
The function df.col_name.str.contains("apple|banana") will catch all of the rows:
"apple is delicious",
"banana is delicious",
"apple and banana both are delicious".
How do I apply AND operator to the str.contains() method, so that it only grabs strings that contain BOTH "apple" & "banana"?
"apple and banana both are delicious"
I'd like to grab strings that contains 10-20 different words (grape, watermelon, berry, orange, ..., etc.)
You can do that as follows:
df[(df['col_name'].str.contains('apple')) & (df['col_name'].str.contains('banana'))]
You can also do it in regex expression style:
df[df['col_name'].str.contains(r'^(?=.*apple)(?=.*banana)')]
You can then, build your list of words into a regex string like so:
base = r'^{}'
expr = '(?=.*{})'
words = ['apple', 'banana', 'cat'] # example
base.format(''.join(expr.format(w) for w in words))
will render:
'^(?=.*apple)(?=.*banana)(?=.*cat)'
Then you can do your stuff dynamically.
df = pd.DataFrame({'col': ["apple is delicious",
"banana is delicious",
"apple and banana both are delicious"]})
targets = ['apple', 'banana']
# Any word from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: any(word in sentence for word in targets))
0 True
1 True
2 True
Name: col, dtype: bool
# All words from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: all(word in sentence for word in targets))
0 False
1 False
2 True
Name: col, dtype: bool
This works
df.col.str.contains(r'(?=.*apple)(?=.*banana)',regex=True)
If you only want to use native methods and avoid writing regexps, here is a vectorized version with no lambdas involved:
targets = ['apple', 'banana', 'strawberry']
fruit_masks = (df['col'].str.contains(string) for string in targets)
combined_mask = np.vstack(fruit_masks).all(axis=0)
df[combined_mask]
Try this regex
apple.*banana|banana.*apple
Code is:
import pandas as pd
df = pd.DataFrame([[1,"apple is delicious"],[2,"banana is delicious"],[3,"apple and banana both are delicious"]],columns=('ID','String_Col'))
print df[df['String_Col'].str.contains(r'apple.*banana|banana.*apple')]
Output
ID String_Col
2 3 apple and banana both are delicious
if you want to catch in the minimum atleast two words in the sentence, maybe this will work (taking the tip from #Alexander) :
target=['apple','banana','grapes','orange']
connector_list=['and']
df[df.col.apply(lambda sentence: (any(word in sentence for word in target)) & (all(connector in sentence for connector in connector_list)))]
output:
col
2 apple and banana both are delicious
if you have more than two words to catch which are separated by comma ',' than add it to the connector_list and modify the second condition from all to any
df[df.col.apply(lambda sentence: (any(word in sentence for word in target)) & (any(connector in sentence for connector in connector_list)))]
output:
col
2 apple and banana both are delicious
3 orange,banana and apple all are delicious
Enumerating all possibilities for large lists is cumbersome. A better way is to use reduce() and the bitwise AND operator (&).
For example, consider the following DataFrame:
df = pd.DataFrame({'col': ["apple is delicious",
"banana is delicious",
"apple and banana both are delicious",
"i love apple, banana, and strawberry"]})
# col
#0 apple is delicious
#1 banana is delicious
#2 apple and banana both are delicious
#3 i love apple, banana, and strawberry
Suppose we wanted to search for all of the following:
targets = ['apple', 'banana', 'strawberry']
We can do:
#from functools import reduce # needed for python3
print(df[reduce(lambda a, b: a&b, (df['col'].str.contains(s) for s in targets))])
# col
#3 i love apple, banana, and strawberry
You can create masks
apple_mask = df.colname.str.contains('apple')
bannana_mask = df.colname.str.contains('bannana')
df = df [apple_mask & bannana_mask]
From #Anzel's answer, I wrote a function since I'm going to be applying this a lot:
def regify(words, base=str(r'^{}'), expr=str('(?=.*{})')):
return base.format(''.join(expr.format(w) for w in words))
So if you have words defined:
words = ['apple', 'banana']
And then call it with something like:
dg = df.loc[
df['col_name'].str.contains(regify(words), case=False, regex=True)
]
then you should get what you're after.

Python: Replace multiple words in a column with multiple words from a list and write them to a different column in Pandas

I have a dataframe and a list as follows:
data = {{"text":["I have one apple and two bananas", "this is my apple", "she has three apples","My friend has five apples but she only has one banana"]}
df= pd.DataFrame(data=data, columns=['text'])
my_list = ['one','two','three','four','five']
what I would like to have as an output is an extra column 'new_text' where the sentences containing the words from the list are replaced with each and every word from my_list, so the output would look like this:
output:
text new_text
0 I have one apple and two bananas I have two apple and three bananas, I have three apple and four bananas,I have four apple and five bananas,I have five apple and one bananas,...
1 this is my apple this is my apple
2 she has three apples she has two apples,she has four apples,she has five apples,...
and so on...
the repetition of the same sentence and plural cases does not matter, the only important thing is that all the words from the list appear in the sentences in the 'new_text' column
I have tried the code here: Python: Replace one word in a sentence with a list of words and put thenew sentences in another column in pandas
with an exception in step 1, but it only finds the first word :
data1 = data['text'].str.extract(
r"(?i)(?P<before>.*)\s(?P<clock>\(?=\bone\b | \btwo\b | \bthree\b | \bfour\b | \bfive\b))\s(?P<after>.*)")
Thank you in advance
You can use this :
a = ["I have one apple and two bananas", "this is my apple",
"she has three apples", "My friend has five apples but she only has one banana"]
b = ['one','two','three','four','five', 'six']
new = []
for sentence in a:
newSen = []
for word in sentence.split():
if word in b:
newSen.append(b[b.index(word) + 1])
else:
newSen.append(word)
new.append(' '.join(newSen))
print(new) #output : ['I have two apple and three bananas', 'this is my apple', 'she has four apples', 'My friend has six apples but she only has two banana']

Remove values from one column, that are equal to the value in another

I currently have two columns:
Word Sentence
apple [this, fruit, is, an, apple]
orange [orange, is, this, fruit]
grape [this, is, grape]
strawberry [strawberry, is, nice]
How would I go about removing the value that appears in df['Word'] from df['Sentence'] so that the output would be:
Word Sentence
apple [this, fruit, is, an]
orange [is, this, fruit]
grape [this, is]
strawberry [is, nice]
I am currently trying to use this while loop, which is not very pythonic.
count_row = df.shape[0]
i=0
while i < count_row :
mylist = df.iloc[i]["Sentence"]
mykeyword = df.iloc[i]["Word"]
mylist = mylist.split()
for word in mylist:
if word == mykeyword:
df.iloc[i]["Sentence"] = df.iloc[i]["Sentence"].replace(word, '')
print(i)
i=i+1
However, the loop is not removing the values. What is the best way to achieve the desired output?
How about something like...
def remove_name(r):
r['Sentence'] = [w for w in r['Sentence'] if w != r['Word']]
return r
df.apply(remove_name,axis=1)
Apply lets us perform operations like this all at once, no iterations required.
You can use remove function to remove an element from a list.
Syntax: list.remove(element)
Where 'list' is your sentence list and 'element' is your fruit name to be removed.
To know more about remove function refer python docs or this link: https://www.programiz.com/python-programming/methods/list/remove

Counting the number of times a word appears in n tweets

I have a data frame of about 118,000 tweets. Here's a made up sample:
Tweets
1 The apple is red
2 The grape is purple
3 The tree is green
I have also used the 'set' function to arrive at a list of of every unique word that is found in my data frame of tweets. For the example above it looks like this (in no particular order):
Words
1 The
2 is
3 apple
4 grape
....so on
Basically I need to find out how many tweets contain a given word. For example, "The" is found in 3 tweets, "apple" is found in 1 tweet, "is" is found in 3 tweets, and so on.
I have tried using a nested for loop that looks like:
number_words = [0]*len(words)
for i in range(len(words)):
for j in range(len(tweets)):
if words[i] in tweets[j]:
number_words[i] += 1
number_words
Which creates a new list and counts the amount of tweets that contain the given word, for each word down the list. However I have found that this incredibly inefficient, the code block takes forever to run.
What is a better way to do this?
you could use: str.count
df.Tweets.str.count(word).sum()
for example, i suppose words is the list
for word in Words:
print(f'{word} count: {df.Tweets.str.count(word).sum()}')
full sample:
import pandas as pd
data = """
Tweets
The apple is red
The grape is purple
The tree is green
"""
datb = """
Words
The
is
apple
grape
"""
dfa = pd.read_csv(pd.compat.StringIO(data), sep=';')
dfb = pd.read_csv(pd.compat.StringIO(datb), sep=';')
Words = dfb['Words'].values
dico = {}
for word in Words:
dico[word] = dfa.Tweets.str.count(word).sum()
print(dico)
output:
{'The': 3, 'is': 3, 'apple': 1, 'grape ': 1}
You can use a default dictionary for this to store all word counts like this:
from collections import defaultdict
word_counts = defaultdict(int)
for tweet in tweets:
for word in tweet:
word_counts[word] += 1
# print(word_counts['some_word']) will output occurrence of some_word
This will take your list of words and turn it into a dictionary
import collections
words = tweets.split()
counter = collections.Counter(words)
for key , value in sorted(counter.items()):
print("`{}` is repeated {} time".format(key , value))

Writing an If condition to filter out the first word

I have a string:
Father ate a banana and slept on a feather
Part of my code shown below:
...
if word.endswith(('ther')):
print word
This prints feather and also Father
But i want to modify this if condition so it doesn't apply this search for the first word of a sentence. So the result should only print feather.
I tried having and but it didn't work:
...
if word.endswith(('ther')) and word[0].endswith(('ther')):
print word
This doesn't work. HELP
You can use a range to skip first word and apply the endswith() function to the rest of words, like:
s = 'Father ate a banana and slept on a feather'
[w for w in s.split()[1:] if w.endswith('ther')]
You can build a regex:
import re
re.findall(r'(\w*ther)',s)[1:]
['feather']
If i understand your question correctly, you don't want it to print the word if it's the first word in the string. So, you could copy the string and drop the first word.
I'll walk you through it. Say you have this string:
s = "Father ate a banana and slept on a feather"
You can split it by running s.split() and catching that output:
['Father', 'ate', 'a', 'banana', 'and', 'slept', 'on', 'a', 'feather']
So if you want all the words, except the first, you can use the index [1:]. And you can combine the list of words by joining with a space.
s1 = "Father ate a banana and slept on a feather"
s2 = " ".join(s1.split()[1:])
The string s2 will now be the following:
ate a banana and slept on a feather
You can use that string and iterate over the words like you did above.
If you want to avoid making a temporary string
[w for i, w in enumerate(s.split()) if w.endswith('ther') and i]

Categories

Resources