I want to print 1 if the word is in the paragraph, if not then print 0
The first line contains the word bestselling yet lambda is printing 0
A good way to do that is to use any function and cast to int the result.
text = "this is a text used for an example."
first_list = ["word", "second_word", "example"]
second_list = ["word", "second_word", "third_word"]
is_in = int(any(k in text for k in first_list))
print(is_in) # print 1
not_in = int(any(k in text for k in second_list))
print(not_in) # print 0
A way to search in a DataFrame would be using the contains method of str (documentation here). In your case you want to search whether multiple words are in the text, so a regular expression could be used:
df["sun"][0].str.contains("brilliant|bestselling|best|best-selling|loved|great|amazing", regex=True)
If you also want to match the word regardless of if it's in lowercase or uppercase you can add:
import re
df["sun"][0].str.contains("brilliant|bestselling|best|best-selling|loved|great|amazing", flags=re.IGNORECASE, regex=True)
Related
I want to search and blank out those sentences which contain words like "masked 111","My Add no" etc. from another sentences like "XYZ masked 111" or "Hello My Add" in python.How can I do that?
I was trying to make changes to below codes but it was not working due to spaces.
def garbagefin(x):
k = " ".join(re.findall("[a-zA-Z0-9]+", x))
print(k)
t=re.split(r'\s',k)
print(t)
Glist={'masked 111', 'DATA',"My Add no" , 'MASKEDDATA',}
for n, m in enumerate(t): ##to remove entire ID
if m in Glist:
return ''
else:
return x
The outputs that I am expecting is:
garbagefin("I am masked 111")-Blank
garbagefin("I am My Add No")-Blank
garbagefin("I am My add")-I am My add
garbagefin("I am My MASKEDDATA")-Blank
You can also use a regex approach like this:
import re
Glist={'masked 111', 'DATA',"My Add no" , 'MASKEDDATA',}
glst_rx = r"\b(?:{})\b".format("|".join(Glist))
def garbagefin(x):
if re.search(glst_rx, x, re.I):
return ''
else:
return x
See the Python demo.
The glst_rx = r"\b(?:{})\b".format("|".join(Glist)) code will generate the \b(?:My Add no|DATA|MASKEDDATA|masked 111)\b regex (see the online demo).
It will match the strings from Glist in a case insensitive way (note the re.I flag in re.search(glst_rx, x, re.I)) as whole words, and once found, an empty string will be returned, else, the input string will be returned.
If there are too many items in Glist, you could leverage a regex trie (see here how to use the trieregex library to generate such tries.)
Seems like you don't actually need regex. Just the usual in operator.
def garbagefin(x):
return "" if any(text in x for text in Glist) else x
If your matching is case insensitive, then compare against lowercase text.
Glist = set(map(lambda text: text.casefold(), Glist))
...
def garbagefin(x):
x_lower = x.casefold()
return "" if any(text in x_lower for text in Glist) else x
Output
1.
2.
3. I am My add
4.
If you're just trying to find a string from another string, I don't think you even need to use such messed-up code. Plus you can just store the key strings in a array
You can simply use the in method and use return.
def garbagefin (x):
L=["masked 111","DATA","My Add no", "MASKEDDATA"]
for i in L:
if i in x:
print("Blank")
return
print(x)
I want to make multiple substitutions to a string using multiple regular expressions. I also want to make the substitutions in a single pass to avoid creating multiple instances of the string.
Let's say for argument that I want to make the substitutions below, while avoiding multiple use of re.sub(), whether explicitly or with a loop:
import re
text = "local foals drink cola"
text = re.sub("(?<=o)a", "w", text)
text = re.sub("l(?=a)", "co", text)
print(text) # "local fowls drink cocoa"
The closest solution I have found for this is to compile a regular expression from a dictionary of substitution targets and then to use a lambda function to replace each matched target with its value in the dictionary. However, this approach does not work when using metacharacters, thus removing the functionality needed from regular expressions in this example.
Let me demonstrate first with an example that works without metacharacters:
import re
text = "local foals drink cola"
subs_dict = {"a":"w", "l":"co"}
subs_regex = re.compile("|".join(subs_dict.keys()))
text = re.sub(subs_regex, lambda match: subs_dict[match.group(0)], text)
print(text) # "coocwco fowcos drink cocow"
Now observe that adding the desired metacharacters to the dictionary keys results in a KeyError:
import re
text = "local foals drink cola"
subs_dict = {"(?<=o)a":"w", "l(?=a)":"co"}
subs_regex = re.compile("|".join(subs_dict.keys()))
text = re.sub(subs_regex, lambda match: subs_dict[match.group(0)], text)
>>> KeyError: "a"
The reason for this is that the sub() function correctly finds a match for the expression "(?<=o)a", so this must now be found in the dictionary to return its substitution, but the value submitted for dictionary lookup by match.group(0) is the corresponding matched string "a". It also does not work to search for match.re in the dictionary (i.e. the expression that produced the match) because the value of that is the whole disjoint expression that was compiled from the dictionary keys (i.e. "(?<=o)a|l(?=a)").
EDIT: In case anyone would benefit from seeing thejonny's solution implemented with a lambda function as close to my originals as possible, it would work like this:
import re
text = "local foals drink cola"
subs_dict = {"(?<=o)a":"w", "l(?=a)":"co"}
subs_regex = re.compile("|".join("("+key+")" for key in subs_dict))
group_index = 1
indexed_subs = {}
for target, sub in subs_dict.items():
indexed_subs[group_index] = sub
group_index += re.compile(target).groups + 1
text = re.sub(subs_regex, lambda match: indexed_subs[match.lastindex], text)
print(text) # "local fowls drink cocoa"
If no expression you want to use matches an empty string (which is a valid assumption if you want to replace), you can use groups before |ing the expressions, and then check which group found a match:
(exp1)|(exp2)|(exp3)
Or maybe named groups so you don't have to count the subgroups inside the subexpressions.
The replacement function than can look which group matched, and chose the replacement from a list.
I came up with this implementation:
import re
def dictsub(replacements, string):
"""things has the form {"regex1": "replacement", "regex2": "replacement2", ...}"""
exprall = re.compile("|".join("("+x+")" for x in replacements))
gi = 1
replacements_by_gi = {}
for (expr, replacement) in replacements.items():
replacements_by_gi[gi] = replacement
gi += re.compile(expr).groups + 1
def choose(match):
return replacements_by_gi[match.lastindex]
return re.sub(exprall, choose, string)
text = "local foals drink cola"
print(dictsub({"(?<=o)a":"w", "l(?=a)":"co"}, text))
that prints local fowls drink cocoa
You could do this by keeping your key as the expected match and storing both your replace and regex in a nested dict. Given you're looking to match specific chars, this definition should work.
subs_dict = {"a": {'replace': 'w', 'regex': '(?<=o)a'}, 'l': {'replace': 'co', 'regex': 'l(?=a)'}}
subs_regex = re.compile("|".join([subs_dict[k]['regex'] for k in subs_dict.keys()]))
re.sub(subs_regex, lambda match: subs_dict[match.group(0)]['replace'], text)
'local fowls drink cocoa'
I am new to python, I have an issue with replacing. SO I have a list and a string I want to replace the word of string if it matches the word in the list. I have tried it but I am not getting the excepted output. Like below:-
str = "'hi'+'bikes'-'cars'>=20+'rangers'"
list = [df['hi'],df['bikes'],df['cars'],df['rangers']]
for i in list:
if i in str:
z=x.replace(j, i)
But getting a wrong answer
Execpted output:
z = "df['hi']+df['bikes']-df['cars']>=20+df['rangers']"
If you are sure that the column names are always alphabets then simply use
import re
s = "'hi'+'bikes'-'cars'>=20+'rangers'"
s2 = re.sub(r"('[A-Za-z]+')", r'df[\1]', s)
"df['hi']+df['bikes']-df['cars']>=20+df['rangers']"
I want to match a string with a list of values. These can be overlapping, so for example string = "test1 test2" and values = ["test1", "test1 test2"].
EDIT: Below is my entire code for a simple example
import regex
string = "This is a test string"
values = ["test", "word", "string", "test string"]
pattern = r'\b({})\b'.format('|'.join(map(regex.escape, values)))
matches = set(map(str.lower, regex.findall(pattern, string, regex.IGNORECASE)))
output = ([x.upper() for x in values if x.lower() in matches])
print(output) # ['TEST', 'STRING']
# Expected output: ['TEST', 'STRING', 'TEST STRING']
As Wiktor commented, if you want to find all matches, you can not
use alternatives, because regex processor tries consecutive alternatives
and returns only the first alternative found.
So your program has to use a separate pattern for each value to test,
but for performance reason you can compile all of them in advance.
Another difference I spotted, between your Python instalation and mine
is import regex. Apparently you use some older Python version, as
I use import re (version 3.7). I checked even Python version 2.7.15, it
also uses import re.
The script can look like below:
import re
def mtch(pat, str):
s = pat.search(str)
return s.group().upper() if s else None
# Strings to look for
values = ["test", "word", "string", "test string"]
# Compile patterns
patterns = [ re.compile(r'\b({})\b'.format(re.escape(v)),
re.IGNORECASE) for v in values ]
# The string to check
string = "This is a test string"
# What has been found
list(filter(None, [ mtch(pat, string) for pat in patterns ]))
mtch function returns the text found by pat (the compiled pattern)
in str (source string) or None in the match failed.
patterns contains a list of compiled patterns.
Then there is [ mtch(pat, string) for pat in patterns ] a list
comprehension, generating match result list (with None values
if the match attempt failed).
To filter out None values I used filter function.
And finally list gathers all filtered strings and prints:
['TEST', 'STRING', 'TEST STRING']
If you want to perform this search for multiple source strings,
run only the last statement for each source string, probably adding
the result (and some indication of what string has been searched)
to some result list.
If your source list is very long, you should not attempt to read them all.
Instead, you should read them one by one in a loop and run the check
only for the current input string.
Edit concerning comment as of 2019-02-18 10:00Z
As I read from your comment, the code reading strings is as follows:
with open("Test_data.csv") as f:
for entry in f:
entry = entry.split(',')
string = entry[2] + " " + entry[3] + " " + entry[6]
Note that you overwrite string in every loop, so after the loop completed,
you have there the result from the last row (only).
Or maybe just after reading you run the search for patterns for the current
string?
Another hints to change the code:
Avoid such combinations that e.g. entry variable initially holds
the whole string and then a list - product of splitting.
Maybe a more readable variant is:
for row in f:
entry = row.split(',')
After you read a row and before doing anything else, check whether the row
just read is not empty. If the row is empty, omit it.
A quick way to test it is just to use the string in if (an empty string
evaluates to False).
for row in f:
if row:
entry = row.split(',')
...
Before string = entry[2] + " " + entry[3] + " " + entry[6] check
whether entry list has at least 7 items (numeration is from 0).
Maybe some of your input rows contain smaller number of fragments
and hence your program attempts to read from a non-existing element of
this list?
To be sure, what strings you are checking, write a short program
which only splits the input and prints resulting strings. Then look at them, maybe you find something wrong.
If you determine that foobar is in the text, you don't need to search the text separately for foo and bar: you know the answer already.
First group your searches:
searches = ['test', 'word', 'string', 'test string', 'wo', 'wordy']
unique = set(searches)
ordered = sorted(unique, key = len)
grouped = {}
while unique:
s1 = ordered.pop()
if s1 in unique:
unique.remove(s1)
grouped[s1] = [s1]
redundant = [s2 for s2 in unique if s2 in s1]
for s2 in redundant:
unique.remove(s2)
grouped[s1].append(s2)
for s, dups in grouped.items():
print(s, dups)
# Output:
# test string ['test string', 'string', 'test']
# wordy ['wordy', 'word', 'wo']
Once you have things grouped, you can confine the searching to just the top-level searches (the keys of grouped).
Also, if scale and performance are concerns, do you really need regular expressions? Your current examples could be handled with ordinary in tests, which are faster. If you do indeed need regular expressions, the idea of grouping the searches is harder -- but perhaps not impossible under some conditions.
I have a regular expression like this:
findthe = re.compile(r" the ")
replacement = ["firstthe", "secondthe"]
sentence = "This is the first sentence in the whole universe!"
What I am trying to do is to replace each occurrence with an associated replacement word from a list so that the end sentence would look like this:
>>> print sentence
This is firstthe first sentence in secondthe whole universe
I tried using re.sub inside a for loop enumerating over replacement but it looks like re.sub returns all occurrences. Can someone tell me how to do this efficiently?
If it is not required to use regEx than you can try to use the following code:
replacement = ["firstthe", "secondthe"]
sentence = "This is the first sentence in the whole universe!"
words = sentence.split()
counter = 0
for i,word in enumerate(words):
if word == 'the':
words[i] = replacement[counter]
counter += 1
sentence = ' '.join(words)
Or something like this will work too:
import re
findthe = re.compile(r"\b(the)\b")
print re.sub(findthe, replacement[1],re.sub(findthe, replacement[0],sentence, 1), 1)
And at least:
re.sub(findthe, lambda matchObj: replacement.pop(0),sentence)
Artsiom's last answer is destructive of replacement variable. Here's a way to do it without emptying replacement
re.sub(findthe, lambda m, r=iter(replacement): next(r), sentence)
You can use a callback function as the replace parameter, see how at:
http://docs.python.org/library/re.html#re.sub
Then use some counter and replace depending on the counter value.