Extracting and replacing a particular string from a sentence in python - python

Say I have a string,
s1="Hey Siri open call up duty"
and another string
s2="call up duty".
Now I know that "call up duty" should be replaced by "call of duty".
Say s3="call of duty".
So what I want to do is that from s1 delete s2 and place s3 in its location. I am not sure how this can be done. Can anyone please guide me as I am new to python. The answer should be
"Hey siri open call of duty"
Note--> s2 can be anywhere within the string s1 and need not be at the last everytime

In python, Strings have a replace() method which you can easily use to replace the sub-string s2 with s3.
s1 = "Hey Siri open call up duty"
s2 = "call up duty"
s3 = "call of duty"
s1 = s1.replace(s2, s3)
print(s1)
This should do it for you. For more complex substitutions the re module can be of help.

You can use f string to use different string blocks in the string.
s2= "call up duty"
s3= "call of duty"
s1= f"Hey Siri open {s2}"

Related

Python regex to find either one or the other

I have the following regex that checks for the word "Read" but i'm looking for it to check for either "Read" or "Deleted"
len(re.findall("Read", phrase))
How can I make the regex so it's looking for either Read or Deleted?
You can use alternatives (separated by a pipe |) to search for either "Read" or "Deleted":
len(re.findall("Read|Deleted", phrase))
the pattern match must be independent of search word order
phrase = " hello world Deleted,Undeleted,Read in a sentence"
results=re.search("(Read|Deleted|Undeleted).*(Read|Deleted|Undeleted).* (Read|Deleted|Undeleted).*", phrase).groups()
for group in results:
print(group)
output
Deleted
Undeleted
Read

How to get the string outside two markers using python regular experssions?

For example, let's say I have the following string:
str1 = "There are !27 papers in! the book !right! now. !Also a marker.!",
I would like it to return a list of the words outside the '!' markers.
So for this particular question: ["There", "are", "the", "book", "now"]
I've tried python regular expression using re.findall('!(.+?)!', str1) but that's returning what's inside the '!' not the ones outside.
this should be close enough, but not using re
[s for w in str1.split('!')[::2] for s in w.split()]
Even using re you need spend efforts to cleanse the result of punctuations.
The simplest way would be to substitute.
str1 = "There are !27 papers in! the book !right! now. !Also a marker.!"
re.sub(r'(!.+?!)','',str1)

Python reading from file vs directly assigning literal

I asked a Python question minutes ago about how Python's newline work only to have it closed because of another question that's not even similar or have Python associated with it.
I have text with a '\n' character and '\t' in it, in a file. I read it using
open().read()
I then Stored the result in an identifier. My expectations is that such a text e.g
I\nlove\tCoding
being read from a file and assigned to an identifier should be same as one directly assigned to the string literal
"I\nlove\tCoding"
being directly assigned to a file.
My assumption was wrong anyway
word = I\nlove\tCoding
ends up being different from
word = open(*.txt).read()
Where the content of *.txt is exactly same as string "I\nlove\tCoding"
Edit:
I did make typo anyway, I meant \t && \n , searching with re module's search() for \t, it return None, but \t is there. Why is this please?
You need to differentiate between newlines/tabs and their corresponding escape sequences:
for filename in ('test1.txt', 'test2.txt'):
print(f"\n{filename} contains:")
fileData = open(filename, 'r').read()
print(fileData)
for pattern in (r'\\n', r'\n'):
# first is the escape sequences, second the (real) newline!
m = re.search(pattern, fileData)
if m:
print(f"found {pattern}")
Out:
test1.txt contains:
I\nlove\tCoding
found \\n
test2.txt contains:
I
love Coding
found \n
The string you get after reading from file is I\\nlove\\nCoding.If you want your string from literal equals string from file you should use r prefix. Something like this - word = r"I\nlove\nCoding"

How do I exclude a string from re.findall?

This might be a silly question, but I'm just trying to learn!
I'm trying to build a simple email search tool to learn more about python. I'm modifying some open source code to parse the email address:
emails = re.findall(r'([A-Za-z0-9\.\+_-]+#[A-Za-z0-9\._-]+\.[a-zA-Z]*)', html)
Then I'm writing the results into a spreadsheet using the CSV module.
Since I'd like to keep the domain extension open to almost any, my results are outputting image files with an email type format:
example: forbes#2x-302019213j32.png
How can I add to exclude "png" string from re.findall
Code:
def scrape(self, page):
try:
request = urllib2.Request(page.url.encode("utf8"))
html = urllib2.urlopen(request).read()
except Exception, e:
return
emails = re.findall(r'([A-Za-z0-9\.\+_-]+#[A-Za-z0-9\._-]+\.[a-zA-Z]*)', html)
for email in emails:
if email not in self.emails: # if not a duplicate
self.csvwriter.writerow([page.title.encode('utf8'), page.url.encode("utf8"), email])
self.emails.append(email)
you already are only acting on an if ... just make part of the if check ... ...that will be much much much easier than trying to exclude it from the regex
if email not in self.emails and not email.endswith("png"): # if not a duplicate
self.csvwriter.writerow([page.title.encode('utf8'), page.url.encode("utf8"), email])
self.emails.append(email)
I know Joran already gave you a response, but here's another way to do it with Python regex that I found cool.
There is a (?!...) matching pattern that essentially says: "Wherever you place this matching pattern, if at that point in the string this pattern is checked and a match is found, then that match occurrence fails."
If that was a bad explanation, the Python document does a much better job: https://docs.python.org/2/howto/regex.html#lookahead-assertions
Also, here is a working example:
y = r'([A-Za-z0-9\.\+_-]+#[A-Za-z0-9\._-]+\.(?!png)[a-zA-z]*)'
s = 'forbes#2x-302019213j32.png'
re.findall(y, s) # Will return an empty list
s2 = 'myname#email2018529391230.net'
re.findall(y, s2) # Will return a list with s2 string
s3 = s + ' ' + s2 # Concatenates the two e-mail-formatted strings
re.findall(y, s3) # Will only return s2 string in list
Lots of ways to do this, but my favorite is:
pat = re.compile(r'''
[A-Za-z0-9\.\+_-]+ # 1+ \w\n.+-_
#[A-Za-z0-9\._-]+ # literal # followed by same
\.png # if png, DON'T CAPTURE
|([A-Za-z0-9\.\+_-]+#[A-Za-z0-9\._-]+\.[a-zA-Z]*)
# if not png, CAPTURE''', flags=re.X)
Since regexes are evaluated left-to-right, if a string starts to match then it will match the left side of the | first. If the string ends in .png, then it will consume that string but NOT capture it. If it DOESN'T end in .png, the right side of the | will begin to consume it and WILL capture it. For a more in-depth conversation of this trick, see here. To use these do:
matches = filter(None,pat.findall(html))
Any string matched by the left side (e.g. all the png files that are matched but NOT part of a capturing group) will show up as an empty string in your findall. filter(None, iterable) removes all the empty strings from your iterable, leaving you with only the data you want.
Alternatively, you can filter after you grab everything
pat = re.compile(r'''[A-Za-z0-9\.\+_-]+#[A-Za-z0-9\._-]+\.[a-zA-Z]*''')
# same regex you have currently
matches = filter(lambda x: not x.endswith('png'), pat.findall(html))
Note that further on, you should really make self.emails a set. It doesn't seem to need to keep its ordering, and set lookup is WAY faster than list lookup. Remember to use set.add instead of list.append though.

"ReplaceWith" & - but only part of it

I need to perform a search/replace on text which contains a comma which is NOT followed by a space, to change to a comma+space.
So I can find this using:
,[^\s]
But I am struggling with the replacement; I can't just use:
, (space, comma)
Or
& ,
As the match originally matches two characters.
Is there a way of saying '&' - 1 ? or '&[0]' or something which means; 'The Matched String, but only part of it' in the replacement argument ?
Another way of trying to ask this:
Can I use Regex to IDENTIFY one part of my string.
But REPLACE a (slightly different,but related) part of my string.
I could just probably replace every comma with a comma+space, but this is a little more controlled and less likely to make a change I do not need....
For example:
Original:
Hello,World.
Should become:
Hello, World.
But:
Hello, World.
Should remain as :
Hello, World.
And currently, using my (bad) pattern I have:
Original:
Hello,World
After (wrong):
Hello, orld
I'm actually using Python's (2.6) 're' module for this as it happens.
Using parantheses to capture a part of the string is one way to do it. Another possibility is to use "lookahead assertion":
,(?=\S)
This pattern matches a comma only if it is followed by a non-whitespace character. It does not match anything followed by comma but uses that information to decide whether or not to match the comma.
For example:
>>> re.sub(r",(?=\S)", ", ", "Hello,World! Hello, World!")
'Hello, World! Hello, World!'
Yes, use parentheses to "capture" part of the string that matches your expression. I'm not up to speed on Python's implementation, but it should give you some kind of array called match[] whose elements correspond to the captures.
Yes, you could. But why would you, in this simple case?
def insertspaceaftercomma(s):
"""inserts a space after every comma, then remove doubled whitespace after comma (if any)"""
return s.replace(",",", ").replace(", ",", ")
seems to work:
>>> insertspaceaftercomma("Hello, World")
'Hello, World'
>>> insertspaceaftercomma("Hello,World")
'Hello, World'
>>>
You can look for a comma + non-space character and then stick a space in between them:
re.sub(r',([^\s])', r', \1', string)
Try this:
import re
s1 = 'Hello,World.'
re.sub(r',([^\s])', ', \g<1>', s1)
> Hello, World.
s2 = 'Hello, World.'
re.sub(r',([^\s])', ', \g<1>', s2)
> Hello, World.

Categories

Resources