Easiest way to replace a substring - python

What would be the easiest way to replace a substring within a string when I don't know the exact substring I am looking for and only know the delimiting strings? For example, if I have the following:
mystr = 'wordone wordtwo "In quotes"."Another word"'
I basically want to delete the first quoted words (including the quotes) and the period (.) following so the resulting string is:
'wordone wordtwo "Another word"'
Basically I want to delete the first quoted words and the quotes and the following period.

You are looking for regular expressions here, using the re module:
import re
quoted_plus_fullstop = re.compile(r'"[^"]+"\.')
result = quoted_plus_fullstop.sub('', mystr)
The pattern matches a literal quote, followed by 1 or more characters that are not quotes, followed by another quote and a full stop.
Demo:
>>> import re
>>> mystr = 'wordone wordtwo "In quotes"."Another word"'
>>> quoted_plus_fullstop = re.compile(r'"[^"]+"\.')
>>> quoted_plus_fullstop.sub('', mystr)
'wordone wordtwo "Another word"'

Related

Substring regex from characters to end of word

I looking for a regex term that will capture a subset of a string beginning with a a certain sequence of characters (http in my case)up until a whitespace.
I am doing the problem in python, working over a list of strings and replacing the 'bad' substring with ''.
The difficulty stems from the characters not necessarily beginning the words within the substring. Example below, with bold being the part I am looking to capture:
"Pasforcémenthttpwwwsudouestfr20101129lesyndromedeliledererevientdanslactualite2525391381php merci httpswwwgooglecomsilvous "
Thank you
Use findall:
>>> text = '''Pasforcémenthttpwwwsudouestfr20101129lesyndromedeliledererevientdanslactualite2525391381php merci httpswwwgooglecomsilvous '''
>>> import re
>>> re.findall(r'http\S+', text)
['httpwwwsudouestfr20101129lesyndromedeliledererevientdanslactualite2525391381php', 'httpswwwgooglecomsilvous']
For substitution (if memory not an issue):
>>> rep = re.compile(r'http\S+')
>>> rep.sub('', text)
You can try this:
strings = [] #your list of strings goes here
import re
new_strings = [re.sub("https.*?php|https.*?$", '.', i) for i in strings]

How to find a non-alphanumeric character and move it to the end of a string in Python

I have the following string:
"string.isnotimportant"
I want to find the dot (it could be any non-alphanumeric character), and move it to the end of the string.
The result should look like:
"stringisnotimportant."
I am looking for a regular expression to do this job.
import re
inp = "string.isnotimportant"
re.sub('(\w*)(\W+)(\w*)', '\\1\\3\\2', inp)
>>> import re
>>> string = "string.isnotimportant"
#I explain a bit about this at the end
>>> regex = '\w*(\W+)\w*' # the brackets in the regex mean that item, if matched will be stored as a group
#in order to understand the re module properly, I think your best bet is to read some docs, I will link you at the end of the post
>>> x = re.search(regex, string)
>>> x.groups() #remember the stored group above? well this accesses that group.
#if there were more than one group above, there would be more items in the tuple
('.',)
#here I reassign the variable string to a modified version where the '.' is replaced with ''(nothing).
>>> string = string.replace('.', '')
>>> string += x.groups()[0] # here I basically append a letter to the end of string
The += operator appends a character to the end of a string. Since strings don't have an .append method like lists do, this is a handy feature. x.groups()[0] refers to the first item(only item in this case) of the tuple above.
>>> print string
"stringisnotimportant."
about the regex:
"\w" Matches any alphanumeric character and the underscore: a through z, A through Z, 0 through 9, and '_'.
"\W" Matches any non-alphanumeric character. Examples for this include '&', '$', '#', etc.
https://developers.google.com/edu/python/regular-expressions?csw=1
http://python.about.com/od/regularexpressions/a/regexprimer.htm

Why does this Python RegEx pipe not pick out both unicode ranges?

A sample string containing both hiragana and katakana unicode characters:
myString = u"Eliminate ひらがな non-alphabetic カタカナ characters"
A pattern to match both ranges, according to:
http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml
myPattern = u"[\u3041-\u309f]*|[\u30a0-\u30ff]*"
Simple Python regex replace function
import re
print re.sub(myPattern, "", myString)
Returns:
Eliminate non-alphabetic カタカナ characters
The only way I can get it to work is if I use the two ranges separately, one after the other. What is stopping this RegEx from simply picking both sides of the |-pipe?
You'll need to combine the ranges into one character class, otherwise it will match one or the other range, not both:
myPattern = u"[\u3041-\u309f\u30a0-\u30ff]*"
Demo:
>>> myPattern = u"[\u3041-\u309f\u30a0-\u30ff]*"
>>> print re.sub(myPattern, "", u"Eliminate ひらがな non-alphabetic カタカナ characters")
Eliminate non-alphabetic characters
>>> myPattern = u"[\u3041-\u309f]|[\u30a0-\u30ff]"
>>> print re.sub(myPattern, "", myString)
Eliminate non-alphabetic characters
>>>
EDIT you can combine the two character classes with the OR operator as well

Python split string by start and end characters

Say you have a string like this: "(hello) (yes) (yo diddly)".
You want a list like this: ["hello", "yes", "yo diddly"]
How would you do this with Python?
import re
pattern = re.compile(r'\(([^)]*)\)')
The pattern matches the parentheses in your string (\(...\)) and these need to be escaped.
Then it defines a subgroup ((...)) - these parentheses are part of the regex-syntax.
The subgroup matches all characters except a right parenthesis ([^)]*)
s = "(hello) (yes) (yo diddly)"
pattern.findall(s)
gives
['hello', 'yes', 'yo diddly']
UPDATE:
It is probably better to use [^)]+ instead of [^)]*. The latter would also match an empty string.
Using the non-greedy modifiers, as DSM suggested, makes the pattern probably better to read: pattern = re.compile(r'\((.+?)\)')
I would do it like this:
"(hello) (yes) (yo diddly)"[1:-1].split(") (")
First, we cut off the first and last characters (since they should be removed anyway). Next, we split the resulting string using ") (" as the delimiter, giving the desired list.
This will give you words from any string :
>>> s="(hello) (yes) (yo diddly)"
>>> import re
>>> words = re.findall(r'\((.*?\))',s)
>>> words
['hello', 'yes', 'yo diddly']
as D.S.M said.
? in the regex to make it non-greedy.

Period stops multiline regex substitute in Python?

I have multiple line string that I'd like to replace, but don't understand why it's not working. For some reason, a period in the string stops the matching for the regular expression.
My string:
s = """
[some_previous_text]
<start>
one_period .
<end>
[some_text_after]
"""
What I'd like to end up with:
s = """
[some_previous_text]
foo
[some_text_after]
"""
What I initially tried, but it doesn't match anything:
>>> import re
>>> s = "<start>\none_period .\n<end>"
>>> print re.sub("<start>[^.]*<end>", "foo", s)
<start>
one_period .
<end>
However, when I took the period out, it worked fine:
>>> import re
>>> s = "<start>\nno_period\n<end>"
>>> print re.sub("<start>[^.]*<end>", "foo", s)
foo
Also, when I put an <end> tag before the period, it matched the first <end> tag:
>>> import re
>>> s = "<start>\n<end>\none_period .\n<end>"
>>> print re.sub("<start>[^.]*<end>", "foo", s)
foo
one_period .
<end>
So what's going on here? Why does the period stop the [^.]* matching?
EDIT:
SOLVED
I mistakenly thought that the carat ^ was for new-line matching. What I needed was a re.DOTALL flag (as indicated by Amber). Here's the expression I'm now using:
>>> import re
>>> s = "<start>\none_period .\n<end>"
>>> print re.sub("<start>.*<end>", "foo", s, flags=re.DOTALL)
foo
Why wouldn't it? [^.] is "the set of all characters that is not a ." and thus doesn't match periods.
Perhaps you instead meant to just put .* (any number of any characters) instead of [^.]*?
For matching across newlines, specify re.DOTALL:
re.sub("<start>.*<end>", "foo", s, flags=re.DOTALL)
Thats because [^.]* is a negated character class that matches any character but a period.
You probably want something like <start>.*?<end> together with the re.S modifier, that makes the dot matches also newline characters.
re.sub("<start>.*?<end>", "foo", s, flags=re.S)

Categories

Resources