Substring regex from characters to end of word - python

I looking for a regex term that will capture a subset of a string beginning with a a certain sequence of characters (http in my case)up until a whitespace.
I am doing the problem in python, working over a list of strings and replacing the 'bad' substring with ''.
The difficulty stems from the characters not necessarily beginning the words within the substring. Example below, with bold being the part I am looking to capture:
"Pasforcémenthttpwwwsudouestfr20101129lesyndromedeliledererevientdanslactualite2525391381php merci httpswwwgooglecomsilvous "
Thank you

Use findall:
>>> text = '''Pasforcémenthttpwwwsudouestfr20101129lesyndromedeliledererevientdanslactualite2525391381php merci httpswwwgooglecomsilvous '''
>>> import re
>>> re.findall(r'http\S+', text)
['httpwwwsudouestfr20101129lesyndromedeliledererevientdanslactualite2525391381php', 'httpswwwgooglecomsilvous']
For substitution (if memory not an issue):
>>> rep = re.compile(r'http\S+')
>>> rep.sub('', text)

You can try this:
strings = [] #your list of strings goes here
import re
new_strings = [re.sub("https.*?php|https.*?$", '.', i) for i in strings]

Related

Remove a character in string if it doesn't belong to a group of matching pattern in Python

If I have a string such that it contains many words. I want to remove the closing parenthesis if the word in the string doesn't start with _.
Examples input:
this is an example to _remove) brackets under certain) conditions.
Output:
this is an example to _remove) brackets under certain conditions.
How can I do that without splitting the words using re.sub?
re.sub accepts a callable as the second parameter, which comes in handy here:
>>> import re
>>> s = 'this is an example to _remove) brackets under certain) conditions.'
>>> re.sub('(\w+)\)', lambda m: m.group(0) if m.group(0).startswith('_') else m.group(1), s)
'this is an example to _remove) brackets under certain conditions.'
I wouldn't use regex here when a list comprehension can do it.
result = ' '.join([word.rstrip(")") if not word.startswith("_") else word
for word in words.split(" ")])
If you have possible input like:
someword))
that you want to turn into:
someword)
Then you'll have to do:
result = ' '.join([word[:-1] if word.endswith(")") and not word.startswith("_") else word
for word in words.split(" ")])

regex for repeating words in a string in Python

I have a good regexp for replacing repeating characters in a string. But now I also need to replace repeating words, three or more word will be replaced by two words.
Like
bye! bye! bye!
should become
bye! bye!
My code so far:
def replaceThreeOrMoreCharachetrsWithTwoCharacters(string):
# pattern to look for three or more repetitions of any character, including newlines.
pattern = re.compile(r"(.)\1{2,}", re.DOTALL)
return pattern.sub(r"\1\1", string)
Assuming that what is called "word" in your requirements is one or more non-whitespaces characters surrounded by whitespaces or string limits, you can try this pattern:
re.sub(r'(?<!\S)((\S+)(?:\s+\2))(?:\s+\2)+(?!\S)', r'\1', s)
You could try the below regex also,
(?<= |^)(\S+)(?: \1){2,}(?= |$)
Sample code,
>>> import regex
>>> s = "hi hi hi hi some words words words which'll repeat repeat repeat repeat repeat"
>>> m = regex.sub(r'(?<= |^)(\S+)(?: \1){2,}(?= |$)', r'\1 \1', s)
>>> m
"hi hi some words words which'll repeat repeat"
DEMO
I know you were after a regular expression but you could use a simple loop to achieve the same thing:
def max_repeats(s, max=2):
last = ''
out = []
for word in s.split():
same = 0 if word != last else same + 1
if same < max: out.append(word)
last = word
return ' '.join(out)
As a bonus, I have allowed a different maximum number of repeats to be specified (the default is 2). If there is more than one space between each word, it will be lost. It's up to you whether you consider that to be a bug or a feature :)
Try the following:
import re
s = your string
s = re.sub( r'(\S+) (?:\1 ?){2,}', r'\1 \1', s )
You can see a sample code here: http://codepad.org/YyS9JCLO
def replaceThreeOrMoreWordsWithTwoWords(string):
# Pattern to look for three or more repetitions of any words.
pattern = re.compile(r"(?<!\S)((\S+)(?:\s+\2))(?:\s+\2)+(?!\S)", re.DOTALL)
return pattern.sub(r"\1", string)

How to find a non-alphanumeric character and move it to the end of a string in Python

I have the following string:
"string.isnotimportant"
I want to find the dot (it could be any non-alphanumeric character), and move it to the end of the string.
The result should look like:
"stringisnotimportant."
I am looking for a regular expression to do this job.
import re
inp = "string.isnotimportant"
re.sub('(\w*)(\W+)(\w*)', '\\1\\3\\2', inp)
>>> import re
>>> string = "string.isnotimportant"
#I explain a bit about this at the end
>>> regex = '\w*(\W+)\w*' # the brackets in the regex mean that item, if matched will be stored as a group
#in order to understand the re module properly, I think your best bet is to read some docs, I will link you at the end of the post
>>> x = re.search(regex, string)
>>> x.groups() #remember the stored group above? well this accesses that group.
#if there were more than one group above, there would be more items in the tuple
('.',)
#here I reassign the variable string to a modified version where the '.' is replaced with ''(nothing).
>>> string = string.replace('.', '')
>>> string += x.groups()[0] # here I basically append a letter to the end of string
The += operator appends a character to the end of a string. Since strings don't have an .append method like lists do, this is a handy feature. x.groups()[0] refers to the first item(only item in this case) of the tuple above.
>>> print string
"stringisnotimportant."
about the regex:
"\w" Matches any alphanumeric character and the underscore: a through z, A through Z, 0 through 9, and '_'.
"\W" Matches any non-alphanumeric character. Examples for this include '&', '$', '#', etc.
https://developers.google.com/edu/python/regular-expressions?csw=1
http://python.about.com/od/regularexpressions/a/regexprimer.htm

Easiest way to replace a substring

What would be the easiest way to replace a substring within a string when I don't know the exact substring I am looking for and only know the delimiting strings? For example, if I have the following:
mystr = 'wordone wordtwo "In quotes"."Another word"'
I basically want to delete the first quoted words (including the quotes) and the period (.) following so the resulting string is:
'wordone wordtwo "Another word"'
Basically I want to delete the first quoted words and the quotes and the following period.
You are looking for regular expressions here, using the re module:
import re
quoted_plus_fullstop = re.compile(r'"[^"]+"\.')
result = quoted_plus_fullstop.sub('', mystr)
The pattern matches a literal quote, followed by 1 or more characters that are not quotes, followed by another quote and a full stop.
Demo:
>>> import re
>>> mystr = 'wordone wordtwo "In quotes"."Another word"'
>>> quoted_plus_fullstop = re.compile(r'"[^"]+"\.')
>>> quoted_plus_fullstop.sub('', mystr)
'wordone wordtwo "Another word"'

python regex find all words in text

This sounds very simple, I know, but for some reason I can't get all the results I need
Word in this case is any char but white-space that is separetaed with white-space
for example in the following string: "Hello there stackoverflow."
the result should be: ['Hello','there','stackoverflow.']
My code:
import re
word_pattern = "^\S*\s|\s\S*\s|\s\S*$"
result = re.findall(word_pattern,text)
print result
but after using this pattern on a string like I've shown it only puts the first and the last words in the list and not the words separeted with two spaces
What is the problem with this pattern?
Use the \b boundary test instead:
r'\b\S+\b'
Result:
>>> import re
>>> re.findall(r'\b\S+\b', 'Hello there StackOverflow.')
['Hello', 'there', 'StackOverflow']
or not use a regular expression at all and just use .split(); the latter would include the punctiation in a sentence (the regex above did not match the . in the sentence).
to find all words in a string best use split
>>> "Hello there stackoverflow.".split()
['Hello', 'there', 'stackoverflow.']
but if you must use regular expressions, then you should change your regex to something simpler and faster: r'\b\S+\b'.
r turns the string to a 'raw' string. meaning it will not escape your characters.
\b means a boundary, which is a space, newline, or punctuation.
\S you should know, is any non-whitespace character.
+ means one or more of the previous.
so together it means find all visible sets of characters (words/numbers).
How about simply using -
>>> s = "Hello there stackoverflow."
>>> s.split()
['Hello', 'there', 'stackoverflow.']
The other answers are good. Depending on what you want (eg. include/exclude punctuation or other non-word characters) an alternative could be to use a regex to split by one or more whitespace characters:
re.split(r'\s+', 'Hello there StackOverflow.')
['Hello', 'There', 'StackOverflow.']

Categories

Resources