Regex match inner '.' - python

I would like to remove an inner sentence based on one word. So instead of just 'start' I would like the regex statement to return 'start.stop.'.
>>> import re
>>> s = 'start.stop.do nice.'
>>> re.sub(r'\..*nice.*', '', s)
'start'

You need a negated character class instead of .* to refuse of matching the dots in other sentences. And in order to preserving the last dot, you can use a positive-lookahead for the last dot, to makes the regex engine doesn't capture that (just check its existence).
>>> re.sub(r'\.[^.]*nice[^.]*(?=\.)', '', s)
'start.stop'
Another good example by #bfontaine:
>>> s = "foo.bar.nice.qux"
>>> re.sub(r'\.[^.]*nice[^.]*(?=\.)', '', s)
'foo.bar.qux'

Related

Replace a substring between two substrings

How can I replace a substring between page1/ and _type-A with 222.6 in the below-provided l string?
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
Expected result:
https://homepage.com/home/page1/222.6_type-A/go
I tried:
import re
re.sub('page1/.*?_type-A','',l, flags=re.DOTALL)
But it also removes page1/ and _type-A.
You may use re.sub like this:
import re
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
print (re.sub(r'(?<=page1/).*?(?=_type-A)', replace_with, l))
Output:
https://homepage.com/home/page1/222.6_type-A/go
RegEx Demo
RegEx Breakup:
(?<=page1/): Lookbehind to assert that we have page1/ at previous position
.*?: Match 0 or more of any string (lazy)
(?=_type-A): Lookahead to assert that we have _type-A at next position
You can use
import re
l = 'https://'+'homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
print (re.sub('(page1/).*?(_type-A)',fr'\g<1>{replace_with}\2',l, flags=re.DOTALL))
Output: https://homepage.com/home/page1/222.6_type-A/go
See the Python demo online
Note you used an empty string as the replacement argument. In the above snippet, the parts before and after .*? are captured and \g<1> refers to the first group value, and \2 refers to the second group value from the replacement pattern. The unambiguous backreference form (\g<X>) is used to avoid backreference issues since there is a digit right after the backreference.
Since the replacement pattern contains no backslashes, there is no need preprocessing (escaping) anything in it.
This works:
import re
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
pattern = r"(?<=page1/).*?(?=_type)"
replace_with = '222.6'
s = re.sub(pattern, replace_with, l)
print(s)
The pattern uses the positive lookahead and lookback assertions, ?<= and ?=. A match only occurs if a string is preceded and followed by the assertions in the pattern, but does not consume them. Meaning that re.sub looks for a string with page1/ in front and _type behind it, but only replaces the part in between.

Match any word in string except those preceded by a curly brace in python

I have a string like
line = u'I need to match the whole line except for {thisword for example'
I have a difficulty doing this. What I've tried and it doesn't work:
# in general case there will be Unicode characters in the pattern
matchobj = re.search(ur'[^\{].+', line)
matchobj = re.search(ur'(?!\{).+', line)
Could you please help me figure out what's wrong and how to do it right?
P.S. I don't think I need to substitute "{thisword" with empty string
I am not exactly clear what you need. From your question title It looks you wants to find "All words in a string e.g 'line' those doesn't starts with {", but you are using re.search() function that confuses me.
re.search() and re.findall()
The function re.search() return a corresponding MatchObject instance, re.serach is usually used to match and return a patter in a long string. It doesn't return all possible matches. See below a simple example:
>>> re.search('a', 'aaa').group(0) # only first match
'a'
>>> re.search('a', 'aaa').group(1) # there is no second matched
Traceback (most recent call last):
File "<console>", line 1, in <module>
IndexError: no such group
With regex 'a' search returns only one patters 'a' in string 'aaa', it doesn't returns all possible matches.
If your objective to find – "all words in a string those doesn't starts with {". You should use re.findall() function:- that matches all occurrences of a pattern, not just the first one as re.search() does. See example:
>>> re.findall('a', 'aaa')
['a', 'a', 'a']
Edit: On the basis of comment adding one more example to demonstrate use of re.search and re.findall:
>>> re.search('a+', 'not itnot baaal laaaaaaall ').group()
'aaa' # returns ^^^ ^^^^^ doesn't
>>> re.findall('a+', 'not itnot baaal laaaaaaall ')
['aaa', 'aaaaaaa'] # ^^^ ^^^^^^^ match both
Here is a good tutorial for Python re module: re – Regular Expressions
Additionally, there is concept of group in Python-regex – "a matching pattern within parenthesis". If more than one groups are present in your regex patter then re.findall() return a list of groups; this will be a list of tuples if the pattern has more than one group. see below:
>>> re.findall('(a(b))', 'abab') # 2 groups according to 2 pair of ( )
[('ab', 'b'), ('ab', 'b')] # list of tuples of groups captured
In Python regex (a(b)) contains two groups; as two pairs of parenthesis (this is unlike regular expression in formal languages – regex are not exactly same as regular
expression in formal languages but that is different matter).
Answer: The words in sentence line are separated by spaces (other either at starts of string) regex should be:
ur"(^|\s)(\w+)
Regex description:
(^|\s+) means: either word at start or start after some spaces.
\w*: Matches an alphanumeric character, including "_".
On applying regex r to your line:
>>> import pprint # for pretty-print, you can ignore thesis two lines
>>> pp = pprint.PrettyPrinter(indent=4)
>>> r = ur"(^|\s)(\w+)"
>>> L = re.findall(r, line)
>>> pp.pprint(L)
[ (u'', u'I'),
(u' ', u'need'),
(u' ', u'to'),
(u' ', u'match'),
(u' ', u'the'),
(u' ', u'whole'),
(u' ', u'line'),
(u' ', u'except'),
(u' ', u'for'), # notice 'for' after 'for'
(u' ', u'for'), # '{thisword' is not included
(u' ', u'example')]
>>>
To find all words in a single line use:
>>> [t[1] for t in re.findall(r, line)]
Note: it will avoid { or any other special char from line because \w only pass alphanumeric and '_' chars.
If you specifically only avoid { if it appears at start of a word (in middle it is allowed) then use regex: r = ur"(^|\s+)(?P<word>[^{]\S*)".
To understand diffidence between this regex and other is check this example:
>>> r = ur"(^|\s+)(?P<word>[^{]\S*)"
>>> [t[1] for t in re.findall(r, "I am {not yes{ what")]
['I', 'am', 'yes{', 'what']
Without Regex:
You could achieve same thing simply without any regex as follows:
>>> [w for w in line.split() if w[0] != '{']
re.sub() to replace pattern
If you wants to just replace one (or more) words starts with { you should use re.sub() to replace patterns start with { by emplty string "" check following code:
>>> r = ur"{\w+"
>>> re.findall(r, line)
[u'{thisword']
>>> re.sub(r, "", line)
u'I need to match the whole line except for for example'
Edit Adding Comment's reply:
The (?P<name>...) is Python's Regex extension: (it has meaning in Python) - (?P<name>...) is similar to regular parentheses - create a group (a named group). The group is accessible via the symbolic group name. Group names must be valid Python identifiers, and each group name must be defined only once within a regular expression. example-1:
>>> r = "(?P<capture_all_A>A+)"
>>> mo = re.search(r, "aaaAAAAAAbbbaaaaa")
>>> mo.group('capture_all_A')
'AAAAAA'
example-2: suppose you wants to filter name from a name-line that may contain title also e.g mr use regex: name_re = "(?P<title>(mr|ms)\.?)? ?(?P<name>[a-z ]*)"
we can read name in input string using group('name'):
>>> re.search(name_re, "mr grijesh chauhan").group('name')
'grijesh chauhan'
>>> re.search(name_re, "grijesh chauhan").group('name')
'grijesh chauhan'
>>> re.search(name_re, "ms. xyz").group('name')
'xyz'
You can simply do:
(?<!{)(\b\w+\b) with the g flag enabled (all matches)
Demo: http://regex101.com/r/zA0sL6
Try this pattern:
(.*)(?:\{\w+)\s(.*)
Code:
import re
p = re.compile(r'(.*)(?:\{\w+)\s(.*)')
str = "I need to match the whole line except for {thisword for example"
p.match(str)
Example:
http://regex101.com/r/wR8eP6

Python: strip a wildcard word

I have strings with words separated by points.
Example:
string1 = 'one.two.three.four.five.six.eight'
string2 = 'one.two.hello.four.five.six.seven'
How do I use this string in a python method, assigning one word as wildcard (because in this case for example the third word varies). I am thinking of regular expressions, but do not know if the approach like I have it in mind is possible in python.
For example:
string1.lstrip("one.two.[wildcard].four.")
or
string2.lstrip("one.two.'/.*/'.four.")
(I know that I can extract this by split('.')[-3:], but I am looking for a general way, lstrip is just an example)
Use re.sub(pattern, '', original_string) to remove matching part from original_string:
>>> import re
>>> string1 = 'one.two.three.four.five.six.eight'
>>> string2 = 'one.two.hello.four.five.six.seven'
>>> re.sub(r'^one\.two\.\w+\.four', '', string1)
'.five.six.eight'
>>> re.sub(r'^one\.two\.\w+\.four', '', string2)
'.five.six.seven'
BTW, you are misunderstanding str.lstrip:
>>> 'abcddcbaabcd'.lstrip('abcd')
''
str.replace is more appropriate (of course, re.sub, too):
>>> 'abcddcbaabcd'.replace('abcd', '')
'dcba'
>>> 'abcddcbaabcd'.replace('abcd', '', 1)
'dcbaabcd'

How do I extract certain parts of strings in Python?

Say I have three strings:
abc534loif
tvd645kgjf
tv96fjbd_gfgf
and three lists:
beginning captures just the first part of the string "the name"
middle captures just the number
end contains only the rest of the characters that are after the number portion
How do I accomplish this in the most efficent way?
Use regular expressions?
>>> import re
>>> strings = 'abc534loif tvd645kgjf tv96fjbd_gfgf'.split()
>>> for s in strings:
... for match in re.finditer(r'\b([a-z]+)(\d+)(.+?)\b', s):
... print match.groups()
...
('abc', '534', 'loif')
('tvd', '645', 'kgjf')
('tv', '96', 'fjbd_gfgf')
This is language agnostic approach that aims at higher efficiency:
find first digit in the string and save its position p0
find last digit in the string and save its position p1
extract substring from 0 to p0-1 into beginning
extract substring from p0 to p1 into middle
extract substring from p1+1 to length-1 into end
I guess you're looking for re.findall:
strs = """
abc534loif
tvd645kgjf
tv96fjbd_gfgf
"""
import re
print re.findall(r'\b(\w+?)(\d+)(\w+)', strs)
>> [('abc', '534', 'loif'), ('tvd', '645', 'kgjf'), ('tv', '96', 'fjbd_gfgf')]
>>> import itertools as it
>>> s="abc534loif"
>>> [''.join(j) for i,j in it.groupby(s, key=str.isdigit)]
['abc', '534', 'loif']
I'd something like this:
>>> import re
>>> l = ['abc534loif', 'tvd645kgjf', 'tv96fjbd_gfgf']
>>> regex = re.compile('([a-z_]+)(\d+)([a-z_]+)')
>>> beginning, middle, end = zip(*[regex.match(s).groups() for s in l])
>>> beginning
('abc', 'tvd', 'tv')
>>> middle
('534', '645', '96')
>>> end
('loif', 'kgjf', 'fjbd_gfgf')
I wouls use regualar expressions like:
(?P<beginning>[^0-9]*)(?P<middle>[^0-9]*)(?P<end>[^0-9]*)
and pull out the three matching sections.
import re
m = re.match(r"(?P<beginning>[^0-9]*)(?P<middle>[^0-9]*)(?P<end>[^0-9]*)", "abc534loif")
m.group('beginning')
m.group('middle')
m.group('end')
import re #You want to match a string against a pattern so you import the regular expressions module 're'
mystring = "abc1234def" #Just a string to test with
match = re.match(r"^(\D+)([0)9]+](\D+)$") #Our regular expression. Everything between brackets is 'captured', meaning that it is accessible as one of the 'groups' in the returned match object. The ^ sign matches at the beginning of a string, while the $ matches the end. the characters in between the square brackets [0-9] are character ranges, so [0-9] matches any digit character, \D is any non-digit character.
if match: # match will be None if the string didn't match the pattern, so we need to check for that, as None.group doesn't exist.
beginning = match.group(1)
middle = match.group(2)
end = match.group(3)

Python remove anything that is not a letter or number

I'm having a little trouble with Python regular expressions.
What is a good way to remove all characters in a string that are not letters or numbers?
Thanks!
[\w] matches (alphanumeric or underscore).
[\W] matches (not (alphanumeric or underscore)), which is equivalent to (not alphanumeric and not underscore)
You need [\W_] to remove ALL non-alphanumerics.
When using re.sub(), it will be much more efficient if you reduce the number of substitutions (expensive) by matching using [\W_]+ instead of doing it one at a time.
Now all you need is to define alphanumerics:
str object, only ASCII A-Za-z0-9:
re.sub(r'[\W_]+', '', s)
str object, only locale-defined alphanumerics:
re.sub(r'[\W_]+', '', s, flags=re.LOCALE)
unicode object, all alphanumerics:
re.sub(ur'[\W_]+', u'', s, flags=re.UNICODE)
Examples for str object:
>>> import re, locale
>>> sall = ''.join(chr(i) for i in xrange(256))
>>> len(sall)
256
>>> re.sub('[\W_]+', '', sall)
'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
>>> re.sub('[\W_]+', '', sall, flags=re.LOCALE)
'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
>>> locale.setlocale(locale.LC_ALL, '')
'English_Australia.1252'
>>> re.sub('[\W_]+', '', sall, flags=re.LOCALE)
'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\x83\x8a\x8c\x8e\
x9a\x9c\x9e\x9f\xaa\xb2\xb3\xb5\xb9\xba\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\
xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde\
xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\
xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
# above output wrapped at column 80
Unicode example:
>>> re.sub(ur'[\W_]+', u'', u'a_b A_Z \x80\xFF \u0404', flags=re.UNICODE)
u'abAZ\xff\u0404'
In the char set matching rule [...] you can specify ^ as first char to mean "not in"
import re
re.sub("[^0-9a-zA-Z]", # Anything except 0..9, a..z and A..Z
"", # replaced with nothing
"this is a test!!") # in this string
--> 'thisisatest'
'\W' is the same as [^A-Za-z0-9_] plus accented chars from your locale.
>>> re.sub('\W', '', 'text 1, 2, 3...')
'text123'
Maybe you want to keep the spaces or have all the words (and numbers):
>>> re.findall('\w+', 'my. text, --without-- (punctuation) 123')
['my', 'text', 'without', 'punctuation', '123']
Also you can try to use isalpha and isnumeric methods the following way:
text = 'base, sample test;'
getVals = lambda x: (c for c in text if c.isalpha() or c.isnumeric())
map(lambda word: ' '.join(getVals(word)): text.split(' '))
There are other ways also you may consider e.g. simply loop thru string and skip unwanted chars e.g. assuming you want to delete all ascii chars which are not letter or digits
>>> newstring = [c for c in "a!1#b$2c%3\t\nx" if c in string.letters + string.digits]
>>> "".join(newstring)
'a1b2c3x'
or use string.translate to map one char to other or delete some chars e.g.
>>> todelete = [ chr(i) for i in range(256) if chr(i) not in string.letters + string.digits ]
>>> todelete = "".join(todelete)
>>> "a!1#b$2c%3\t\nx".translate(None, todelete)
'a1b2c3x'
this way you need to calculate todelete list once or todelete can be hard-coded once and use it everywhere you need to convert string
you can use predefined regex in python : \W corresponds to the set [^a-zA-Z0-9_]. Then,
import re
s = 'Hello dutrow 123'
re.sub('\W', '', s)
--> 'Hellodutrow123'
You need to be more specific:
What about Unicode "letters"? ie, those with diacriticals.
What about white space? (I assume this is what you DO want to delete along with punctuation)
When you say "letters" do you mean A-Z and a-z in ASCII only?
When you say "numbers" do you mean 0-9 only? What about decimals, separators and exponents?
It gets complex quickly...
A great place to start is an interactive regex site, such as RegExr
You can also get Python specific Python Regex Tool

Categories

Resources