Regexp python - finding substring

Regexp python - finding substring - python

How could I find all instances of a substring in a string?
For example I have the string ("%1 is going to the %2 with %3"). I need to extract all placeholders in this string (%1, %2, %3)
The current code could only find the first two because the ending is not a white space.
import re
string = "%1 is going to the %2 with %3"
r = re.compile('%(.*?) ')
m = r.finditer(string)
for y in m:
print (y.group())

Don't match on whitespace, match on a word boundary instead using \b:
r = re.compile(r'%(.*?)\b')
You may want to restrict your characters to word characters only instead of the . wildcard, and match at least one character:
r = re.compile(r'%(\w+)\b')
You don't appear to be using the capturing group either, so you could just omit that:
r = re.compile(r'%\w+\b')

Related

Match a whole string variable using python regex

I am new to python.
I want to write a regex to find a whole string using re.match.
For example,
word1=I
word2=U
str_to_match = word1+"[There could be some space in between]"+ word2[this is the end of string I want to match]"
In this case, it should match I[0 or more spaces in between]U.
It shouldn't match abI[0 or more spaces in between]U, OR abI[0 or more spaces in between]Ucd, OR
I[0 or more spaces in between]Ucd.
I know you could use boundary \b to set the word boundary but since there could be a number of spaces in between word1 and word2, the whole string is like a variable, not a fixed string, it won't work for me:
ret=re.match(r"\b"+word1+r"\s+" +word2+r"\b")
not working
Does anyone know how could I find the correct way to match this case?
Thanks!

match only matches at the beginning of the string. Which since you are using \b should not be what you want. findall will find all matching strings.
If you want to find the first occurrence and just test if it exists in the string, and not find all matches you can use search:
word1 = 'I'
word2 = 'U'
ret = re.findall(rf"\b{word1}\s*{word2}\b"," I U IU I U aI U I Ub")
print(ret)
ret = re.search(rf"\b{word1}\s*{word2}\b"," IUb .I U ")
print(ret)

How to find the substring if the substring has random characters replaced?

Let's say we have a string in Python:
original_string = "TwasTheNightBeforeChristmasWhenAllThroughTheHouse"
And we are interested in finding the beginning coordinates of the substring substring ="ChristmasWhen". This is very straightforward in Python, i.e.
>>> substring ="ChristmasWhen"
>>> original_string.find(substring)
18
and this checks out
>>> "TwasTheNightBeforeChristmasWhenAllThroughTheHouse"[18]
'C'
If we tried to look for a string which didn't exist, find() will return -1.
Here is my problem:
I have a substring which is guaranteed to be from the original string. However, characters in this substring have been randomly replaced with another character.
How could I algorithmically find the beginning coordinate of the substring (or at least, check if it's possible) if the substring has random characters '-' replacing certain letters?
Here's a concrete example:
original_string = "TwasTheNightBeforeChristmasWhenAllThroughTheHouse"
substring = '-hri-t-asW-en'
Naturally, if I try original_string.find('-hri-t-asW-en'), but it would be possible to find hri begins at 19, and therefore with the prefix -, the substring original_string.find('-hri-t-asW-en') must be 18.

This is typically what regular expressions are for : find patterns. You can then try:
import re # use regexp
original_string = "TwasTheNightBeforeChristmasWhenAllThroughTheHouse"
r = re.compile(".hri.t.asW.en") # constructs the search machinery
res = r.search(original_string) # search
print (res.group(0)) # get results
result will be:
ChristmasWhen
Now if your input (the search string) must use '-' as a wildcard you can then translate it to obtain the right regular expression:
import re
original_string = "TwasTheNightBeforeChristmasWhenAllThroughTheHouse"
s = ".hri.t.asW.en" # supposedly inputed by user
s = s.replace('-','.') # translate to regexp syntax
r = re.compile(s)
res = r.search(original_string)
print (res.group(0))

perhaps use a regular expression? For instance, you can use the . (dot character) to match any character (other than a newline, by default). So if you modify your substring to use dots instead of dashes for the erased letters in the string, you can use re.search to locate those patterns:
text = 'TwasTheNightBeforeChristmasWhenAllThroughTheHouse';
re.search('.hri.t.asW.en', text)

You can use regular expresions to find both the match and the possition
import re
p = re.compile(".hri.t.asW.en")
for m in p.finditer('TwasTheNightBeforeChristmasWhenAllThroughTheHouse'):
print(m.start(), m.group())
out: (18 ChristmasWhen)

A non-regex approach, less efficient than the latter, but still a possibility:
o = "TwasTheNightBeforeChristmasWhenAllThroughTheHouse"
s = '-hri-t-asW-en'
r = next(i for i in range(len(o)-len(s)) if all(a == b or b == '-' for a, b in zip(o[i:i+len(s)], s)))
Output
18

Python .replace() function, removing backslash in certain way

I have a huge string which contains emotions like "\u201d", AS WELL AS "\advance\"
all that I need is to remove back slashed so that:
- \u201d = \u201d
- \united\ = united
(as it breaks the process of uploading it to BigQuery database)
I know it should be somehow this way:
string.replace('\','') But not sure how to keep \u201d emotions.
ADDITIONAL:
Example of Unicode emotions
\ud83d\udc9e
\u201c
\u2744\ufe0f\u2744\ufe0f\u2744\ufe0f

You can split on all '\' and then use a regex to replace your emotions with adding leading '\'
s = '\\advance\\\\united\\ud83d\\udc9e\\u201c\\u2744\\ufe0f\\u2744\\ufe0f\\u2744\\ufe0f'
import re
print(re.sub('(u[a-f0-9]{4})',lambda m: '\\'+m.group(0),''.join(s.split('\\'))))
As your emotions are 'u' and 4 hexa numbers, 'u[a-f0-9]{4}' will match them all, and you just have to add leading backslashes
First of all, you delete every '\' in the string with either ''.join(s.split('\\')) or s.replace('\\')
And then we match every "emotion" with the regex u[a-f0-9]{4} (Which is u with 4 hex letters behind)
And with the regex sub, you replace every match with a leading \\

You could simply add the backslash in front of your string after replacement if your string starts with \u and have at least one digit.
import re
def clean(s):
re1='(\\\\)' # Any Single Character "\"
re2='(u)' # Any Single Character "u"
re3='.*?' # Non-greedy match on filler
re4='(\\d)' # Any Single Digit
rg = re.compile(re1+re2+re3+re4,re.IGNORECASE|re.DOTALL)
m = rg.search(s)
if m:
r = '\\'+s.replace('\\','')
else:
r = s.replace('\\','')
return r
a = '\\u123'
b = '\\united\\'
c = '\\ud83d'
>>> print(a, b, c)
\u123 \united\ \ud83d
>>> print(clean(a), clean(b), clean(c))
\u123 united \ud83d
Of course, you have to split your sting if multiple entries are in the same line:
string = '\\u123 \\united\\ \\ud83d'
clean_string = ' '.join([clean(word) for word in string.split()])

You can use this simple method to replace the last occurence of your character backslash:
Check the code and use this method.
def replace_character(s, old, new):
return (s[::-1].replace(old[::-1],new[::-1], 1))[::-1]
replace_character('\advance\', '\','')
replace_character('\u201d', '\','')
Ooutput:
\advance
\u201d

You can do it as simple as this
text = text.replace(text[-1],'')
Here you just replace the last character with nothing

python regex - replace newline (\n) to something else

I'm trying to convert multiple continuous newline characters followed by a Capital Letter to "____" so that I can parse them.
For example,
i = "Inc\n\nContact"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
In [25]: i
Out [25]: 'Inc____Contact'
This string works fine. I can parse them using ____ later.
However it doesn't work on this particular string.
i = "(2 months)\n\nML"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
Out [31]: '(2 months)____L'
It ate capital M.
What am I missing here?

EDIT To replace multiple continuous newline characters (\n) to ____, this should do:
>>> import re
>>> i = "(2 months)\n\nML"
>>> re.sub(r'(\n+)(?=[A-Z])', r'____', i)
'(2 months)____ML'
(?=[A-Z]) is to assert "newline characters followed by Capital Letter". REGEX DEMO.

Well let's take a look at your regex ([\n]+)([A-Z])+ - the first part ([\n]+) is fine, matching multiple occurences of a newline into one group (note - this wont match the carriage return \r). However the second part ([A-Z])+ leeds to your error it matches a single uppercase letter into a capturing group - multiple times, if there are multiple Uppercase letter, which will reset the group to the last matched uppercase letter, which is then used for the replace.
Try the following and see what happens
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
You could simply place the + inside the capturing group, so multiple uppercase letters are matched into it. You could also just leave it out, as it doesn't make a difference, how many of these uppercase letters follow.
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'(\n+)([A-Z])', r"____\2", i)
If you want to replace any sequence of linebreaks, no matter what follows - drop the ([A-Z]) completely and try
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'(\n+)', r"____", i)
You could also use ([\r\n]+) as pattern, if you want to consider carriage returns

Try:
import re
p = re.compile(ur'[\r?\n]')
test_str = u"(2 months)\n\nML"
subst = u"_"
result = re.sub(p, subst, test_str)
It will reduce string to
(2 months)__ML
See Demo

Regex to get list of all words with specific letters (unicode graphemes)

I'm writing a Python script for a FOSS language learning initiative. Let's say I have an XML file (or to keep it simple, a Python list) with a list of words in a particular language (in my case, the words are in Tamil, which uses a Brahmi-based Indic script).
I need to draw out the subset of those words that can be spelled using just those letters.
An English example:
words = ["cat", "dog", "tack", "coat"]
get_words(['o', 'c', 'a', 't']) should return ["cat", "coat"]
get_words(['k', 'c', 't', 'a']) should return ["cat", "tack"]
A Tamil example:
words = [u"மரம்", u"மடம்", u"படம்", u"பாடம்"]
get_words([u'ம', u'ப', u'ட', u'ம்') should return [u"மடம்", u"படம்")
get_words([u'ப', u'ம்', u'ட') should return [u"படம்"]
The order that the words are returned in, or the order that the letters are entered in should not make a difference.
Although I understand the difference between unicode codepoints and graphemes, I'm not sure how they're handled in regular expressions.
In this case, I would want to match only those words that are made up of the specific graphemes in the input list, and nothing else (i.e. the markings that follow a letter should only follow that letter, but the graphemes themselves can occur in any order).

To support characters that can span several Unicode codepoints:
# -*- coding: utf-8 -*-
import re
import unicodedata
from functools import partial
NFKD = partial(unicodedata.normalize, 'NFKD')
def match(word, letters):
word, letters = NFKD(word), map(NFKD, letters) # normalize
return re.match(r"(?:%s)+$" % "|".join(map(re.escape, letters)), word)
words = [u"மரம்", u"மடம்", u"படம்", u"பாடம்"]
get_words = lambda letters: [w for w in words if match(w, letters)]
print(" ".join(get_words([u'ம', u'ப', u'ட', u'ம்'])))
# -> மடம் படம்
print(" ".join(get_words([u'ப', u'ம்', u'ட'])))
# -> படம்
It assumes that the same character can be used zero or more times in a word.
If you want only words that contain exactly given characters:
import regex # $ pip install regex
chars = regex.compile(r"\X").findall # get all characters
def match(word, letters):
return sorted(chars(word)) == sorted(letters)
words = ["cat", "dog", "tack", "coat"]
print(" ".join(get_words(['o', 'c', 'a', 't'])))
# -> coat
print(" ".join(get_words(['k', 'c', 't', 'a'])))
# -> tack
Note: there is no cat in the output in this case because cat doesn't use all given characters.
What does it mean to normalize? And could you please explain the syntax of the re.match() regex?
>>> import re
>>> re.escape('.')
'\\.'
>>> c = u'\u00c7'
>>> cc = u'\u0043\u0327'
>>> cc == c
False
>>> re.match(r'%s$' % (c,), cc) # do not match
>>> import unicodedata
>>> norm = lambda s: unicodedata.normalize('NFKD', s)
>>> re.match(r'%s$' % (norm(c),), norm(cc)) # do match
<_sre.SRE_Match object at 0x1364648>
>>> print c, cc
Ç Ç
Without normalization c and cc do not match. The characters are from the unicodedata.normalize() docs.

EDIT: Okay, don't use any of the answers from here. I wrote them all while thinking Python regular expressions didn't have a word boundary marker, and I tried to work around this lack. Then #Mark Tolonen added a comment that Python has \b as a word boundary marker! So I posted another answer, short and simple, using \b. I'll leave this here in case anyone is interested in seeing solutions that work around the lack of \b, but I don't really expect anyone to be.
It is easy to make a regular expression that matches only a string of a specific set of characters. What you need to use is a "character class" with just the characters you want to match.
I'll do this example in English.
[ocat] This is a character class that will match a single character from the set [o, c, a, t]. Order of the characters doesn't matter.
[ocat]+ Putting a + on the end makes it match one or more characters from the set. But this is not enough by itself; if you had the word "coach" this would match and return "coac".
Sadly, there isn't a regular expression feature for "word boundary". [EDIT: This turns out not to be correct, as I said in the first paragraph.] We need to make one of our own. There are two possible word beginnings: the start of a line, or whitespace separating our word from the previous word. Similarly, there are two possible word endings: end of a line, or whitespace separating our word from the next word.
Since we will be matching some extra stuff we don't want, we can put parentheses around the part of the pattern we do want.
To match two alternatives, we can make a group in parentheses and separate the alternatives with a vertical bar. Python regular expressions have a special notation to make a group whose contents we don't want to keep: (?:)
So, here is the pattern to match the beginning of a word. Start of line or white space: (?:^|\s)
Here is the pattern for end of word. White space or end of line: `(?:\s|$)
Putting it all together, here is our final pattern:
(?:^|\s)([ocat]+)(?:\s|$)
You can build this dynamically. You don't need to hard-code the whole thing.
import re
s_pat_start = r'(?:^|\s)(['
s_pat_end = r']+)(?:\s|$)'
set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars is now set to the string: "ocat"
s_pat = s_pat_start + set_of_chars + s_pat_end
pat = re.compile(s_pat)
Now, this doesn't in any way check for valid words. If you have the following text:
This is sensible. This not: occo cttc
The pattern I showed you will match on occo and cttc, and those are not really words. They are strings made only of letters from [ocat] though.
So just do the same thing with Unicode strings. (If you are using Python 3.x then all strings are Unicode strings, so there you go.) Put the Tamil characters in the character class and you are good to go.
This has a confusing problem: re.findall() doesn't return all possible matches.
EDIT: Okay, I figured out what was confusing me.
What we want is for our pattern to work with re.findall() so you can collect all the words. But re.findall() only finds non-overlapping patterns. In my example, re.findall() only returned ['occo'] and not ['occo', 'cttc'] as I expected... but this is because my pattern was matching the white space after occo. The match group didn't collect the white space, but it was matched all the same, and since re.findall() wants no overlap between matches, the white space was "used up" and didn't work for cttc.
The solution is to use a feature of Python regular expressions that I have never used before: special syntax that says "must not be preceded by" or "must not be followed by". The sequence \S matches any non-whitespace so we could use that. But punctuation is non-whitespace, and I think we do want punctuation to delimit a word. There is also special syntax for "must be preceded by" or "must be followed by". So here is, I think, the best we can do:
Build a string that means "match when the character class string is at start of line and followed by whitespace, or when character class string is preceded by whitespace and followed by whitespace, or when character class string is preceded by whitespace and followed by end of line, or when character class string is preceded by start of line and followed by end of line".
Here is that pattern using ocat:
r'(?:^([ocat]+)(?=\s)|(?<=\s)([ocat]+)(?=\s)|(?<=\s)([ocat]+)$|^([ocat]+)$)'
I'm very sorry but I really do think this is the best we can do and still work with re.findall()!
It's actually less confusing in Python code though:
import re
NMGROUP_BEGIN = r'(?:' # begin non-matching group
NMGROUP_END = r')' # end non-matching group
WS_BEFORE = r'(?<=\s)' # require white space before
WS_AFTER = r'(?=\s)' # require white space after
BOL = r'^' # beginning of line
EOL = r'$' # end of line
CCS_BEGIN = r'([' #begin a character class string
CCS_END = r']+)' # end a character class string
PAT_OR = r'|'
set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars now set to "ocat"
CCS = CCS_BEGIN + set_of_chars + CCS_END # build up character class string pattern
s_pat = (NMGROUP_BEGIN +
BOL + CCS + WS_AFTER + PAT_OR +
WS_BEFORE + CCS + WS_AFTER + PAT_OR +
WS_BEFORE + CCS + EOL + PAT_OR +
BOL + CCS + EOL +
NMGROUP_END)
pat = re.compile(s_pat)
text = "This is sensible. This not: occo cttc"
pat.findall(text)
# returns: [('', 'occo', '', ''), ('', '', 'cttc', '')]
So, the crazy thing is that when we have alternative patterns that could match, re.findall() seems to return an empty string for the alternatives that didn't match. So we just need to filter out the length-zero strings from our results:
import itertools as it
raw_results = pat.findall(text)
results = [s for s in it.chain(*raw_results) if s]
# results set to: ['occo', 'cttc']
I guess it might be less confusing to just build four different patterns, run re.findall() on each, and join the results together.
EDIT: Okay, here is the code for building four patterns and trying each. I think this is an improvement.
import re
WS_BEFORE = r'(?<=\s)' # require white space before
WS_AFTER = r'(?=\s)' # require white space after
BOL = r'^' # beginning of line
EOL = r'$' # end of line
CCS_BEGIN = r'([' #begin a character class string
CCS_END = r']+)' # end a character class string
set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars now set to "ocat"
CCS = CCS_BEGIN + set_of_chars + CCS_END # build up character class string pattern
lst_s_pat = [
BOL + CCS + WS_AFTER,
WS_BEFORE + CCS + WS_AFTER,
WS_BEFORE + CCS + EOL,
BOL + CCS
]
lst_pat = [re.compile(s) for s in lst_s_pat]
text = "This is sensible. This not: occo cttc"
result = []
for pat in lst_pat:
result.extend(pat.findall(text))
# result set to: ['occo', 'cttc']
EDIT: Okay, here is a very different approach. I like this one best.
First, we will match all words in the text. A word is defined as one or more characters that are not punctuation and are not white space.
Then, we use a filter to remove words from the above; we keep only words that are made only of the characters we want.
import re
import string
# Create a pattern that matches all characters not part of a word.
#
# Note that '-' has a special meaning inside a character class, but it
# is valid punctuation that we want to match, so put in a backslash in
# front of it to disable the special meaning and just match it.
#
# Use '^' which negates all the chars following.  So, a word is a series
# of characters that are all not whitespace and not punctuation.
WORD_BOUNDARY = string.whitespace + string.punctuation.replace('-', r'\-')
WORD = r'[^' + WORD_BOUNDARY + r']+'
# Create a pattern that matches only the words we want.
set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars now set to "ocat"
# build up character class string pattern
CCS = r'[' + set_of_chars + r']+'
pat_word = re.compile(WORD)
pat = re.compile(CCS)
text = "This is sensible.  This not: occo cttc"
# This makes it clear how we are doing this.
all_words = pat_word.findall(text)
result = [s for s in all_words if pat.match(s)]
# "lazy" generator expression that yields up good results when iterated
# May be better for very large texts.
result_genexp = (s for s in (m.group(0) for m in pat_word.finditer(text)) if pat.match(s))
# force the expression to expand out to a list
result = list(result_genexp)
# result set to: ['occo', 'cttc']
EDIT: Now I don't like any of the above solutions; please see the other answer, the one using \b, for the best solution in Python.

It is easy to make a regular expression that matches only a string of a specific set of characters. What you need to use is a "character class" with just the characters you want to match.
I'll do this example in English.
[ocat] This is a character class that will match a single character from the set [o, c, a, t]. Order of the characters doesn't matter.
[ocat]+ Putting a + on the end makes it match one or more characters from the set. But this is not enough by itself; if you had the word "coach" this would match and return "coac".
\b[ocat]+\b' Now it only matches on word boundaries. (Thank you very much #Mark Tolonen for educating me about\b`.)
So, just build up a pattern like the above, only using the desired character set at runtime, and there you go. You can use this pattern with re.findall() or re.finditer().
import re
words = ["cat", "dog", "tack", "coat"]
def get_words(chars_seq, words_seq=words):
s_chars = ''.join(chars_seq)
s_pat = r'\b[' + s_chars + r']+\b'
pat = re.compile(s_pat)
return [word for word in words_seq if pat.match(word)]
assert get_words(['o', 'c', 'a', 't']) == ["cat", "coat"]
assert get_words(['k', 'c', 't', 'a']) == ["cat", "tack"]

I would not use regular expressions to solve this problem. I would rather use collections.Counter like so:
>>> from collections import Counter
>>> def get_words(word_list, letter_string):
return [word for word in word_list if Counter(word) & Counter(letter_string) == Counter(word)]
>>> words = ["cat", "dog", "tack", "coat"]
>>> letters = 'ocat'
>>> get_words(words, letters)
['cat', 'coat']
>>> letters = 'kcta'
>>> get_words(words, letters)
['cat', 'tack']
This solution should also work for other languages. Counter(word) & Counter(letter_string) finds the intersection between the two counters, or the min(c[x], f[x]). If this intersection is equivalent to your word, then you want to return the word as a match.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regexp python - finding substring - python

Related

Match a whole string variable using python regex

How to find the substring if the substring has random characters replaced?

Python .replace() function, removing backslash in certain way

python regex - replace newline (\n) to something else

Regex to get list of all words with specific letters (unicode graphemes)

Categories

Resources