So, I use NVDA, a free screen reader for the blind that many people use, and a speech synthesizer. I am building a library of modified versions of addons which it takes, and dictionaries that can contain regular expressions acceptable by python, as well as standard word replacement operation.
My thing is, I do not know how to design a regular expression that will place a space between capital letters such as in ANM, which the synth says as one word rather than spelling it like it should.
I do not know enough python to manually code an addon for this thing, I only use regexp for this kind of thing. I do know regular expressions basics, the general implementation, which you can find by googling "regular expressions in about 55 minutes".
I want it to do something like this.
Input: ANM
Output: A N M
Also with the way this speech synth works, I may have to replace A with eh, which would make this.
Input: ANM
Output: Eh N M
Could any of you provide me a regular expression to do this if it is possible? And no, I don't think I can compile them in loops because I didn't write the python.
This should do the trick for the capital letters, it uses ?= to look ahead for the next capital letter without 'eating it up':
>>> import re
>>> re.sub("([A-Z])(?=[A-Z])", r"\1 ", "ABC thIs iS XYZ a Test")
'A B C thIs iS X Y Z a Test'
If you have a lot of replacements to make, it might be easiest to put them into a single variable:
replacements = [("A", "eh"), ("B", "bee"), ("X", "ex")]
result = re.sub("([A-Z])(?=[A-Z])", r"\1 ", "ABC thIs iS XYZX. A Xylophone")
for source, dest in replacements:
result = re.sub("("+source+r")(?=\W)" , dest, result)
print(result)
Output:
eh bee C thIs iS ex Y Z ex. eh Xylophone
I build a regex in the 'replacements' code to handle capitalised words and standalone capitals at the end of sentences correctly. If you want to avoid replacing e.g. the standalone 'A' with 'eh' then the more advanced regex replacement function as mentioned in #fjarri's answer is the way to go.
While #Galax's solution certainly works, it may be easier to perform further processing of abbreviations if you use callbacks on matches (this way you won't replace any standalone capitals):
import re
s = "This is a normal sentence featuring an abbreviation ANM. One, two, three."
def process_abbreviation(match_object):
spaced = ' '.join(match_object.group(1))
return spaced.replace('A', 'Eh')
print(re.sub("([A-Z]{2,})", process_abbreviation, s))
okay, found the answer. Using a sequence of regexes in a certain order, i got it to work. THanks you guys, you helped me form the basis and you are appreciated.
Related
I've recently decided to jump into the deep end of the Python pool and start converting some of my R code over to Python and I'm stuck on something that is very important to me. In my line of work, I spend a lot of time parsing text data, which, as we all know, is very unstructured. As a result, I've come to rely on the lookaround feature of regex and R's lookaround functionality is quite robust. For example, if I'm parsing a PDF that might introduce some spaces in between letters when I OCR the file, I'd get to the value I want with something like this:
oAcctNum <- str_extract(textBlock[indexVal], "(?<=ORIG\\s?:\\s?/\\s?)[A-Z0-9]+")
In Python, this isn't possible because the use of ? makes the lookbehind a variable-width expression as opposed to a fixed-width. This functionality is important enough to me that it deters me from wanting to use Python, but instead of giving up on the language I'd like to know the Pythonista way of addressing this issue. Would I have to preprocess the string before extracting the text? Something like this:
oAcctNum = re.sub(r"(?<=\b\w)\s(?=\w\b)", "")
oAcctNum = re.search(r"(?<=ORIG:/)([A-Z0-9])", textBlock[indexVal]).group(1)
Is there a more efficient way to do this? Because while this example was trivial, this issue comes up in very complex ways with the data I work with and I'd hate to have to do this kind of preprocessing for every line of text I analyze.
Lastly, I apologize if this is not the right place to ask this question; I wasn't sure where else to post it. Thanks in advance.
Notice that if you can use groups, you generally do not need lookbehinds. So how about
match = re.search(r"ORIG\s?:\s?/\s?([A-Z0-9]+)", string)
if match:
text = match.group(1)
In practice:
>>> string = 'ORIG : / AB123'
>>> match = re.search(r"ORIG\s?:\s?/\s?([A-Z0-9]+)", string)
>>> match
<_sre.SRE_Match object; span=(0, 12), match='ORIG : / AB123'>
>>> match.group(1)
'AB123'
You need to use capture groups in this case you described:
"(?<=ORIG\\s?:\\s?/\\s?)[A-Z0-9]+"
will become
r"ORIG\s?:\s?/\s?([A-Z0-9]+)"
The value will be in .group(1). Note that raw strings are preferred.
Here is a sample code:
import re
p = re.compile(r'ORIG\s?:\s?/\s?([A-Z0-9]+)', re.IGNORECASE)
test_str = "ORIG:/texthere"
print re.search(p, test_str).group(1)
IDEONE demo
Unless you need overlapping matches, capturing groups usage instead of a look-behind is rather straightforward.
print re.findall(r"ORIG\s?:\s?/\s?([A-Z0-9]+)",test_str)
You can directly use findall which will return all the groups in the regex if present.
This question already has answers here:
Capturing emoticons using regular expression in python
(4 answers)
Closed 9 years ago.
Here is the list of emoticons: http://en.wikipedia.org/wiki/List_of_emoticons
I want to form a regex which checks if any of these emoticons exist in the sentence. For example, "hey there I am good :)" or "I am angry and sad :(" but there are a lot of emoticons in the list on wikipedia so wondering how I can achieve this task.
I am new to regex. & python.
>>> s = "hey there I am good :)"
>>> import re
>>> q = re.findall(":",s)
>>> q
[':']
I see two approaches to your problem:
Either, you can create a regular expression for a "generic smiley" and try to match as many as possible without making it overly complicated and insane. For example, you could say that each smiley has some sort of eyes, a nose (optional), and a mouth.
Or, if you want to match each and every smiley from that list (and none else) you can just take those smileys, escape any regular-expression specific special characters, and build a huge disjunction from those.
Here is some code that should get you started for both approaches:
# approach 1: pattern for "generic smiley"
eyes, noses, mouths = r":;8BX=", r"-~'^", r")(/\|DP"
pattern1 = "[%s][%s]?[%s]" % tuple(map(re.escape, [eyes, noses, mouths]))
# approach 2: disjunction of a list of smileys
smileys = """:-) :) :o) :] :3 :c) :> =] 8) =) :} :^)
:D 8-D 8D x-D xD X-D XD =-D =D =-3 =3 B^D""".split()
pattern2 = "|".join(map(re.escape, smileys))
text = "bla bla bla :-/ more text 8^P and another smiley =-D even more text"
print re.findall(pattern1, text)
Both approaches have pros, cons, and some general limitations. You will always have false positives, like in a mathematical term like 18^P. It might help to put spaces around the expression, but then you can't match smileys followed by punctuation. The first approach is more powerful and catches smileys the second approach won't match, but only as long as they follow a certain schema. You could use the same approach for "eastern" smileys, but it won't work for strictly symmetric ones, like =^_^=, as this is not a regular language. The second approach, on the other hand, is easier to extend with new smileys, as you just have to add them to the list.
I'm writing my first script and trying to learn python.
But I'm stuck and can't get out of this one.
I'm writing a script to change file names.
Lets say I have a string = "this.is.tEst3.E00.erfeh.ervwer.vwtrt.rvwrv"
I want the result to be string = "This Is Test3 E00"
this is what I have so far:
l = list(string)
//Transform the string into list
for i in l:
if "E" in l:
p = l.index("E")
if isinstance((p+1), int () is True:
if isinstance((p+2), int () is True:
delp = p+3
a = p-3
del l[delp:]
new = "".join(l)
new = new.replace("."," ")
print (new)
get in index where "E" and check if after "E" there are 2 integers.
Then delete everything after the second integer.
However this will not work if there is an "E" anyplace else.
at the moment the result I get is:
this is tEst
because it is finding index for the first "E" on the list and deleting everything after index+3
I guess my question is how do I get the index in the list if a combination of strings exists.
but I can't seem to find how.
thanks for everyone answers.
I was going in other direction but it is also not working.
if someone could see why it would be awesome. It is much better to learn by doing then just coping what others write :)
this is what I came up with:
for i in l:
if i=="E" and isinstance((i+1), int ) is True:
p = l.index(i)
print (p)
anyone can tell me why this isn't working. I get an error.
Thank you so much
Have you ever heard of a Regular Expression?
Check out python's re module. Link to the Docs.
Basically, you can define a "regex" that would match "E and then two integers" and give you the index of it.
After that, I'd just use python's "Slice Notation" to choose the piece of the string that you want to keep.
Then, check out the string methods for str.replace to swap the periods for spaces, and str.title to put them in Title Case
An easy way is to use a regex to find up until the E followed by 2 digits criteria, with s as your string:
import re
up_until = re.match('(.*?E\d{2})', s).group(1)
# this.is.tEst3.E00
Then, we replace the . with a space and then title case it:
output = up_until.replace('.', ' ').title()
# This Is Test3 E00
The technique to consider using is Regular Expressions. They allow you to search for a pattern of text in a string, rather than a specific character or substring. Regular Expressions have a bit of a tough learning curve, but are invaluable to learn and you can use them in many languages, not just in Python. Here is the Python resource for how Regular Expressions are implemented:
http://docs.python.org/2/library/re.html
The pattern you are looking to match in your case is an "E" followed by two digits. In Regular Expressions (usually shortened to "regex" or "regexp"), that pattern looks like this:
E\d\d # ('\d' is the specifier for any digit 0-9)
In Python, you create a string of the regex pattern you want to match, and pass that and your file name string into the search() method of the the re module. Regex patterns tend to use a lot of special characters, so it's common in Python to prepend the regex pattern string with 'r', which tells the Python interpreter not to interpret the special characters as escape characters. All of this together looks like this:
import re
filename = 'this.is.tEst3.E00.erfeh.ervwer.vwtrt.rvwrv'
match_object = re.search(r'E\d\d', filename)
if match_object:
# The '0' means we want the first match found
index_of_Exx = match_object.end(0)
truncated_filename = filename[:index_of_Exx]
# Now take care of any more processing
Regular expressions can get very detailed (and complex). In fact, you can probably accomplish your entire task of fully changing the file name using a single regex that's correctly put together. But since I don't know the full details about what sorts of weird file names might come into your program, I can't go any further than this. I will add one more piece of information: if the 'E' could possibly be lower-case, then you want to add a flag as a third argument to your pattern search which indicates case-insensitive matching. That flag is 're.I' and your search() method would look like this:
match_object = re.search(r'E\d\d', filename, re.I)
Read the documentation on Python's 're' module for more information, and you can find many great tutorials online, such as this one:
http://www.zytrax.com/tech/web/regex.htm
And before you know it you'll be a superhero. :-)
The reason why this isn't working:
for i in l:
if i=="E" and isinstance((i+1), int ) is True:
p = l.index(i)
print (p)
...is because 'i' contains a character from the string 'l', not an integer. You compare it with 'E' (which works), but then try to add 1 to it, which errors out.
I have tokenized names (strings), with the tokens separated by underscores, which will always contain a "side" token by the value of M, L, or R.
The presence of that value is guaranteed to be unique (no repetitions or dangers that other tokens might get similar values).
In example:
foo_M_bar_type
foo_R_bar_type
foo_L_bar_type
I'd like, in a single regex, to swap L for R and viceversa whenever found, and M to be left untouched.
IE the above would become:
foo_M_bar_type
foo_L_bar_type
foo_R_bar_type
when pushed through this ideal expression.
This was what I thought to be a 10 minutes exercise while writing some simple stuff, that I couldn't quite crack as concisely as I wanted to.
The problem itself was of course trivial to solve with one condition that changes the pattern, but I'd love some help doing it within a single re.sub()
Of course any food for thought is always welcome, but this being an intellectual exercise that me and a couple colleagues failed at I'd love to see it cracked that way.
And yes, I'm fully aware it might not be considered very Pythonic, nor ideal, to solve the problem with a regex, but humour me please :)
Thanks in advance
This answer [ab]uses the replacement function:
>>> s = "foo_M_bar_type foo_R_bar_type foo_L_bar_type"
>>> import re
>>> re.sub("_[LR]_", lambda m: {'_L_':'_R_','_R_':'_L_'}[m.group()], s)
'foo_M_bar_type foo_L_bar_type foo_R_bar_type'
>>>
I'm writing a python program that deals with a fair amount of strings/files. My problem is that I'm going to be presented with a fairly short piece of text, and I'm going to need to search it for instances of a fairly broad range of words/phrases.
I'm thinking I'll need to compile regular expressions as a way of matching these words/phrases in the text. My concern, however, is that this will take a lot of time.
My question is how fast is the process of repeatedly compiling regular expressions, and then searching through a small body of text to find matches? Would I be better off using some string method?
Edit: So, I guess an example of my question would be: How expensive would it be to compile and search with one regular expression versus say, iterating 'if "word" in string' say, 5 times?
You should try to compile all your regexps into a single one using the | operator. That way, the regexp engine will do most of the optimizations for you. Use the grouping operator () to determine which regexp matched.
If speed is of the essence, you are better off running some tests before you decide how to code your production application.
First of all, you said that you are searching for words which suggests that you may be able to do this using split() to break up the string on whitespace. And then use simple string comparisons to do your search.
Definitely do compile your regular expressions and do a timing test comparing that with the plain string functions. Check the documentation for the string class for a full list.
Your requirement appears to be searching a text for the first occurrence of any one of a collection of strings. Presumably you then wish to restart the search to find the next occurrence, and so on until the searched string is exhausted. Only plain old string comparison is involved.
The classic algorithm for this task is Aho-Corasick for which there is a Python extension (written in C). This should beat the socks off any alternative that's using the re module.
If you like to know how does it fast during compiling regex patterns, you need to benchmark it.
Here is how I do that. Its compile 1 Million time each patterns.
import time,re
def taken(f):
def wrap(*arg):
t1,r,t2=time.time(),f(*arg),time.time()
print t2-t1,"s taken"
return r
return wrap
#taken
def regex_compile_test(x):
for i in range(1000000):
re.compile(x)
print "for",x,
#sample tests
regex_compile_test("a")
regex_compile_test("[a-z]")
regex_compile_test("[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}")
Its took around 5 min for each patterns in my computer.
for a 4.88999986649 s taken
for [a-z] 4.70300006866 s taken
for [A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4} 4.78200006485 s taken
The real Bottleneck is not in compiling patterns, its in extracting text like re.findall, replacing re.sub. If you use that against Several MB texts, Its quite slow.
If your text is fixed, use normal str.find, its faster than regex.
Actually, If you give your text samples, and your regex patterns samples, we could give you better idea, there is many many great regex, and python guys out there.
Hope this help, sorry If my answer couldn't help you.
When you compile the regexp, it is converted into a state machine representation. Provided the regexp is efficiently expressed, it should still be very fast to match. Compiling the regexp can be expensive though, so you will want to do that up front, and as infrequently as possible. Ultimately though, only you can answer if it is fast enough for your requirements.
There are other string searching approaches, such as the Boyer-Moore algorithm. But I'd wager the complexity of searching for multiple separate strings is much higher than a regexp that can switch off each successive character.
This is a question that can readily be answered by just trying it.
>>> import re
>>> import timeit
>>> find = ['foo', 'bar', 'baz']
>>> pattern = re.compile("|".join(find))
>>> with open('c:\\temp\\words.txt', 'r') as f:
words = f.readlines()
>>> len(words)
235882
>>> timeit.timeit('r = filter(lambda w: any(s for s in find if w.find(s) >= 0), words)', 'from __main__ import find, words', number=30)
18.404569854548527
>>> timeit.timeit('r = filter(lambda w: any(s for s in find if s in w), words)', 'from __main__ import find, words', number=30)
10.953313759150944
>>> timeit.timeit('r = filter(lambda w: pattern.search(w), words)', 'from __main__ import pattern, words', number=30)
6.8793022576891758
It looks like you can reasonably expect regular expressions to be faster than using find or in. Though if I were you I'd repeat this test with a case that was more like your real data.
If you're just searching for a particular substring, use str.find() instead.
Depending on what you're doing it might be better to use a tokenizer and loop through the tokens to find matches.
However, when it comes to short pieces of text regexes have incredibly good performance. Personally I remember only coming into problems when text sizes became ridiculous like 100k words or something like that.
Furthermore, if you are worried about the speed of actual regex compilation rather than matching, you might benefit from creating a daemon that compiles all the regexes then goes through all the pieces of text in a big loop or runs as a service. This way you will only have to compile the regexes once.
in general case, you can use "in" keyword
for line in open("file"):
if "word" in line:
print line.rstrip()
regex is usually not needed when you use Python :)