Generate random string from regex character set - python

I assume there's some beautiful Pythonic way to do this, but I haven't quite figured it out yet. Basically I'm looking to create a testing module and would like a nice simple way for users to define a character set to pull from. I could potentially concatenate a list of the various charsets associated with string, but that strikes me as a very unclean solution. Is there any way to get the charset that the regex represents?
Example:
def foo(regex_set):
re.something(re.compile(regex_set))
foo("[a-z]")
>>> abcdefghijklmnopqrstuvwxyz
The compile is of course optional, but in my mind that's what this function would look like.

Paul McGuire, author of Pyparsing, has written an inverse regex parser, with which you could do this:
import invRegex
print(''.join(invRegex.invert('[a-z]')))
# abcdefghijklmnopqrstuvwxyz
If you do not want to install Pyparsing, there is also a regex inverter that uses only modules from the standard library with which you could write:
import inverse_regex
print(''.join(inverse_regex.ipermute('[a-z]')))
# abcdefghijklmnopqrstuvwxyz
Note: neither module can invert all regex patterns.
And there are differences between the two modules:
import invRegex
import inverse_regex
print(repr(''.join(invRegex.invert('.'))))
print(repr(''.join(inverse_regex.ipermute('.'))))
yields
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ \t\n\r\x0b\x0c'
Here is another difference, this time pyparsing enumerates a larger set of matches:
x = list(invRegex.invert('[a-z][0-9]?.'))
y = list(inverse_regex.ipermute('[a-z][0-9]?.'))
print(len(x))
# 26884
print(len(y))
# 1100

A regex is not needed here. If you want to have users select a character set, let them just pick characters. As I said in my comment, simply listing all the characters and putting checkboxes by them would be sufficent. If you want something that is more compact, or just looks cooler, you could do something like one of these:
Of course, if you actually use this, what you come up with will undoubtedly look better than these (And they will also actually have all the letters in them, not just "A").
If you need, you could include a button to invert the selection, select all, clear selection, save selection, or anything else you need to do.

if its just simple ranges you could manually parse it
def range_parse(rng):
min,max = rng.split("-")
return "".join(chr(i) for i in range(ord(min),ord(max)+1))
print range_parse("a-z")+range_parse('A-Z')
but its gross ...

Another solution I thought of to simplify the problem:
Stick your own [ and ] on the line as part of the prompt, and disallow those characters in the input. After you scan the input and verify it doesn't contain anything matching [\[\]], you can prepend [ and append ] to the string, and use it like a regex against a string of all the characters needed ("abcdefghijklmnopqrstuvwxyz", fort instance).

Related

How can I speed up an email-finding regular expression when searching through a massive string?

I have a massive string. It looks something like this:
hej34g934gj93gh398gie foo#bar.com e34y9u394y3h4jhhrjg bar#foo.com hge98gej9rg938h9g34gug
Except that it's much longer (1,000,000+ characters).
My goal is to find all the email addresses in this string.
I've tried a number of solutions, including this one:
#matches foo#bar.com and bar#foo.com
re.findall(r'[\w\.-]{1,100}#[\w\.-]{1,100}', line)
Although the above code technically works, it takes an insane amount of time to execute. I'm not sure if it counts as catastrophic backtracking or if it's just really inefficient, but whatever the case, it's not good enough for my use case.
I suspect that there's a better way to do this. For example, if I use this regex to only search for the latter part of the email addresses:
#matches #bar.com and #foo.com
re.findall(r'#[\w-]{1,256}[\.]{1}[a-z.]{1,64}', line)
It executes in just a few milliseconds.
I'm not familiar enough with regex to write the rest, but I assume that there's some way to find the #x.x part first and then check the first part afterwards? If so, then I'm guessing that would be a lot quicker.
You can use PyPi regex module by Matthew Barnett, that is much more powerful and stable when it comes to parsing long texts. This regex library has some basic checks for pathological cases implemented. The library author mentions at his post:
The internal engine no longer interprets a form of bytecode but
instead follows a linked set of nodes, and it can work breadth-wise as
well as depth-first, which makes it perform much better when faced
with one of those 'pathological' regexes.
However, there is yet another trick you may implement in your regex: Python re (and regex, too) optimize matching at word boundary locations. Thus, if your pattern is supposed to match at a word boundary, always start your pattern with it. In your case, r'\b[\w.-]{1,100}#[\w.-]{1,100}' or r'\b\w[\w.-]{0,99}#[\w.-]{1,100}' should also work much better than the original pattern without a word boundary.
Python test:
import re, regex, timeit
text='your_long_sting'
re_pattern=re.compile(r'\b\w[\w.-]{0,99}#[\w.-]{1,100}')
regex_pattern=regex.compile(r'\b\w[\w.-]{0,99}#[\w.-]{1,100}')
timeit.timeit("p.findall(text)", 'from __main__ import text, re_pattern as p', number=100000)
# => 6034.659449000001
timeit.timeit("p.findall(text)", 'from __main__ import text, regex_pattern as p', number=100000)
# => 218.1561693
Don't use regex on the whole string. Regex are slow. Avoiding them is your best bet to better overall performance.
My first approach would look like this:
Split the string on spaces.
Filter the result down to the parts that contain #.
Create a pre-compiled regex.
Use regex on the remaining parts only to remove false positives.
Another idea:
in a loop....
use .index("#") to find the position of the next candidate
extend e.g. 100 characters to the left, 50 to the right to cover name and domain
adapt the range depending on the last email address you found so you don't overlap
check the range with a regex, if it matches, yield the match

Python Regex back reference a named group

I'm attempting to parse phone numbers that can come through in different ways. For example:
(321) 123-4567
(321) 1234567
321-123-4567
321123-4567
I then want to graph each of the three parts separately. My thought is to use named groups and some and or situation like so:
(^\s*(?P<area>[0-9]{3})\-?(?P<fst>[0-9]{3})\-(?P<lst>[0-9]{4}))|(^\s*\(\area\)\s*(\fst)\-?(\lst))
Problem with that, I believe, is that I am not calling the named groups properly. I'm trying to use https://regex101.com/ to help but am still getting stuck. Because the parentheses around the area code should either both be there or neither should be there I don't want to use the "?" character like:
\(?(?P<area>[0-9]{3})\)?
Can anyone Help me with this? Thank you so much.
I'm using python 3.6 and the re package.
There were a few issues with your regex. You didn't make the brackets optional, and you didn't allow optional spaces between area code and first part. Without seeing your Python code it's not easy to know how you were doing things, but I did this by splitting into a compiled regex, and then using the regex against the list of numbers.
from __future__ import print_function
import re
phone_numbers = [
'(321) 123-4567',
'(321) 1234567',
'321-123-4567',
'321123-4567',
]
regex = re.compile(r'^\s*\(?(?P<area>[0-9]{3})[) -]*(?P<fst>[0-9]{3})-?(?P<sec>[0-9]{4})')
for p in phone_numbers:
print(regex.sub(r'(\g<area>) \g<fst>-\g<sec>', p))
This isn't perfect as it will allow things that aren't valid syntax (according to your list) to be parsed, but this shouldn't be a problem. For example '(321))- - )) 123-4567' would be parsed correctly.
I'd use group testing: ^(\()?(?P<area>\d{3})(?(1)\))[ -]?(?P<fst>\d{3})-?(?P<lst>\d{4})$.
In there:
(\()? captures an opening parenthese in group 1 when exists.
(?(1)\)) tests for existence of a captured group 1, if so matches a closing parenthese.
The rest is pretty straightforward.

Think you know Python RE? Here's a challenge

Here's the skinny: how do you make a character set match NOT a previously captured character?
r'(.)[^\1]' # doesn't work
Here's the uh... fat? It's part of a (simple) cryptography program. Suppose "hobo" got coded to "fxgx". The program only gets the encoded text and has to figure what it could be, so it generates the pattern:
r'(.)(.)(.)\2' # 1st and 3rd letters *should* be different!
Now it (correctly) matches "hobo", but also matches "hoho" (think about it!). I've tried stuff like:
r'(.)([^\1])([^\1\2])\2' # also doesn't work
and MANY variations but alas! Alack...
Please help!
P.S. The work-around (which I had to implement) is to just retrieve the "hobo"s as well the "hoho"s, and then just filter the results (discarding the "hoho"s), if you catch my drift ;)
P.P.S Now I want a hoho
VVVVV THE ANSWER VVVVV
Yes, I re-re-read the documentation and it does say:
Inside the '[' and ']' of a character class, all numeric escapes are
treated as characters.
As well as:
Special characters lose their special meaning inside sets.
Which pretty much means (I think) NO, you can't do anything like:
re.compile(r'(.)[\1]') # Well you can, but it kills the back-reference!
Thanks for the help!
1st and 3rd letters should be different!
This cannot be detected using a regular expression (not just python's implementation). More specifically, it can't be detected using automata without memory. You'll have to use a different kind of automata.
The kind of grammar you're trying to discover (‫‪reduplication‬‬) is not regular. Moreover, it is not context-free.
Automata is the mechanism which allows regular expression match to be so efficient.

A simple regexp in python

My program is a simple calculator, so I need to parse te expression which the user types, to get the input more user-friendly. I know I can do it with regular expressions, but I'm not familar enough about this.
So I need transform a input like this:
import re
input_user = "23.40*1200*(12.00-0.01)*MM(H2O)/(8.314 *func(2*x+273.15,x))"
re.some_stuff( ,input_user) # ????
in this:
"23.40*1200*(12.00-0.01)*MM('H2O')/(8.314 *func('2*x+273.15',x))"
just adding these simple quotes inside the parentheses. How can I do that?
UPDATE:
To be more clear, I want add simple quotes after every sequence of characters "MM(" and before the ")" which comes after it, and after every sequence of characters "func(" and before the "," which comes after it.
This is the sort of thing where regexes can work, but they can potentially result in major problems unless you consider exactly what your input will be like. For example, can whatever is inside MM(...) contain parentheses of its own? Can the first expression in func( contain a comma? If the answers to both questions is no, then the following could work:
input_user2 = re.sub(r'MM\(([^\)]*)\)', r"MM('\1')", input_user)
output = re.sub(r'func\(([^,]*),', r"func('\1',", input_user)
However, this will not work if the answer to either question is yes, and even without that could cause problems depending upon what sort of inputs you expect to receive. Essentially, the first re.sub here looks for MM( ('MM('), followed by any number (including 0) of characters that aren't a close-parenthesis ('([^)]*)') that are then stored as a group (caused by the extra parentheses), and then a close-parenthesis. It replaces that section with the string in the second argument, where \1 is replaced by the first and only group from the pattern. The second re.sub works similarly, looking for any number of characters that aren't a comma.
If the answer to either question is yes, then regexps aren't appropriate for the parsing, as your language would not be regular. The answer to this question, while discussing a different application, may give more insight into that matter.

Split string with caret character in python

I have a huge text file, each line seems like this:
Some sort of general menu^a_sub_menu_title^^pagNumber
Notice that the first "general menu" has white spaces, the second part (a subtitle) each word is separate with "_" character and finally a number (a pag number). I want to split each line in 3 (obvious) parts, because I want to create some sort of directory in python.
I was trying with re module, but as the caret character has a strong meaning in such module, I couldn't figure it out how to do it.
Could someone please help me????
>>> "Some sort of general menu^a_sub_menu_title^^pagNumber".split("^")
['Some sort of general menu', 'a_sub_menu_title', '', 'pagNumber']
If you only want three pieces you can accomplish this through a generator expression:
line = 'Some sort of general menu^a_sub_menu_title^^pagNumber'
pieces = [x for x in line.split('^') if x]
# pieces => ['Some sort of general menu', 'a_sub_menu_title', 'pagNumber']
What you need to do is to "escape" the special characters, like r'\^'. But better than regular expressions in this case would be:
line = "Some sort of general menu^a_sub_menu_title^^pagNumber"
(menu, title, dummy, page) = line.split('^')
That gives you the components in a much more straightforward fashion.
You could just say string.split("^") to divide the string into an array containing each segment. The only caveat is that it will divide consecutive caret characters into an empty string. You could protect against this by either collapsing consecutive carats down into a single one, or detecting empty strings in the resultant array.
For more information see http://docs.python.org/library/stdtypes.html
Does that help?
It's also possible that your file is using a format that's compatible with the csv module, you could also look into that, especially if the format allows quoting, because then line.split would break. If the format doesn't use quoting and it's just delimiters and text, line.split is probably the best.
Also, for the re module, any special characters can be escaped with \, like r'\^'. I'd suggest before jumping to use re to 1) learn how to write regular expressions, 2) first look for a solution to your problem instead of jumping to regular expressions - «Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. »

Categories

Resources