I have a need to recover 2 results of a regular expression in Python: what is searched and all else.
For example, in:
"boofums",3,4
I'd like to find what is in the quotes and what isn't:
boofums
,3,4
What I have so far is:
bobbles = '"boofums",3,4'
pickles = re.split(r'\".*\"', bobbles)
morton = re.match(r'\".*\"', bobbles)
print(pickles[1])
print(morton[0])
,3,4
"boofums"
This seems to me insanely inefficient and not Python-esque. Is there a better way to do this? (Sorry for the "is there a better way" construct on StackOverflow, but... I need to do this better! 😂)
...and if you can help me extract just what's in the quotes, something that I'd easily do in Perl or Ruby, all the better!
You're probably best off with regex groupings:
So for your example I'd use something like
regex = re.compile("\"(.*)\"(.*)")
bobble_groups = regex.match(bobbles)
you can then use bobble_groups.group(1) to just get the quotation marks.
See named groups if you don't want to depend on an index number.
a, b = re.match('"(.*)"(.*)', bobbles).groups()
Brackets determine groups that are "saved" to the match object
Related
I'm new to regular expressions, help me extract the necessary information from the text:
salespackquantity=1&itemCode=3760041","quantity_box_sales_uom"
&salespackquantity=1&itemCode=2313441","quantity_box
I need to take the numbers 3760041 and 2313441 respectively. What should a regular expression look like?
If we're dealing with just line-based data as you show it could be as easy as:
.*itemCode=([0-9]+).*
Which is brutal but would do the work. You'd extract the first matching group.
Although your example seems inconsistent and truncated, so this may vary. Please provide more details if there are other conditions.
Example
>>> import re
>>> oneline = "salespackquantity=1&itemCode=3760041\",\"quantity_box_sales_uom\""
>>> match = re.search('.*itemCode=([0-9]+).*', oneline)
>>> match.group(0)
'salespackquantity=1&itemCode=3760041","quantity_box_sales_uom"'
>>> match.group(1)
'3760041'
Do you really need regex?
Arguably, a regex seems an easy way to get what you want here, but it might be grossly inefficient, depending on your use case and input data.
Several other strategies might be easier:
remove unnecessary data first,
use a proper parser for your specific content (here this looks like a mix of a CSV and URL query strings),
don't even bother and cut on appropriate boundaries, if the format is fixed.
Regex are powerful, and can be overly powerful for simple scenarios. Totally fair if it's to run a one-off data extraction script, though, or if the cost/benefit analysis of the development effort is worth it.
a = "example is the int and string 223576"
ext = []
b = "1234567890"
for i in a:
if i in b:
ext.append(i)
print(ext)
I've recently decided to jump into the deep end of the Python pool and start converting some of my R code over to Python and I'm stuck on something that is very important to me. In my line of work, I spend a lot of time parsing text data, which, as we all know, is very unstructured. As a result, I've come to rely on the lookaround feature of regex and R's lookaround functionality is quite robust. For example, if I'm parsing a PDF that might introduce some spaces in between letters when I OCR the file, I'd get to the value I want with something like this:
oAcctNum <- str_extract(textBlock[indexVal], "(?<=ORIG\\s?:\\s?/\\s?)[A-Z0-9]+")
In Python, this isn't possible because the use of ? makes the lookbehind a variable-width expression as opposed to a fixed-width. This functionality is important enough to me that it deters me from wanting to use Python, but instead of giving up on the language I'd like to know the Pythonista way of addressing this issue. Would I have to preprocess the string before extracting the text? Something like this:
oAcctNum = re.sub(r"(?<=\b\w)\s(?=\w\b)", "")
oAcctNum = re.search(r"(?<=ORIG:/)([A-Z0-9])", textBlock[indexVal]).group(1)
Is there a more efficient way to do this? Because while this example was trivial, this issue comes up in very complex ways with the data I work with and I'd hate to have to do this kind of preprocessing for every line of text I analyze.
Lastly, I apologize if this is not the right place to ask this question; I wasn't sure where else to post it. Thanks in advance.
Notice that if you can use groups, you generally do not need lookbehinds. So how about
match = re.search(r"ORIG\s?:\s?/\s?([A-Z0-9]+)", string)
if match:
text = match.group(1)
In practice:
>>> string = 'ORIG : / AB123'
>>> match = re.search(r"ORIG\s?:\s?/\s?([A-Z0-9]+)", string)
>>> match
<_sre.SRE_Match object; span=(0, 12), match='ORIG : / AB123'>
>>> match.group(1)
'AB123'
You need to use capture groups in this case you described:
"(?<=ORIG\\s?:\\s?/\\s?)[A-Z0-9]+"
will become
r"ORIG\s?:\s?/\s?([A-Z0-9]+)"
The value will be in .group(1). Note that raw strings are preferred.
Here is a sample code:
import re
p = re.compile(r'ORIG\s?:\s?/\s?([A-Z0-9]+)', re.IGNORECASE)
test_str = "ORIG:/texthere"
print re.search(p, test_str).group(1)
IDEONE demo
Unless you need overlapping matches, capturing groups usage instead of a look-behind is rather straightforward.
print re.findall(r"ORIG\s?:\s?/\s?([A-Z0-9]+)",test_str)
You can directly use findall which will return all the groups in the regex if present.
I am looking for a regular expression that discriminates between a string that contains a numerical value enclosed between parentheses, and a string that contains outside of them. The problem is, parentheses may be embedded into each other:
So, for example the expression should match the following strings:
hey(example1)
also(this(onetoo2(hard)))
but(here(is(a(harder)one)maybe23)Hehe)
But it should not match any of the following:
this(one)is22misleading
how(to(go)on)with(multiple)3parent(heses(around))
So far I've tried
\d[A-Za-z] \)
and easy things like this one. The problem with this one is it does not match the example 2, because it has a ( string after it.
How could I solve this one?
The problem is not one of pattern matching. That means regular expressions are not the right tool for this.
Instead, you need lexical analysis and parsing. There are many libraries available for that job.
You might try the parsing or pyparsing libraries.
These type of regexes are not always easy, but sometimes it's possible to come up with a way provided the input remains somewhat consistent. A pattern generally like this should work:
(.*(\([\d]+[^(].*\)|\(.*[^)][\d]+.*\)).*)
Code:
import re
p = re.compile(ur'(.*(\([\d]+[^(].*\)|\(.*[^)][\d]+.*\)).*)', re.MULTILINE)
result = re.findall(p, searchtext)
print(result)
Result:
https://regex101.com/r/aL8bB8/1
For example:
Characters to match: 'czk'
string1: 'zack' Matches
string2: 'zak' Does not match
I tried (c)+(k)+(z) and [ckz] which are obviously wrong. I feel this is a simple task, but i am unable to find an answer
The most natural way would probably to use sets rather than regex, like so
set('czk').issubset(s)
Code is very often simpler and easier to maintain without using regex much.
Basically you have to sort the string first so you get "ackz" and then you can use a regex like /.*c.*k.*z.*/ to match against.
If I am finding & replacing some text how can I get it to replace some text that will change each day so ie anything between (( & )) whatever it is?
Cheers!
Use regular expressions (http://docs.python.org/library/re.html)?
Could you please be more specific, I don't think I fully understand what you are trying to accomplish.
EDIT:
Ok, now I see. This may be done even easier, but here goes:
>>> import re
>>> s = "foo(bar)whatever"
>>> r = re.compile(r"(\()(.+?)(\))")
>>> r.sub(r"\1baz\3",s)
'foo(baz)whatever'
For multiple levels of parentheses this will not work, or rather it WILL work, but will do something you probably don't want it to do.
Oh hey, as a bonus here's the same regular expression, only now it will replace the string in the innermost parentheses:
r1 = re.compile(r"(\()([^)^(]+?)(\))")