Best way to parse string with delimiters (thinking regex)?

Best way to parse string with delimiters (thinking regex)? - python

I'm currently trying to parse a python string for some specific text inside of it. It should actually be really straightforward.
But more importantly, I want to know if regex is a "tool set" type thing, where you know a certain number of tricks? Some people are very, very proficient with them, and I want to attain that proficiency.
So while I am asking how to match this string, I'd like an explanation of your thought process as you went through as you came to your solution
I basically want text A, text-B, and text_C, delimited only by commas.
The desired output string:
"text A,text-B,text_C"
The original text is as follows:
"(1, u'text A', u'text-B', u'text_C')"
In my limited understand, I understand that the main thing separating each expression is a single-quote, so I would start with that. But ultimately I might have strings such as text-'A and I want to make sure that I don't run into errors because I parse the string incorrectly.
Thanks for your time. Remember: thought process.

Since the string you're dealing with is a repr version of a Python tuple, the most Pythonic way is to use ast.literal_eval which can take that object and safely convert back to a Python object retaining the correct types:
import ast
text = "(1, u'text A', u'text-B', u'text_C')"
tup = ast.literal_eval(text)
Then if you only wish to join each item that's a string together:
joined = ', '.join(el for el in tup if isinstance(el, basestring))
# text A, text-B, text_C
Otherwise just slice the tuple tup[1:] and join the items in that...
In terms of a regex, a quick and dirty, non-robust method, that will break easily and possibly even provide incorrect matches under some circumstances is to use:
import re
string_vals = re.findall("'(.*?)'", text)
This finds anything after a ' up until the very next '... Again, using ast.literal_eval is much nicer here...

Must it be regex? :(
a_str = "(1, u'text A', u'text-B', u'text_C')"
print ",".join(a_str[1:-1].split(",")[1:]).replace('u','').replace("'",'')
Yields:
text A, text-B, text_C
EDIT: well if it must be regex .. don't mind this post, it doesn't work for many cases.

Related

How to extract a numeric value from a line of text with a regular expression?

I'm new to regular expressions, help me extract the necessary information from the text:
salespackquantity=1&itemCode=3760041","quantity_box_sales_uom"
&salespackquantity=1&itemCode=2313441","quantity_box
I need to take the numbers 3760041 and 2313441 respectively. What should a regular expression look like?

If we're dealing with just line-based data as you show it could be as easy as:
.*itemCode=([0-9]+).*
Which is brutal but would do the work. You'd extract the first matching group.
Although your example seems inconsistent and truncated, so this may vary. Please provide more details if there are other conditions.
Example
>>> import re
>>> oneline = "salespackquantity=1&itemCode=3760041\",\"quantity_box_sales_uom\""
>>> match = re.search('.*itemCode=([0-9]+).*', oneline)
>>> match.group(0)
'salespackquantity=1&itemCode=3760041","quantity_box_sales_uom"'
>>> match.group(1)
'3760041'
Do you really need regex?
Arguably, a regex seems an easy way to get what you want here, but it might be grossly inefficient, depending on your use case and input data.
Several other strategies might be easier:
remove unnecessary data first,
use a proper parser for your specific content (here this looks like a mix of a CSV and URL query strings),
don't even bother and cut on appropriate boundaries, if the format is fixed.
Regex are powerful, and can be overly powerful for simple scenarios. Totally fair if it's to run a one-off data extraction script, though, or if the cost/benefit analysis of the development effort is worth it.

a = "example is the int and string 223576"
ext = []
b = "1234567890"
for i in a:
if i in b:
ext.append(i)
print(ext)

Regular expression for 'b' not preceded by an odd number of 'a's [duplicate]

I've recently decided to jump into the deep end of the Python pool and start converting some of my R code over to Python and I'm stuck on something that is very important to me. In my line of work, I spend a lot of time parsing text data, which, as we all know, is very unstructured. As a result, I've come to rely on the lookaround feature of regex and R's lookaround functionality is quite robust. For example, if I'm parsing a PDF that might introduce some spaces in between letters when I OCR the file, I'd get to the value I want with something like this:
oAcctNum <- str_extract(textBlock[indexVal], "(?<=ORIG\\s?:\\s?/\\s?)[A-Z0-9]+")
In Python, this isn't possible because the use of ? makes the lookbehind a variable-width expression as opposed to a fixed-width. This functionality is important enough to me that it deters me from wanting to use Python, but instead of giving up on the language I'd like to know the Pythonista way of addressing this issue. Would I have to preprocess the string before extracting the text? Something like this:
oAcctNum = re.sub(r"(?<=\b\w)\s(?=\w\b)", "")
oAcctNum = re.search(r"(?<=ORIG:/)([A-Z0-9])", textBlock[indexVal]).group(1)
Is there a more efficient way to do this? Because while this example was trivial, this issue comes up in very complex ways with the data I work with and I'd hate to have to do this kind of preprocessing for every line of text I analyze.
Lastly, I apologize if this is not the right place to ask this question; I wasn't sure where else to post it. Thanks in advance.

Notice that if you can use groups, you generally do not need lookbehinds. So how about
match = re.search(r"ORIG\s?:\s?/\s?([A-Z0-9]+)", string)
if match:
text = match.group(1)
In practice:
>>> string = 'ORIG : / AB123'
>>> match = re.search(r"ORIG\s?:\s?/\s?([A-Z0-9]+)", string)
>>> match
<_sre.SRE_Match object; span=(0, 12), match='ORIG : / AB123'>
>>> match.group(1)
'AB123'

You need to use capture groups in this case you described:
"(?<=ORIG\\s?:\\s?/\\s?)[A-Z0-9]+"
will become
r"ORIG\s?:\s?/\s?([A-Z0-9]+)"
The value will be in .group(1). Note that raw strings are preferred.
Here is a sample code:
import re
p = re.compile(r'ORIG\s?:\s?/\s?([A-Z0-9]+)', re.IGNORECASE)
test_str = "ORIG:/texthere"
print re.search(p, test_str).group(1)
IDEONE demo
Unless you need overlapping matches, capturing groups usage instead of a look-behind is rather straightforward.

print re.findall(r"ORIG\s?:\s?/\s?([A-Z0-9]+)",test_str)
You can directly use findall which will return all the groups in the regex if present.

Python split with regular expression to divide string

I have a need to recover 2 results of a regular expression in Python: what is searched and all else.
For example, in:
"boofums",3,4
I'd like to find what is in the quotes and what isn't:
boofums
,3,4
What I have so far is:
bobbles = '"boofums",3,4'
pickles = re.split(r'\".*\"', bobbles)
morton = re.match(r'\".*\"', bobbles)
print(pickles[1])
print(morton[0])
,3,4
"boofums"
This seems to me insanely inefficient and not Python-esque. Is there a better way to do this? (Sorry for the "is there a better way" construct on StackOverflow, but... I need to do this better! 😂)
...and if you can help me extract just what's in the quotes, something that I'd easily do in Perl or Ruby, all the better!

You're probably best off with regex groupings:
So for your example I'd use something like
regex = re.compile("\"(.*)\"(.*)")
bobble_groups = regex.match(bobbles)
you can then use bobble_groups.group(1) to just get the quotation marks.
See named groups if you don't want to depend on an index number.

a, b = re.match('"(.*)"(.*)', bobbles).groups()
Brackets determine groups that are "saved" to the match object

in python find index in list if combination of strings exist

I'm writing my first script and trying to learn python.
But I'm stuck and can't get out of this one.
I'm writing a script to change file names.
Lets say I have a string = "this.is.tEst3.E00.erfeh.ervwer.vwtrt.rvwrv"
I want the result to be string = "This Is Test3 E00"
this is what I have so far:
l = list(string)
//Transform the string into list
for i in l:
if "E" in l:
p = l.index("E")
if isinstance((p+1), int () is True:
if isinstance((p+2), int () is True:
delp = p+3
a = p-3
del l[delp:]
new = "".join(l)
new = new.replace("."," ")
print (new)
get in index where "E" and check if after "E" there are 2 integers.
Then delete everything after the second integer.
However this will not work if there is an "E" anyplace else.
at the moment the result I get is:
this is tEst
because it is finding index for the first "E" on the list and deleting everything after index+3
I guess my question is how do I get the index in the list if a combination of strings exists.
but I can't seem to find how.
thanks for everyone answers.
I was going in other direction but it is also not working.
if someone could see why it would be awesome. It is much better to learn by doing then just coping what others write :)
this is what I came up with:
for i in l:
if i=="E" and isinstance((i+1), int ) is True:
p = l.index(i)
print (p)
anyone can tell me why this isn't working. I get an error.
Thank you so much

Have you ever heard of a Regular Expression?
Check out python's re module. Link to the Docs.
Basically, you can define a "regex" that would match "E and then two integers" and give you the index of it.
After that, I'd just use python's "Slice Notation" to choose the piece of the string that you want to keep.
Then, check out the string methods for str.replace to swap the periods for spaces, and str.title to put them in Title Case

An easy way is to use a regex to find up until the E followed by 2 digits criteria, with s as your string:
import re
up_until = re.match('(.*?E\d{2})', s).group(1)
# this.is.tEst3.E00
Then, we replace the . with a space and then title case it:
output = up_until.replace('.', ' ').title()
# This Is Test3 E00

The technique to consider using is Regular Expressions. They allow you to search for a pattern of text in a string, rather than a specific character or substring. Regular Expressions have a bit of a tough learning curve, but are invaluable to learn and you can use them in many languages, not just in Python. Here is the Python resource for how Regular Expressions are implemented:
http://docs.python.org/2/library/re.html
The pattern you are looking to match in your case is an "E" followed by two digits. In Regular Expressions (usually shortened to "regex" or "regexp"), that pattern looks like this:
E\d\d # ('\d' is the specifier for any digit 0-9)
In Python, you create a string of the regex pattern you want to match, and pass that and your file name string into the search() method of the the re module. Regex patterns tend to use a lot of special characters, so it's common in Python to prepend the regex pattern string with 'r', which tells the Python interpreter not to interpret the special characters as escape characters. All of this together looks like this:
import re
filename = 'this.is.tEst3.E00.erfeh.ervwer.vwtrt.rvwrv'
match_object = re.search(r'E\d\d', filename)
if match_object:
# The '0' means we want the first match found
index_of_Exx = match_object.end(0)
truncated_filename = filename[:index_of_Exx]
# Now take care of any more processing
Regular expressions can get very detailed (and complex). In fact, you can probably accomplish your entire task of fully changing the file name using a single regex that's correctly put together. But since I don't know the full details about what sorts of weird file names might come into your program, I can't go any further than this. I will add one more piece of information: if the 'E' could possibly be lower-case, then you want to add a flag as a third argument to your pattern search which indicates case-insensitive matching. That flag is 're.I' and your search() method would look like this:
match_object = re.search(r'E\d\d', filename, re.I)
Read the documentation on Python's 're' module for more information, and you can find many great tutorials online, such as this one:
http://www.zytrax.com/tech/web/regex.htm
And before you know it you'll be a superhero. :-)

The reason why this isn't working:
for i in l:
if i=="E" and isinstance((i+1), int ) is True:
p = l.index(i)
print (p)
...is because 'i' contains a character from the string 'l', not an integer. You compare it with 'E' (which works), but then try to add 1 to it, which errors out.

Split string with caret character in python

I have a huge text file, each line seems like this:
Some sort of general menu^a_sub_menu_title^^pagNumber
Notice that the first "general menu" has white spaces, the second part (a subtitle) each word is separate with "_" character and finally a number (a pag number). I want to split each line in 3 (obvious) parts, because I want to create some sort of directory in python.
I was trying with re module, but as the caret character has a strong meaning in such module, I couldn't figure it out how to do it.
Could someone please help me????

>>> "Some sort of general menu^a_sub_menu_title^^pagNumber".split("^")
['Some sort of general menu', 'a_sub_menu_title', '', 'pagNumber']

If you only want three pieces you can accomplish this through a generator expression:
line = 'Some sort of general menu^a_sub_menu_title^^pagNumber'
pieces = [x for x in line.split('^') if x]
# pieces => ['Some sort of general menu', 'a_sub_menu_title', 'pagNumber']

What you need to do is to "escape" the special characters, like r'\^'. But better than regular expressions in this case would be:
line = "Some sort of general menu^a_sub_menu_title^^pagNumber"
(menu, title, dummy, page) = line.split('^')
That gives you the components in a much more straightforward fashion.

You could just say string.split("^") to divide the string into an array containing each segment. The only caveat is that it will divide consecutive caret characters into an empty string. You could protect against this by either collapsing consecutive carats down into a single one, or detecting empty strings in the resultant array.
For more information see http://docs.python.org/library/stdtypes.html
Does that help?

It's also possible that your file is using a format that's compatible with the csv module, you could also look into that, especially if the format allows quoting, because then line.split would break. If the format doesn't use quoting and it's just delimiters and text, line.split is probably the best.
Also, for the re module, any special characters can be escaped with \, like r'\^'. I'd suggest before jumping to use re to 1) learn how to write regular expressions, 2) first look for a solution to your problem instead of jumping to regular expressions - «Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. »

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Best way to parse string with delimiters (thinking regex)? - python

Must it be regex? :( a_str = "(1, u'text A', u'text-B', u'text_C')" print ",".join(a_str[1:-1].split(",")[1:]).replace('u','').replace("'",'') Yields: text A, text-B, text_C EDIT: well if it must be regex .. don't mind this post, it doesn't work for many cases.

Related

How to extract a numeric value from a line of text with a regular expression?

Regular expression for 'b' not preceded by an odd number of 'a's [duplicate]

Python split with regular expression to divide string

in python find index in list if combination of strings exist

Split string with caret character in python

Categories

Resources