Better way to parse from regex? - python

I am doing the following to get the movieID:
>>> x.split('content')
['movieID" ', '="770672122">']
>>> [item for item in x.split('content')[1] if item.isdigit()]
['7', '7', '0', '6', '7', '2', '1', '2', '2']
>>> ''.join([item for item in x.split('content')[1] if item.isdigit()])
'770672122'
Would would be a better way to do this?

Without using a regular expression, you could just split by the double quotes and take the next to last field.
u="""movieID" content="7706">"""
u.split('"')[-2] # returns: '7706'
This trick is definitely the most readable, if you don't know about regular expressions yet.
Your string is a bit strange though as there are 3 double quotes. I assume it comes from an HTML file and you're only showing a small substring. In that case, you might make your code more robust by using a regular expression such as:
import re
s = re.search('(\d+)', u) # looks for multiple consecutive digits
s.groups() # returns: ('7706',)
You could make it even more robust (but you'll need to read more) by using a DOM-parser such as BeautifulSoup.

I assume x looks like this:
x = 'movieID content="770672122">'
Regex is definitely one way to extract the content. For example:
>>> re.search(r'content="(\d+)', x).group(1)
'770672122'
The above fetches one or more consecutive digits which follow the string content=".

It seems you could do something like the following if your string is like the below:
>>> import re
>>> x = 'movieID content="770672122">'
>>> re.search(r'\d+', x).group()
'770672122'

Related

separate number from the string

the separate number from the string, but when successive '1', separate them
I think there must have a smart way to solve the question.
s = 'NNNN1234N11N1N123'
expected result is:
['1234','1','1','1','123']
I think what you want can be solved by using the re module
>>> import re
>>> re.findall('(?:1[2-90]+)|1', 'NNNN1234N11N1N123')
EDIT: As suggested in the comments by #CrafterKolyan the regular expression can be reduced to 1[2-90]*.
Outputs
['1234', '1', '1', '1', '123']
I also would use regular expressions (re module), but other function, namely re.split following way:
import re
s = 'NNNN1234N11N1N123'
output = re.split(r'[^\d]+|(?<=1)(?=1)',s)
print(output) # ['', '1234', '1', '1', '1', '123']
output = [i for i in output if i] # jettison empty strs
print(output) # ['1234', '1', '1', '1', '123']
Explanation: You want to split str to get list of strs - that is for what re.split is used. First argument of re.split is used to tell where split should happen, with everything which will be matched being removed if capturing groups are not used (similar to str method split), so I need to specify two places where cut happen, so I used | that is alternative and informed re.split to cut at:
[^\d]+ that is 1 or more non-digits
(?<=1)(?=1) that is empty str preceded by 1 and followed by 1, here I used feature named zero-length assertion (twice)
Note that re.split produced '' (empty str) before your desired output - this mean that first cut (NNNN in this case) spanned from start of str. This is expected behavior of re.split although we do not need that information in this case so we can jettison any empty strs, for which I used list comprehension.

Python re.findall returning only first character

Working in Python 3.6, I have a list of html files with date prefixes. I'd like to return all dates, so I join the list and use some regex, like so:
import re
snapshots = ['20180614_SII.html', '20180615_SII.html']
p = re.compile("(\d|^)\d*(?=_)")
snapshot_dates = p.findall(' '.join(snapshots))
snapshot_dates is a list, ['2', '2'], but I'm expecting ['20180614', '20180615']. Demonstration here: https://regexr.com/3r44o. What am I missing?
You can simplify your pattern to use \d+ instead of (\d|^)\d*:
p = re.compile("\d+(?=_)")
print(p.findall(' '.join(snapshots)))
#['20180614', '20180615']
However, in this case you may not need regex to achieve the desired result. You can instead just split the string on _:
print([x.split("_")[0] for x in snapshots])
#['20180614', '20180615']

Split string using capture groups

I have a two strings
/some/path/to/sequence2.1001.tif
and
/some/path/to/sequence_another_u1_v2.tif
I want to write a function so that both strings can be split up into a list by some regex and joined back together, without losing any characters.
so
def split_by_group(path, re_compile):
# ...
return ['the', 'parts', 'here']
split_by_group('/some/path/to/sequence2.1001.tif', re.compile(r'(\.(\d+)\.')
# Result: ['/some/path/to/sequence2.', '1001', '.tif']
split_by_group('/some/path/to/sequence_another_u1_v2.tif', re.compile(r'_[uv](\d+)')
# Result: ['/some/path/to/sequence_another_u', '1', '_v', '2', '.tif']
It's less important that the regex be exactly what I wrote above (but ideally, I'd like the accepted answer to use both). My only criteria are that the split string must be combinable without losing any digits and that each of the groups split in the way that I showed above (where the split occurs right at the start/end of the capture group and not the full string.
I made something with finditer but it's horribly hacky and I'm looking for a cleaner way. Can anyone help me out?
Changed your regex a little bit if you don't mind. Not sure if this works with your other cases.
def split_by_group(path, re_compile):
l = [s for s in re_compile.split(path) if s]
l[0:2] = [''.join(l[0:2])]
return l
split_by_group('/some/path/to/sequence2.1001.tif', re.compile('(\.)(\d+)'))
# Result: ['/some/path/to/sequence2.', '1001', '.tif']
split_by_group('/some/path/to/sequence_another_u1_v2.tif', re.compile('(_[uv])(\d+)'))
# Result: ['/some/path/to/sequence_another_u', '1', '_v', '2', '.tif']

Separate the string in Python, excluding some elements which contain separator

I have a really ugly string like this:
# ugly string follows:
ugly_string1 = SVEF/XX:1/60/24.02.16 07:30:00/"isk/kWh"/0/ENDTIME
# which also may look like this (part within quotes is different):
ugly_string2 = SVEF/XX:1/60/24.02.16 07:30:00/"kWh"/0/ENDTIME
and I'd like to separate it to get this list in Python:
['SVEF/XX:1', '60', '24.02.16 07:30:00', '"isk/kWh"', '0', 'ENDTIME']
# or from the second string:
['SVEF/XX:1', '60', '24.02.16 07:30:00', '"kWh"', '0', 'ENDTIME']
The first element (SVEF/XX:1) will always be the same, but the fourth element might or might not have the separator character in it (/).
I came up with regex which isolates the 1st and the 4th element (example here):
(?=(SVEF/XX:1))|(?=("(.*?)"))
but I just cannot figure out how to separate the rest of the string by / character, while excluding those two isolated elements?
I can do it with more "manual" approach, with regex like this (example here):
([^/]+/[^/]+)/([^/]+)/([^/]+)/("[^"]+")/([^/]+)/([^/]+)
but when I try this out in Python, I get extra empty elements for some reason:
['', 'SVEF/XX:1', '60', '24.02.16 07:30:00', '"isk/kWh"', '0', 'ENDTIME', '']
I could sanitize this result afterwards, but it would be great if I separate those strings without extra interventions.
In python, this can be done more easily (and with more room to generalize or adapt the approach in the future) with successive uses of split() and rsplit().
ugly_string = 'SVEF/XX:1/60/24.02.16 07:30:00/"isk/kWh"/0/ENDTIME'
temp = ugly_string.split("/", maxsplit=4)
result = [ temp[0]+"/"+temp[1] ] + temp[2:-1] + temp[-1].rsplit("/", maxsplit=2)
print(result)
Prints:
['SVEF/XX:1', '60', '24.02.16 07:30:00', '"isk/kWh"', '0', 'ENDTIME']
I use the second argument of split/rsplit to limit how many slashes are split;
I first split as many parts off the left as possible (i.e., 4), and rejoin parts 0 and 1
(the SVEF and XX). I then use rsplit() to make the rest of the split from the right. What's left in the middle is the quoted word, regardless of what it contains.
Rejoining the first two parts isn't too elegant, but neither is a format that allows / to appear both as a field separator and inside an unquoted field.
You can use re.findall testing first the quoted parts and making the beginning optional in the second branch:
re.findall(r'(?:^|/)("[^"]*"|(?:^[^/]*/)?[^/"]*)', s)
Python's csv module can handle multiple different delimiters, if you're ok with reinserting the " in the field where it seems to always exist, and reassembling the first field.
If you have a string, and want to treat it as a csv file, you can do this to prepare:
>>> import StringIO
>>> import csv
>>> ugly_string1 = 'SVEF/XX:1/60/24.02.16 07:30:00/"isk/kWh"/0/ENDTIME'
>>> f = StringIO.StringIO(ugly_string1)
Otherwise, assuming f is an open file, or the object we just created above:
>>> reader = csv.reader(f, delimiter='/')
>>> for row in reader:
>>> print(row)
['SVEF', 'XX:1', '60', '24.02.16 07:30:00', 'isk/kWh', '0', 'ENDTIME']
>>> first = "/".join(row[0:2])
Thank you all for your answers, they are all good and very helpful! However, after trying to test the performance of each one I came up with surprising results. You can take a look here,
but essentially, timit module ended up every time with results similar to this:
============================================================
example from my question:
0.21345195919275284
============================================================
Tushar's comment on my question:
0.21896087005734444
============================================================
alexis' answer (although not completely correct answer):
0.2645496800541878
============================================================
Casimir et Hippolyte's answer:
0.3663317859172821
============================================================
Simon Fraser's csv answer:
1.398559506982565
So, I decided to stick with my own example:
([^/]+/[^/]+)/([^/]+)/([^/]+)/("[^"]+")/([^/]+)/([^/]+)`)
but I'll reward your efforts nevertheless!

Transform comma separated string into a list but ignore comma in quotes

How do I convert "1,,2'3,4'" into a list? Commas separate the individual items, unless they are within quotes. In that case, the comma is to be included in the item.
This is the desired result: ['1', '', '2', '3,4']. One regex I found on another thread to ignore the quotes is as follows:
re.compile(r'''((?:[^,"']|"[^"]*"|'[^']*')+)''')
But this gives me this output:
['', '1', ',,', "2'3,4'", '']
I can't understand, where these extra empty strings are coming from, and why the two commas are even being printed at all, let alone together.
I tried making this regex myself:
re.compile(r'''(, | "[^"]*" | '[^']*')''')
which ended up not detecting anything, and just returned my original list.
I don't understand why, shouldn't it detect the commas at the very least? The same problem occurs if I add a ? after the comma.
Instead of a regular expression, you might be better off using the csv module since what you are dealing with is a CSV string:
from cStringIO import StringIO
from csv import reader
file_like_object = StringIO("1,,2,'3,4'")
csv_reader = reader(file_like_object, quotechar="'")
for row in csv_reader:
print row
This results in the following output:
['1', '', '2', '3,4']
pyparsing includes a predefined expression for comma-separated lists:
>>> from pyparsing import commaSeparatedList
>>> s = "1,,2'3,4'"
>>> print commaSeparatedList.parseString(s).asList()
['1', '', "2'3", "4'"]
Hmm, looks like you have a typo in your data, missing a comma after the 2:
>>> s = "1,,2,'3,4'"
>>> print commaSeparatedList.parseString(s).asList()
['1', '', '2', "'3,4'"]

Categories

Resources