separate number from the string - python

the separate number from the string, but when successive '1', separate them
I think there must have a smart way to solve the question.
s = 'NNNN1234N11N1N123'
expected result is:
['1234','1','1','1','123']

I think what you want can be solved by using the re module
>>> import re
>>> re.findall('(?:1[2-90]+)|1', 'NNNN1234N11N1N123')
EDIT: As suggested in the comments by #CrafterKolyan the regular expression can be reduced to 1[2-90]*.
Outputs
['1234', '1', '1', '1', '123']

I also would use regular expressions (re module), but other function, namely re.split following way:
import re
s = 'NNNN1234N11N1N123'
output = re.split(r'[^\d]+|(?<=1)(?=1)',s)
print(output) # ['', '1234', '1', '1', '1', '123']
output = [i for i in output if i] # jettison empty strs
print(output) # ['1234', '1', '1', '1', '123']
Explanation: You want to split str to get list of strs - that is for what re.split is used. First argument of re.split is used to tell where split should happen, with everything which will be matched being removed if capturing groups are not used (similar to str method split), so I need to specify two places where cut happen, so I used | that is alternative and informed re.split to cut at:
[^\d]+ that is 1 or more non-digits
(?<=1)(?=1) that is empty str preceded by 1 and followed by 1, here I used feature named zero-length assertion (twice)
Note that re.split produced '' (empty str) before your desired output - this mean that first cut (NNNN in this case) spanned from start of str. This is expected behavior of re.split although we do not need that information in this case so we can jettison any empty strs, for which I used list comprehension.

Related

Python re.findall returning only first character

Working in Python 3.6, I have a list of html files with date prefixes. I'd like to return all dates, so I join the list and use some regex, like so:
import re
snapshots = ['20180614_SII.html', '20180615_SII.html']
p = re.compile("(\d|^)\d*(?=_)")
snapshot_dates = p.findall(' '.join(snapshots))
snapshot_dates is a list, ['2', '2'], but I'm expecting ['20180614', '20180615']. Demonstration here: https://regexr.com/3r44o. What am I missing?
You can simplify your pattern to use \d+ instead of (\d|^)\d*:
p = re.compile("\d+(?=_)")
print(p.findall(' '.join(snapshots)))
#['20180614', '20180615']
However, in this case you may not need regex to achieve the desired result. You can instead just split the string on _:
print([x.split("_")[0] for x in snapshots])
#['20180614', '20180615']

Split string using capture groups

I have a two strings
/some/path/to/sequence2.1001.tif
and
/some/path/to/sequence_another_u1_v2.tif
I want to write a function so that both strings can be split up into a list by some regex and joined back together, without losing any characters.
so
def split_by_group(path, re_compile):
# ...
return ['the', 'parts', 'here']
split_by_group('/some/path/to/sequence2.1001.tif', re.compile(r'(\.(\d+)\.')
# Result: ['/some/path/to/sequence2.', '1001', '.tif']
split_by_group('/some/path/to/sequence_another_u1_v2.tif', re.compile(r'_[uv](\d+)')
# Result: ['/some/path/to/sequence_another_u', '1', '_v', '2', '.tif']
It's less important that the regex be exactly what I wrote above (but ideally, I'd like the accepted answer to use both). My only criteria are that the split string must be combinable without losing any digits and that each of the groups split in the way that I showed above (where the split occurs right at the start/end of the capture group and not the full string.
I made something with finditer but it's horribly hacky and I'm looking for a cleaner way. Can anyone help me out?
Changed your regex a little bit if you don't mind. Not sure if this works with your other cases.
def split_by_group(path, re_compile):
l = [s for s in re_compile.split(path) if s]
l[0:2] = [''.join(l[0:2])]
return l
split_by_group('/some/path/to/sequence2.1001.tif', re.compile('(\.)(\d+)'))
# Result: ['/some/path/to/sequence2.', '1001', '.tif']
split_by_group('/some/path/to/sequence_another_u1_v2.tif', re.compile('(_[uv])(\d+)'))
# Result: ['/some/path/to/sequence_another_u', '1', '_v', '2', '.tif']

How to remove falsy values when splitting a string with a non-whitespace separator

According to the docs:
str.split(sep=None, maxsplit=-1)
If sep is given, consecutive delimiters are not grouped together and
are deemed to delimit empty strings (for example, '1,,2'.split(',')
returns ['1', '', '2']). The sep argument may consist of multiple
characters (for example, '1<>2<>3'.split('<>') returns ['1', '2', '3']).
Splitting an empty string with a specified separator returns
[''].
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].
So to use the keyword argument sep=, is the following the pythonic way to remove the falsy items?
[w for w in s.strip().split(' ') if w]
If it's just whitespaces (\s\t\n), str.split() will suffice but let's say we are trying to split on another character/substring, the if-condition in the list comprehension is necessary. Is that right?
If you want to be obtuse, you could use filter(None, x) to remove falsey items:
>>> list(filter(None, '1,2,,3,'.split(',')))
['1', '2', '3']
Probably less Pythonic. It might be clearer to iterate over the items specifically:
for w in '1,2,,3,'.split(','):
if w:
…
This makes it clear that you're skipping the empty items and not relying on the fact that str.split sometimes skips empty items.
I'd just as soon use a regex, either to skip consecutive runs of the separator (but watch out for the end):
>>> re.split(r',+', '1,2,,3,')
['1', '2', '3', '']
or to find everything that's not a separator:
>>> re.findall(r'[^,]+', '1,2,,3,')
['1', '2', '3']
If you want to go way back in Python's history, there were two separate functions, split and splitfields. I think the name explains the purpose. The first splits on any whitespace, useful for arbitrary text input, and the second behaves predictably on some delimited input. They were implemented in pure Python before v1.6.
Well, I think you might just need a hand in understanding the documentation. In your example, you pretty much are demonstrating the differences in the algorithm mentioned in documentation. Not using the sep keyword argument more or less is like using sep=' ' and then throwing out the empty strings. When you have multiple spaces in a row the algorithm splits those and finds None. Because you were explicit that you wanted everything split by a space it converts None to an empty string. Changing None to an empty string is good practice in this case, because it avoids changing the signature of the function (or in other words what the functions returns), in this case it returns a list of strings.
Below is showing how an empty string with 4 spaces is treated differently...
>>> empty = ' '
>>> s = 'this is an irritating string with random spacing .'
>>> empty.split()
[]
>>> empty.split(' ')
['', '', '', '']
For you question, just use split() with no sep argument
well your string
s = 'this is an irritating string with random spacing .',
which is containing more than one white spaces that's why empty.split(' ') is returning noney value.
you would have to remove extra white space from string s and can get desired result.

Better way to parse from regex?

I am doing the following to get the movieID:
>>> x.split('content')
['movieID" ', '="770672122">']
>>> [item for item in x.split('content')[1] if item.isdigit()]
['7', '7', '0', '6', '7', '2', '1', '2', '2']
>>> ''.join([item for item in x.split('content')[1] if item.isdigit()])
'770672122'
Would would be a better way to do this?
Without using a regular expression, you could just split by the double quotes and take the next to last field.
u="""movieID" content="7706">"""
u.split('"')[-2] # returns: '7706'
This trick is definitely the most readable, if you don't know about regular expressions yet.
Your string is a bit strange though as there are 3 double quotes. I assume it comes from an HTML file and you're only showing a small substring. In that case, you might make your code more robust by using a regular expression such as:
import re
s = re.search('(\d+)', u) # looks for multiple consecutive digits
s.groups() # returns: ('7706',)
You could make it even more robust (but you'll need to read more) by using a DOM-parser such as BeautifulSoup.
I assume x looks like this:
x = 'movieID content="770672122">'
Regex is definitely one way to extract the content. For example:
>>> re.search(r'content="(\d+)', x).group(1)
'770672122'
The above fetches one or more consecutive digits which follow the string content=".
It seems you could do something like the following if your string is like the below:
>>> import re
>>> x = 'movieID content="770672122">'
>>> re.search(r'\d+', x).group()
'770672122'

Transform comma separated string into a list but ignore comma in quotes

How do I convert "1,,2'3,4'" into a list? Commas separate the individual items, unless they are within quotes. In that case, the comma is to be included in the item.
This is the desired result: ['1', '', '2', '3,4']. One regex I found on another thread to ignore the quotes is as follows:
re.compile(r'''((?:[^,"']|"[^"]*"|'[^']*')+)''')
But this gives me this output:
['', '1', ',,', "2'3,4'", '']
I can't understand, where these extra empty strings are coming from, and why the two commas are even being printed at all, let alone together.
I tried making this regex myself:
re.compile(r'''(, | "[^"]*" | '[^']*')''')
which ended up not detecting anything, and just returned my original list.
I don't understand why, shouldn't it detect the commas at the very least? The same problem occurs if I add a ? after the comma.
Instead of a regular expression, you might be better off using the csv module since what you are dealing with is a CSV string:
from cStringIO import StringIO
from csv import reader
file_like_object = StringIO("1,,2,'3,4'")
csv_reader = reader(file_like_object, quotechar="'")
for row in csv_reader:
print row
This results in the following output:
['1', '', '2', '3,4']
pyparsing includes a predefined expression for comma-separated lists:
>>> from pyparsing import commaSeparatedList
>>> s = "1,,2'3,4'"
>>> print commaSeparatedList.parseString(s).asList()
['1', '', "2'3", "4'"]
Hmm, looks like you have a typo in your data, missing a comma after the 2:
>>> s = "1,,2,'3,4'"
>>> print commaSeparatedList.parseString(s).asList()
['1', '', '2', "'3,4'"]

Categories

Resources