Separate the string in Python, excluding some elements which contain separator - python

I have a really ugly string like this:
# ugly string follows:
ugly_string1 = SVEF/XX:1/60/24.02.16 07:30:00/"isk/kWh"/0/ENDTIME
# which also may look like this (part within quotes is different):
ugly_string2 = SVEF/XX:1/60/24.02.16 07:30:00/"kWh"/0/ENDTIME
and I'd like to separate it to get this list in Python:
['SVEF/XX:1', '60', '24.02.16 07:30:00', '"isk/kWh"', '0', 'ENDTIME']
# or from the second string:
['SVEF/XX:1', '60', '24.02.16 07:30:00', '"kWh"', '0', 'ENDTIME']
The first element (SVEF/XX:1) will always be the same, but the fourth element might or might not have the separator character in it (/).
I came up with regex which isolates the 1st and the 4th element (example here):
(?=(SVEF/XX:1))|(?=("(.*?)"))
but I just cannot figure out how to separate the rest of the string by / character, while excluding those two isolated elements?
I can do it with more "manual" approach, with regex like this (example here):
([^/]+/[^/]+)/([^/]+)/([^/]+)/("[^"]+")/([^/]+)/([^/]+)
but when I try this out in Python, I get extra empty elements for some reason:
['', 'SVEF/XX:1', '60', '24.02.16 07:30:00', '"isk/kWh"', '0', 'ENDTIME', '']
I could sanitize this result afterwards, but it would be great if I separate those strings without extra interventions.

In python, this can be done more easily (and with more room to generalize or adapt the approach in the future) with successive uses of split() and rsplit().
ugly_string = 'SVEF/XX:1/60/24.02.16 07:30:00/"isk/kWh"/0/ENDTIME'
temp = ugly_string.split("/", maxsplit=4)
result = [ temp[0]+"/"+temp[1] ] + temp[2:-1] + temp[-1].rsplit("/", maxsplit=2)
print(result)
Prints:
['SVEF/XX:1', '60', '24.02.16 07:30:00', '"isk/kWh"', '0', 'ENDTIME']
I use the second argument of split/rsplit to limit how many slashes are split;
I first split as many parts off the left as possible (i.e., 4), and rejoin parts 0 and 1
(the SVEF and XX). I then use rsplit() to make the rest of the split from the right. What's left in the middle is the quoted word, regardless of what it contains.
Rejoining the first two parts isn't too elegant, but neither is a format that allows / to appear both as a field separator and inside an unquoted field.

You can use re.findall testing first the quoted parts and making the beginning optional in the second branch:
re.findall(r'(?:^|/)("[^"]*"|(?:^[^/]*/)?[^/"]*)', s)

Python's csv module can handle multiple different delimiters, if you're ok with reinserting the " in the field where it seems to always exist, and reassembling the first field.
If you have a string, and want to treat it as a csv file, you can do this to prepare:
>>> import StringIO
>>> import csv
>>> ugly_string1 = 'SVEF/XX:1/60/24.02.16 07:30:00/"isk/kWh"/0/ENDTIME'
>>> f = StringIO.StringIO(ugly_string1)
Otherwise, assuming f is an open file, or the object we just created above:
>>> reader = csv.reader(f, delimiter='/')
>>> for row in reader:
>>> print(row)
['SVEF', 'XX:1', '60', '24.02.16 07:30:00', 'isk/kWh', '0', 'ENDTIME']
>>> first = "/".join(row[0:2])

Thank you all for your answers, they are all good and very helpful! However, after trying to test the performance of each one I came up with surprising results. You can take a look here,
but essentially, timit module ended up every time with results similar to this:
============================================================
example from my question:
0.21345195919275284
============================================================
Tushar's comment on my question:
0.21896087005734444
============================================================
alexis' answer (although not completely correct answer):
0.2645496800541878
============================================================
Casimir et Hippolyte's answer:
0.3663317859172821
============================================================
Simon Fraser's csv answer:
1.398559506982565
So, I decided to stick with my own example:
([^/]+/[^/]+)/([^/]+)/([^/]+)/("[^"]+")/([^/]+)/([^/]+)`)
but I'll reward your efforts nevertheless!

Related

Splitting a string every 2 digits

I have a column existing of rows with different strings (Python). ex.
5456656352
435365
46765432
...
I want to seperate the strings every 2 digits with a comma, so I have following result:
54,56,65,63,52
43,53,65
46,76,54,32
...
Can someone help me please.
Try:
text = "5456656352"
print(",".join(text[i:i + 2] for i in range(0, len(text), 2)))
output:
54,56,65,63,52
You can wrap it into a function if you want to apply it to a DF or ...
note: This will separate from left, so if the length is odd, there will be a single number at the end.
Not sure about the structure of desired output (pandas and dataframes, pure strings, etc.). But, you can always use a regex pattern like:
import re
re.findall("\d{2}", "5456656352")
Output
['54', '56', '65', '63', '52']
You can have this output as a string too:
",".join(re.findall("\d{2}", "5456656352"))
Output
54,56,65,63,52
Explanation
\d{2} is a regex pattern that points to a part of a string that has 2 digits. Using findall function, this pattern will divide each string to elements containing just two digits.
Edit
Based on your comment, you want to APPLY this on a column. In this case, you should do something like:
df["my_column"] = df["my_column"].apply(split_it)

Split string using capture groups

I have a two strings
/some/path/to/sequence2.1001.tif
and
/some/path/to/sequence_another_u1_v2.tif
I want to write a function so that both strings can be split up into a list by some regex and joined back together, without losing any characters.
so
def split_by_group(path, re_compile):
# ...
return ['the', 'parts', 'here']
split_by_group('/some/path/to/sequence2.1001.tif', re.compile(r'(\.(\d+)\.')
# Result: ['/some/path/to/sequence2.', '1001', '.tif']
split_by_group('/some/path/to/sequence_another_u1_v2.tif', re.compile(r'_[uv](\d+)')
# Result: ['/some/path/to/sequence_another_u', '1', '_v', '2', '.tif']
It's less important that the regex be exactly what I wrote above (but ideally, I'd like the accepted answer to use both). My only criteria are that the split string must be combinable without losing any digits and that each of the groups split in the way that I showed above (where the split occurs right at the start/end of the capture group and not the full string.
I made something with finditer but it's horribly hacky and I'm looking for a cleaner way. Can anyone help me out?
Changed your regex a little bit if you don't mind. Not sure if this works with your other cases.
def split_by_group(path, re_compile):
l = [s for s in re_compile.split(path) if s]
l[0:2] = [''.join(l[0:2])]
return l
split_by_group('/some/path/to/sequence2.1001.tif', re.compile('(\.)(\d+)'))
# Result: ['/some/path/to/sequence2.', '1001', '.tif']
split_by_group('/some/path/to/sequence_another_u1_v2.tif', re.compile('(_[uv])(\d+)'))
# Result: ['/some/path/to/sequence_another_u', '1', '_v', '2', '.tif']

Better way to parse from regex?

I am doing the following to get the movieID:
>>> x.split('content')
['movieID" ', '="770672122">']
>>> [item for item in x.split('content')[1] if item.isdigit()]
['7', '7', '0', '6', '7', '2', '1', '2', '2']
>>> ''.join([item for item in x.split('content')[1] if item.isdigit()])
'770672122'
Would would be a better way to do this?
Without using a regular expression, you could just split by the double quotes and take the next to last field.
u="""movieID" content="7706">"""
u.split('"')[-2] # returns: '7706'
This trick is definitely the most readable, if you don't know about regular expressions yet.
Your string is a bit strange though as there are 3 double quotes. I assume it comes from an HTML file and you're only showing a small substring. In that case, you might make your code more robust by using a regular expression such as:
import re
s = re.search('(\d+)', u) # looks for multiple consecutive digits
s.groups() # returns: ('7706',)
You could make it even more robust (but you'll need to read more) by using a DOM-parser such as BeautifulSoup.
I assume x looks like this:
x = 'movieID content="770672122">'
Regex is definitely one way to extract the content. For example:
>>> re.search(r'content="(\d+)', x).group(1)
'770672122'
The above fetches one or more consecutive digits which follow the string content=".
It seems you could do something like the following if your string is like the below:
>>> import re
>>> x = 'movieID content="770672122">'
>>> re.search(r'\d+', x).group()
'770672122'

Python: Checking a list with regex, filling in blanks

I've tried to find ways to do this and searched online here, but cannot find examples to help me figure this out.
I'm reading in rows from a large csv and changing each row to a list. The problem is that the data source isn't very clean. It has empty strings or bad data sometimes, and I need to fill in default values when that happens. For example:
list_ex1 = ['apple','9','','2012-03-05','455.6']
list_ex2 = ['pear','0','45','wrong_entry','565.11']
Here, list_ex1 has a blank third entry and list_ex2 has erroneous data where a date should be. To be clear, I can create a regex that limits what each of the five entries should be:
reg_ex_check = ['[A-Za-z]+','[0-9]','[0-9]','[0-9]{4}-[0-1][0-9]-[0-3][0-9]','[0-9.]+']
That is:
1st entry: A string, no numbers
2nd entry: Exactly one digit between 0 and 9
3rd entry: Exactly one digit as well.
4th entry: Date in standard format (allowing any four digit ints for year)
5th entry: Float
If an entry is blank OR does not match the regular expression, then it should be filled in/replaced with the following defaults:
default_fill = ['empty','0','0','2000-01-01','0']
I'm not sure how the best way to go about this is. I think I could write a complicated loop, but it doesn't feel very 'pythonic' to me to do such things.
Any better ideas?
Use zip and a conditional expression in a list comprehension:
[x if re.match(r,x) else d for x,r,d in zip(list_ex2,reg_ex_check,default_fill)]
Out[14]: ['pear', '0', '45', '2000-01-01', '565.11']
You don't really need to explicitly check for blank strings since your various regexen (plural of regex) will all fail on blank strings.
Other note: you probably still want to add an anchor for the end of your string to each regex. Using re.match ensures that it tries to match from the start, but still provides no guarantee that there is not illegal stuff after your match. Consider:
['pear and a pear tree', '0blah', '4 4 4', '2000-01-0000', '192.168.0.bananas']
The above entire list is "acceptable" if you don't add a $ anchor to the end of each regex :-)
What about something like this?
map(lambda(x,y,z): re.search(y,x) and x or z, zip(list_ex1, reg_ex_check, default_fill))

Transform comma separated string into a list but ignore comma in quotes

How do I convert "1,,2'3,4'" into a list? Commas separate the individual items, unless they are within quotes. In that case, the comma is to be included in the item.
This is the desired result: ['1', '', '2', '3,4']. One regex I found on another thread to ignore the quotes is as follows:
re.compile(r'''((?:[^,"']|"[^"]*"|'[^']*')+)''')
But this gives me this output:
['', '1', ',,', "2'3,4'", '']
I can't understand, where these extra empty strings are coming from, and why the two commas are even being printed at all, let alone together.
I tried making this regex myself:
re.compile(r'''(, | "[^"]*" | '[^']*')''')
which ended up not detecting anything, and just returned my original list.
I don't understand why, shouldn't it detect the commas at the very least? The same problem occurs if I add a ? after the comma.
Instead of a regular expression, you might be better off using the csv module since what you are dealing with is a CSV string:
from cStringIO import StringIO
from csv import reader
file_like_object = StringIO("1,,2,'3,4'")
csv_reader = reader(file_like_object, quotechar="'")
for row in csv_reader:
print row
This results in the following output:
['1', '', '2', '3,4']
pyparsing includes a predefined expression for comma-separated lists:
>>> from pyparsing import commaSeparatedList
>>> s = "1,,2'3,4'"
>>> print commaSeparatedList.parseString(s).asList()
['1', '', "2'3", "4'"]
Hmm, looks like you have a typo in your data, missing a comma after the 2:
>>> s = "1,,2,'3,4'"
>>> print commaSeparatedList.parseString(s).asList()
['1', '', '2', "'3,4'"]

Categories

Resources