Python Split With Delimiter In Field Value

Python Split With Delimiter In Field Value - python

I have a "CSV" with which some of the data fields happen to contain the comma delimiter as in the second row of the following sample data.
"1","stuff","and","things"
"2","black,white","more","stuff"
I can't change the source data and I don't know how to str.split() and not split in the value "black,white".
Ways I've approached my problem:
I looked at partition() and don't see how that would benefit me.
I'm sure a regex would capture data properly but I'm not sure how to tie one into splitting.
Since every row in the source will always have the same number of fields I thought maybe setting maxsplit would help but talked myself out of that with the thinking that it would still split within "black,white" and I would end up loosing the last value (which would be "stuff" in this case).
Certainly this is easy to overcome so I'm looking forward to learning something new!
Your help is greatly appreciated.

Using csv and StringIO:
>>> import csv, StringIO
>>> data = """"1","stuff","and","things"
... "2","black,white","more","stuff"
... """
>>> reader = csv.reader(StringIO.StringIO(data))
>>> for row in reader:
... print row
...
['1', 'stuff', 'and', 'things']
['2', 'black,white', 'more', 'stuff']

If your source is not CSV, and you just want to balance quotes in your string you may try using shlex module:
import shlex
lex = shlex.shlex('"2","black,white","more","stuff"')
for i in lex:
print i

Commas outside the strings are always followed by double-quotes. Just split on ," instead of just , (or even ",")
>>> x = '"2","black,white","more","stuff"'
>>> x
'"2","black,white","more","stuff"'
>>> x.split(',"')
['"2"', 'black,white"', 'more"', 'stuff"']
>>> [y.strip('"') for y in x.split(',"')]
['2', 'black,white', 'more', 'stuff']
Of course, edit for efficiency
YevgenYampolskiy's suggestion of shlex is also an alternative.
>>> x = '"2","black,white","more","stuff"'
>>> x
'"2","black,white","more","stuff"'
>>> import shlex
>>> y = shlex.shlex(x)
>>> [i.strip('"') for i in y if i != ',']
['2', 'black,white', 'more', 'stuff']

Related

How to split a content of a exisiting arrary element in python

The code looks like this
txt = "ID:2020,Sugar:3,cost_sugar:30,ID:2021,Sugar:5,cost_sugar:50"
x = str(txt.split(","))
And the output is
['ID:2020', 'Sugar:3']
Now I want to perform split again such that the output should look like:
['ID', '2020', 'Sugar', '3']

I like the answers provided by people by using the re module. Here is another way to do that:
txt.replace(":", ",").split(",")
Essentially, you're treating ":" as a ",", so why not first replace the ":" with a "," and then split?

Another solution is to use regex
>>> import re
>>> re.split(",|:", txt)
['ID', '2020', 'Sugar', '3', 'cost_sugar', '30', 'ID', '2021', 'Sugar', '5', 'cost_sugar', '50']

Python's re module has a nice feature
import re
re.split(':|,',txt)

You can use either code below related with your project, or it's a good example if you are learning the first time. The first one uses list. The second one uses re.
txt = "ID:2020,Sugar:3,cost_sugar:30,ID:2021,Sugar:5,cost_sugar:50"
returnlist = []
for x in list(txt.split(",")):
for fullkey in list(x.split(":")):
returnlist.append(fullkey)
print(returnlist)
OR
import re
txt = "ID:2020,Sugar:3,cost_sugar:30,ID:2021,Sugar:5,cost_sugar:50"
returnlist = re.split(':|,', txt)
print(returnlist)

you can use a list to first get the key and values and then print the list

Separating Strings and other values with comma as a delimiter

I'm working with a project, where there will be variable holding any data types just separated with a comma.
I need to separate all these things and I also need to define which type it is.
For e.g:
data='"Hello, Hey",123,10.04'
I used split() function to separate, but it separates the comma even within "Hello,Hey", outputing:
['"Hello','Hey"','123','10'.'04']
I don't need it like this, all i need is to separate the values by commas but not the ones inside other quotes. The output should be like this:
['"Hello, Hey"','123','10.04']
I killed my brain, but it is still a problem for me. Because I'm a beginner.
Thanks in Advance

I'm struggling to understand your question - it seems you have a string with data inside the string, separated by commas:
data='"Hello, Hey",123,10.04'
You can use the shlex module to split it respecting the quotes
>>> import shlex
>>> s = shlex.shlex(data)
>>> s.whitespace = ','
>>> s.wordchars += '.'
>>> print(list(s))
['"Hello, Hey"', '123', '10.04']

You may use the re module like so:
[m.group(1) or m.group(2) for m in re.finditer(r'"([^"]*)",?|([^,]*),?', '"Hello, Hey",123,10.04')]

You can use re.findall with regex pattern "[^"]+"|[^,]+:
import re
print(re.findall(r'"[^"]+"|[^,]+', '"Hello, Hey",123,10.04'))
This outputs:
['"Hello, Hey"', '123', '10.04']

Just use the shlex module
import shlex
data = '"Hello, Hey",123,10.04'
data = shlex.split(data)
print(data)
Output:
["Hello, Hey", "123" , "10.04"]

You can use re.split to split on a combination of either a double quote before a comma or a comma followed by a digit
import re
data='"Hello, Hey",123,10.04'
re.split(r'(?<="),|,(?=\d)', data)
['"Hello, Hey"', '123', '10.04']

splitting a string based on tab in the file

I have file that contains values separated by tab ("\t"). I am trying to create a list and store all values of file in the list. But I get some problem. Here is my code.
line = "abc def ghi"
values = line.split("\t")
It works fine as long as there is only one tab between each value. But if there is one than one tab then it copies the tab to values as well. In my case mostly the extra tab will be after the last value in the file.

You can use regex here:
>>> import re
>>> strs = "foo\tbar\t\tspam"
>>> re.split(r'\t+', strs)
['foo', 'bar', 'spam']
update:
You can use str.rstrip to get rid of trailing '\t' and then apply regex.
>>> yas = "yas\t\tbs\tcda\t\t"
>>> re.split(r'\t+', yas.rstrip('\t'))
['yas', 'bs', 'cda']

Split on tab, but then remove all blank matches.
text = "hi\tthere\t\t\tmy main man"
print([splits for splits in text.split("\t") if splits])
Outputs:
['hi', 'there', 'my main man']

You can use regexp to do this:
import re
patt = re.compile("[^\t]+")
s = "a\t\tbcde\t\tef"
patt.findall(s)
['a', 'bcde', 'ef']

An other regex-based solution:
>>> strs = "foo\tbar\t\tspam"
>>> r = re.compile(r'([^\t]*)\t*')
>>> r.findall(strs)[:-1]
['foo', 'bar', 'spam']

Python has support for CSV files in the eponymous csv module. It is relatively misnamed since it support much more that just comma separated values.
If you need to go beyond basic word splitting you should take a look. Say, for example, because you are in need to deal with quoted values...

Transform comma separated string into a list but ignore comma in quotes

How do I convert "1,,2'3,4'" into a list? Commas separate the individual items, unless they are within quotes. In that case, the comma is to be included in the item.
This is the desired result: ['1', '', '2', '3,4']. One regex I found on another thread to ignore the quotes is as follows:
re.compile(r'''((?:[^,"']|"[^"]*"|'[^']*')+)''')
But this gives me this output:
['', '1', ',,', "2'3,4'", '']
I can't understand, where these extra empty strings are coming from, and why the two commas are even being printed at all, let alone together.
I tried making this regex myself:
re.compile(r'''(, | "[^"]*" | '[^']*')''')
which ended up not detecting anything, and just returned my original list.
I don't understand why, shouldn't it detect the commas at the very least? The same problem occurs if I add a ? after the comma.

Instead of a regular expression, you might be better off using the csv module since what you are dealing with is a CSV string:
from cStringIO import StringIO
from csv import reader
file_like_object = StringIO("1,,2,'3,4'")
csv_reader = reader(file_like_object, quotechar="'")
for row in csv_reader:
print row
This results in the following output:
['1', '', '2', '3,4']

pyparsing includes a predefined expression for comma-separated lists:
>>> from pyparsing import commaSeparatedList
>>> s = "1,,2'3,4'"
>>> print commaSeparatedList.parseString(s).asList()
['1', '', "2'3", "4'"]
Hmm, looks like you have a typo in your data, missing a comma after the 2:
>>> s = "1,,2,'3,4'"
>>> print commaSeparatedList.parseString(s).asList()
['1', '', '2', "'3,4'"]

Protect commas on consecutive string.join() and string.split()

Suppose the following code (notice the commas inside the strings):
>>> a = ['1',",2","3,"]
I need to concatenate the values into a single string. Naive example:
>>> b = ",".join(a)
>>> b
'1,,2,3,'
And later I need to split the resulting object again:
>>> b.split(',')
['1', '', '2', '3', '']
However, the result I am looking for is the original list:
['1', ',2', '3,']
What's the simplest way to protect the commas in this process? The best solution I came up with looks rather ugly.
Note: the comma is just an example. The strings can contain any character. And I can choose other characters as separators.

The strings can contain any character.
If no matter what you use as a delimiter, there is a chance that the item itself contains the delimiter character, then use the csv module:
import csv
class PseudoFile(object):
# http://stackoverflow.com/a/8712426/190597
def write(self, string):
return string
writer = csv.writer(PseudoFile())
This concatenates the items in a using commas:
a = ['1',",2","3,"]
line = writer.writerow(a)
print(line)
# 1,",2","3,"
This recovers a from line:
print(next(csv.reader([line])))
# ['1', ',2', '3,']

Do you have to use comas to separate the items? Else you could also use another symbol that is not used in the items of the list.
In [1]: '|'.join(['1', ',2', '3,']).split('|')
Out[1]: ['1', ',2', '3,']
Edit: The string may apparently contain any character. Is it an option to use the json module? You could just dump and load the list.
In [3]: json.dumps(['1', ',2', '3,'])
Out[3]: '["1", ",2", "3,"]'
In [4]: json.loads('["1", ",2", "3,"]')
Out[4]: [u'1', u',2', u'3,']
Edit #2: If you may not use it, you could use str.encode('string-encode') to escape the characters in your string and then enclose the encoded version into single quotes and separate those with comas:
In [10]: print "'example'".encode('string-escape')
\'example\' #' (have to close the opened string for stackoverflow
In [11]: print r"\'example\'".decode('string-escape')
'example'
Edit #3: Running example of str.encode('string-encode'):
import re
def list_to_str(list):
return ','.join("'{}'".format(s.encode('string-escape')) for s in list)
def str_to_list(str):
return re.findall(r"'([^']*)'", str)
if __name__ == '__main__':
a = ['1', ',2', '3,']
b = list_to_str(a)
print 'It is {} that this works.'.format(str_to_list(b) == a)

When you are serializing a list to a String, then you need to choose as a separator a character that doesn't appear in the list items. Can't you just replace the comma with another character?
b = ";".join(a)
b.split(';')

Does the delimiter need to be only a single character? If not then you can use a delimiter made up of a sequence of characters that definitley wont appear in your string, like |#| or something similar.

You need to escape the comma and probably also escape the escape sequence. Here's one way:
>>> a = ['1',",2","3,"]
>>> b = ','.join(s.replace('%', '%%').replace(',', '%2c') for s in a)
>>> [s.replace('%2c', ',').replace('%%', '%') for s in b.split(',')]
['1', ',2', '3,']
>>> b
'1,%2c2,3%2c'
>>>

I would join and split using another character than ",", e.g. ";":
>>> b = ";".join(a)
>>> b.split(';')
['1', ',2', '3,']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Split With Delimiter In Field Value - python

Using csv and StringIO: >>> import csv, StringIO >>> data = """"1","stuff","and","things" ... "2","black,white","more","stuff" ... """ >>> reader = csv.reader(StringIO.StringIO(data)) >>> for row in reader: ... print row ... ['1', 'stuff', 'and', 'things'] ['2', 'black,white', 'more', 'stuff']

If your source is not CSV, and you just want to balance quotes in your string you may try using shlex module: import shlex lex = shlex.shlex('"2","black,white","more","stuff"') for i in lex: print i

Related

How to split a content of a exisiting arrary element in python

Separating Strings and other values with comma as a delimiter

splitting a string based on tab in the file

Transform comma separated string into a list but ignore comma in quotes

Protect commas on consecutive string.join() and string.split()

Categories

Resources