Python csv module splitting strings, not just fields

Python csv module splitting strings, not just fields - python

When I run this input (saved as variable 'line'):
xsc_i,202,"House of Night",21,"/21_202"
through a csv reader:
for row in csv.reader(line):
print row
it splits the strings, not just the fields
['x']
['s']
['c']
['_']
['i']
['', '']
['2']
['0']
['2']
['', '']
etc.
It exhibits this behavior even if I explicitly set the delimiter:
csv.reader(line, delimiter=",")
It's treating even strings as arrays, but I can't figure out why, and I can't just split on commas because many commas are inside "" strings in the input.
Python 2.7, if it matters.

The first argument to csv.reader() is expected to be an iterable object containing csv rows. In your case the input is a string (which is also iterable) containing a single row. You need to enclose the line into a list:
for row in csv.reader([line]):
print row
Demo:
>>> import csv
>>> line = 'xsc_i,202,"House of Night",21,"/21_202"'
>>> for row in csv.reader([line]):
... print row
...
['xsc_i', '202', 'House of Night', '21', '/21_202']

Just in case you want to see re in action.
import re
line='xsc_i,202,"House of Night",21,"/21_202"'
print map(lambda x:x.strip('"'),re.split(r',(?=(?:[^"]*"[^"]*")*[^"]*$)',line))
Output:['xsc_i', '202', 'House of Night', '21', '/21_202']

This is because csv.reader expects
any object which supports the iterator protocol and returns a string
each time its next() method is called
You have passed a string to the reader.
If you say:
line = ['xsc_i,202,"House of Night",21,"/21_202"',]
Your code should work as expected.
Please see docs

Related

I need double quotes around datas from CSV [duplicate]

I have a list:
my_list = ['"3"', '"45"','"12"','"6"']
This list has single and double quotes and the item value. How can I replace either the single or double quotes from each item. I tried below, but the results are same:
my_list = [i.replace("''", " ") for i in my_list]

Your list doesn't contain any strings with single quotes. I think you are confusing the repr() representation of the strings with their values.
When you print a Python standard library container such as a list (or a tuple, set, dictionary, etc.) then the contents of such a container are shown their repr() representation output; this is great when debugging because it makes it clear what type of objects you have. For strings, the representation uses valid Python string literal syntax; you can copy the output and paste it into another Python script or the interactive interpreter and you'll get the exact same value.
For example, s here is a string that contains some text, some quote characters, and a newline character. When I print the string, the newline character causes an extra blank line to be printed, but when I use repr(), you get the string value in Python syntax form, where the single quotes are part of the syntax, not the value. Note that the newline character also is shown with the \n syntax, exactly the same as when I created the s string in the first place:
>>> s = 'They heard him say "Hello world!".\n'
>>> print(s)
They heard him say "Hello world!".
>>> print(repr(s))
'They heard him say "Hello world!".\n'
>>> s
'They heard him say "Hello world!".\n'
And when I echoed the s value at the end, the interactive interpreter also shows me the value using the repr() output.
So in your list, your strings do not have the ' characters as part of the value. They are part of the string syntax. You only need to replace the " characters, they are part of the value, because they are inside the outermost '...' string literal syntax. You could use str.replace('"', '') to remove them:
[value.replace('"', '') for value in my_list]
or, you could use the str.strip() method to only remove quotes that are at the start or end of the value:
[value.strip('"') for value in my_list]
Both work just fine for your sample list:
>>> my_list = ['"3"', '"45"','"12"','"6"']
>>> [value.replace('"', '') for value in my_list]
['3', '45', '12', '6']
>>> [value.strip('"') for value in my_list]
['3', '45', '12', '6']
Again, the ' characters are not part of the value:
>>> first = my_list[0].strip('"')
>>> first # echo, uses repr()
'3'
>>> print(first) # printing, the actual value written out
3
>>> len(first) # there is just a single character in the string
1
However, I have seen that you are reading your data from a tab-separated file that you hand-parse. You can avoid having to deal with the " quotes altogether if you instead used the csv.reader() object, configured to handle tabs as the delimiter. That class automatically will handle quoted columns:
import csv
with open(inputfile, 'r', newline='') as datafile:
reader = csv.reader(datafile, delimiter='\t')
for row in reader:
# row is a list with strings, *but no quotes*
# e.g. ['3', '45', '12', '6']
Demo showing how csv.reader() handles quotes:
>>> import csv
>>> lines = '''\
... "3"\t"45"\t"12"\t"6"
... "42"\t"81"\t"99"\t"11"
... '''.splitlines()
>>> reader = csv.reader(lines, delimiter='\t')
>>> for row in reader:
... print(row)
...
['3', '45', '12', '6']
['42', '81', '99', '11']

As suggested by #MartijnPieters in comments, you can use replace on the strings to get the desired output.
The change I like to suggest is that using .replace('"', '') instead of .replace('"', ' '). Otherwise the resultant strings will have a leading and trailing white space
You can use list comprehension to deal with the list you have like this
my_list = ['"3"', '"45"','"12"','"6"']
new_list = [x.replace('"', '') for x in my_list]
print(new_list) # ['3', '45', '12', '6']

You can use split:
[x.split('"')[1] for x in my_list]
or you can use:
[x.strip('"') for x in my_list]

Remove blank string value from a list of strings

I am reading string information as input from a text file and placing them into lists, and one of the lines is like this:
30121,long,Mehtab,10,20,,30
I want to remove the empty value in between the ,, portion from this list, but have had zero results. I've tried .remove() and filter(). Python reads it as a 'str' value.

>>> import re
>>> re.sub(',,+', ',', '30121,long,Mehtab,10,20,,30')
'30121,long,Mehtab,10,20,30'

Use split() and remove()
In [11]: s = '30121,long,Mehtab,10,20,,30'
In [14]: l = s.split(',')
In [15]: l.remove('')
In [16]: l
Out[16]: ['30121', 'long', 'Mehtab', '10', '20', '30']

Filter should work. First I am writing the data in a list and then using filter operation to filter out items in a list which which are empty. In other words, only taking items that are not empty.
data = list("30121","long","Mehtab",10,20,"",30)
filtered_data = list(filter(lambda str: str != '', data))
print(filtered_data)

You can split the string based on your separator ("," for this) and then use list comprehension to consolidate the elements after making sure they are not blank.
",".join([element for element in string.split(",") if element])
We can also use element.strip() as if condition if we want to filter out string with only spaces.

Python Split With Delimiter In Field Value

I have a "CSV" with which some of the data fields happen to contain the comma delimiter as in the second row of the following sample data.
"1","stuff","and","things"
"2","black,white","more","stuff"
I can't change the source data and I don't know how to str.split() and not split in the value "black,white".
Ways I've approached my problem:
I looked at partition() and don't see how that would benefit me.
I'm sure a regex would capture data properly but I'm not sure how to tie one into splitting.
Since every row in the source will always have the same number of fields I thought maybe setting maxsplit would help but talked myself out of that with the thinking that it would still split within "black,white" and I would end up loosing the last value (which would be "stuff" in this case).
Certainly this is easy to overcome so I'm looking forward to learning something new!
Your help is greatly appreciated.

Using csv and StringIO:
>>> import csv, StringIO
>>> data = """"1","stuff","and","things"
... "2","black,white","more","stuff"
... """
>>> reader = csv.reader(StringIO.StringIO(data))
>>> for row in reader:
... print row
...
['1', 'stuff', 'and', 'things']
['2', 'black,white', 'more', 'stuff']

If your source is not CSV, and you just want to balance quotes in your string you may try using shlex module:
import shlex
lex = shlex.shlex('"2","black,white","more","stuff"')
for i in lex:
print i

Commas outside the strings are always followed by double-quotes. Just split on ," instead of just , (or even ",")
>>> x = '"2","black,white","more","stuff"'
>>> x
'"2","black,white","more","stuff"'
>>> x.split(',"')
['"2"', 'black,white"', 'more"', 'stuff"']
>>> [y.strip('"') for y in x.split(',"')]
['2', 'black,white', 'more', 'stuff']
Of course, edit for efficiency
YevgenYampolskiy's suggestion of shlex is also an alternative.
>>> x = '"2","black,white","more","stuff"'
>>> x
'"2","black,white","more","stuff"'
>>> import shlex
>>> y = shlex.shlex(x)
>>> [i.strip('"') for i in y if i != ',']
['2', 'black,white', 'more', 'stuff']

Transform comma separated string into a list but ignore comma in quotes

How do I convert "1,,2'3,4'" into a list? Commas separate the individual items, unless they are within quotes. In that case, the comma is to be included in the item.
This is the desired result: ['1', '', '2', '3,4']. One regex I found on another thread to ignore the quotes is as follows:
re.compile(r'''((?:[^,"']|"[^"]*"|'[^']*')+)''')
But this gives me this output:
['', '1', ',,', "2'3,4'", '']
I can't understand, where these extra empty strings are coming from, and why the two commas are even being printed at all, let alone together.
I tried making this regex myself:
re.compile(r'''(, | "[^"]*" | '[^']*')''')
which ended up not detecting anything, and just returned my original list.
I don't understand why, shouldn't it detect the commas at the very least? The same problem occurs if I add a ? after the comma.

Instead of a regular expression, you might be better off using the csv module since what you are dealing with is a CSV string:
from cStringIO import StringIO
from csv import reader
file_like_object = StringIO("1,,2,'3,4'")
csv_reader = reader(file_like_object, quotechar="'")
for row in csv_reader:
print row
This results in the following output:
['1', '', '2', '3,4']

pyparsing includes a predefined expression for comma-separated lists:
>>> from pyparsing import commaSeparatedList
>>> s = "1,,2'3,4'"
>>> print commaSeparatedList.parseString(s).asList()
['1', '', "2'3", "4'"]
Hmm, looks like you have a typo in your data, missing a comma after the 2:
>>> s = "1,,2,'3,4'"
>>> print commaSeparatedList.parseString(s).asList()
['1', '', '2', "'3,4'"]

Protect commas on consecutive string.join() and string.split()

Suppose the following code (notice the commas inside the strings):
>>> a = ['1',",2","3,"]
I need to concatenate the values into a single string. Naive example:
>>> b = ",".join(a)
>>> b
'1,,2,3,'
And later I need to split the resulting object again:
>>> b.split(',')
['1', '', '2', '3', '']
However, the result I am looking for is the original list:
['1', ',2', '3,']
What's the simplest way to protect the commas in this process? The best solution I came up with looks rather ugly.
Note: the comma is just an example. The strings can contain any character. And I can choose other characters as separators.

The strings can contain any character.
If no matter what you use as a delimiter, there is a chance that the item itself contains the delimiter character, then use the csv module:
import csv
class PseudoFile(object):
# http://stackoverflow.com/a/8712426/190597
def write(self, string):
return string
writer = csv.writer(PseudoFile())
This concatenates the items in a using commas:
a = ['1',",2","3,"]
line = writer.writerow(a)
print(line)
# 1,",2","3,"
This recovers a from line:
print(next(csv.reader([line])))
# ['1', ',2', '3,']

Do you have to use comas to separate the items? Else you could also use another symbol that is not used in the items of the list.
In [1]: '|'.join(['1', ',2', '3,']).split('|')
Out[1]: ['1', ',2', '3,']
Edit: The string may apparently contain any character. Is it an option to use the json module? You could just dump and load the list.
In [3]: json.dumps(['1', ',2', '3,'])
Out[3]: '["1", ",2", "3,"]'
In [4]: json.loads('["1", ",2", "3,"]')
Out[4]: [u'1', u',2', u'3,']
Edit #2: If you may not use it, you could use str.encode('string-encode') to escape the characters in your string and then enclose the encoded version into single quotes and separate those with comas:
In [10]: print "'example'".encode('string-escape')
\'example\' #' (have to close the opened string for stackoverflow
In [11]: print r"\'example\'".decode('string-escape')
'example'
Edit #3: Running example of str.encode('string-encode'):
import re
def list_to_str(list):
return ','.join("'{}'".format(s.encode('string-escape')) for s in list)
def str_to_list(str):
return re.findall(r"'([^']*)'", str)
if __name__ == '__main__':
a = ['1', ',2', '3,']
b = list_to_str(a)
print 'It is {} that this works.'.format(str_to_list(b) == a)

When you are serializing a list to a String, then you need to choose as a separator a character that doesn't appear in the list items. Can't you just replace the comma with another character?
b = ";".join(a)
b.split(';')

Does the delimiter need to be only a single character? If not then you can use a delimiter made up of a sequence of characters that definitley wont appear in your string, like |#| or something similar.

You need to escape the comma and probably also escape the escape sequence. Here's one way:
>>> a = ['1',",2","3,"]
>>> b = ','.join(s.replace('%', '%%').replace(',', '%2c') for s in a)
>>> [s.replace('%2c', ',').replace('%%', '%') for s in b.split(',')]
['1', ',2', '3,']
>>> b
'1,%2c2,3%2c'
>>>

I would join and split using another character than ",", e.g. ";":
>>> b = ";".join(a)
>>> b.split(';')
['1', ',2', '3,']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python csv module splitting strings, not just fields - python

Just in case you want to see re in action. import re line='xsc_i,202,"House of Night",21,"/21_202"' print map(lambda x:x.strip('"'),re.split(r',(?=(?:[^"]"[^"]")[^"]$)',line)) Output:['xsc_i', '202', 'House of Night', '21', '/21_202']

This is because csv.reader expects any object which supports the iterator protocol and returns a string each time its next() method is called You have passed a string to the reader. If you say: line = ['xsc_i,202,"House of Night",21,"/21_202"',] Your code should work as expected. Please see docs

Related

I need double quotes around datas from CSV [duplicate]

Remove blank string value from a list of strings

Python Split With Delimiter In Field Value

Transform comma separated string into a list but ignore comma in quotes

Protect commas on consecutive string.join() and string.split()

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python csv module splitting strings, not just fields - python

Just in case you want to see re in action. import re line='xsc_i,202,"House of Night",21,"/21_202"' print map(lambda x:x.strip('"'),re.split(r',(?=(?:[^"]*"[^"]*")*[^"]*$)',line)) Output:['xsc_i', '202', 'House of Night', '21', '/21_202']

This is because csv.reader expects any object which supports the iterator protocol and returns a string each time its next() method is called You have passed a string to the reader. If you say: line = ['xsc_i,202,"House of Night",21,"/21_202"',] Your code should work as expected. Please see docs

Related

I need double quotes around datas from CSV [duplicate]

Remove blank string value from a list of strings

Python Split With Delimiter In Field Value

Transform comma separated string into a list but ignore comma in quotes

Protect commas on consecutive string.join() and string.split()

Categories

Resources

Just in case you want to see re in action. import re line='xsc_i,202,"House of Night",21,"/21_202"' print map(lambda x:x.strip('"'),re.split(r',(?=(?:[^"]"[^"]")[^"]$)',line)) Output:['xsc_i', '202', 'House of Night', '21', '/21_202']