Protect commas on consecutive string.join() and string.split() - python

Suppose the following code (notice the commas inside the strings):
>>> a = ['1',",2","3,"]
I need to concatenate the values into a single string. Naive example:
>>> b = ",".join(a)
>>> b
'1,,2,3,'
And later I need to split the resulting object again:
>>> b.split(',')
['1', '', '2', '3', '']
However, the result I am looking for is the original list:
['1', ',2', '3,']
What's the simplest way to protect the commas in this process? The best solution I came up with looks rather ugly.
Note: the comma is just an example. The strings can contain any character. And I can choose other characters as separators.

The strings can contain any character.
If no matter what you use as a delimiter, there is a chance that the item itself contains the delimiter character, then use the csv module:
import csv
class PseudoFile(object):
# http://stackoverflow.com/a/8712426/190597
def write(self, string):
return string
writer = csv.writer(PseudoFile())
This concatenates the items in a using commas:
a = ['1',",2","3,"]
line = writer.writerow(a)
print(line)
# 1,",2","3,"
This recovers a from line:
print(next(csv.reader([line])))
# ['1', ',2', '3,']

Do you have to use comas to separate the items? Else you could also use another symbol that is not used in the items of the list.
In [1]: '|'.join(['1', ',2', '3,']).split('|')
Out[1]: ['1', ',2', '3,']
Edit: The string may apparently contain any character. Is it an option to use the json module? You could just dump and load the list.
In [3]: json.dumps(['1', ',2', '3,'])
Out[3]: '["1", ",2", "3,"]'
In [4]: json.loads('["1", ",2", "3,"]')
Out[4]: [u'1', u',2', u'3,']
Edit #2: If you may not use it, you could use str.encode('string-encode') to escape the characters in your string and then enclose the encoded version into single quotes and separate those with comas:
In [10]: print "'example'".encode('string-escape')
\'example\' #' (have to close the opened string for stackoverflow
In [11]: print r"\'example\'".decode('string-escape')
'example'
Edit #3: Running example of str.encode('string-encode'):
import re
def list_to_str(list):
return ','.join("'{}'".format(s.encode('string-escape')) for s in list)
def str_to_list(str):
return re.findall(r"'([^']*)'", str)
if __name__ == '__main__':
a = ['1', ',2', '3,']
b = list_to_str(a)
print 'It is {} that this works.'.format(str_to_list(b) == a)

When you are serializing a list to a String, then you need to choose as a separator a character that doesn't appear in the list items. Can't you just replace the comma with another character?
b = ";".join(a)
b.split(';')

Does the delimiter need to be only a single character? If not then you can use a delimiter made up of a sequence of characters that definitley wont appear in your string, like |#| or something similar.

You need to escape the comma and probably also escape the escape sequence. Here's one way:
>>> a = ['1',",2","3,"]
>>> b = ','.join(s.replace('%', '%%').replace(',', '%2c') for s in a)
>>> [s.replace('%2c', ',').replace('%%', '%') for s in b.split(',')]
['1', ',2', '3,']
>>> b
'1,%2c2,3%2c'
>>>

I would join and split using another character than ",", e.g. ";":
>>> b = ";".join(a)
>>> b.split(';')
['1', ',2', '3,']

Related

I need double quotes around datas from CSV [duplicate]

I have a list:
my_list = ['"3"', '"45"','"12"','"6"']
This list has single and double quotes and the item value. How can I replace either the single or double quotes from each item. I tried below, but the results are same:
my_list = [i.replace("''", " ") for i in my_list]
Your list doesn't contain any strings with single quotes. I think you are confusing the repr() representation of the strings with their values.
When you print a Python standard library container such as a list (or a tuple, set, dictionary, etc.) then the contents of such a container are shown their repr() representation output; this is great when debugging because it makes it clear what type of objects you have. For strings, the representation uses valid Python string literal syntax; you can copy the output and paste it into another Python script or the interactive interpreter and you'll get the exact same value.
For example, s here is a string that contains some text, some quote characters, and a newline character. When I print the string, the newline character causes an extra blank line to be printed, but when I use repr(), you get the string value in Python syntax form, where the single quotes are part of the syntax, not the value. Note that the newline character also is shown with the \n syntax, exactly the same as when I created the s string in the first place:
>>> s = 'They heard him say "Hello world!".\n'
>>> print(s)
They heard him say "Hello world!".
>>> print(repr(s))
'They heard him say "Hello world!".\n'
>>> s
'They heard him say "Hello world!".\n'
And when I echoed the s value at the end, the interactive interpreter also shows me the value using the repr() output.
So in your list, your strings do not have the ' characters as part of the value. They are part of the string syntax. You only need to replace the " characters, they are part of the value, because they are inside the outermost '...' string literal syntax. You could use str.replace('"', '') to remove them:
[value.replace('"', '') for value in my_list]
or, you could use the str.strip() method to only remove quotes that are at the start or end of the value:
[value.strip('"') for value in my_list]
Both work just fine for your sample list:
>>> my_list = ['"3"', '"45"','"12"','"6"']
>>> [value.replace('"', '') for value in my_list]
['3', '45', '12', '6']
>>> [value.strip('"') for value in my_list]
['3', '45', '12', '6']
Again, the ' characters are not part of the value:
>>> first = my_list[0].strip('"')
>>> first # echo, uses repr()
'3'
>>> print(first) # printing, the actual value written out
3
>>> len(first) # there is just a single character in the string
1
However, I have seen that you are reading your data from a tab-separated file that you hand-parse. You can avoid having to deal with the " quotes altogether if you instead used the csv.reader() object, configured to handle tabs as the delimiter. That class automatically will handle quoted columns:
import csv
with open(inputfile, 'r', newline='') as datafile:
reader = csv.reader(datafile, delimiter='\t')
for row in reader:
# row is a list with strings, *but no quotes*
# e.g. ['3', '45', '12', '6']
Demo showing how csv.reader() handles quotes:
>>> import csv
>>> lines = '''\
... "3"\t"45"\t"12"\t"6"
... "42"\t"81"\t"99"\t"11"
... '''.splitlines()
>>> reader = csv.reader(lines, delimiter='\t')
>>> for row in reader:
... print(row)
...
['3', '45', '12', '6']
['42', '81', '99', '11']
As suggested by #MartijnPieters in comments, you can use replace on the strings to get the desired output.
The change I like to suggest is that using .replace('"', '') instead of .replace('"', ' '). Otherwise the resultant strings will have a leading and trailing white space
You can use list comprehension to deal with the list you have like this
my_list = ['"3"', '"45"','"12"','"6"']
new_list = [x.replace('"', '') for x in my_list]
print(new_list) # ['3', '45', '12', '6']
You can use split:
[x.split('"')[1] for x in my_list]
or you can use:
[x.strip('"') for x in my_list]

read a file with single quote data and store it in a list in python

When I was trying to read a file and store it in a list its failing to store string which is inside a single quote as single value in list.
sample file:
12 3 'dsf dsf'
the list should contain
listname = [12, 3, 'dsf dsf']
I am able to do this like below:
listname = [12, 3, 'dsf', 'dsf']
Please help
Use the csv module.
Demo:
>>> import csv
>>> with open('input.txt') as inp:
... print(list(csv.reader(inp, delimiter=' ', quotechar="'"))[0])
...
['12', '3', 'dsf dsf']
input.txt is the file containing your data in the example.
You can use shlex module to split your data in a simple way.
import shlex
data = open("sample file", 'r')
print shlex.split(data.read())
Try it:)
You can use regular expressions:
import re
my_regex = re.compile(r"(?<=')[\w\s]+(?=')|\w+")
with open ("filename.txt") as my_file:
my_list = my_regex.findall(my_file.read())
print(my_list)
Output for file content 12 3 'dsf dsf':
['12', '3', 'dsf dsf']
RegEx explanation:
(?<=') # matches if there's a single quote *before* the matched pattern
[\w\s]+ # matches one or more alphanumeric characters and spaces
(?=') # matches if there's a single quote *after* the matched pattern
| # match either the pattern above or below
\w+ # matches one or more alphanumeric characters
You can use:
>>> l = ['12', '3', 'dsf', 'dsf']
>>> l[2:] = [' '.join(l[2:])]
>>> l
['12', '3', 'dsf dsf']
Basically, you need to parse the data. Which is:
split it into tokens
interpret the resulting sequence
in your case, each token can be interpreted separately
For the 1st task:
each token is:
a set nonspace characters, or
a quote, then anything until another quote.
the separator is a single space (you didn't specify if runs of spaces/other whitespace characters are valid)
Interpretation:
quoted: take the enclosed text, discarding the quotes
non-quoted: convert to integer if possible (you didn't specify if it always is/should be an interger)
(you also didn't specify if it's always 2 integers + quoted string - i.e. if this combination should be enforced)
Since the syntax is very simple, the two tasks can be done at the same time:
import re
i=0
maxi=len(line)
tokens=[]
re_sep=r"\s"
re_term=r"\S+"
re_quoted=r"'(?P<enclosed>[^']*)'"
re_chunk=re.compile("(?:(?P<term>%(re_term)s)"\
"|(?P<quoted>%(re_quoted)s))"\
"(?:%(re_sep)s|$)"%locals())
del re_sep,re_term,re_quoted
while i<maxi:
m=re.match(re_chunk,line,i)
if not m: raise ValueError("invalid syntax at char %d"%i)
gg=m.groupdict()
token=gg['term']
if token:
try: token=int(token)
except ValueError: pass
elif gg['quoted']:
token=gg['enclosed']
else: assert False,"invalid match. locals=%r"%locals()
tokens.append(token)
i+=m.end()
del m,gg,token
This is an example of how it can be done by hand. You can, however, reuse any existing parsing algorithm that can process the same syntax. csv and shlex suggested in other answers are examples. Do note though that they likely accept other syntax, too, which you may or may not want. E.g.:
shlex also accepts double quotes and constructs like "asd"fgh and 'asd'\''fgh'
csv allows multiple consecutive separators (producing an empty element) and things like 'asd'fgh (stripping the quotes) and asd'def' (leaving the quotes intact)

How can split string in python and get result with delimiter?

I have code like
a = "*abc*bbc"
a.split("*")#['','abc','bbc']
#i need ["*","abc","*","bbc"]
a = "abc*bbc"
a.split("*")#['abc','bbc']
#i need ["abc","*","bbc"]
How can i get list with delimiter in python split function or regex or partition ?
I am using python 2.7 , windows
You need to use RegEx with the delimiter as a group and ignore the empty string, like this
>>> [item for item in re.split(r"(\*)", "abc*bbc") if item]
['abc', '*', 'bbc']
>>> [item for item in re.split(r"(\*)", "*abc*bbc") if item]
['*', 'abc', '*', 'bbc']
Note 1: You need to escape * with \, because RegEx has special meaning for *. So, you need to tell RegEx engine that * should be treated as the normal character.
Note 2: You ll be getting an empty string, when you are splitting the string where the delimiter is at the beginning or at the end. Check this question to understand the reason behind it.
import re
x="*abc*bbc"
print [x for x in re.split(r"(\*)",x) if x]
You have to use re.split and group the delimiter.
or
x="*abc*bbc"
print re.findall(r"[^*]+|\*",x)
Or thru re.findall
Use partition();
a = "abc*bbc"
print (a.partition("*"))
>>>
('abc', '*', 'bbc')
>>>

Transform comma separated string into a list but ignore comma in quotes

How do I convert "1,,2'3,4'" into a list? Commas separate the individual items, unless they are within quotes. In that case, the comma is to be included in the item.
This is the desired result: ['1', '', '2', '3,4']. One regex I found on another thread to ignore the quotes is as follows:
re.compile(r'''((?:[^,"']|"[^"]*"|'[^']*')+)''')
But this gives me this output:
['', '1', ',,', "2'3,4'", '']
I can't understand, where these extra empty strings are coming from, and why the two commas are even being printed at all, let alone together.
I tried making this regex myself:
re.compile(r'''(, | "[^"]*" | '[^']*')''')
which ended up not detecting anything, and just returned my original list.
I don't understand why, shouldn't it detect the commas at the very least? The same problem occurs if I add a ? after the comma.
Instead of a regular expression, you might be better off using the csv module since what you are dealing with is a CSV string:
from cStringIO import StringIO
from csv import reader
file_like_object = StringIO("1,,2,'3,4'")
csv_reader = reader(file_like_object, quotechar="'")
for row in csv_reader:
print row
This results in the following output:
['1', '', '2', '3,4']
pyparsing includes a predefined expression for comma-separated lists:
>>> from pyparsing import commaSeparatedList
>>> s = "1,,2'3,4'"
>>> print commaSeparatedList.parseString(s).asList()
['1', '', "2'3", "4'"]
Hmm, looks like you have a typo in your data, missing a comma after the 2:
>>> s = "1,,2,'3,4'"
>>> print commaSeparatedList.parseString(s).asList()
['1', '', '2', "'3,4'"]

How to quickly parse a list of strings

If I want to split a list of words separated by a delimiter character, I can use
>>> 'abc,foo,bar'.split(',')
['abc', 'foo', 'bar']
But how to easily and quickly do the same thing if I also want to handle quoted-strings which can contain the delimiter character ?
In: 'abc,"a string, with a comma","another, one"'
Out: ['abc', 'a string, with a comma', 'another, one']
Related question: How can i parse a comma delimited string into a list (caveat)?
import csv
input = ['abc,"a string, with a comma","another, one"']
parser = csv.reader(input)
for fields in parser:
for i,f in enumerate(fields):
print i,f # in Python 3 and up, print is a function; use: print(i,f)
Result:
0 abc
1 a string, with a comma
2 another, one
The CSV module should be able to do that for you

Categories

Resources