How to quickly parse a list of strings - python

If I want to split a list of words separated by a delimiter character, I can use
>>> 'abc,foo,bar'.split(',')
['abc', 'foo', 'bar']
But how to easily and quickly do the same thing if I also want to handle quoted-strings which can contain the delimiter character ?
In: 'abc,"a string, with a comma","another, one"'
Out: ['abc', 'a string, with a comma', 'another, one']
Related question: How can i parse a comma delimited string into a list (caveat)?

import csv
input = ['abc,"a string, with a comma","another, one"']
parser = csv.reader(input)
for fields in parser:
for i,f in enumerate(fields):
print i,f # in Python 3 and up, print is a function; use: print(i,f)
Result:
0 abc
1 a string, with a comma
2 another, one

The CSV module should be able to do that for you

Related

Python String manipulation to store result in a string after removing comma delimiter

I have a string which I want to separate based on the ',' delimiter and store the result in a new string. Currently the split function stores the result in an array. How to store the result in a string with out the ',' delimiter? Also , I want to manipulate the positions of the string content. Are there ways in Python to do this?
code
string_in = "a,bcd,e1,20"
print (string_in.split())
output
['a,bcd,e1,20']
I want the below result to be stored in a string without the comma delimiter and manipulate the position of the string content as below.
string_out = a bcd 20 e1
You want to pass your delimiter as an argument to split, like so:
>>> split = string_in.split(",")
['a', 'bcd', 'e1', '20']
That will give you a list of elements that you can manipulate as you wish. When you want to put them back into a space delimited string, you use join like so:
>>> " ".join(split)
'a bcd e1 20'
Take a look at the python documentation for split and join:
https://docs.python.org/3.8/library/stdtypes.html#str.split
https://docs.python.org/3.8/library/stdtypes.html#str.join
You are reinventing the wheel. In your case ordinary search /replace in source string suffices
string_in = "a,bcd,e1,20"
result = string_in.replace(',', ' ')
If you want split/join then
string_in = "a,bcd,e1,20"
result = ' '.join(string_in.split(','))

read a file with single quote data and store it in a list in python

When I was trying to read a file and store it in a list its failing to store string which is inside a single quote as single value in list.
sample file:
12 3 'dsf dsf'
the list should contain
listname = [12, 3, 'dsf dsf']
I am able to do this like below:
listname = [12, 3, 'dsf', 'dsf']
Please help
Use the csv module.
Demo:
>>> import csv
>>> with open('input.txt') as inp:
... print(list(csv.reader(inp, delimiter=' ', quotechar="'"))[0])
...
['12', '3', 'dsf dsf']
input.txt is the file containing your data in the example.
You can use shlex module to split your data in a simple way.
import shlex
data = open("sample file", 'r')
print shlex.split(data.read())
Try it:)
You can use regular expressions:
import re
my_regex = re.compile(r"(?<=')[\w\s]+(?=')|\w+")
with open ("filename.txt") as my_file:
my_list = my_regex.findall(my_file.read())
print(my_list)
Output for file content 12 3 'dsf dsf':
['12', '3', 'dsf dsf']
RegEx explanation:
(?<=') # matches if there's a single quote *before* the matched pattern
[\w\s]+ # matches one or more alphanumeric characters and spaces
(?=') # matches if there's a single quote *after* the matched pattern
| # match either the pattern above or below
\w+ # matches one or more alphanumeric characters
You can use:
>>> l = ['12', '3', 'dsf', 'dsf']
>>> l[2:] = [' '.join(l[2:])]
>>> l
['12', '3', 'dsf dsf']
Basically, you need to parse the data. Which is:
split it into tokens
interpret the resulting sequence
in your case, each token can be interpreted separately
For the 1st task:
each token is:
a set nonspace characters, or
a quote, then anything until another quote.
the separator is a single space (you didn't specify if runs of spaces/other whitespace characters are valid)
Interpretation:
quoted: take the enclosed text, discarding the quotes
non-quoted: convert to integer if possible (you didn't specify if it always is/should be an interger)
(you also didn't specify if it's always 2 integers + quoted string - i.e. if this combination should be enforced)
Since the syntax is very simple, the two tasks can be done at the same time:
import re
i=0
maxi=len(line)
tokens=[]
re_sep=r"\s"
re_term=r"\S+"
re_quoted=r"'(?P<enclosed>[^']*)'"
re_chunk=re.compile("(?:(?P<term>%(re_term)s)"\
"|(?P<quoted>%(re_quoted)s))"\
"(?:%(re_sep)s|$)"%locals())
del re_sep,re_term,re_quoted
while i<maxi:
m=re.match(re_chunk,line,i)
if not m: raise ValueError("invalid syntax at char %d"%i)
gg=m.groupdict()
token=gg['term']
if token:
try: token=int(token)
except ValueError: pass
elif gg['quoted']:
token=gg['enclosed']
else: assert False,"invalid match. locals=%r"%locals()
tokens.append(token)
i+=m.end()
del m,gg,token
This is an example of how it can be done by hand. You can, however, reuse any existing parsing algorithm that can process the same syntax. csv and shlex suggested in other answers are examples. Do note though that they likely accept other syntax, too, which you may or may not want. E.g.:
shlex also accepts double quotes and constructs like "asd"fgh and 'asd'\''fgh'
csv allows multiple consecutive separators (producing an empty element) and things like 'asd'fgh (stripping the quotes) and asd'def' (leaving the quotes intact)

splitting a string based on tab in the file

I have file that contains values separated by tab ("\t"). I am trying to create a list and store all values of file in the list. But I get some problem. Here is my code.
line = "abc def ghi"
values = line.split("\t")
It works fine as long as there is only one tab between each value. But if there is one than one tab then it copies the tab to values as well. In my case mostly the extra tab will be after the last value in the file.
You can use regex here:
>>> import re
>>> strs = "foo\tbar\t\tspam"
>>> re.split(r'\t+', strs)
['foo', 'bar', 'spam']
update:
You can use str.rstrip to get rid of trailing '\t' and then apply regex.
>>> yas = "yas\t\tbs\tcda\t\t"
>>> re.split(r'\t+', yas.rstrip('\t'))
['yas', 'bs', 'cda']
Split on tab, but then remove all blank matches.
text = "hi\tthere\t\t\tmy main man"
print([splits for splits in text.split("\t") if splits])
Outputs:
['hi', 'there', 'my main man']
You can use regexp to do this:
import re
patt = re.compile("[^\t]+")
s = "a\t\tbcde\t\tef"
patt.findall(s)
['a', 'bcde', 'ef']
An other regex-based solution:
>>> strs = "foo\tbar\t\tspam"
>>> r = re.compile(r'([^\t]*)\t*')
>>> r.findall(strs)[:-1]
['foo', 'bar', 'spam']
Python has support for CSV files in the eponymous csv module. It is relatively misnamed since it support much more that just comma separated values.
If you need to go beyond basic word splitting you should take a look. Say, for example, because you are in need to deal with quoted values...

How to delete some characters from a string by matching certain character in python

i am trying to delete certain portion of a string if a match found in the string as below
string = 'Newyork, NY'
I want to delete all the characters after the comma from the string including comma, if comma is present in the string
Can anyone let me now how to do this .
Use .split():
string = string.split(',', 1)[0]
We split the string on the comma once, to save python the work of splitting on more commas.
Alternatively, you can use .partition():
string = string.partition(',')[0]
Demo:
>>> 'Newyork, NY'.split(',', 1)[0]
'Newyork'
>>> 'Newyork, NY'.partition(',')[0]
'Newyork'
.partition() is the faster method:
>>> import timeit
>>> timeit.timeit("'one, two'.split(',', 1)[0]")
0.52929401397705078
>>> timeit.timeit("'one, two'.partition(',')[0]")
0.26499605178833008
You can split the string with the delimiter ",":
string.split(",")[0]
Example:
'Newyork, NY'.split(",") # ['Newyork', ' NY']
'Newyork, NY'.split(",")[0] # 'Newyork'
Try this :
s = "this, is"
m = s.index(',')
l = s[:m]
A fwe options:
string[:string.index(",")]
This will raise a ValueError if , cannot be found in the string. Here, we find the position of the character with .index then use slicing.
string.split(",")[0]
The split function will give you a list of the substrings that were separated by ,, and you just take the first element of the list. This will work even if , is not present in the string (as there'd be nothing to split in that case, we'd have string.split(...) == [string])

Protect commas on consecutive string.join() and string.split()

Suppose the following code (notice the commas inside the strings):
>>> a = ['1',",2","3,"]
I need to concatenate the values into a single string. Naive example:
>>> b = ",".join(a)
>>> b
'1,,2,3,'
And later I need to split the resulting object again:
>>> b.split(',')
['1', '', '2', '3', '']
However, the result I am looking for is the original list:
['1', ',2', '3,']
What's the simplest way to protect the commas in this process? The best solution I came up with looks rather ugly.
Note: the comma is just an example. The strings can contain any character. And I can choose other characters as separators.
The strings can contain any character.
If no matter what you use as a delimiter, there is a chance that the item itself contains the delimiter character, then use the csv module:
import csv
class PseudoFile(object):
# http://stackoverflow.com/a/8712426/190597
def write(self, string):
return string
writer = csv.writer(PseudoFile())
This concatenates the items in a using commas:
a = ['1',",2","3,"]
line = writer.writerow(a)
print(line)
# 1,",2","3,"
This recovers a from line:
print(next(csv.reader([line])))
# ['1', ',2', '3,']
Do you have to use comas to separate the items? Else you could also use another symbol that is not used in the items of the list.
In [1]: '|'.join(['1', ',2', '3,']).split('|')
Out[1]: ['1', ',2', '3,']
Edit: The string may apparently contain any character. Is it an option to use the json module? You could just dump and load the list.
In [3]: json.dumps(['1', ',2', '3,'])
Out[3]: '["1", ",2", "3,"]'
In [4]: json.loads('["1", ",2", "3,"]')
Out[4]: [u'1', u',2', u'3,']
Edit #2: If you may not use it, you could use str.encode('string-encode') to escape the characters in your string and then enclose the encoded version into single quotes and separate those with comas:
In [10]: print "'example'".encode('string-escape')
\'example\' #' (have to close the opened string for stackoverflow
In [11]: print r"\'example\'".decode('string-escape')
'example'
Edit #3: Running example of str.encode('string-encode'):
import re
def list_to_str(list):
return ','.join("'{}'".format(s.encode('string-escape')) for s in list)
def str_to_list(str):
return re.findall(r"'([^']*)'", str)
if __name__ == '__main__':
a = ['1', ',2', '3,']
b = list_to_str(a)
print 'It is {} that this works.'.format(str_to_list(b) == a)
When you are serializing a list to a String, then you need to choose as a separator a character that doesn't appear in the list items. Can't you just replace the comma with another character?
b = ";".join(a)
b.split(';')
Does the delimiter need to be only a single character? If not then you can use a delimiter made up of a sequence of characters that definitley wont appear in your string, like |#| or something similar.
You need to escape the comma and probably also escape the escape sequence. Here's one way:
>>> a = ['1',",2","3,"]
>>> b = ','.join(s.replace('%', '%%').replace(',', '%2c') for s in a)
>>> [s.replace('%2c', ',').replace('%%', '%') for s in b.split(',')]
['1', ',2', '3,']
>>> b
'1,%2c2,3%2c'
>>>
I would join and split using another character than ",", e.g. ";":
>>> b = ";".join(a)
>>> b.split(';')
['1', ',2', '3,']

Categories

Resources