string split considering quotation - python

Imagine this string:
"a","b","hi, this is Mboyle"
I would like to split it on commas, unless the comma is between two quotations:
i.e:
["a","b","hi, this is Mboyle"]
How do I achieve this? Using split, the "hi, this is Mboyle" gets split as well!

You can split your string not by commas, but by ",":
In [1]: '"a","b","hi, this is Mboyle"'.strip('"').split('","')
Out[1]: ['a', 'b', 'hi, this is Mboyle']

My take on the problem (use with caution!)
s = '"a","b","hi, this is Mboyle"'
new_s = eval(f'[{s}]')
print(new_s)
Output:
['a', 'b', 'hi, this is Mboyle']
EDIT (safer version):
import ast.literal_eval
s = '"a","b","hi, this is Mboyle"'
new_s = ast.literal_eval(f'[{s}]')

Solved.
with gzip.open(file, 'rt') as handler:
for row in csv.reader(handler, delimiter=","):
This makes the trick! Thank you to you all

You could include the quotations in the split, so with .split('","'). Then remove the quotations on the end list items as needed.

You can use re.split:
import re
s = '"a","b","hi, this is Mboyle"'
new_s = list(map(lambda x:x[1:-1], re.split('(?<="),(?=")', s)))
Output:
['a', 'b', 'hi, this is Mboyle']
However, re.findall is much cleaner:
new_result = re.findall('"(.*?)"', s)
Output:
['a', 'b', 'hi, this is Mboyle']

Related

How to convert a string to a list if the string has wild characters for a group of characters like [] or {}, ()

I have a string of this sort
s = 'a,s,[c,f],[f,t]'
I want to convert this to a list
S = ['a','s',['c','f'],['f','t']]
I tried using strip()
d = s.strip('][').split(',')
But it is not giving me the desired output:
output = ['a', 's', '[c', 'f]', '[f', 't']
You could use ast.literal_eval(), having first enclosed each element in quotes:
>>> qs = re.sub(r'(\w+)', r'"\1"', s) # add quotes
>>> ast.literal_eval('[' + qs + ']') # enclose in brackets & safely eval
['a', 's', ['c', 'f'], ['f', 't']]
You may need to tweak the regex if your elements can contain non-word characters.
This only works if your input string follows Python expression syntax or is sufficiently close to be mechanically converted to Python syntax (as we did above by adding quotes and brackets). If this assumption does not hold, you might need to look into using a parsing library. (You could also hand-code a recursive descent parser, but that'll probably be more work to do correctly than just using a parsing library.)
Alternative to ast.literal_eval you can use the json package with more or less the same restrictions of NPE's answer:
import re
import json
qs = re.sub(r'(\w+)', r'"\1"', s) # add quotes
ls = json.loads('[' + qs + ']')
print(ls)
# ['a', 's', ['c', 'f'], ['f', 't']]

Is there any function in python list to convert comma seperated values to a single element?

I'm looking for converting a list of strings separated by comma to a single element like so:
my_list=['A','B','B','C','C','A']
I want the output to be:
my_list=['ABBCCA']
Use join:
my_list = ["".join(my_list)]
print(my_list)
Output:
['ABBCCA']
Use str.join:
>>> my_list= ['A','B','B','C','C','A']
>>> "".join(my_list)
'ABBCCA'
So in your case, enclose it in a list:
>>> ["".join(my_list)]
['ABBCCA']
you can concat str by join:
my_list = ['A', 'B', 'B', 'C', 'C', 'A']
print(''.join(my_list)) # 'ABBCCA'
if you mean split comma in one string, like:
s = 'A,B,B,C,C,A'
print(''.join(s.split(','))) # 'ABBCCA'
You can use a loop too:
str=''
for i in my_list:
str+=i
print(str)

Find all strings in nested brackets

How do i find string in nested brackets
Lets say I have a string
uv(wh(x(yz))
and I want to find all string in brackets (so wh, x, yz)
import re
s="uuv(wh(x(yz))"
regex = r"(\(\w*?\))"
matches = re.findall(regex, s)
The above code only finds yz
Can I modify this regex to find all matches?
To get all properly parenthesized text:
import re
def get_all_in_parens(text):
in_parens = []
n = "has something to substitute"
while n:
text, n = re.subn(r'\(([^()]*)\)', # match flat expression in parens
lambda m: in_parens.append(m.group(1)) or '', text)
return in_parens
Example:
>>> get_all_in_parens("uuv(wh(x(yz))")
['yz', 'x']
Note: there is no 'wh' in the result due to the unbalanced paren.
If the parentheses are balanced; it returns all three nested substrings:
>>> get_all_in_parens("uuv(wh(x(yz)))")
['yz', 'x', 'wh']
>>> get_all_in_parens("a(b(c)de)")
['c', 'bde']
Would a string split work instead of a regex?
s='uv(wh(x(yz))'
match=[''.join(x for x in i if x.isalpha()) for i in s.split('(')]
>>>print(match)
['uv', 'wh', 'x', 'yz']
>>> match.pop(0)
You could pop off the first element because if it was contained in a parenthesis, the first position would be blank, which you wouldn't want and if it wasn't blank that means it wasn't in the parenthesis so again, you wouldn't want it.
Since that wasn't flexible enough something like this would work:
def match(string):
unrefined_match=re.findall('\((\w+)|(\w+)\)', string)
return [x for i in unrefined_match for x in i if x]
>>> match('uv(wh(x(yz))')
['wh', 'x', 'yz']
>>> match('a(b(c)de)')
['b', 'c', 'de']
Using regex a pattern such as this might potentially work:
\((\w{1,})
Result:
['wh', 'x', 'yz']
Your current pattern escapes the ( ) and doesn't treat them as a capture group.
Well if you know how to covert from PHP regex to Python , then you can use this
\(((?>[^()]+)|(?R))*\)

How can I avoid those empty strings caused by preceding or trailing whitespaces?

>>> import re
>>> re.split(r'[ "]+', ' a n" "c ')
['', 'a', 'n', 'c', '']
When there is preceding or trailing whitespace, there will be empty strings after splitting.
How can I avoid those empty strings? Thanks.
The empty values are the things between the splits. re.split() is not the right tool for the job.
I recommend matching what you want instead.
>>> re.findall(r'[^ "]+', ' a n" "c ')
['a', 'n', 'c']
If you must use split, you could use a list comprehension and filter it directly.
>>> [x for x in re.split(r'[ "]+', ' a n" "c ') if x != '']
['a', 'n', 'c']
That's what re.split is supposed to do. You're asking it to split the string on any runs of whitespace or quotes; if it didn't return an empty string at the start, you wouldn't be able to distinguish that case from the case with no preceding whitespace.
If what you're actually asking for is to find all runs of non-whitespace-or-quote characters, just write that:
>>> re.findall(r'[^ "]+', ' a n" "c ')
['a', 'n', 'c']
I like abarnert solution.
However, you can also do (maybe not a pythonic way):
myString.strip()
Before your split (or etc).

split string on a number of different characters

I'd like to split a string using one or more separator characters.
E.g. "a b.c", split on " " and "." would give the list ["a", "b", "c"].
At the moment, I can't see anything in the standard library to do this, and my own attempts are a bit clumsy. E.g.
def my_split(string, split_chars):
if isinstance(string_L, basestring):
string_L = [string_L]
try:
split_char = split_chars[0]
except IndexError:
return string_L
res = []
for s in string_L:
res.extend(s.split(split_char))
return my_split(res, split_chars[1:])
print my_split("a b.c", [' ', '.'])
Horrible! Any better suggestions?
>>> import re
>>> re.split('[ .]', 'a b.c')
['a', 'b', 'c']
This one replaces all of the separators with the first separator in the list, and then "splits" using that character.
def split(string, divs):
for d in divs[1:]:
string = string.replace(d, divs[0])
return string.split(divs[0])
output:
>>> split("a b.c", " .")
['a', 'b', 'c']
>>> split("a b.c", ".")
['a b', 'c']
I do like that 're' solution though.
Solution without re:
from itertools import groupby
sep = ' .,'
s = 'a b.c,d'
print [''.join(g) for k, g in groupby(s, sep.__contains__) if not k]
An explanation is here https://stackoverflow.com/a/19211729/2468006
Not very fast but does the job:
def my_split(text, seps):
for sep in seps:
text = text.replace(sep, seps[0])
return text.split(seps[0])

Categories

Resources