split string on a number of different characters

split string on a number of different characters - python

I'd like to split a string using one or more separator characters.
E.g. "a b.c", split on " " and "." would give the list ["a", "b", "c"].
At the moment, I can't see anything in the standard library to do this, and my own attempts are a bit clumsy. E.g.
def my_split(string, split_chars):
if isinstance(string_L, basestring):
string_L = [string_L]
try:
split_char = split_chars[0]
except IndexError:
return string_L
res = []
for s in string_L:
res.extend(s.split(split_char))
return my_split(res, split_chars[1:])
print my_split("a b.c", [' ', '.'])
Horrible! Any better suggestions?

>>> import re
>>> re.split('[ .]', 'a b.c')
['a', 'b', 'c']

This one replaces all of the separators with the first separator in the list, and then "splits" using that character.
def split(string, divs):
for d in divs[1:]:
string = string.replace(d, divs[0])
return string.split(divs[0])
output:
>>> split("a b.c", " .")
['a', 'b', 'c']
>>> split("a b.c", ".")
['a b', 'c']
I do like that 're' solution though.

Solution without re:
from itertools import groupby
sep = ' .,'
s = 'a b.c,d'
print [''.join(g) for k, g in groupby(s, sep.__contains__) if not k]
An explanation is here https://stackoverflow.com/a/19211729/2468006

Not very fast but does the job:
def my_split(text, seps):
for sep in seps:
text = text.replace(sep, seps[0])
return text.split(seps[0])

Related

Python join not giving back comma separated string [duplicate]

I was wondering what the simplest way is to convert a string representation of a list like the following to a list:
x = '[ "A","B","C" , " D"]'
Even in cases where the user puts spaces in between the commas, and spaces inside of the quotes, I need to handle that as well and convert it to:
x = ["A", "B", "C", "D"]
I know I can strip spaces with strip() and split() and check for non-letter characters. But the code was getting very kludgy. Is there a quick function that I'm not aware of?

>>> import ast
>>> x = '[ "A","B","C" , " D"]'
>>> x = ast.literal_eval(x)
>>> x
['A', 'B', 'C', ' D']
>>> x = [n.strip() for n in x]
>>> x
['A', 'B', 'C', 'D']
ast.literal_eval:
With ast.literal_eval you can safely evaluate an expression node or a string containing a Python literal or container display. The string or node provided may only consist of the following Python literal structures: strings, bytes, numbers, tuples, lists, dicts, booleans, and None.

The json module is a better solution whenever there is a stringified list of dictionaries. The json.loads(your_data) function can be used to convert it to a list.
>>> import json
>>> x = '[ "A","B","C" , " D"]'
>>> json.loads(x)
['A', 'B', 'C', ' D']
Similarly
>>> x = '[ "A","B","C" , {"D":"E"}]'
>>> json.loads(x)
['A', 'B', 'C', {'D': 'E'}]

The eval is dangerous - you shouldn't execute user input.
If you have 2.6 or newer, use ast instead of eval:
>>> import ast
>>> ast.literal_eval('["A","B" ,"C" ," D"]')
["A", "B", "C", " D"]
Once you have that, strip the strings.
If you're on an older version of Python, you can get very close to what you want with a simple regular expression:
>>> x='[ "A", " B", "C","D "]'
>>> re.findall(r'"\s*([^"]*?)\s*"', x)
['A', 'B', 'C', 'D']
This isn't as good as the ast solution, for example it doesn't correctly handle escaped quotes in strings. But it's simple, doesn't involve a dangerous eval, and might be good enough for your purpose if you're on an older Python without ast.

There is a quick solution:
x = eval('[ "A","B","C" , " D"]')
Unwanted whitespaces in the list elements may be removed in this way:
x = [x.strip() for x in eval('[ "A","B","C" , " D"]')]

Inspired from some of the answers above that work with base Python packages I compared the performance of a few (using Python 3.7.3):
Method 1: ast
import ast
list(map(str.strip, ast.literal_eval(u'[ "A","B","C" , " D"]')))
# ['A', 'B', 'C', 'D']
import timeit
timeit.timeit(stmt="list(map(str.strip, ast.literal_eval(u'[ \"A\",\"B\",\"C\" , \" D\"]')))", setup='import ast', number=100000)
# 1.292875313000195
Method 2: json
import json
list(map(str.strip, json.loads(u'[ "A","B","C" , " D"]')))
# ['A', 'B', 'C', 'D']
import timeit
timeit.timeit(stmt="list(map(str.strip, json.loads(u'[ \"A\",\"B\",\"C\" , \" D\"]')))", setup='import json', number=100000)
# 0.27833264000014424
Method 3: no import
list(map(str.strip, u'[ "A","B","C" , " D"]'.strip('][').replace('"', '').split(',')))
# ['A', 'B', 'C', 'D']
import timeit
timeit.timeit(stmt="list(map(str.strip, u'[ \"A\",\"B\",\"C\" , \" D\"]'.strip('][').replace('\"', '').split(',')))", number=100000)
# 0.12935059100027502
I was disappointed to see what I considered the method with the worst readability was the method with the best performance... there are trade-offs to consider when going with the most readable option... for the type of workloads I use Python for I usually value readability over a slightly more performant option, but as usual it depends.

import ast
l = ast.literal_eval('[ "A","B","C" , " D"]')
l = [i.strip() for i in l]

If it's only a one dimensional list, this can be done without importing anything:
>>> x = u'[ "A","B","C" , " D"]'
>>> ls = x.strip('[]').replace('"', '').replace(' ', '').split(',')
>>> ls
['A', 'B', 'C', 'D']

This u can do,
**
x = '[ "A","B","C" , " D"]'
print(list(eval(x)))
**
best one is the accepted answer
Though this is not a safe way, the best answer is the accepted one.
wasn't aware of the eval danger when answer was posted.

There isn't any need to import anything or to evaluate. You can do this in one line for most basic use cases, including the one given in the original question.
One liner
l_x = [i.strip() for i in x[1:-1].replace('"',"").split(',')]
Explanation
x = '[ "A","B","C" , " D"]'
# String indexing to eliminate the brackets.
# Replace, as split will otherwise retain the quotes in the returned list
# Split to convert to a list
l_x = x[1:-1].replace('"',"").split(',')
Outputs:
for i in range(0, len(l_x)):
print(l_x[i])
# vvvv output vvvvv
'''
A
B
C
D
'''
print(type(l_x)) # out: class 'list'
print(len(l_x)) # out: 4
You can parse and clean up this list as needed using list comprehension.
l_x = [i.strip() for i in l_x] # list comprehension to clean up
for i in range(0, len(l_x)):
print(l_x[i])
# vvvvv output vvvvv
'''
A
B
C
D
'''
Nested lists
If you have nested lists, it does get a bit more annoying. Without using regex (which would simplify the replace), and assuming you want to return a flattened list (and the zen of python says flat is better than nested):
x = '[ "A","B","C" , " D", ["E","F","G"]]'
l_x = x[1:-1].split(',')
l_x = [i
.replace(']', '')
.replace('[', '')
.replace('"', '')
.strip() for i in l_x
]
# returns ['A', 'B', 'C', 'D', 'E', 'F', 'G']
If you need to retain the nested list it gets a bit uglier, but it can still be done just with regular expressions and list comprehension:
import re
x = '[ "A","B","C" , " D", "["E","F","G"]","Z", "Y", "["H","I","J"]", "K", "L"]'
# Clean it up so the regular expression is simpler
x = x.replace('"', '').replace(' ', '')
# Look ahead for the bracketed text that signifies nested list
l_x = re.split(r',(?=\[[A-Za-z0-9\',]+\])|(?<=\]),', x[1:-1])
print(l_x)
# Flatten and split the non nested list items
l_x0 = [item for items in l_x for item in items.split(',') if not '[' in items]
# Convert the nested lists to lists
l_x1 = [
i[1:-1].split(',') for i in l_x if '[' in i
]
# Add the two lists
l_x = l_x0 + l_x1
This last solution will work on any list stored as a string, nested or not.

Assuming that all your inputs are lists and that the double quotes in the input actually don't matter, this can be done with a simple regexp replace. It is a bit perl-y, but it works like a charm. Note also that the output is now a list of Unicode strings, you didn't specify that you needed that, but it seems to make sense given Unicode input.
import re
x = u'[ "A","B","C" , " D"]'
junkers = re.compile('[[" \]]')
result = junkers.sub('', x).split(',')
print result
---> [u'A', u'B', u'C', u'D']
The junkers variable contains a compiled regexp (for speed) of all characters we don't want, using ] as a character required some backslash trickery.
The re.sub replaces all these characters with nothing, and we split the resulting string at the commas.
Note that this also removes spaces from inside entries u'["oh no"]' ---> [u'ohno']. If this is not what you wanted, the regexp needs to be souped up a bit.

If you know that your lists only contain quoted strings, this pyparsing example will give you your list of stripped strings (even preserving the original Unicode-ness).
>>> from pyparsing import *
>>> x =u'[ "A","B","C" , " D"]'
>>> LBR,RBR = map(Suppress,"[]")
>>> qs = quotedString.setParseAction(removeQuotes, lambda t: t[0].strip())
>>> qsList = LBR + delimitedList(qs) + RBR
>>> print qsList.parseString(x).asList()
[u'A', u'B', u'C', u'D']
If your lists can have more datatypes, or even contain lists within lists, then you will need a more complete grammar - like this one in the pyparsing examples directory, which will handle tuples, lists, ints, floats, and quoted strings.

You may run into such problem while dealing with scraped data stored as Pandas DataFrame.
This solution works like charm if the list of values is present as text.
def textToList(hashtags):
return hashtags.strip('[]').replace('\'', '').replace(' ', '').split(',')
hashtags = "[ 'A','B','C' , ' D']"
hashtags = textToList(hashtags)
Output: ['A', 'B', 'C', 'D']
No external library required.

This usually happens when you load list stored as string to CSV
If you have your list stored in CSV in form like OP asked:
x = '[ "A","B","C" , " D"]'
Here is how you can load it back to list:
import csv
with open('YourCSVFile.csv') as csv_file:
reader = csv.reader(csv_file, delimiter=',')
rows = list(reader)
listItems = rows[0]
listItems is now list

To further complete Ryan's answer using JSON, one very convenient function to convert Unicode is in this answer.
Example with double or single quotes:
>print byteify(json.loads(u'[ "A","B","C" , " D"]')
>print byteify(json.loads(u"[ 'A','B','C' , ' D']".replace('\'','"')))
['A', 'B', 'C', ' D']
['A', 'B', 'C', ' D']

I would like to provide a more intuitive patterning solution with regex.
The below function takes as input a stringified list containing arbitrary strings.
Stepwise explanation:
You remove all whitespacing,bracketing and value_separators (provided they are not part of the values you want to extract, else make the regex more complex). Then you split the cleaned string on single or double quotes and take the non-empty values (or odd indexed values, whatever the preference).
def parse_strlist(sl):
import re
clean = re.sub("[\[\],\s]","",sl)
splitted = re.split("[\'\"]",clean)
values_only = [s for s in splitted if s != '']
return values_only
testsample: "['21',"foo" '6', '0', " A"]"

You can save yourself the .strip() function by just slicing off the first and last characters from the string representation of the list (see the third line below):
>>> mylist=[1,2,3,4,5,'baloney','alfalfa']
>>> strlist=str(mylist)
['1', ' 2', ' 3', ' 4', ' 5', " 'baloney'", " 'alfalfa'"]
>>> mylistfromstring=(strlist[1:-1].split(', '))
>>> mylistfromstring[3]
'4'
>>> for entry in mylistfromstring:
... print(entry)
... type(entry)
...
1
<class 'str'>
2
<class 'str'>
3
<class 'str'>
4
<class 'str'>
5
<class 'str'>
'baloney'
<class 'str'>
'alfalfa'
<class 'str'>

And with pure Python - not importing any libraries:
[x for x in x.split('[')[1].split(']')[0].split('"')[1:-1] if x not in[',',' , ',', ']]

So, following all the answers I decided to time the most common methods:
from time import time
import re
import json
my_str = str(list(range(19)))
print(my_str)
reps = 100000
start = time()
for i in range(0, reps):
re.findall("\w+", my_str)
print("Regex method:\t", (time() - start) / reps)
start = time()
for i in range(0, reps):
json.loads(my_str)
print("JSON method:\t", (time() - start) / reps)
start = time()
for i in range(0, reps):
ast.literal_eval(my_str)
print("AST method:\t\t", (time() - start) / reps)
start = time()
for i in range(0, reps):
[n.strip() for n in my_str]
print("strip method:\t", (time() - start) / reps)
regex method: 6.391477584838867e-07
json method: 2.535374164581299e-06
ast method: 2.4425282478332518e-05
strip method: 4.983267784118653e-06
So in the end regex wins!

This solution is simpler than some I read in the previous answers, but it requires to match all features of the list.
x = '[ "A","B","C" , " D"]'
[i.strip() for i in x.split('"') if len(i.strip().strip(',').strip(']').strip('['))>0]
Output:
['A', 'B', 'C', 'D']

string split considering quotation

Imagine this string:
"a","b","hi, this is Mboyle"
I would like to split it on commas, unless the comma is between two quotations:
i.e:
["a","b","hi, this is Mboyle"]
How do I achieve this? Using split, the "hi, this is Mboyle" gets split as well!

You can split your string not by commas, but by ",":
In [1]: '"a","b","hi, this is Mboyle"'.strip('"').split('","')
Out[1]: ['a', 'b', 'hi, this is Mboyle']

My take on the problem (use with caution!)
s = '"a","b","hi, this is Mboyle"'
new_s = eval(f'[{s}]')
print(new_s)
Output:
['a', 'b', 'hi, this is Mboyle']
EDIT (safer version):
import ast.literal_eval
s = '"a","b","hi, this is Mboyle"'
new_s = ast.literal_eval(f'[{s}]')

Solved.
with gzip.open(file, 'rt') as handler:
for row in csv.reader(handler, delimiter=","):
This makes the trick! Thank you to you all

You could include the quotations in the split, so with .split('","'). Then remove the quotations on the end list items as needed.

You can use re.split:
import re
s = '"a","b","hi, this is Mboyle"'
new_s = list(map(lambda x:x[1:-1], re.split('(?<="),(?=")', s)))
Output:
['a', 'b', 'hi, this is Mboyle']
However, re.findall is much cleaner:
new_result = re.findall('"(.*?)"', s)
Output:
['a', 'b', 'hi, this is Mboyle']

Replace empty string with space

Hello I started learning Python and I've come across a problem. I want to replace each empty string in a list with a space (" "). For example, if I call the function with function(['', 'x', 'x', '', '', 'y', 'y', '', 'a']) I would like to return a string ' xx yy a'.
def function(a):
for i in a:
if i == None:
a[i] = " "
string = "".join(a)
return string

Use a generator expression instead with a short-circuiting or:
def function(a):
return ''.join(char or ' ' for char in a)
If the character is a non-empty string, it'll be used as is. Otherwise, a space will be used.

to replace None , False and empty strings
>>> a = ['','11',None,'22',False]
>>> b=[elem if elem else " " for elem in a]
>>> ''.join(b)
' 11 22 '
>>>
note that your code would not replace '' as '' is not None
>>>''==None
False
>>>

Python list to string spacing

I have a list such as this
list = ['Hi', ',', 'my', 'name', 'is', 'Bob', '!']
I wanted to convert this to a string, and originally, I found on stackoverflow that .join() could be used. So i did:
x = ' '.join(list)
print(x)
which prints
"Hi , my name is Bob !"
when what I want printed is:
"Hi, my name is Bob!"
How do I not add spaces before periods and exclamation points? I want a more general case so that I can for example read in a text file as a list, and convert it to a string.
Thanks!

To solve it in a general case, use the nltk's "moses" detokenizer:
In [1]: l = ["Hi", ",", "my", "name", "is", "Bob", "!"]
In [2]: from nltk.tokenize.moses import MosesDetokenizer
In [3]: detokenizer = MosesDetokenizer()
In [4]: detokenizer.detokenize(l, return_str=True)
Out[4]: u'Hi, my name is Bob!'
The detokenizer is not yet a part of a stable nltk package. To be able to use it now, install nltk directly from github.

How about this, using simple regex?
import re
list = ['Hi', ',', 'my', 'name', 'is', 'Bob', '!']
x = re.sub(r' (\W)',r'\1',' '.join(list))
print(x)
>>> Hi, my name is Bob!

A little different solution:
>>> from string import punctuation
>>> lis = ["Hi", ",", "my", "name", "is", "Bob", "!"]
>>> string = ''
>>> for i, x in enumerate(lis):
if x not in punctuation and i != 0:
string += ' ' + x
elif x not in punctuation and i == 0:
string += x
else:
string += x
>>> print(string)
"Hi, my name is Bob!"

How to sort the letters in a string alphabetically in Python

Is there an easy way to sort the letters in a string alphabetically in Python?
So for:
a = 'ZENOVW'
I would like to return:
'ENOVWZ'

You can do:
>>> a = 'ZENOVW'
>>> ''.join(sorted(a))
'ENOVWZ'

>>> a = 'ZENOVW'
>>> b = sorted(a)
>>> print b
['E', 'N', 'O', 'V', 'W', 'Z']
sorted returns a list, so you can make it a string again using join:
>>> c = ''.join(b)
which joins the items of b together with an empty string '' in between each item.
>>> print c
'ENOVWZ'

Sorted() solution can give you some unexpected results with other strings.
List of other solutions:
Sort letters and make them distinct:
>>> s = "Bubble Bobble"
>>> ''.join(sorted(set(s.lower())))
' belou'
Sort letters and make them distinct while keeping caps:
>>> s = "Bubble Bobble"
>>> ''.join(sorted(set(s)))
' Bbelou'
Sort letters and keep duplicates:
>>> s = "Bubble Bobble"
>>> ''.join(sorted(s))
' BBbbbbeellou'
If you want to get rid of the space in the result, add strip() function in any of those mentioned cases:
>>> s = "Bubble Bobble"
>>> ''.join(sorted(set(s.lower()))).strip()
'belou'

Python functionsorted returns ASCII based result for string.
INCORRECT: In the example below, e and d is behind H and W due it's to ASCII value.
>>>a = "Hello World!"
>>>"".join(sorted(a))
' !!HWdellloor'
CORRECT: In order to write the sorted string without changing the case of letter. Use the code:
>>> a = "Hello World!"
>>> "".join(sorted(a,key=lambda x:x.lower()))
' !deHllloorW'
OR (Ref: https://docs.python.org/3/library/functions.html#sorted)
>>> a = "Hello World!"
>>> "".join(sorted(a,key=str.lower))
' !deHllloorW'
If you want to remove all punctuation and numbers.
Use the code:
>>> a = "Hello World!"
>>> "".join(filter(lambda x:x.isalpha(), sorted(a,key=lambda x:x.lower())))
'deHllloorW'

You can use reduce
>>> a = 'ZENOVW'
>>> reduce(lambda x,y: x+y, sorted(a))
'ENOVWZ'

the code can be used to sort string in alphabetical order without using any inbuilt function of python
k = input("Enter any string again ")
li = []
x = len(k)
for i in range (0,x):
li.append(k[i])
print("List is : ",li)
for i in range(0,x):
for j in range(0,x):
if li[i]<li[j]:
temp = li[i]
li[i]=li[j]
li[j]=temp
j=""
for i in range(0,x):
j = j+li[i]
print("After sorting String is : ",j)

Really liked the answer with the reduce() function. Here's another way to sort the string using accumulate().
from itertools import accumulate
s = 'mississippi'
print(tuple(accumulate(sorted(s)))[-1])
sorted(s) -> ['i', 'i', 'i', 'i', 'm', 'p', 'p', 's', 's', 's', 's']
tuple(accumulate(sorted(s)) -> ('i', 'ii', 'iii', 'iiii', 'iiiim', 'iiiimp', 'iiiimpp', 'iiiimpps', 'iiiimppss', 'iiiimppsss', 'iiiimppssss')
We are selecting the last index (-1) of the tuple

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

split string on a number of different characters - python

>>> import re >>> re.split('[ .]', 'a b.c') ['a', 'b', 'c']

Solution without re: from itertools import groupby sep = ' .,' s = 'a b.c,d' print [''.join(g) for k, g in groupby(s, sep.contains) if not k] An explanation is here https://stackoverflow.com/a/19211729/2468006

Not very fast but does the job: def my_split(text, seps): for sep in seps: text = text.replace(sep, seps[0]) return text.split(seps[0])

Related

Python join not giving back comma separated string [duplicate]

string split considering quotation

Replace empty string with space

Python list to string spacing

How to sort the letters in a string alphabetically in Python

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

split string on a number of different characters - python

>>> import re >>> re.split('[ .]', 'a b.c') ['a', 'b', 'c']

Solution without re: from itertools import groupby sep = ' .,' s = 'a b.c,d' print [''.join(g) for k, g in groupby(s, sep.__contains__) if not k] An explanation is here https://stackoverflow.com/a/19211729/2468006

Not very fast but does the job: def my_split(text, seps): for sep in seps: text = text.replace(sep, seps[0]) return text.split(seps[0])

Related

Python join not giving back comma separated string [duplicate]

string split considering quotation

Replace empty string with space

Python list to string spacing

How to sort the letters in a string alphabetically in Python

Categories

Resources

Solution without re: from itertools import groupby sep = ' .,' s = 'a b.c,d' print [''.join(g) for k, g in groupby(s, sep.contains) if not k] An explanation is here https://stackoverflow.com/a/19211729/2468006