Split a string by a delimiter in python - python

How to split this string where __ is the delimiter
MATCHES__STRING
To get an output of ['MATCHES', 'STRING']?
For splitting specifically on whitespace, see How do I split a string into a list of words?.
To extract everything before the first delimiter, see Splitting on first occurrence.
To extract everything before the last delimiter, see partition string in python and get value of last segment after colon.

You can use the str.split method: string.split('__')
>>> "MATCHES__STRING".split("__")
['MATCHES', 'STRING']

You may be interested in the csv module, which is designed for comma-separated files but can be easily modified to use a custom delimiter.
import csv
csv.register_dialect( "myDialect", delimiter = "__", <other-options> )
lines = [ "MATCHES__STRING" ]
for row in csv.reader( lines ):
...

When you have two or more elements in the string (in the example below there are three), then you can use a comma to separate these items:
date, time, event_name = ev.get_text(separator='#').split("#")
After this line of code, the three variables will have values from three parts of the variable ev.
So, if the variable ev contains this string and we apply separator #:
Sa., 23. März#19:00#Klavier + Orchester: SPEZIAL
Then, after the split operation the variable
date will have value Sa., 23. März
time will have value 19:00
event_name will have value Klavier + Orchester: SPEZIAL

Besides split and rsplit, there is partition/rpartition. It separates string once, but the way question was asked, it may apply as well.
Example:
>>> "MATCHES__STRING".partition("__")
('MATCHES', '__', 'STRING')
>>> "MATCHES__STRING".partition("__")[::2]
('MATCHES', 'STRING')
And a bit faster then split("_",1):
$ python -m timeit "'validate_field_name'.split('_', 1)[-1]"
2000000 loops, best of 5: 136 nsec per loop
$ python -m timeit "'validate_field_name'.partition('_')[-1]"
2000000 loops, best of 5: 108 nsec per loop
Timeit lines are based on this answer

For Python 3.8, you actually don't need the get_text method, you can just go with ev.split("#"), as a matter of fact the get_text method is throwing an att. error.
So if you have a string variable, for example:
filename = 'file/foo/bar/fox'
You can just split that into different variables with comas as suggested in the above comment but with a correction:
W, X, Y, Z = filename.split('_')
W = 'file'
X = 'foo'
Y = 'bar'
Z = 'fox'

Related

Want to replace comma with decimal point in text file where after each number there is a comma in python

eg
Arun,Mishra,108,23,34,45,56,Mumbai
o\p I want is
Arun,Mishra,108.23,34,45,56,Mumbai
Tried to replace the comma with dot but all the demiliters are replaced with comma
tried text.replace(',','.') but replacing all the commas with dot
You can use regex for these kind of tasks:
import re
old_str = 'Arun,Mishra,108,23,34,45,56,Mumbai'
new_str = re.sub(r'(\d+)(,)(\d+)', r'\1.\3', old_str, 1)
>>> 'Arun,Mishra,108.23,34,45,56,Mumbai'
The search pattern r'(\d+)(,)(\d+)' was to find a comma between two numbers. There are three capture groups, therefore one can use them in the replacement: r\1.\3 (\1 and \3 are first and third groups). The old_str is the string and 1 is to tell the pattern to only replace the first occurrence (thus keep 34, 45).
It may be instructive to show how this can be done without additional module imports.
The idea is to search the string for all/any commas. Once the index of a comma has been identified, examine the characters either side (checking for digits). If such a pattern is observed, modify the string accordingly
s = 'Arun,Mishra,108,23,34,45,56,Mumbai'
pos = 1
while (pos := s.find(',', pos, len(s)-1)) > 0:
if s[pos-1].isdigit() and s[pos+1].isdigit():
s = s[:pos] + '.' + s[pos+1:]
break
pos += 1
print(s)
Output:
Arun,Mishra,108.23,34,45,56,Mumbai
Assuming you have a plain CSV file as in your single line example, we can assume there are 8 columns and you want to 'merge' columns 3 and 4 together. You can do this with a regular expression - as shown below.
Here I explicitly match the 8 columns into 8 groups - matching everything that is not a comma as a column value and then write out the 8 columns again with commas separating all except columns 3 and 4 where I put the period/dot you require.
$ echo "Arun,Mishra,108,23,34,45,56,Mumbai" | sed -r "s/([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)/\1,\2,\3.\4,\5,\6,\7,\8/"
Arun,Mishra,108.23,34,45,56,Mumbai
This regex is for your exact data. Having a generic regex to replace any comma between two subsequent sets of digits might give false matches on other data however so I think explicitly matching the data based on the exact columns you have will be the safest way to do it.
You can take the above regex and code it into your python code as shown below.
import re
inLine = 'Arun,Mishra,108,23,34,45,56,Mumbai'
outLine = re.sub(r'([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)'
, r'\1,\2,\3.\4,\5,\6,\7,\8', inLine, 0)
print(outLine)
As Tim Biegeleisen pointed out in an original comment, if you have access to the original source data you would be better fixing the formatting there. Of course that is not always possible.
First split the string using s.split() and then replace ',' in 2nd element
after replacing join the string back again.
s= 'Arun,Mishra,108,23,34,45,56,Mumbai '
ls = s.split(',')
ls[2] = '.'.join([ls[2], ls[3]])
ls.pop(3)
s = ','.join(ls)
It changes all the commas to dots if dot have numbers before and after itself.
txt = "2459,12 is the best number. lets change the dots . with commas , 458,45."
commaindex = 0
while commaindex != -1:
commaindex = txt.find(",",commaindex+1)
if txt[commaindex-1].isnumeric() and txt[commaindex+1].isnumeric():
txt = txt[0:commaindex] + "." + txt[commaindex+1:len(txt)+1]
print(txt)

How to partial split and take the first portion of string in Python?

Have a scenario where I wanted to split a string partially and pick up the 1st portion of the string.
Say String could be like aloha_maui_d0_b0 or new_york_d9_b10. Note: After d its numerical and it could be any size.
I wanted to partially strip any string before _d* i.e. wanted only _d0_b0 or _d9_b10.
Tried below code, but obviously it removes the split term as well.
print(("aloha_maui_d0_b0").split("_d"))
#Output is : ['aloha_maui', '0_b0']
#But Wanted : _d0_b0
Is there any other way to get the partial portion? Do I need to try out in regexp?
How about just
stArr = "aloha_maui_d0_b0".split("_d")
st2 = '_d' + stArr[1]
This should do the trick if the string always has a '_d' in it
You can use index() to split in 2 parts:
s = 'aloha_maui_d0_b0'
idx = s.index('_d')
l = [s[:idx], s[idx:]]
# l = ['aloha_maui', '_d0_b0']
Edit: You can also use this if you have multiple _d in your string:
s = 'aloha_maui_d0_b0_d1_b1_d2_b2'
idxs = [n for n in range(len(s)) if n == 0 or s.find('_d', n) == n]
parts = [s[i:j] for i,j in zip(idxs, idxs[1:]+[None])]
# parts = ['aloha_maui', '_d0_b0', '_d1_b1', '_d2_b2']
I have two suggestions.
partition()
Use the method partition() to get a tuple containing the delimiter as one of the elements and use the + operator to get the String you want:
teste1 = 'aloha_maui_d0_b0'
partitiontest = teste1.partition('_d')
print(partitiontest)
print(partitiontest[1] + partitiontest[2])
Output:
('aloha_maui', '_d', '0_b0')
_d0_b0
The partition() methods returns a tuple with the first element being what is before the delimiter, the second being the delimiter itself and the third being what is after the delimiter.
The method does that to the first case of the delimiter it finds on the String, so you can't use it to split in more than 3 without extra work on the code. For that my second suggestion would be better.
replace()
Use the method replace() to insert an extra character (or characters) right before your delimiter (_d) and use these as the delimiter on the split() method.
teste2 = 'new_york_d9_b10'
replacetest = teste2.replace('_d', '|_d')
print(replacetest)
splitlist = replacetest.split('|')
print(splitlist)
Output:
new_york|_d9_b10
['new_york', '_d9_b10']
Since it replaces all cases of _d on the String for |_d there is no problem on using it to split in more than 2.
Problem?
A situation to which you may need to be careful would be for unwanted splits because of _d being present in more places than anticipated.
Following the apparent logic of your examples with city names and numericals, you might have something like this:
teste3 = 'rio_de_janeiro_d3_b32'
replacetest = teste3.replace('_d', '|_d')
print(replacetest)
splitlist = replacetest.split('|')
print(splitlist)
Output:
rio|_de_janeiro|_d3_b32
['rio', '_de_janeiro', '_d3_b32']
Assuming you always have the numerical on the end of the String and _d won't happen inside the numerical, rpartition() could be a solution:
rpartitiontest = teste3.rpartition('_d')
print(rpartitiontest)
print(rpartitiontest[1] + rpartitiontest[2])
Output:
('rio_de_janeiro', '_d', '3_b32')
_d3_b32
Since rpartition() starts the search on the String's end and only takes the first match to separate the terms into a tuple, you won't have to worry about the first term (city's name?) causing unexpected splits.
Use regex's split and keep delimiters capability:
import re
patre = re.compile(r"(_d\d)")
#👆 👆
#note the surrounding parenthesises - they're what drives "keep"
for line in """aloha_maui_d0_b0 new_york_d9_b10""".split():
parts = patre.split(line)
print("\n", line)
print(parts)
p1, p2 = parts[0], "".join(parts[1:])
print(p1, p2)
output:
aloha_maui_d0_b0
['aloha_maui', '_d0', '_b0']
aloha_maui _d0_b0
new_york_d9_b10
['new_york', '_d9', '_b10']
new_york _d9_b10
credit due: https://stackoverflow.com/a/15668433

Python Split if there is only one element

I am trying to do a Python Split but there seems to be a problem with my logic.
I have some data, separated with a semicolon. Some example of my data would be like:
89;50;20
40
I only want to retrieve one value from each row. Like for example in row 1, i only want the last value which is 20, and i want 40 from the second row.
I tried using the following code:
fields = fields.split(";")[-1]
It works for the first row, i got 20. but i am unable to get the data from second row as it has only one element in the split.
Then I tried using an if-else condition like below but the code is unable to run.
if (len(fields.split(";")) > 0):
fields = fields.split(";")[-1]
else:
pass
Anybody knows how to deal with this problem ? What I am achieve is that if there is only 1 value in that row I will read it. If there is more than one value, I split it and take the last value.
Use strip to normalize input, the problem is there is an extra ; for one number situation, so we should remove it first.
In [1]: def lnum(s):
...: return s.strip(';').split(';')[-1]
...:
In [2]: lnum('89;50;20')
Out[2]: '20'
In [3]: lnum('89;')
Out[3]: '89'
In [5]: lnum('10;')
Out[5]: '10'
So, if you see when you split the string - '40;' using semicolon (;), you get a list of two strings - ['40', '']. So, fields.split(";")[-1] returns an empty string for the input '40;'.
So, either you strip the last semicolon ; before splitting as follows.
print('40;'.rstrip(';').split(';')[-1])
OR, you can do:
fields = '40;'.split(';')
if fields[-1]:
print(fields[-1])
else:
print(fields[-2])
I prefer the first approach than the if/else approach. Also, have a look at the .strip(), .lstrip(), .rstrip() functions.
Another way is to use re module.
from re import findall
s1 = '80;778;20'
s2 = '40'
res1 = findall( '\d+', s1)
res2 = findall( '\d+', s2)
print res1[-1]
print res2[-1]

Python split a string at an underscore

How do I split a string at the second underscore in Python so that I get something like this
name = this_is_my_name_and_its_cool
split name so I get this ["this_is", "my_name_and_its_cool"]
the following statement will split name into a list of strings
a=name.split("_")
you can combine whatever strings you want using join, in this case using the first two words
b="_".join(a[:2])
c="_".join(a[2:])
maybe you can write a small function that takes as argument the number of words (n) after which you want to split
def func(name, n):
a=name.split("_")
b="_".join(a[:n])
c="_".join(a[n:])
return [b,c]
Assuming that you have a string with multiple instances of the same delimiter and you want to split at the nth delimiter, ignoring the others.
Here's a solution using just split and join, without complicated regular expressions. This might be a bit easier to adapt to other delimiters and particularly other values of n.
def split_at(s, c, n):
words = s.split(c)
return c.join(words[:n]), c.join(words[n:])
Example:
>>> split_at('this_is_my_name_and_its_cool', '_', 2)
('this_is', 'my_name_and_its_cool')
I think you're trying the split the string based on second underscore. If yes, then you used use findall function.
>>> import re
>>> s = "this_is_my_name_and_its_cool"
>>> re.findall(r'^[^_]*_[^_]*|[^_].*$', s)
['this_is', 'my_name_and_its_cool']
>>> [i for i in re.findall(r'^[^_]*_[^_]*|(?!_).*$', s) if i]
['this_is', 'my_name_and_its_cool']
print re.split(r"(^[^_]+_[^_]+)_","this_is_my_name_and_its_cool")
Try this.
Here's a quick & dirty way to do it:
s = 'this_is_my_name_and_its_cool'
i = s.find('_'); i = s.find('_', i+1)
print [s[:i], s[i+1:]]
output
['this_is', 'my_name_and_its_cool']
You could generalize this approach to split on the nth separator by putting the find() into a loop.

Fastest way to split a concatenated string into a tuple and ignore empty strings

I have a concatenated string like this:
my_str = 'str1;str2;str3;'
and I would like to apply split function to it and then convert the resulted list to a tuple, and get rid of any empty string resulted from the split (notice the last ';' in the end)
So far, I am doing this:
tuple(filter(None, my_str.split(';')))
Is there any more efficient (in terms of speed and space) way to do it?
How about this?
tuple(my_str.split(';')[:-1])
('str1', 'str2', 'str3')
You split the string at the ; character, and pass all off the substrings (except the last one, the empty string) to tuple to create the result tuple.
That is a very reasonable way to do it. Some alternatives:
foo.strip(";").split(";") (if there won't be any empty slices inside the string)
[ x.strip() for x in foo.split(";") if x.strip() ] (to strip whitespace from each slice)
The "fastest" way to do this will depend on a lot of things… but you can easily experiment with ipython's %timeit:
In [1]: foo = "1;2;3;4;"
In [2]: %timeit foo.strip(";").split(";")
1000000 loops, best of 3: 1.03 us per loop
In [3]: %timeit filter(None, foo.split(';'))
1000000 loops, best of 3: 1.55 us per loop
If you only expect an empty string at the end, you can do:
a = 'str1;str2;str3;'
tuple(a.split(';')[:-1])
or
tuple(a[:-1].split(';'))
Try tuple(my_str.split(';')[:-1])
Yes, that is quite a Pythonic way to do it. If you have a love for generator expressions, you could also replace the filter() with:
tuple(part for part in my_str.split(';') if part)
This has the benefit of allowing further processing on each part in-line.
It's interesting to note that the documentation for str.split() says:
... If sep is not specified or is None, any whitespace string is a
separator and empty strings are removed from the result.
I wonder why this special case was done, without allowing it for other separators...
use split and then slicing:
my_str.split(';')[:-1]
or :
lis=[x for x in my_str.split(';') if x]
if number of items in your string is fixed, you could also de-structure inline like this:
(str1, str2, str3) = my_str.split(";")
more on that here:
https://blog.teclado.com/destructuring-in-python/
I know this is an old question, but I just came upon this and saw that the top answer (David) doesn't return a tuple like OP requested. Although the solution works for the one example OP gave, the highest voted answer (Levon) strips the trailing semicolon with a substring, which would error on an empty string.
The most robust and pythonic solution is voithos' answer:
tuple(part for part in my_str.split(';') if part)
Here's my solution:
tuple(my_str.strip(';').split(';'))
It returns this when run against an empty string though:
('',)
So I'll be replacing mine with voithos' answer. Thanks voithos!

Categories

Resources