Split text after the second occurrence of character - python

I need to split text before the second occurrence of the '-' character. What I have now is producing inconsistent results. I've tried various combinations of rsplit and read through and tried other solutions on SO, with no results.
Sample file name to split: 'some-sample-filename-to-split' returned in data.filename. In this case, I would only like to have 'some-sample' returned.
fname, extname = os.path.splitext(data.filename)
file_label = fname.rsplit('/',1)[-1]
file_label2 = file_label.rsplit('-',maxsplit=3)
print(file_label2,'\n','---------------','\n')

You can do something like this:
>>> a = "some-sample-filename-to-split"
>>> "-".join(a.split("-", 2)[:2])
'some-sample'
a.split("-", 2) will split the string upto the second occurrence of -.
a.split("-", 2)[:2] will give the first 2 elements in the list. Then simply join the first 2 elements.
OR
You could use regular expression : ^([\w]+-[\w]+)
>>> import re
>>> reg = r'^([\w]+-[\w]+)'
>>> re.match(reg, a).group()
'some-sample'
EDIT: As discussed in the comments, here is what you need:
def hyphen_split(a):
if a.count("-") == 1:
return a.split("-")[0]
return "-".join(a.split("-", 2)[:2])
>>> hyphen_split("some-sample-filename-to-split")
'some-sample'
>>> hyphen_split("some-sample")
'some'

A generic form to split a string into halves on the nth occurence of the separator would be:
def split(strng, sep, pos):
strng = strng.split(sep)
return sep.join(strng[:pos]), sep.join(strng[pos:])
If pos is negative it will count the occurrences from the end of string.
>>> strng = 'some-sample-filename-to-split'
>>> split(strng, '-', 3)
('some-sample-filename', 'to-split')
>>> split(strng, '-', -4)
('some', 'sample-filename-to-split')
>>> split(strng, '-', 1000)
('some-sample-filename-to-split', '')
>>> split(strng, '-', -1000)
('', 'some-sample-filename-to-split')

You can use str.index():
def hyphen_split(s):
pos = s.index('-')
try:
return s[:s.index('-', pos + 1)]
except ValueError:
return s[:pos]
test:
>>> hyphen_split("some-sample-filename-to-split")
'some-sample'
>>> hyphen_split("some-sample")
'some'

You could use regular expressions:
import re
file_label = re.search('(.*?-.*?)-', fname).group(1)

When proceeding with the dataframe and the split needed
for the entire column values, lambda function is better than regex.
df['column_name'].apply(lambda x: "-".join(x.split('-',2)[:2]))

Here's a somewhat cryptic implementation avoiding the use of join():
def split(string, sep, n):
"""Split `stringĀ“ at the `n`th occurrence of `sep`"""
pos = reduce(lambda x, _: string.index(sep, x + 1), range(n + 1), -1)
return string[:pos], string[pos + len(sep):]

Related

Python Split String at First Non-Alpha Character

Say I have strings such as 'ABC)D.' or 'AB:CD/'. How can I split them at the first non-alphabetic character to end up with ['ABC', 'D.'] and ['AB', 'CD/']? Is there a way to do this without regex?
You can use a loop
a = 'AB$FDWRE'
i = 0
while i<len(a) and a[i].isalpha():
i += 1
>>> a[:i]
'AB'
>>> a[i:]
'$FDWRE'
One option would be to find the location of the first non-alphabetic character:
def split_at_non_alpha(s):
try:
split_at = next(i for i, x in enumerate(s) if not x.isalpha())
return s[:split_at], s[split_at+1:]
except StopIteration: # if not found
return (s,)
print(split_at_non_alpha('ABC)D.')) # ('ABC', 'D.')
print(split_at_non_alpha('AB:CD/')) # ('AB', 'CD/')
print(split_at_non_alpha('.ABCD')) # ('', 'ABCD')
print(split_at_non_alpha('ABCD.')) # ('ABCD', '')
print(split_at_non_alpha('ABCD')) # ('ABCD',)
With for loop, enumerate, and string indexing:
def first_non_alpha_splitter(word):
for index, char in enumerate(word):
if not char.isalpha():
break
return [word[:index], word[index+1:]]
The result
first_non_alpha_splitter('ABC)D.')
# Output: ['ABC', 'D.']
first_non_alpha_splitter('AB:CD/')
# Output: ['AB', 'CD/']
Barmar's suggestion's worked best for me. The other answers had near the same execution time but I chose the former for readability.
from itertools import takewhile
str = 'ABC)D.'
alphStr = ''.join(takewhile(lambda x: x.isalpha(), str))
print(alphStr) # Outputs 'ABC'

How to replace characters in a spliced list?

My code right now is very simple
sentence = input("Input a sentence: ")
print(sentence[::2])
My goal is to instead of having the spliced list replace the characters with nothing it will replace with another character like 'A'
Some things I've tried are
print(sentence[::2].replace("A", "B")
print(sentence[::2], sep = "A")
print(sentence[::2], "A")
One solution with print:
s = 'test'
print(*s[::2], sep='A', end='\n' if len(s) % 2 else 'A\n')
Prints:
tAsA
If s='tes':
tAs
You have the right idea with sep, but the wrong function.
extract alternate characters with [::2]
convert to a list of individual chars
join them with the desired separator, A
Step by step:
>>> s = "Hello, world"
>>> s[::2]
'Hlo ol'
>>> list(s[::2])
['H', 'l', 'o', ' ', 'o', 'l']
>>> 'A'.join(list(s[::2]))
'HAlAoA AoAl'
Q.E.D.
Thanks to kaya3 for the bug catching. Kludge solution:
>>> new = 'A'.join(list(s[::2]))
>>> if len(new) < len(s):
... new += 'A'
...
>>> new
'HAlAoA AoAlA'
>>>
The join method almost does this, but fails on even-lengthed strings:
>>> 'A'.join('hello'[::2])
'hAlAo'
>>> 'A'.join('test'[::2])
'tAs'
To solve this we can add on an extra A if the length is even:
def replace_alternating(s, sep):
result = sep.join(s[::2])
if len(s) % 2 == 0:
result += sep
return result
Here's an alternative solution using a regex to replace pairs of characters:
>>> re.sub('(.).', r'\1A', 'hello')
'hAlAo'
>>> re.sub('(.).', r'\1A', 'test')
'tAsA'

Spilt a string and appending

I have a string for example:
"112233445566778899"
How can I spilt it to the following pattern:
"\x11\x22\x33\x44\x55\x66\x77\x88\x99"
I could spilt the string with following commands, but I could find out a way to append "\x" to the them:
s = "112233445566778899"
[s[i:i + 2] for i in range(0, len(s), 2)]
Assuming your string will always have an even length, you always want to split the string into pairs, and that your string is already ordered:
>>> string = "112233445566778899"
>>> joined = ''.join(r'\x{}'.format(s + s) for s in string[1::2])
>>> print(joined)
\x11\x22\x33\x44\x55\x66\x77\x88\x99
>>>
You can do the following edit to your code:
...
[r"\x"+s[i:i + 2] for i in range(0, len(s), 2)]
...
Notice that this will return two forward slashes:
['\\x11', '\\x22', '\\x33', '\\x44', '\\x55', '\\x66', '\\x77', '\\x88', '\\x99']
This is because of Python escaping the \ using the escaping character \.
When using the string you will notice that one of the \ disappears:
>> x = ['\\x11', '\\x22', '\\x33', '\\x44', '\\x55', '\\x66', '\\x77', '\\x88', '\\x99']
>> print(x[0])
>> '\x11'
s = "112233445566778899"
a = [r'\x' + s[i:i + 2] for i in range(0, len(s), 2)]
print(''.join(a))
I think using regular expressions can you the best. Because it can find doubled characters anywhere on the string.
>>>import re
>>>string = "112233445566778899"
>>>x = ''.join(r'\x{}'.format(s) for s in re.finditer(r'(\w)\1',string))
>>>x
'\\x11\\x22\\x33\\x44\\x55\\x66\\x77\\x88\\x99'
>>> print(x)
\x11\x22\x33\x44\x55\x66\x77\x88\x99

Split string into strings by length?

Is there a way to take a string that is 4*x characters long, and cut it into 4 strings, each x characters long, without knowing the length of the string?
For example:
>>>x = "qwertyui"
>>>split(x, one, two, three, four)
>>>two
'er'
>>> x = "qwertyui"
>>> chunks, chunk_size = len(x), len(x)//4
>>> [ x[i:i+chunk_size] for i in range(0, chunks, chunk_size) ]
['qw', 'er', 'ty', 'ui']
I tried Alexanders answer but got this error in Python3:
TypeError: 'float' object cannot be interpreted as an integer
This is because the division operator in Python3 is returning a float. This works for me:
>>> x = "qwertyui"
>>> chunks, chunk_size = len(x), len(x)//4
>>> [ x[i:i+chunk_size] for i in range(0, chunks, chunk_size) ]
['qw', 'er', 'ty', 'ui']
Notice the // at the end of line 2, to ensure truncation to an integer.
:param s: str; source string
:param w: int; width to split on
Using the textwrap module:
PyDocs-textwrap
import textwrap
def wrap(s, w):
return textwrap.fill(s, w)
:return str:
Inspired by Alexander's Answer
PyDocs-data structures
def wrap(s, w):
return [s[i:i + w] for i in range(0, len(s), w)]
:return list:
Inspired by Eric's answer
PyDocs-regex
import re
def wrap(s, w):
sre = re.compile(rf'(.{{{w}}})')
return [x for x in re.split(sre, s) if x]
:return list:
some_string="ABCDEFGHIJKLMNOPQRSTUVWXYZ"
x=3
res=[some_string[y-x:y] for y in range(x, len(some_string)+x,x)]
print(res)
will produce
['ABC', 'DEF', 'GHI', 'JKL', 'MNO', 'PQR', 'STU', 'VWX', 'YZ']
In Split string every nth character?, "the wolf" gives the most concise answer:
>>> import re
>>> re.findall('..','1234567890')
['12', '34', '56', '78', '90']
Here is a one-liner that doesn't need to know the length of the string beforehand:
from functools import partial
from StringIO import StringIO
[l for l in iter(partial(StringIO(data).read, 4), '')]
If you have a file or socket, then you don't need the StringIO wrapper:
[l for l in iter(partial(file_like_object.read, 4), '')]
def split2len(s, n):
def _f(s, n):
while s:
yield s[:n]
s = s[n:]
return list(_f(s, n))
Got an re trick:
In [28]: import re
In [29]: x = "qwertyui"
In [30]: [x for x in re.split(r'(\w{2})', x) if x]
Out[30]: ['qw', 'er', 'ty', 'ui']
Then be a func, it might looks like:
def split(string, split_len):
# Regex: `r'.{1}'` for example works for all characters
regex = r'(.{%s})' % split_len
return [x for x in re.split(regex, string) if x]
Here are two generic approaches. Probably worth adding to your own lib of reusables. First one requires the item to be sliceable and second one works with any iterables (but requires their constructor to accept iterable).
def split_bylen(item, maxlen):
'''
Requires item to be sliceable (with __getitem__ defined)
'''
return [item[ind:ind+maxlen] for ind in range(0, len(item), maxlen)]
#You could also replace outer [ ] brackets with ( ) to use as generator.
def split_bylen_any(item, maxlen, constructor=None):
'''
Works with any iterables.
Requires item's constructor to accept iterable or alternatively
constructor argument could be provided (otherwise use item's class)
'''
if constructor is None: constructor = item.__class__
return [constructor(part) for part in zip(* ([iter(item)] * maxlen))]
#OR: return map(constructor, zip(* ([iter(item)] * maxlen)))
# which would be faster if you need an iterable, not list
So, in topicstarter's case, the usage is:
string = 'Baboons love bananas'
parts = 5
splitlen = -(-len(string) // parts) # is alternative to math.ceil(len/parts)
first_method = split_bylen(string, splitlen)
#Result :['Babo', 'ons ', 'love', ' ban', 'anas']
second_method = split_bylen_any(string, splitlen, constructor=''.join)
#Result :['Babo', 'ons ', 'love', ' ban', 'anas']
length = 4
string = "abcdefgh"
str_dict = [ o for o in string ]
parts = [ ''.join( str_dict[ (j * length) : ( ( j + 1 ) * length ) ] ) for j in xrange(len(string)/length )]
# spliting a string by the length of the string
def len_split(string,sub_string):
n,sub,str1=list(string),len(sub_string),')/^0*/-'
for i in range(sub,len(n)+((len(n)-1)//sub),sub+1):
n.insert(i,str1)
n="".join(n)
n=n.split(str1)
return n
x="divyansh_looking_for_intership_actively_contact_Me_here"
sub="four"
print(len_split(x,sub))
# Result-> ['divy', 'ansh', 'tiwa', 'ri_l', 'ooki', 'ng_f', 'or_i', 'nter', 'ship', '_con', 'tact', '_Me_', 'here']
There is a built in function in python for that
import textwrap
text = "Your Text.... and so on"
width = 5 #
textwrap.wrap(text,width)
Vualla
And for dudes who prefer it to be a bit more readable:
def itersplit_into_x_chunks(string,x=10): # we assume here that x is an int and > 0
size = len(string)
chunksize = size//x
for pos in range(0, size, chunksize):
yield string[pos:pos+chunksize]
output:
>>> list(itersplit_into_x_chunks('qwertyui',x=4))
['qw', 'er', 'ty', 'ui']
My solution
st =' abs de fdgh 1234 556 shg shshh'
print st
def splitStringMax( si, limit):
ls = si.split()
lo=[]
st=''
ln=len(ls)
if ln==1:
return [si]
i=0
for l in ls:
st+=l
i+=1
if i <ln:
lk=len(ls[i])
if (len(st))+1+lk < limit:
st+=' '
continue
lo.append(st);st=''
return lo
############################
print splitStringMax(st,7)
# ['abs de', 'fdgh', '1234', '556', 'shg', 'shshh']
print splitStringMax(st,12)
# ['abs de fdgh', '1234 556', 'shg shshh']
l = 'abcdefghijklmn'
def group(l,n):
tmp = len(l)%n
zipped = zip(*[iter(l)]*n)
return zipped if tmp == 0 else zipped+[tuple(l[-tmp:])]
print group(l,3)
The string splitting is required in many cases like where you have to sort the characters of the string given, replacing a character with an another character etc. But all these operations can be performed with the following mentioned string splitting methods.
The string splitting can be done in two ways:
Slicing the given string based on the length of split.
Converting the given string to a list with list(str) function, where characters of the string breakdown to form the the elements of a list. Then do the required operation and join them with 'specified character between the characters of the original string'.join(list) to get a new processed string.

rreplace - How to replace the last occurrence of an expression in a string?

Is there a quick way in Python to replace strings but, instead of starting from the beginning as replace does, starting from the end? For example:
>>> def rreplace(old, new, occurrence)
>>> ... # Code to replace the last occurrences of old by new
>>> '<div><div>Hello</div></div>'.rreplace('</div>','</bad>',1)
>>> '<div><div>Hello</div></bad>'
>>> def rreplace(s, old, new, occurrence):
... li = s.rsplit(old, occurrence)
... return new.join(li)
...
>>> s
'1232425'
>>> rreplace(s, '2', ' ', 2)
'123 4 5'
>>> rreplace(s, '2', ' ', 3)
'1 3 4 5'
>>> rreplace(s, '2', ' ', 4)
'1 3 4 5'
>>> rreplace(s, '2', ' ', 0)
'1232425'
Here is a one-liner:
result = new.join(s.rsplit(old, maxreplace))
Return a copy of string s with all occurrences of substring old replaced by new. The first maxreplace occurrences are replaced.
and a full example of this in use:
s = 'mississipi'
old = 'iss'
new = 'XXX'
maxreplace = 1
result = new.join(s.rsplit(old, maxreplace))
>>> result
'missXXXipi'
I'm not going to pretend that this is the most efficient way of doing it, but it's a simple way. It reverses all the strings in question, performs an ordinary replacement using str.replace on the reversed strings, then reverses the result back the right way round:
>>> def rreplace(s, old, new, count):
... return (s[::-1].replace(old[::-1], new[::-1], count))[::-1]
...
>>> rreplace('<div><div>Hello</div></div>', '</div>', '</bad>', 1)
'<div><div>Hello</div></bad>'
Just reverse the string, replace first occurrence and reverse it again:
mystr = "Remove last occurrence of a BAD word. This is a last BAD word."
removal = "BAD"
reverse_removal = removal[::-1]
replacement = "GOOD"
reverse_replacement = replacement[::-1]
newstr = mystr[::-1].replace(reverse_removal, reverse_replacement, 1)[::-1]
print ("mystr:", mystr)
print ("newstr:", newstr)
Output:
mystr: Remove last occurence of a BAD word. This is a last BAD word.
newstr: Remove last occurence of a BAD word. This is a last GOOD word.
If you know that the 'old' string does not contain any special characters you can do it with a regex:
In [44]: s = '<div><div>Hello</div></div>'
In [45]: import re
In [46]: re.sub(r'(.*)</div>', r'\1</bad>', s)
Out[46]: '<div><div>Hello</div></bad>'
Here is a recursive solution to the problem:
def rreplace(s, old, new, occurence = 1):
if occurence == 0:
return s
left, found, right = s.rpartition(old)
if found == "":
return right
else:
return rreplace(left, old, new, occurence - 1) + new + right
Try this:
def replace_last(string, old, new):
old_idx = string.rfind(old)
return string[:old_idx] + new + string[old_idx+len(old):]
Similarly you can replace first occurrence by replacing string.rfind() with string.find().
I hope it helps.
If you have a list of strings you can use list comprehension and string slicing in a one liner to cover the whole list.. No need to use a function;
myList = [x[::-1].replace('<div>'[::-1],'<bad>'[::-1],1)[::-1] if x.endswith('<div>') else x for x in myList]
I use if else to keep the items in the list that don't meet the criteria for replacement otherwise your list would just be the items that do meet the criteria.

Categories

Resources