Related
My code right now is very simple
sentence = input("Input a sentence: ")
print(sentence[::2])
My goal is to instead of having the spliced list replace the characters with nothing it will replace with another character like 'A'
Some things I've tried are
print(sentence[::2].replace("A", "B")
print(sentence[::2], sep = "A")
print(sentence[::2], "A")
One solution with print:
s = 'test'
print(*s[::2], sep='A', end='\n' if len(s) % 2 else 'A\n')
Prints:
tAsA
If s='tes':
tAs
You have the right idea with sep, but the wrong function.
extract alternate characters with [::2]
convert to a list of individual chars
join them with the desired separator, A
Step by step:
>>> s = "Hello, world"
>>> s[::2]
'Hlo ol'
>>> list(s[::2])
['H', 'l', 'o', ' ', 'o', 'l']
>>> 'A'.join(list(s[::2]))
'HAlAoA AoAl'
Q.E.D.
Thanks to kaya3 for the bug catching. Kludge solution:
>>> new = 'A'.join(list(s[::2]))
>>> if len(new) < len(s):
... new += 'A'
...
>>> new
'HAlAoA AoAlA'
>>>
The join method almost does this, but fails on even-lengthed strings:
>>> 'A'.join('hello'[::2])
'hAlAo'
>>> 'A'.join('test'[::2])
'tAs'
To solve this we can add on an extra A if the length is even:
def replace_alternating(s, sep):
result = sep.join(s[::2])
if len(s) % 2 == 0:
result += sep
return result
Here's an alternative solution using a regex to replace pairs of characters:
>>> re.sub('(.).', r'\1A', 'hello')
'hAlAo'
>>> re.sub('(.).', r'\1A', 'test')
'tAsA'
What I was trying to achieve, was something like this:
>>> camel_case_split("CamelCaseXYZ")
['Camel', 'Case', 'XYZ']
>>> camel_case_split("XYZCamelCase")
['XYZ', 'Camel', 'Case']
So I searched and found this perfect regular expression:
(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])
As the next logical step I tried:
>>> re.split("(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])", "CamelCaseXYZ")
['CamelCaseXYZ']
Why does this not work, and how do I achieve the result from the linked question in python?
Edit: Solution summary
I tested all provided solutions with a few test cases:
string: ''
AplusKminus: ['']
casimir_et_hippolyte: []
two_hundred_success: []
kalefranz: string index out of range # with modification: either [] or ['']
string: ' '
AplusKminus: [' ']
casimir_et_hippolyte: []
two_hundred_success: [' ']
kalefranz: [' ']
string: 'lower'
all algorithms: ['lower']
string: 'UPPER'
all algorithms: ['UPPER']
string: 'Initial'
all algorithms: ['Initial']
string: 'dromedaryCase'
AplusKminus: ['dromedary', 'Case']
casimir_et_hippolyte: ['dromedary', 'Case']
two_hundred_success: ['dromedary', 'Case']
kalefranz: ['Dromedary', 'Case'] # with modification: ['dromedary', 'Case']
string: 'CamelCase'
all algorithms: ['Camel', 'Case']
string: 'ABCWordDEF'
AplusKminus: ['ABC', 'Word', 'DEF']
casimir_et_hippolyte: ['ABC', 'Word', 'DEF']
two_hundred_success: ['ABC', 'Word', 'DEF']
kalefranz: ['ABCWord', 'DEF']
In summary you could say the solution by #kalefranz does not match the question (see the last case) and the solution by #casimir et hippolyte eats a single space, and thereby violates the idea that a split should not change the individual parts. The only difference among the remaining two alternatives is that my solution returns a list with the empty string on an empty string input and the solution by #200_success returns an empty list.
I don't know how the python community stands on that issue, so I say: I am fine with either one. And since 200_success's solution is simpler, I accepted it as the correct answer.
As #AplusKminus has explained, re.split() never splits on an empty pattern match. Therefore, instead of splitting, you should try finding the components you are interested in.
Here is a solution using re.finditer() that emulates splitting:
def camel_case_split(identifier):
matches = finditer('.+?(?:(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|$)', identifier)
return [m.group(0) for m in matches]
Use re.sub() and split()
import re
name = 'CamelCaseTest123'
splitted = re.sub('([A-Z][a-z]+)', r' \1', re.sub('([A-Z]+)', r' \1', name)).split()
Result
'CamelCaseTest123' -> ['Camel', 'Case', 'Test123']
'CamelCaseXYZ' -> ['Camel', 'Case', 'XYZ']
'XYZCamelCase' -> ['XYZ', 'Camel', 'Case']
'XYZ' -> ['XYZ']
'IPAddress' -> ['IP', 'Address']
Most of the time when you don't need to check the format of a string, a global research is more simple than a split (for the same result):
re.findall(r'[A-Z](?:[a-z]+|[A-Z]*(?=[A-Z]|$))', 'CamelCaseXYZ')
returns
['Camel', 'Case', 'XYZ']
To deal with dromedary too, you can use:
re.findall(r'[A-Z]?[a-z]+|[A-Z]+(?=[A-Z]|$)', 'camelCaseXYZ')
Note: (?=[A-Z]|$) can be shorten using a double negation (a negative lookahead with a negated character class): (?![^A-Z])
Working solution, without regexp
I am not that good at regexp. I like to use them for search/replace in my IDE but I try to avoid them in programs.
Here is a quite straightforward solution in pure python:
def camel_case_split(s):
idx = list(map(str.isupper, s))
# mark change of case
l = [0]
for (i, (x, y)) in enumerate(zip(idx, idx[1:])):
if x and not y: # "Ul"
l.append(i)
elif not x and y: # "lU"
l.append(i+1)
l.append(len(s))
# for "lUl", index of "U" will pop twice, have to filter that
return [s[x:y] for x, y in zip(l, l[1:]) if x < y]
And some tests
TESTS = [
("XYZCamelCase", ['XYZ', 'Camel', 'Case']),
("CamelCaseXYZ", ['Camel', 'Case', 'XYZ']),
("CamelCaseXYZa", ['Camel', 'Case', 'XY', 'Za']),
("XYZCamelCaseXYZ", ['XYZ', 'Camel', 'Case', 'XYZ']),
("aCamelCaseWordT", ['a', 'Camel', 'Case', 'Word', 'T']),
("CamelCaseWordT", ['Camel', 'Case', 'Word', 'T']),
("CamelCaseWordTa", ['Camel', 'Case', 'Word', 'Ta']),
("aCamelCaseWordTa", ['a', 'Camel', 'Case', 'Word', 'Ta']),
("Ta", ['Ta']),
("aT", ['a', 'T']),
("a", ['a']),
("T", ['T']),
("", []),
]
def test():
for (q,a) in TESTS:
assert camel_case_split(q) == a
if __name__ == "__main__":
test()
Edit: a solution which streams data in one pass
This solution leverages the fact that the decision to split word or not can be taken locally, just considering the current character and the previous one.
def camel_case_split(s):
u = True # case of previous char
w = b = '' # current word, buffer for last uppercase letter
for c in s:
o = c.isupper()
if u and o:
w += b
b = c
elif u and not o:
if len(w)>0:
yield w
w = b + c
b = ''
elif not u and o:
yield w
w = ''
b = c
else: # not u and not o:
w += c
u = o
if len(w)>0 or len(b)>0: # flush
yield w + b
It is theoretically faster and lesser memory usage.
same tests suite applies
but list must be built by caller
def test():
for (q,a) in TESTS:
r = list(camel_case_split(q))
print(q,a,r)
assert r == a
Try it online
I just stumbled upon this case and wrote a regular expression to solve it. It should work for any group of words, actually.
RE_WORDS = re.compile(r'''
# Find words in a string. Order matters!
[A-Z]+(?=[A-Z][a-z]) | # All upper case before a capitalized word
[A-Z]?[a-z]+ | # Capitalized words / all lower case
[A-Z]+ | # All upper case
\d+ # Numbers
''', re.VERBOSE)
The key here is the lookahead on the first possible case. It will match (and preserve) uppercase words before capitalized ones:
assert RE_WORDS.findall('FOOBar') == ['FOO', 'Bar']
import re
re.split('(?<=[a-z])(?=[A-Z])', 'camelCamelCAMEL')
# ['camel', 'Camel', 'CAMEL'] <-- result
# '(?<=[a-z])' --> means preceding lowercase char (group A)
# '(?=[A-Z])' --> means following UPPERCASE char (group B)
# '(group A)(group B)' --> 'aA' or 'aB' or 'bA' and so on
The documentation for python's re.split says:
Note that split will never split a string on an empty pattern match.
When seeing this:
>>> re.findall("(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])", "CamelCaseXYZ")
['', '']
it becomes clear, why the split does not work as expected. The remodule finds empty matches, just as intended by the regular expression.
Since the documentation states that this is not a bug, but rather intended behavior, you have to work around that when trying to create a camel case split:
def camel_case_split(identifier):
matches = finditer('(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])', identifier)
split_string = []
# index of beginning of slice
previous = 0
for match in matches:
# get slice
split_string.append(identifier[previous:match.start()])
# advance index
previous = match.start()
# get remaining string
split_string.append(identifier[previous:])
return split_string
This solution also supports numbers, spaces, and auto remove underscores:
def camel_terms(value):
return re.findall('[A-Z][a-z]+|[0-9A-Z]+(?=[A-Z][a-z])|[0-9A-Z]{2,}|[a-z0-9]{2,}|[a-zA-Z0-9]', value)
Some tests:
tests = [
"XYZCamelCase",
"CamelCaseXYZ",
"Camel_CaseXYZ",
"3DCamelCase",
"Camel5Case",
"Camel5Case5D",
"Camel Case XYZ"
]
for test in tests:
print(test, "=>", camel_terms(test))
results:
XYZCamelCase => ['XYZ', 'Camel', 'Case']
CamelCaseXYZ => ['Camel', 'Case', 'XYZ']
Camel_CaseXYZ => ['Camel', 'Case', 'XYZ']
3DCamelCase => ['3D', 'Camel', 'Case']
Camel5Case => ['Camel', '5', 'Case']
Camel5Case5D => ['Camel', '5', 'Case', '5D']
Camel Case XYZ => ['Camel', 'Case', 'XYZ']
Simple solution:
re.sub(r"([a-z0-9])([A-Z])", r"\1 \2", str(text))
Here's another solution that requires less code and no complicated regular expressions:
def camel_case_split(string):
bldrs = [[string[0].upper()]]
for c in string[1:]:
if bldrs[-1][-1].islower() and c.isupper():
bldrs.append([c])
else:
bldrs[-1].append(c)
return [''.join(bldr) for bldr in bldrs]
Edit
The above code contains an optimization that avoids rebuilding the entire string with every appended character. Leaving out that optimization, a simpler version (with comments) might look like
def camel_case_split2(string):
# set the logic for creating a "break"
def is_transition(c1, c2):
return c1.islower() and c2.isupper()
# start the builder list with the first character
# enforce upper case
bldr = [string[0].upper()]
for c in string[1:]:
# get the last character in the last element in the builder
# note that strings can be addressed just like lists
previous_character = bldr[-1][-1]
if is_transition(previous_character, c):
# start a new element in the list
bldr.append(c)
else:
# append the character to the last string
bldr[-1] += c
return bldr
I know that the question added the tag of regex. But still, I always try to stay as far away from regex as possible. So, here is my solution without regex:
def split_camel(text, char):
if len(text) <= 1: # To avoid adding a wrong space in the beginning
return text+char
if char.isupper() and text[-1].islower(): # Regular Camel case
return text + " " + char
elif text[-1].isupper() and char.islower() and text[-2] != " ": # Detect Camel case in case of abbreviations
return text[:-1] + " " + text[-1] + char
else: # Do nothing part
return text + char
text = "PathURLFinder"
text = reduce(split_camel, a, "")
print text
# prints "Path URL Finder"
print text.split(" ")
# prints "['Path', 'URL', 'Finder']"
EDIT:
As suggested, here is the code to put the functionality in a single function.
def split_camel(text):
def splitter(text, char):
if len(text) <= 1: # To avoid adding a wrong space in the beginning
return text+char
if char.isupper() and text[-1].islower(): # Regular Camel case
return text + " " + char
elif text[-1].isupper() and char.islower() and text[-2] != " ": # Detect Camel case in case of abbreviations
return text[:-1] + " " + text[-1] + char
else: # Do nothing part
return text + char
converted_text = reduce(splitter, text, "")
return converted_text.split(" ")
split_camel("PathURLFinder")
# prints ['Path', 'URL', 'Finder']
Putting a more comprehensive approach otu ther. It takes care of several issues like numbers, string starting with lower case, single letter words etc.
def camel_case_split(identifier, remove_single_letter_words=False):
"""Parses CamelCase and Snake naming"""
concat_words = re.split('[^a-zA-Z]+', identifier)
def camel_case_split(string):
bldrs = [[string[0].upper()]]
string = string[1:]
for idx, c in enumerate(string):
if bldrs[-1][-1].islower() and c.isupper():
bldrs.append([c])
elif c.isupper() and (idx+1) < len(string) and string[idx+1].islower():
bldrs.append([c])
else:
bldrs[-1].append(c)
words = [''.join(bldr) for bldr in bldrs]
words = [word.lower() for word in words]
return words
words = []
for word in concat_words:
if len(word) > 0:
words.extend(camel_case_split(word))
if remove_single_letter_words:
subset_words = []
for word in words:
if len(word) > 1:
subset_words.append(word)
if len(subset_words) > 0:
words = subset_words
return words
My requirement was a bit more specific than the OP. In particular, in addition to handling all OP cases, I needed the following which the other solutions do not provide:
- treat all non-alphanumeric input (e.g. !##$%^&*() etc) as a word separator
- handle digits as follows:
- cannot be in the middle of a word
- cannot be at the beginning of the word unless the phrase starts with a digit
def splitWords(s):
new_s = re.sub(r'[^a-zA-Z0-9]', ' ', # not alphanumeric
re.sub(r'([0-9]+)([^0-9])', '\\1 \\2', # digit followed by non-digit
re.sub(r'([a-z])([A-Z])','\\1 \\2', # lower case followed by upper case
re.sub(r'([A-Z])([A-Z][a-z])', '\\1 \\2', # upper case followed by upper case followed by lower case
s
)
)
)
)
return [x for x in new_s.split(' ') if x]
Output:
for test in ['', ' ', 'lower', 'UPPER', 'Initial', 'dromedaryCase', 'CamelCase', 'ABCWordDEF', 'CamelCaseXYZand123.how23^ar23e you doing AndABC123XYZdf']:
print test + ':' + str(splitWords(test))
:[]
:[]
lower:['lower']
UPPER:['UPPER']
Initial:['Initial']
dromedaryCase:['dromedary', 'Case']
CamelCase:['Camel', 'Case']
ABCWordDEF:['ABC', 'Word', 'DEF']
CamelCaseXYZand123.how23^ar23e you doing AndABC123XYZdf:['Camel', 'Case', 'XY', 'Zand123', 'how23', 'ar23', 'e', 'you', 'doing', 'And', 'ABC123', 'XY', 'Zdf']
Maybe this will be enough to for some people:
a = "SomeCamelTextUpper"
def camelText(val):
return ''.join([' ' + i if i.isupper() else i for i in val]).strip()
print(camelText(a))
It dosen't work with the type "CamelXYZ", but with 'typical' CamelCase scenario should work just fine.
I think below is the optimim
Def count_word():
Return(re.findall(‘[A-Z]?[a-z]+’, input(‘please enter your string’))
Print(count_word())
I have a string of characters which I know to be sorted. Example:
myString = "aaaabbbbbbcccddddd"
I want to split this item into a list at the point when the character I am on is different than its preceding character, as shown below:
splitList = ["aaaa","bbbbbb","ccc","ddddd"]
I am working in Python 3.4.
Thanks!
In [294]: myString = "aaaabbbbbbcccddddd"
In [295]: [''.join(list(g)) for i,g in itertools.groupby(myString)]
Out[295]: ['aaaa', 'bbbbbb', 'ccc', 'ddddd']
myString = "aaaabbbbbbcccddddd"
result = []
for i,s in enumerate(myString):
l = len(result)
if l == 0 or s != myString[i-1]:
result.append(s)
else:
result[l-1] = result[l-1] + s
print result
Output:
['aaaa', 'bbbbbb', 'ccc', 'ddddd']
Is there a better way to pull A and F from this: A13:F20
a="A13:F20"
import re
pattern = re.compile(r'\D+\d+\D+')
matches = re.search(pattern, a)
num = matches.group(0)
print num[0]
print num[len(num)-1]
output
A
F
note: the digits are of unknown length
You don't have to use regular expressions, or re at all. Assuming you want just letters to remain, you could do something like this:
a = "A13:F20"
a = filter(lambda x: x.isalpha(), a)
I'd do it like this:
>>> re.findall(r'[a-z]', a, re.IGNORECASE)
['A', 'F']
Use a simple list comprehension, as a filter and get only the alphabets from the actual string.
print [char for char in input_string if char.isalpha()]
# ['A', 'F']
You could use re.sub:
>>> a="A13.F20"
>>> re.sub(r'[^A-Z]', '', a) # Remove everything apart from A-Z
'AF'
>>> re.sub(r'[A-Z]', '', a) # Remove A-Z
'13.20'
>>>
If you're working with strings that all have the same format, you can just cut out substrings:
a="A13:F20"
print a[0], a[4]
More on python slicing in this answer:
Is there a way to substring a string in Python?
Is there a way to take a string that is 4*x characters long, and cut it into 4 strings, each x characters long, without knowing the length of the string?
For example:
>>>x = "qwertyui"
>>>split(x, one, two, three, four)
>>>two
'er'
>>> x = "qwertyui"
>>> chunks, chunk_size = len(x), len(x)//4
>>> [ x[i:i+chunk_size] for i in range(0, chunks, chunk_size) ]
['qw', 'er', 'ty', 'ui']
I tried Alexanders answer but got this error in Python3:
TypeError: 'float' object cannot be interpreted as an integer
This is because the division operator in Python3 is returning a float. This works for me:
>>> x = "qwertyui"
>>> chunks, chunk_size = len(x), len(x)//4
>>> [ x[i:i+chunk_size] for i in range(0, chunks, chunk_size) ]
['qw', 'er', 'ty', 'ui']
Notice the // at the end of line 2, to ensure truncation to an integer.
:param s: str; source string
:param w: int; width to split on
Using the textwrap module:
PyDocs-textwrap
import textwrap
def wrap(s, w):
return textwrap.fill(s, w)
:return str:
Inspired by Alexander's Answer
PyDocs-data structures
def wrap(s, w):
return [s[i:i + w] for i in range(0, len(s), w)]
:return list:
Inspired by Eric's answer
PyDocs-regex
import re
def wrap(s, w):
sre = re.compile(rf'(.{{{w}}})')
return [x for x in re.split(sre, s) if x]
:return list:
some_string="ABCDEFGHIJKLMNOPQRSTUVWXYZ"
x=3
res=[some_string[y-x:y] for y in range(x, len(some_string)+x,x)]
print(res)
will produce
['ABC', 'DEF', 'GHI', 'JKL', 'MNO', 'PQR', 'STU', 'VWX', 'YZ']
In Split string every nth character?, "the wolf" gives the most concise answer:
>>> import re
>>> re.findall('..','1234567890')
['12', '34', '56', '78', '90']
Here is a one-liner that doesn't need to know the length of the string beforehand:
from functools import partial
from StringIO import StringIO
[l for l in iter(partial(StringIO(data).read, 4), '')]
If you have a file or socket, then you don't need the StringIO wrapper:
[l for l in iter(partial(file_like_object.read, 4), '')]
def split2len(s, n):
def _f(s, n):
while s:
yield s[:n]
s = s[n:]
return list(_f(s, n))
Got an re trick:
In [28]: import re
In [29]: x = "qwertyui"
In [30]: [x for x in re.split(r'(\w{2})', x) if x]
Out[30]: ['qw', 'er', 'ty', 'ui']
Then be a func, it might looks like:
def split(string, split_len):
# Regex: `r'.{1}'` for example works for all characters
regex = r'(.{%s})' % split_len
return [x for x in re.split(regex, string) if x]
Here are two generic approaches. Probably worth adding to your own lib of reusables. First one requires the item to be sliceable and second one works with any iterables (but requires their constructor to accept iterable).
def split_bylen(item, maxlen):
'''
Requires item to be sliceable (with __getitem__ defined)
'''
return [item[ind:ind+maxlen] for ind in range(0, len(item), maxlen)]
#You could also replace outer [ ] brackets with ( ) to use as generator.
def split_bylen_any(item, maxlen, constructor=None):
'''
Works with any iterables.
Requires item's constructor to accept iterable or alternatively
constructor argument could be provided (otherwise use item's class)
'''
if constructor is None: constructor = item.__class__
return [constructor(part) for part in zip(* ([iter(item)] * maxlen))]
#OR: return map(constructor, zip(* ([iter(item)] * maxlen)))
# which would be faster if you need an iterable, not list
So, in topicstarter's case, the usage is:
string = 'Baboons love bananas'
parts = 5
splitlen = -(-len(string) // parts) # is alternative to math.ceil(len/parts)
first_method = split_bylen(string, splitlen)
#Result :['Babo', 'ons ', 'love', ' ban', 'anas']
second_method = split_bylen_any(string, splitlen, constructor=''.join)
#Result :['Babo', 'ons ', 'love', ' ban', 'anas']
length = 4
string = "abcdefgh"
str_dict = [ o for o in string ]
parts = [ ''.join( str_dict[ (j * length) : ( ( j + 1 ) * length ) ] ) for j in xrange(len(string)/length )]
# spliting a string by the length of the string
def len_split(string,sub_string):
n,sub,str1=list(string),len(sub_string),')/^0*/-'
for i in range(sub,len(n)+((len(n)-1)//sub),sub+1):
n.insert(i,str1)
n="".join(n)
n=n.split(str1)
return n
x="divyansh_looking_for_intership_actively_contact_Me_here"
sub="four"
print(len_split(x,sub))
# Result-> ['divy', 'ansh', 'tiwa', 'ri_l', 'ooki', 'ng_f', 'or_i', 'nter', 'ship', '_con', 'tact', '_Me_', 'here']
There is a built in function in python for that
import textwrap
text = "Your Text.... and so on"
width = 5 #
textwrap.wrap(text,width)
Vualla
And for dudes who prefer it to be a bit more readable:
def itersplit_into_x_chunks(string,x=10): # we assume here that x is an int and > 0
size = len(string)
chunksize = size//x
for pos in range(0, size, chunksize):
yield string[pos:pos+chunksize]
output:
>>> list(itersplit_into_x_chunks('qwertyui',x=4))
['qw', 'er', 'ty', 'ui']
My solution
st =' abs de fdgh 1234 556 shg shshh'
print st
def splitStringMax( si, limit):
ls = si.split()
lo=[]
st=''
ln=len(ls)
if ln==1:
return [si]
i=0
for l in ls:
st+=l
i+=1
if i <ln:
lk=len(ls[i])
if (len(st))+1+lk < limit:
st+=' '
continue
lo.append(st);st=''
return lo
############################
print splitStringMax(st,7)
# ['abs de', 'fdgh', '1234', '556', 'shg', 'shshh']
print splitStringMax(st,12)
# ['abs de fdgh', '1234 556', 'shg shshh']
l = 'abcdefghijklmn'
def group(l,n):
tmp = len(l)%n
zipped = zip(*[iter(l)]*n)
return zipped if tmp == 0 else zipped+[tuple(l[-tmp:])]
print group(l,3)
The string splitting is required in many cases like where you have to sort the characters of the string given, replacing a character with an another character etc. But all these operations can be performed with the following mentioned string splitting methods.
The string splitting can be done in two ways:
Slicing the given string based on the length of split.
Converting the given string to a list with list(str) function, where characters of the string breakdown to form the the elements of a list. Then do the required operation and join them with 'specified character between the characters of the original string'.join(list) to get a new processed string.