Removing orphan letters in a list with Python - python

I have a python list:
list = ['clothing items s','shoes s','handbag d','fashion k']
I have used a for loop that removed words from the above list using another list.
The challenge I have been facing is the issue around plurals/singulars. This has left me with random orphan letters.
Do you know how to loop through the list items and identify single letters such as 's','d','k' (in the above example) and remove them? While in the example the orphan is at the end of the string, it is not always the case.
Here is my current loop:
new_new_keywords = []
#first we start looping over every keyword
for keyword in new_keywords2:
# loop over every stop
for stop in new_stops:
# check if this stop is inside the current new_key
if stop in keyword:
# if it is, update the new key to remove the current stop
keyword = keyword.replace(stop, '')
#regex removes numbers at the end of the string in the list
keyword = re.sub(" \d+", " ", keyword)
#loop over the keyword over and over again until
#remove every stop word
# append the new stop-less keyword to the end of the array
# even if there are no changes
new_new_keywords.append(keyword)

The following is a rather old fashioned (and inefficient) approach which should work. This will preserve your original strings, apart from removing the unwanted characters:
test_list = ['clothing items s','shoes s','handbag d','fashion k', 'keep a', 'keep i', 'leave a alone remove k', 'keep , spacing b']
remove_list = "sdk" # letters that need to be removed
newlist = []
for item in test_list:
item += "_" # append unused symbol to end of string
for letter in remove_list:
item = item.replace(" %s " % letter, "")
item = item.replace(" %s_" % letter, "")
newlist.append(item.rstrip("_"))
print newlist
It gives the following output:
['clothing items', 'shoes', 'handbag', 'fashion', 'keep a', 'keep i', 'leave a alone remove', 'keep , spacing b']
If at some point you choose to give regular expressions a go, then similar logic can be achieved using the following:
import re
test_list = ['clothing items s','shoes s','handbag d','fashion k', 'keep a', 'keep i', 'leave a alone remove k', 'keep , spacing b']
remove_list = "sdk"
newlist = [re.sub(" ([%s])( |$)" % remove_list, "", item) for item in test_list]
print newlist

Take each string s, split it in words w, then reassemble s filtering out words with only 1 letter:
map(lambda s: ' '.join(w for w in s.split() if len(w) > 1), list)

You can use a set to decide what are invalid ending single letters preceded by a space, once the string length is > 1, the second last letter is a space and the last is in the rm set then slice the string to remove the chars, else just keep the string as is.:
lst = ['clothing items s','clothing s','shoes s','handbag d','fashion k']
rm = set((" bcdefghjklnpqrstuvwzy"))
print([ch[:-2] if all((len(ch) > 1,ch[-2].isspace(),ch[-1] in rm)) else ch
for ch in lst])
['clothing items', 'clothing', 'shoes', 'handbag', 'fashion']
You can reverse the logic with what letters that are valid.
lst = ['clothing items s','clothing s','shoes s','handbag d','fashion k']
st = set("ioa")
print([ch[:-2] if all((len(ch) > 1,ch[-2].isspace(),ch[-1] not in st)) else ch
for ch in lst])
You might also want to call str.lower on the strings as I and O should be capitalised when used by themselves.
You can use rsplit again and a loop, you just have to decide if you want to keep only valid single letter words I,O,a but that would not mean your sentence was grammatically correct either:
lst = ['clothing items s', 'clothing s', 'shoes s', 'handbag d', 'fashion k']
rm = set("bcdefghjklnpqrstuvwzy")
out = []
for s in lst:
spl = s.rsplit(None,1)
if spl[-1] not in rm:
out.append(s)
else:
out.append(s[:-2])
print(out)
Or using a regex:
lst = ['clothing items s', 'clothing s', 'shoes s', 'handbag d', 'fashion k']
import re
r = re.compile(r"\s[bcdefghjklnpqrstuvwzy]$")
print([r.sub("", ele) for ele in lst])
['clothing items', 'clothing', 'shoes', 'handbag', 'fashion']
Even considering what are possible one letter words then you would still need to see if the sentence was grammatically correct, for that you would need to use something like nltk, you could add a lowercase i and o to re or the set of letters to further filter your data but only you can decide what is relevant. If you want a robust solution and the sentence to be grammatically correct then there is a lot more work than simply just removing all or certain single trailing letters at the end of the string.

Straightforward solution - it deletes single letter words starting from last element:
def trim(s):
parts = s.split()
while parts:
if len(parts[-1]) == 1:
del parts[-1]
else:
break
return ' '.join(parts)
assert trim('clothing items s') == 'clothing items'
assert trim('fashion a b c') == 'fashion'
assert trim('stack overflow') == 'stack overflow'
assert trim('have a nice day') == 'have a nice day'
assert trim('a b c') == ''

Related

Python - Trying to replace words in a list of strings but having problems with single letter words

I have a list of strings such as
words = ['Twinkle Twinkle', 'How I wonder']
I am trying to create a function that will find and replace words in the original list and I was able to do that except for when the user inputs single letter words such as 'I' or 'a' etc.
current function
def sub(old: string, new: string, words: list):
words[:] = [w.replace(old, new) for w in words]
if input for old = 'I'
and new = 'ASD'
current output = ['TwASDnkle TwASDnkle', 'How ASD wonder']
intended output = ['Twinkle Twinkle', 'How ASD wonder']
This is my first post here and I have only been learning python for a few months now so I would appreciate any help, thank you
Don't use str.replace in a loop. This often doesn't do what is expected as it doesn't work on words but on all matches.
Instead, split the words, replace on match and join:
l = ['Twinkle Twinkle', 'How I wonder']
def sub(old: str, new: str, words: list):
words[:] = [' '.join(new if w==old else w for w in x.split()) for x in words]
sub('I', 'ASD', l)
Output: ['Twinkle Twinkle', 'How ASD wonder']
Or use a regex with word boundaries:
import re
def sub(old, new, words):
words[:] = [re.sub(fr'\b{re.escape(old)}\b', new, w) for w in words]
l = ['Twinkle Twinkle', 'How I wonder']
sub('I', 'ASD', l)
# ['Twinkle Twinkle', 'How ASD wonder']
NB. As #re-za pointed out, it might be a better practice to return a new list rather than mutating the input, just be aware of it
It seems like you are replacing letters and not words. I recommend splitting sentences (strings) into words by splitting strings by the ' ' (space char).
output = []
I would first get each string from the list like this:
for string in words:
I would then split the strings into a list of words like this:
temp_string = '' # a temp string we will use later to reconstruct the words
for word in string.split(' '):
Then I would check to see if the word is the one we are looking for by comparing it to old, and replacing (if it matches) with new:
if word == old:
temp_string += new + ' '
else:
temp_string += word + ' '
Now that we have each word reconstructed or replaced (if needed) back into a temp_string we can put all the temp_strings back into the array like this:
output.append(temp_string[:-1]) # [:-1] means we omit the space at the end
It should finally look like this:
def sub(old: string, new: string, words: list):
output = []
for string in words:
temp_string = '' # a temp string we will use later to reconstruct the words
for word in string.split(' '):
if word == old:
temp_string += new + ' '
else:
temp_string += word + ' '
output.append(temp_string[:-1]) # [:-1] means we omit the space at the end
return output

Replace a string with corresponding value from a list of strings

I have a list of strings and some text strings as follows:
my_list = ['en','di','fi','ope']
test_strings = ['you cannot enter', 'the wound is open', 'the house is clean']
I would like to replace the last words in the test_strings with their corresponding strings from the list above, if they appear in the list. i have written a for loop to capture the pattern, but do not know how to proceed with the replacement (?).
for entry in [entry for entry in test_strings]:
if entry.split(' ', 1)[-1].startswith(tuple(my_list)):
output = entry.replace(entry,?)
as output I would like to have:
output = ['you cannot en', 'the wound is ope', 'the house is clean']
for i,v in enumerate(test_strings):
last_term = v.split(' ')[-1]
for k in my_list:
if last_term.startswith(k):
test_strings[i] = v.replace(last_term,k)
break
print(test_strings)
['you cannot en', 'the wound is ope', 'the house is clean']
my_list = ['en', 'di', 'fi', 'ope']
test_strings = ['you cannot enter', 'the wound is open', 'the house is clean']
for i in range(len(test_strings)):
temp = test_strings[i].split(" ")
last_word = temp[-1]
for j in my_list:
if last_word.startswith(j):
break
else:
j = last_word
temp[-1] = j
test_strings[i] = " ".join(temp)
After this, test_strings will be what is required.
For replacing the text, the index of the list has been reassigned to the new text.

How do I split a string with several delimiters, but only once on each delimiter? Python

I am trying to split a string such as the one below, with all of the delimiters below, but only once.
string = 'it; seems; like\ta good\tday to watch\va\vmovie.'
delimiters = '\t \v ;'
The output, in this case, would be:
['it', ' seems; like', 'a good\tday to watch', 'a\vmovie.']
Obviously the example above is a nonsense example, but I am trying to learn whether or not this is possible. Would a fairly involved regex be in order?
Apologies if this question had been asked before. I did a fair bit of searching and could not find something quite like my example. Thanks for your time!
This should do the trick:
import re
def split_once_by(s, delims):
delims = set(delims)
parts = []
while delims:
delim_re = '({})'.format('|'.join(re.escape(d) for d in delims))
result = re.split(delim_re, s, maxsplit=1)
if len(result) == 3:
first, delim, s = result
parts.append(first)
delims.remove(delim)
else:
break
parts.append(s)
return parts
Example:
>>> split_once_by('it; seems; like\ta good\tday to watch\va\vmovie.', '\t\v;')
['it', ' seems; like', 'a good\tday to watch', 'a\x0bmovie.']
Burning Alcohol's answer inspired me to write this (IMO) better function:
def split_once_by(s, delims):
split_points = sorted((s.find(d), -len(d), d) for d in delims)
start = 0
for stop, _longest_first, d in split_points:
if stop < start: continue
yield s[start:stop]
start = stop + len(d)
yield s[start:]
with usage:
>>> list(split_once_by('it; seems; like\ta good\tday to watch\va\vmovie.', '\t\v;'))
['it', ' seems; like', 'a good\tday to watch', 'a\x0bmovie.']
A simple algorithm would do,
test_string = r'it; seems; like\ta good\tday to watch\va\vmovie.'
delimiters = [r'\t', r'\v', ';']
# find the index of each first occurence and sort it
delimiters = sorted(delimiters, key=lambda delimiter: test_string.find(delimiter))
splitted_string = [test_string]
# perform split with option maxsplit
for index, delimiter in enumerate(delimiters):
if delimiter in splitted_string[-1]:
splitted_string += splitted_string[-1].split(delimiter, maxsplit=1)
splitted_string.pop(index)
print(splitted_string)
# ['it', ' seems; like', 'a good\\tday to watch', 'a\\vmovie.']
Just create a list of patterns and apply them once:
string = 'it; seems; like\ta good\tday to watch\va\vmovie.'
patterns = ['\t', '\v', ';']
for pattern in patterns:
string = '*****'.join(string.split(pattern, maxsplit=1))
print(string.split('*****'))
Output:
['it', ' seems; like', 'a good\tday to watch', 'a\x0bmovie.']
So, what is "*****" ??
On each iteration, when you apply the split method you get a list. So, in the next iteration, You can't apply the .split () method (because you have a list), so you have to join each value of that list with some weird character like "****" or "###" or "^^^^^^^" or whatever you want, in order to re-apply the split () in the next iteration.
Finally, for each "*****" on your string, you will have one pattern of the list, so you can use this to make a final split.

Python dictionary replacement with space in key

I have a string and a dictionary, I have to replace every occurrence of the dict key in that text.
text = 'I have a smartphone and a Smart TV'
dict = {
'smartphone': 'toy',
'smart tv': 'junk'
}
If there is no space in keys, I will break the text into word and compare one by one with dict. Look like it took O(n). But now the key have space inside it so thing is more complected. Please suggest me the good way to do this and please notice the key may not match case with the text.
Update
I have think of this solution but it not efficient. O(m*n) or more...
for k,v in dict.iteritems():
text = text.replace(k,v) #or regex...
If the key word in the text is not close to each others (keyword other keyword) we may do this. Took O(n) to me >"<
def dict_replace(dictionary, text, strip_chars=None, replace_func=None):
"""
Replace word or word phrase in text with keyword in dictionary.
Arguments:
dictionary: dict with key:value, key should be in lower case
text: string to replace
strip_chars: string contain character to be strip out of each word
replace_func: function if exist will transform final replacement.
Must have 2 params as key and value
Return:
string
Example:
my_dict = {
"hello": "hallo",
"hallo": "hello", # Only one pass, don't worry
"smart tv": "http://google.com?q=smart+tv"
}
dict_replace(my_dict, "hello google smart tv",
replace_func=lambda k,v: '[%s](%s)'%(k,v))
"""
# First break word phrase in dictionary into single word
dictionary = dictionary.copy()
for key in dictionary.keys():
if ' ' in key:
key_parts = key.split()
for part in key_parts:
# Mark single word with False
if part not in dictionary:
dictionary[part] = False
# Break text into words and compare one by one
result = []
words = text.split()
words.append('')
last_match = '' # Last keyword (lower) match
original = '' # Last match in original
for word in words:
key_word = word.lower().strip(strip_chars) if \
strip_chars is not None else word.lower()
if key_word in dictionary:
last_match = last_match + ' ' + key_word if \
last_match != '' else key_word
original = original + ' ' + word if \
original != '' else word
else:
if last_match != '':
# If match whole word
if last_match in dictionary and dictionary[last_match] != False:
if replace_func is not None:
result.append(replace_func(original, dictionary[last_match]))
else:
result.append(dictionary[last_match])
else:
# Only match partial of keyword
match_parts = last_match.split(' ')
match_original = original.split(' ')
for i in xrange(0, len(match_parts)):
if match_parts[i] in dictionary and \
dictionary[match_parts[i]] != False:
if replace_func is not None:
result.append(replace_func(match_original[i], dictionary[match_parts[i]]))
else:
result.append(dictionary[match_parts[i]])
result.append(word)
last_match = ''
original = ''
return ' '.join(result)
If your keys have no spaces:
output = [dct[i] if i in dct else i for i in text.split()]
' '.join(output)
You should use dct instead of dict so it doesn't collide with the built in function dict()
This makes use of a dictionary comprehension, and a ternary operator
to filter the data.
If your keys do have spaces, you are correct:
for k,v in dct.iteritems():
string.replace('d', dct[d])
And yes, this time complexity will be m*n, as you have to iterate through the string every time for each key in dct.
Drop all dictionary keys and the input text to lower case, so the comparisons are easy. Now ...
for entry in my_dict:
if entry in text:
# process the match
This assumes that the dictionary is small enough to warrant the match. If, instead, the dictionary is large and the text is small, you'll need to take each word, then each 2-word phrase, and see whether they're in the dictionary.
Is that enough to get you going?
You need to test all the neighbor permutations from 1 (each individual word) to len(text) (the entire string). You can generate the neighbor permutations this way:
text = 'I have a smartphone and a Smart TV'
array = text.lower().split()
key_permutations = [" ".join(array[j:j + i]) for i in range(1, len(array) + 1) for j in range(0, len(array) - (i - 1))]
>>> key_permutations
['i', 'have', 'a', 'smartphone', 'and', 'a', 'smart', 'tv', 'i have', 'have a', 'a smartphone', 'smartphone and', 'and a', 'a smart', 'smart tv', 'i have a', 'have a smartphone', 'a smartphone and', 'smartphone and a', 'and a smart', 'a smart tv', 'i have a smartphone', 'have a smartphone and', 'a smartphone and a', 'smartphone and a smart', 'and a smart tv', 'i have a smartphone and', 'have a smartphone and a', 'a smartphone and a smart', 'smartphone and a smart tv', 'i have a smartphone and a', 'have a smartphone and a smart', 'a smartphone and a smart tv', 'i have a smartphone and a smart', 'have a smartphone and a smart tv', 'i have a smartphone and a smart tv']
Now we substitute through the dictionary:
import re
for permutation in key_permutations:
if permutation in dict:
text = re.sub(re.escape(permutation), dict[permutation], text, flags=re.IGNORECASE)
>>> text
'I have a toy and a junk'
Though you'll likely want to try the permutations in the reverse order, longest first, so more specific phrases have precedence over individual words.
You can do this pretty easily with regular expressions.
import re
text = 'I have a smartphone and a Smart TV'
dict = {
'smartphone': 'toy',
'smart tv': 'junk'
}
for k, v in dict.iteritems():
regex = re.compile(re.escape(k), flags=re.I)
text = regex.sub(v, text)
It still suffers from the problem of depending on processing order of the dict keys, if the replacement value for one item is part of the search term for another item.

How to do CamelCase split in python

What I was trying to achieve, was something like this:
>>> camel_case_split("CamelCaseXYZ")
['Camel', 'Case', 'XYZ']
>>> camel_case_split("XYZCamelCase")
['XYZ', 'Camel', 'Case']
So I searched and found this perfect regular expression:
(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])
As the next logical step I tried:
>>> re.split("(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])", "CamelCaseXYZ")
['CamelCaseXYZ']
Why does this not work, and how do I achieve the result from the linked question in python?
Edit: Solution summary
I tested all provided solutions with a few test cases:
string: ''
AplusKminus: ['']
casimir_et_hippolyte: []
two_hundred_success: []
kalefranz: string index out of range # with modification: either [] or ['']
string: ' '
AplusKminus: [' ']
casimir_et_hippolyte: []
two_hundred_success: [' ']
kalefranz: [' ']
string: 'lower'
all algorithms: ['lower']
string: 'UPPER'
all algorithms: ['UPPER']
string: 'Initial'
all algorithms: ['Initial']
string: 'dromedaryCase'
AplusKminus: ['dromedary', 'Case']
casimir_et_hippolyte: ['dromedary', 'Case']
two_hundred_success: ['dromedary', 'Case']
kalefranz: ['Dromedary', 'Case'] # with modification: ['dromedary', 'Case']
string: 'CamelCase'
all algorithms: ['Camel', 'Case']
string: 'ABCWordDEF'
AplusKminus: ['ABC', 'Word', 'DEF']
casimir_et_hippolyte: ['ABC', 'Word', 'DEF']
two_hundred_success: ['ABC', 'Word', 'DEF']
kalefranz: ['ABCWord', 'DEF']
In summary you could say the solution by #kalefranz does not match the question (see the last case) and the solution by #casimir et hippolyte eats a single space, and thereby violates the idea that a split should not change the individual parts. The only difference among the remaining two alternatives is that my solution returns a list with the empty string on an empty string input and the solution by #200_success returns an empty list.
I don't know how the python community stands on that issue, so I say: I am fine with either one. And since 200_success's solution is simpler, I accepted it as the correct answer.
As #AplusKminus has explained, re.split() never splits on an empty pattern match. Therefore, instead of splitting, you should try finding the components you are interested in.
Here is a solution using re.finditer() that emulates splitting:
def camel_case_split(identifier):
matches = finditer('.+?(?:(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|$)', identifier)
return [m.group(0) for m in matches]
Use re.sub() and split()
import re
name = 'CamelCaseTest123'
splitted = re.sub('([A-Z][a-z]+)', r' \1', re.sub('([A-Z]+)', r' \1', name)).split()
Result
'CamelCaseTest123' -> ['Camel', 'Case', 'Test123']
'CamelCaseXYZ' -> ['Camel', 'Case', 'XYZ']
'XYZCamelCase' -> ['XYZ', 'Camel', 'Case']
'XYZ' -> ['XYZ']
'IPAddress' -> ['IP', 'Address']
Most of the time when you don't need to check the format of a string, a global research is more simple than a split (for the same result):
re.findall(r'[A-Z](?:[a-z]+|[A-Z]*(?=[A-Z]|$))', 'CamelCaseXYZ')
returns
['Camel', 'Case', 'XYZ']
To deal with dromedary too, you can use:
re.findall(r'[A-Z]?[a-z]+|[A-Z]+(?=[A-Z]|$)', 'camelCaseXYZ')
Note: (?=[A-Z]|$) can be shorten using a double negation (a negative lookahead with a negated character class): (?![^A-Z])
Working solution, without regexp
I am not that good at regexp. I like to use them for search/replace in my IDE but I try to avoid them in programs.
Here is a quite straightforward solution in pure python:
def camel_case_split(s):
idx = list(map(str.isupper, s))
# mark change of case
l = [0]
for (i, (x, y)) in enumerate(zip(idx, idx[1:])):
if x and not y: # "Ul"
l.append(i)
elif not x and y: # "lU"
l.append(i+1)
l.append(len(s))
# for "lUl", index of "U" will pop twice, have to filter that
return [s[x:y] for x, y in zip(l, l[1:]) if x < y]



And some tests
TESTS = [
("XYZCamelCase", ['XYZ', 'Camel', 'Case']),
("CamelCaseXYZ", ['Camel', 'Case', 'XYZ']),
("CamelCaseXYZa", ['Camel', 'Case', 'XY', 'Za']),
("XYZCamelCaseXYZ", ['XYZ', 'Camel', 'Case', 'XYZ']),
("aCamelCaseWordT", ['a', 'Camel', 'Case', 'Word', 'T']),
("CamelCaseWordT", ['Camel', 'Case', 'Word', 'T']),
("CamelCaseWordTa", ['Camel', 'Case', 'Word', 'Ta']),
("aCamelCaseWordTa", ['a', 'Camel', 'Case', 'Word', 'Ta']),
("Ta", ['Ta']),
("aT", ['a', 'T']),
("a", ['a']),
("T", ['T']),
("", []),
]
def test():
for (q,a) in TESTS:
assert camel_case_split(q) == a
if __name__ == "__main__":
test()
Edit: a solution which streams data in one pass
This solution leverages the fact that the decision to split word or not can be taken locally, just considering the current character and the previous one.
def camel_case_split(s):
u = True # case of previous char
w = b = '' # current word, buffer for last uppercase letter
for c in s:
o = c.isupper()
if u and o:
w += b
b = c
elif u and not o:
if len(w)>0:
yield w
w = b + c
b = ''
elif not u and o:
yield w
w = ''
b = c
else: # not u and not o:
w += c
u = o
if len(w)>0 or len(b)>0: # flush
yield w + b
It is theoretically faster and lesser memory usage.
same tests suite applies
but list must be built by caller
def test():
for (q,a) in TESTS:
r = list(camel_case_split(q))
print(q,a,r)
assert r == a
Try it online
I just stumbled upon this case and wrote a regular expression to solve it. It should work for any group of words, actually.
RE_WORDS = re.compile(r'''
# Find words in a string. Order matters!
[A-Z]+(?=[A-Z][a-z]) | # All upper case before a capitalized word
[A-Z]?[a-z]+ | # Capitalized words / all lower case
[A-Z]+ | # All upper case
\d+ # Numbers
''', re.VERBOSE)
The key here is the lookahead on the first possible case. It will match (and preserve) uppercase words before capitalized ones:
assert RE_WORDS.findall('FOOBar') == ['FOO', 'Bar']
import re
re.split('(?<=[a-z])(?=[A-Z])', 'camelCamelCAMEL')
# ['camel', 'Camel', 'CAMEL'] <-- result
# '(?<=[a-z])' --> means preceding lowercase char (group A)
# '(?=[A-Z])' --> means following UPPERCASE char (group B)
# '(group A)(group B)' --> 'aA' or 'aB' or 'bA' and so on
The documentation for python's re.split says:
Note that split will never split a string on an empty pattern match.
When seeing this:
>>> re.findall("(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])", "CamelCaseXYZ")
['', '']
it becomes clear, why the split does not work as expected. The remodule finds empty matches, just as intended by the regular expression.
Since the documentation states that this is not a bug, but rather intended behavior, you have to work around that when trying to create a camel case split:
def camel_case_split(identifier):
matches = finditer('(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])', identifier)
split_string = []
# index of beginning of slice
previous = 0
for match in matches:
# get slice
split_string.append(identifier[previous:match.start()])
# advance index
previous = match.start()
# get remaining string
split_string.append(identifier[previous:])
return split_string
This solution also supports numbers, spaces, and auto remove underscores:
def camel_terms(value):
return re.findall('[A-Z][a-z]+|[0-9A-Z]+(?=[A-Z][a-z])|[0-9A-Z]{2,}|[a-z0-9]{2,}|[a-zA-Z0-9]', value)
Some tests:
tests = [
"XYZCamelCase",
"CamelCaseXYZ",
"Camel_CaseXYZ",
"3DCamelCase",
"Camel5Case",
"Camel5Case5D",
"Camel Case XYZ"
]
for test in tests:
print(test, "=>", camel_terms(test))
results:
XYZCamelCase => ['XYZ', 'Camel', 'Case']
CamelCaseXYZ => ['Camel', 'Case', 'XYZ']
Camel_CaseXYZ => ['Camel', 'Case', 'XYZ']
3DCamelCase => ['3D', 'Camel', 'Case']
Camel5Case => ['Camel', '5', 'Case']
Camel5Case5D => ['Camel', '5', 'Case', '5D']
Camel Case XYZ => ['Camel', 'Case', 'XYZ']
Simple solution:
re.sub(r"([a-z0-9])([A-Z])", r"\1 \2", str(text))
Here's another solution that requires less code and no complicated regular expressions:
def camel_case_split(string):
bldrs = [[string[0].upper()]]
for c in string[1:]:
if bldrs[-1][-1].islower() and c.isupper():
bldrs.append([c])
else:
bldrs[-1].append(c)
return [''.join(bldr) for bldr in bldrs]
Edit
The above code contains an optimization that avoids rebuilding the entire string with every appended character. Leaving out that optimization, a simpler version (with comments) might look like
def camel_case_split2(string):
# set the logic for creating a "break"
def is_transition(c1, c2):
return c1.islower() and c2.isupper()
# start the builder list with the first character
# enforce upper case
bldr = [string[0].upper()]
for c in string[1:]:
# get the last character in the last element in the builder
# note that strings can be addressed just like lists
previous_character = bldr[-1][-1]
if is_transition(previous_character, c):
# start a new element in the list
bldr.append(c)
else:
# append the character to the last string
bldr[-1] += c
return bldr
I know that the question added the tag of regex. But still, I always try to stay as far away from regex as possible. So, here is my solution without regex:
def split_camel(text, char):
if len(text) <= 1: # To avoid adding a wrong space in the beginning
return text+char
if char.isupper() and text[-1].islower(): # Regular Camel case
return text + " " + char
elif text[-1].isupper() and char.islower() and text[-2] != " ": # Detect Camel case in case of abbreviations
return text[:-1] + " " + text[-1] + char
else: # Do nothing part
return text + char
text = "PathURLFinder"
text = reduce(split_camel, a, "")
print text
# prints "Path URL Finder"
print text.split(" ")
# prints "['Path', 'URL', 'Finder']"
EDIT:
As suggested, here is the code to put the functionality in a single function.
def split_camel(text):
def splitter(text, char):
if len(text) <= 1: # To avoid adding a wrong space in the beginning
return text+char
if char.isupper() and text[-1].islower(): # Regular Camel case
return text + " " + char
elif text[-1].isupper() and char.islower() and text[-2] != " ": # Detect Camel case in case of abbreviations
return text[:-1] + " " + text[-1] + char
else: # Do nothing part
return text + char
converted_text = reduce(splitter, text, "")
return converted_text.split(" ")
split_camel("PathURLFinder")
# prints ['Path', 'URL', 'Finder']
Putting a more comprehensive approach otu ther. It takes care of several issues like numbers, string starting with lower case, single letter words etc.
def camel_case_split(identifier, remove_single_letter_words=False):
"""Parses CamelCase and Snake naming"""
concat_words = re.split('[^a-zA-Z]+', identifier)
def camel_case_split(string):
bldrs = [[string[0].upper()]]
string = string[1:]
for idx, c in enumerate(string):
if bldrs[-1][-1].islower() and c.isupper():
bldrs.append([c])
elif c.isupper() and (idx+1) < len(string) and string[idx+1].islower():
bldrs.append([c])
else:
bldrs[-1].append(c)
words = [''.join(bldr) for bldr in bldrs]
words = [word.lower() for word in words]
return words
words = []
for word in concat_words:
if len(word) > 0:
words.extend(camel_case_split(word))
if remove_single_letter_words:
subset_words = []
for word in words:
if len(word) > 1:
subset_words.append(word)
if len(subset_words) > 0:
words = subset_words
return words
My requirement was a bit more specific than the OP. In particular, in addition to handling all OP cases, I needed the following which the other solutions do not provide:
- treat all non-alphanumeric input (e.g. !##$%^&*() etc) as a word separator
- handle digits as follows:
- cannot be in the middle of a word
- cannot be at the beginning of the word unless the phrase starts with a digit
def splitWords(s):
new_s = re.sub(r'[^a-zA-Z0-9]', ' ', # not alphanumeric
re.sub(r'([0-9]+)([^0-9])', '\\1 \\2', # digit followed by non-digit
re.sub(r'([a-z])([A-Z])','\\1 \\2', # lower case followed by upper case
re.sub(r'([A-Z])([A-Z][a-z])', '\\1 \\2', # upper case followed by upper case followed by lower case
s
)
)
)
)
return [x for x in new_s.split(' ') if x]
Output:
for test in ['', ' ', 'lower', 'UPPER', 'Initial', 'dromedaryCase', 'CamelCase', 'ABCWordDEF', 'CamelCaseXYZand123.how23^ar23e you doing AndABC123XYZdf']:
print test + ':' + str(splitWords(test))
:[]
:[]
lower:['lower']
UPPER:['UPPER']
Initial:['Initial']
dromedaryCase:['dromedary', 'Case']
CamelCase:['Camel', 'Case']
ABCWordDEF:['ABC', 'Word', 'DEF']
CamelCaseXYZand123.how23^ar23e you doing AndABC123XYZdf:['Camel', 'Case', 'XY', 'Zand123', 'how23', 'ar23', 'e', 'you', 'doing', 'And', 'ABC123', 'XY', 'Zdf']
Maybe this will be enough to for some people:
a = "SomeCamelTextUpper"
def camelText(val):
return ''.join([' ' + i if i.isupper() else i for i in val]).strip()
print(camelText(a))
It dosen't work with the type "CamelXYZ", but with 'typical' CamelCase scenario should work just fine.
I think below is the optimim
Def count_word():
Return(re.findall(‘[A-Z]?[a-z]+’, input(‘please enter your string’))
Print(count_word())

Categories

Resources