How to split strings with special characters without removing those characters?

How to split strings with special characters without removing those characters? - python

I'm writing this function which needs to return an abbreviated version of a str. The return str must contain the first letter, number of characters removed and the, last letter;it must be abbreviated per word and not by sentence, then after that I need to join every word again with the same format including the special-characters. I tried using the re.findall() method but it automatically removes the special-characters so I can't use " ".join() because it will leave out the special-characters.
Here's my code:
import re
def abbreviate(wrd):
return " ".join([i if len(i) < 4 else i[0] + str(len(i[1:-1])) + i[-1] for i in re.findall(r"[\w']+", wrd)])
print(abbreviate("elephant-rides are really fun!"))
The output would be:
e6t r3s are r4y fun
But the output should be:
e6t-r3s are r4y fun!

No need for str.join. Might as well take full advantage of what the re module has to offer.
re.sub accepts a string or a callable object (like a function or lambda), which takes the current match as an input and must return a string with which to replace the current match.
import re
pattern = "\\b[a-z]([a-z]{2,})[a-z]\\b"
string = "elephant-rides are really fun!"
def replace(match):
return f"{match.group(0)[0]}{len(match.group(1))}{match.group(0)[-1]}"
abbreviated = re.sub(pattern, replace, string)
print(abbreviated)
Output:
e6t-r3s are r4y fun!
>>>
Maybe someone else can improve upon this answer with a cuter pattern, or any other suggestions. The way the pattern is written now, it assumes that you're only dealing with lowercase letters, so that's something to keep in mind - but it should be pretty straightforward to modify it to suit your needs. I'm not really a fan of the repetition of [a-z], but that's just the quickest way I could think of for capturing the "inner" characters of a word in a separate capturing group. You may also want to consider what should happen with words/contractions like "don't" or "shouldn't".

Thank you for viewing my question. After a few more searches, trial, and error I finally found a way to execute my code properly without changing it too much. I simply substituted re.findall(r"[\w']+", wrd) with re.split(r'([\W\d\_])', wrd) and also removed the whitespace in "".join() for they were simply not needed anymore.
import re
def abbreviate(wrd):
return "".join([i if len(i) < 4 else i[0] + str(len(i[1:-1])) + i[-1] for i in re.split(r'([\W\d\_])', wrd)])
print(abbreviate("elephant-rides are not fun!"))
Output:
e6t-r3s are not fun!

Related

Python re.sub() is not replacing every match

I'm using Python 3 and I have two strings: abbcabb and abca. I want to remove every double occurrence of a single character. For example:
abbcabb should give c and abca should give bc.
I've tried the following regex (here):
(.)(.*?)\1
But, it gives wrong output for first string. Also, when I tried another one (here):
(.)(.*?)*?\1
But, this one again gives wrong output. What's going wrong here?
The python code is a print statement:
print(re.sub(r'(.)(.*?)\1', '\g<2>', s)) # s is the string

It can be solved without regular expression, like below
>>>''.join([i for i in s1 if s1.count(i) == 1])
'bc'
>>>''.join([i for i in s if s.count(i) == 1])
'c'

re.sub() doesn't perform overlapping replacements. After it replaces the first match, it starts looking after the end of the match. So when you perform the replacement on
abbcabb
it first replaces abbca with bbc. Then it replaces bb with an empty string. It doesn't go back and look for another match in bbc.
If you want that, you need to write your own loop.
while True:
newS = re.sub(r'(.)(.*?)\1', r'\g<2>', s)
if newS == s:
break
s = newS
print(newS)
DEMO

Regular expressions doesn't seem to be the ideal solution
they don't handle overlapping so it it needs a loop (like in this answer) and it creates strings over and over (performance suffers)
they're overkill here, we just need to count the characters
I like this answer, but using count repeatedly in a list comprehension loops over all elements each time.
It can be solved without regular expression and without O(n**2) complexity, only O(n) using collections.Counter
first count the characters of the string very easily & quickly
then filter the string testing if the count matches using the counter we just created.
like this:
import collections
s = "abbcabb"
cnt = collections.Counter(s)
s = "".join([c for c in s if cnt[c]==1])
(as a bonus, you can change the count to keep characters which have 2, 3, whatever occurrences)

EDIT: based on the comment exchange - if you're just concerned with the parity of the letter counts, then you don't want regex and instead want an approach like #jon's recommendation. (If you don't care about order, then a more performant approach with very long strings might use something like collections.Counter instead.)
My best guess as to what you're trying to match is: "one or more characters - call this subpattern A - followed by a different set of one or more characters - call this subpattern B - followed by subpattern A again".
You can use + as a shortcut for "one or more" (instead of specifying it once and then using * for the rest of the matches), but either way you need to get the subpatterns right. Let's try:
>>> import re
>>> pattern = re.compile(r'(.+?)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'bbcbaca'
Hmm. That didn't work. Why? Because with the first pattern not being greedy, our "subpattern A" can just match the first a in the string - it does appear later, after all. So if we use a greedy match, Python will backtrack until it finds as long of a pattern for subpattern A that still allows for the A-B-A pattern to appear:
>>> pattern = re.compile(r'(.+)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'cbc'
Looks good to me.

The site explains it well, hover and use the explanation section.
(.)(.*?)\1 Does not remove or match every double occurance. It matches 1 character, followed by anything in the middle sandwiched till that same character is encountered again.
so, for abbcabb the "sandwiched" portion should be bbc between two a
EDIT:
You can try something like this instead without regexes:
string = "abbcabb"
result = []
for i in string:
if i not in result:
result.append(i)
else:
result.remove(i)
print(''.join(result))
Note that this produces the "last" odd occurrence of a string and not first.
For "first" known occurance, you should use a counter as suggested in this answer . Just change the condition to check for odd counts. pseudo code(count[letter] %2 == 1)

Python: strip function definition using regex

I am a very beginner of programming and reading the book "Automate the boring stuff with Python'. In Chapter 7, there is a project practice: the regex version of strip(). My code below does not work (I use Python 3.6.1). Could anyone help?
import re
string = input("Enter a string to strip: ")
strip_chars = input("Enter the characters you want to be stripped: ")
def strip_fn(string, strip_chars):
if strip_chars == '':
blank_start_end_regex = re.compile(r'^(\s)+|(\s)+$')
stripped_string = blank_start_end_regex.sub('', string)
print(stripped_string)
else:
strip_chars_start_end_regex = re.compile(r'^(strip_chars)*|(strip_chars)*$')
stripped_string = strip_chars_start_end_regex.sub('', string)
print(stripped_string)

You can also use re.sub to substitute the characters in the start or end.
Let us say if the char is 'x'
re.sub(r'^x+', "", string)
re.sub(r'x+$', "", string)
The first line as lstrip and the second as rstrip
This just looks simpler.

When using r'^(strip_chars)*|(strip_chars)*$' string literal, the strip_chars is not interpolated, i.e. it is treated as a part of the string. You need to pass it as a variable to the regex. However, just passing it in the current form would result in a "corrupt" regex because (...) in a regex is a grouping construct, while you want to match a single char from the define set of chars stored in the strip_chars variable.
You could just wrap the string with a pair of [ and ] to create a character class, but if the variable contains, say z-a, it would make the resulting pattern invalid. You also need to escape each char to play it safe.
Replace
r'^(strip_chars)*|(strip_chars)*$'
with
r'^[{0}]+|[{0}]+$'.format("".join([re.escape(x) for x in strip_chars]))
I advise to replace * (zero or more occurrences) with + (one or more occurrences) quantifier because in most cases, when we want to remove something, we need to match at least 1 occurrence of the unnecessary string(s).
Also, you may replace r'^(\s)+|(\s)+$' with r'^\s+|\s+$' since the repeated capturing groups will keep on re-writing group values upon each iteration slightly hampering the regex execution.

#! python
# Regex Version of Strip()
import re
def RegexStrip(mainString,charsToBeRemoved=None):
if(charsToBeRemoved!=None):
regex=re.compile(r'[%s]'%charsToBeRemoved)#Interesting TO NOTE
return regex.sub('',mainString)
else:
regex=re.compile(r'^\s+')
regex1=re.compile(r'$\s+')
newString=regex1.sub('',mainString)
newString=regex.sub('',newString)
return newString
Str=' hello3123my43name is antony '
print(RegexStrip(Str))
Maybe this could help, it can be further simplified of course.

How to delete shortest matched pattern from the end of string by re module of python?

I am converting my bash code to python code.
Now I would like to make a function which has same functionality of ${variable%pattern} in bash; which delete shortest matched pattern from the end of the string,
For example, I expect delete_tail('_usr_home_you_file.ext.tar.oz', r'.') results in '_usr_home_you_file.ext.tar'
I made python function below,
import re
def delete_tail(word,pattern):
return re.sub('{0}.*?$'.format(pattern), '', word)
However, it deletes longest matched pattern like following.
word='_usr_home_you_file.ext.tar.oz'
delete_shortest_match_tail=delete_tail(word,r'\.')
print("word = {0}".format(word))
print("delete_shortest_match_tail = {0}". format(delete_shortest_match_tail))
Output:
delete_shortest_match_tail = _usr_home_you_file
How can I make a function which deletes shortest matched pattern from the end of the string as I expected above?
Thank you very much.

You rather want to search for the string in front of the pattern rather than for the pattern to replace it. Regex always looks from left to right and all matches are reported in that order. We can't simply reverse the strings, because that would mess up the RegEx pattern. Because of that we can't use sub, but replacing something with an empty string is the same as deleting, OR taking the rest of the string. This is basically what this solution does. It searches for your result and simply omits the part you don't want.
def removeFromEnd(pattern, target):
m = re.match("(.*)" + pattern + ".*$", target)
if m:
return m.group(0)
else:
return target
>>> removeFromEnd("\.", "foo.tar.gz")
'foo.tar'

I want to split a string by a character on its first occurence, which belongs to a list of characters. How to do this in python?

Basically, I have a list of special characters. I need to split a string by a character if it belongs to this list and exists in the string. Something on the lines of:
def find_char(string):
if string.find("some_char"):
#do xyz with some_char
elif string.find("another_char"):
#do xyz with another_char
else:
return False
and so on. The way I think of doing it is:
def find_char_split(string):
char_list = [",","*",";","/"]
for my_char in char_list:
if string.find(my_char) != -1:
my_strings = string.split(my_char)
break
else:
my_strings = False
return my_strings
Is there a more pythonic way of doing this? Or the above procedure would be fine? Please help, I'm not very proficient in python.
(EDIT): I want it to split on the first occurrence of the character, which is encountered first. That is to say, if the string contains multiple commas, and multiple stars, then I want it to split by the first occurrence of the comma. Please note, if the star comes first, then it will be broken by the star.

I would favor using the re module for this because the expression for splitting on multiple arbitrary characters is very simple:
r'[,*;/]'
The brackets create a character class that matches anything inside of them. The code is like this:
import re
results = re.split(r'[,*;/]', my_string, maxsplit=1)
The maxsplit argument makes it so that the split only occurs once.
If you are doing the same split many times, you can compile the regex and search on that same expression a little bit faster (but see Jon Clements' comment below):
c = re.compile(r'[,*;/]')
results = c.split(my_string)
If this speed up is important (it probably isn't) you can use the compiled version in a function instead of having it re compile every time. Then make a separate function that stores the actual compiled expression:
def split_chars(chars, maxsplit=0, flags=0, string=None):
# see note about the + symbol below
c = re.compile('[{}]+'.format(''.join(chars)), flags=flags)
def f(string, maxsplit=maxsplit):
return c.split(string, maxsplit=maxsplit)
return f if string is None else f(string)
Then:
special_split = split_chars(',*;/', maxsplit=1)
result = special_split(my_string)
But also:
result = split_chars(',*;/', my_string, maxsplit=1)
The purpose of the + character is to treat multiple delimiters as one if that is desired (thank you Jon Clements). If this is not desired, you can just use re.compile('[{}]'.format(''.join(chars))) above. Note that with maxsplit=1, this will not have any effect.
Finally: have a look at this talk for a quick introduction to regular expressions in Python, and this one for a much more information packed journey.

How do you filter a string to only contain letters?

How do I make a function where it will filter out all the non-letters from the string? For example, letters("jajk24me") will return back "jajkme". (It needs to be a for loop) and will string.isalpha() function help me with this?
My attempt:
def letters(input):
valids = []
for character in input:
if character in letters:
valids.append( character)
return (valids)

If it needs to be in that for loop, and a regular expression won't do, then this small modification of your loop will work:
def letters(input):
valids = []
for character in input:
if character.isalpha():
valids.append(character)
return ''.join(valids)
(The ''.join(valids) at the end takes all of the characters that you have collected in a list, and joins them together into a string. Your original function returned that list of characters instead)
You can also filter out characters from a string:
def letters(input):
return ''.join(filter(str.isalpha, input))
or with a list comprehension:
def letters(input):
return ''.join([c for c in input if c.isalpha()])
or you could use a regular expression, as others have suggested.

import re
valids = re.sub(r"[^A-Za-z]+", '', my_string)
EDIT: If it needs to be a for loop, something like this should work:
output = ''
for character in input:
if character.isalpha():
output += character

See re.sub, for performance consider a re.compile to optimize the pattern once.
Below you find a short version which matches all characters not in the range from A to Z and replaces them with the empty string. The re.I flag ignores the case, thus also lowercase (a-z) characters are replaced.
import re
def charFilter(myString)
return re.sub('[^A-Z]+', '', myString, 0, re.I)
If you really need that loop there are many awnsers, explaining that specifically. However you might want to give a reason why you need a loop.
If you want to operate on the number sequences and thats the reason for the loop consider replacing the replacement string parameter with a function like:
import re
def numberPrinter(matchString) {
print(matchString)
return ''
}
def charFilter(myString)
return re.sub('[^A-Z]+', '', myString, 0, re.I)

The method string.isalpha() checks whether string consists of alphabetic characters only. You can use it to check if any modification is needed.
As to the other part of the question, pst is just right. You can read about regular expressions in the python doc: http://docs.python.org/library/re.html
They might seem daunting but are really useful once you get the hang of them.

Of course you can use isalpha. Also, valids can be a string.
Here you go:
def letters(input):
valids = ""
for character in input:
if character.isalpha():
valids += character
return valids

Not using a for-loop. But that's already been thoroughly covered.
Might be a little late, and I'm not sure about performance, but I just thought of this solution which seems pretty nifty:
set(x).intersection(y)
You could use it like:
from string import ascii_letters
def letters(string):
return ''.join(set(string).intersection(ascii_letters))
NOTE:
This will not preserve linear order. Which in my use case is fine, but be warned.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to split strings with special characters without removing those characters? - python

Related

Python re.sub() is not replacing every match

Python: strip function definition using regex

How to delete shortest matched pattern from the end of string by re module of python?

I want to split a string by a character on its first occurence, which belongs to a list of characters. How to do this in python?

How do you filter a string to only contain letters?

Categories

Resources