Python - regex, blank element at the end of the list? - python

I have a code
print(re.split(r"[\s\?\!\,\;]+", "Holy moly, feferoni!"))
which results
['Holy', 'moly', 'feferoni', '']
How can I get rid of this last blank element, what caused it?
If this is a dirty way to get rid of punctuation and spaces from a string, how else can I write but in regex?

Expanding on what #HamZa said in his comment, you would use re.findall and a negative character set:
>>> from re import findall
>>> findall(r"[^\s?!,;]+", "Holy moly, feferoni!")
['Holy', 'moly', 'feferoni']
>>>

You get the empty string as the last element of you list, because the RegEx splits after the last !. It ends up giving you what's before the ! and what's after it, but after it, there's simply nothing, i.e. an empty string! You might have the same problem in the middle of the string if you didn't wisely add the + to your RegEx.
Add a call to list if you can't work with an iterator. If you want to elegantly get rid of the optional empty string, do:
filter(None, re.split(r"[\s?!,;]+", "Holy moly, feferoni!"))
This will result in:
['Holy', 'moly', 'feferoni']
What this does is remove every element that is not a True value. The filter function generally only returns elements that satisfy a requirement given as a function, but if you pass None it will check if the value itself is True. Because an empty string is False and every other string is True it will remove every empty string from the list.
Also note I removed the escaping of special characters in the character class, as it is simply not neccessary and just makes the RegEx harder to read.

the first thing which comes to my mind is something like this:
>>> mystring = re.split(r"[\s\?\!\,\;]+", "Holy moly, feferoni!")
['Holy', 'moly', 'feferoni', '']
>>> mystring.pop(len(mystring)-1)
>>> print mystring
['Holy', 'moly', 'feferoni']

__import__('re').findall('[^\s?!,;]+', 'Holy moly, feferoni!')

Related

Regex to get all occurrences of a pattern followed by a value in a comma separate string

This is in python
Input string:
Str = 'Y=DAT,X=ZANG,FU=_COG-GAB-CANE-,FU=FARE,T=TART,RO=TOP,FU=#-_MAP.com-,Z=TRY'
Expected output
'FU=_COG-GAB-CANE_,FU=FARE,FU=#-_MAP.com_'
here 'FU=' is the occurence we are looking for and the value which follows FU=
return all occurrences of FU=(with the associated value for FU=) in a comma-separated string, they can occur anywhere within the string and special characters are allowed.
Here is one approach.
>>> import re
>>> str_ = 'Y=DAT,X=ZANG,FU=FAT,T=TART,FU=GEM,RO=TOP,FU=MAP,Z=TRY'
>>> re.findall.__doc__[:58]
'Return a list of all non-overlapping matches in the string'
>>> re.findall(r'FU=\w+', str_)
['FU=FAT', 'FU=GEM', 'FU=MAP']
>>> ','.join(re.findall(r'FU=\w+', str_))
'FU=FAT,FU=GEM,FU=MAP'
Got it working
Python Code
import re
str_ = 'Y=DAT,X=ZANG,FU=_COG-GAB-CANE-,FU=FARE,T=TART,RO=TOP,FU=#-_MAP.com-,Z=TRY'
str2='FU='+',FU='.join(re.findall(r'FU=(.*?),', str_))
print(str2)
Gives the desired output:
'FU=_COG-GAB-CANE-,FU=FARE,FU=#-_MAP.com-'
Successfully gives me all the occurrences of FU= followed by values, irrespective of order and number of special characters.
Although a bit unclean way as I am manually adding FU= for the first occurrence.
Please suggest if there is a cleaner way of doing it ? , but yes it gets the work done.

Python 3 split()

When I'm splitting a string "abac" I'm getting undesired results.
Example
print("abac".split("a"))
Why does it print:
['', 'b', 'c']
instead of
['b', 'c']
Can anyone explain this behavior and guide me on how to get my desired output?
Thanks in advance.
As #DeepSpace pointed out (referring to the docs)
If sep is given, consecutive delimiters are not grouped together and are deemed to delimit empty strings (for example, '1,,2'.split(',') returns ['1', '', '2']).
Therefore I'd suggest using a better delimiter such as a comma , or if this is the formatting you're stuck with then you could just use the builtin filter() function as suggested in this answer, this will remove any "empty" strings if passed None as the function.
sample = 'abac'
filtered_sample = filter(None, sample.split('a'))
print(filtered_sample)
#['b', 'c']
When you split a string in python you keep everything between your delimiters (even when it's an empty string!)
For example, if you had a list of letters separated by commas:
>>> "a,b,c,d".split(',')
['a','b','c','d']
If your list had some missing values you might leave the space in between the commas blank:
>>> "a,b,,d".split(',')
['a','b','','d']
The start and end of the string act as delimiters themselves, so if you have a leading or trailing delimiter you will also get this "empty string" sliced out of your main string:
>>> "a,b,c,d,,".split(',')
['a','b','c','d','','']
>>> ",a,b,c,d".split(',')
['','a','b','c','d']
If you want to get rid of any empty strings in your output, you can use the filter function.
If instead you just want to get rid of this behavior near the edges of your main string, you can strip the delimiters off first:
>>> ",,a,b,c,d".strip(',')
"a,b,c,d"
>>> ",,a,b,c,d".strip(',').split(',')
['a','b','c','d']
In your example, "a" is what's called a delimiter. It acts as a boundary between the characters before it and after it. So, when you call split, it gets the characters before "a" and after "a" and inserts it into the list. Since there's nothing in front of the first "a" in the string "abac", it returns an empty string and inserts it into the list.
split will return the characters between the delimiters you specify (or between an end of the string and a delimiter), even if there aren't any, in which case it will return an empty string. (See the documentation for more information.)
In this case, if you don't want any empty strings in the output, you can use filter to remove them:
list(filter(lambda s: len(s) > 0, "abac".split("a"))

Python - Comma in string causes issue with strip

I have strings as tuples that I'm trying to remove quotation marks from. If there isn't a comma present in the string, then it works. But if there is a comma, then quotation marks still remain:
example = [('7-30-17','0x34','"Upload Complete"'),('7-31-17','0x35','"RCM","Interlock error"')]
example = [(x,y,(z.strip('"')))
for x,y,z in example]
The result is that quotation marks partially remain in the strings that had commas in them. The second tuple now reads RCM","Interlock error as opposed to RCM, Interlock error
('7-30-17','0x34','Upload Complete')
('7-31-17','0x35','RCM","Interlock error')
Any ideas what I'm doing wrong? Thanks!
You can use list comprehension to iterate the list items and similarly for the inner tuple items
>>> [tuple(s.replace('"','') for s in tup) for tup in example]
[('7-30-17', '0x34', 'Upload Complete'), ('7-31-17', '0x35', 'RCM,Interlock error')]
It seems like you're looking for the behaviour of replace(), rather than strip().
Try using replace('"', '') instead of strip('"'). strip only removes characters from the beginning and end of strings, while replace will take care of all occurrences.
Your example would be updated to look like this:
example = [('7-30-17','0x34','"Upload Complete"'),('7-31-17','0x35','"RCM","Interlock error"')]
example = [(x,y,(z.replace('"', '')))
for x,y,z in example]
example ends up with the following value:
[('7-30-17', '0x34', 'Upload Complete'), ('7-31-17', '0x35', 'RCM,Interlock error')]
The problem is because strip will remove only from ends of string.
Use a regex to replace ":
import re
example = [('7-30-17','0x34','"Upload Complete"'),('7-31-17','0x35','"RCM","Interlock error"')]
example = [(x,y,(re.sub('"','',z)))
for x,y,z in example]
print(example)
# [('7-30-17', '0x34', 'Upload Complete'), ('7-31-17', '0x35', 'RCM,Interlock error')]

Python: list strip overkill

I just want to remove the '.SI' in the list but it will overkill by remove any that contain S or I in the list.
ab = ['abc.SI','SIV.SI','ggS.SI']
[x.strip('.SI') for x in ab]
>> ['abc','V','gg']
output which I want is
>> ['abc','SIV','ggS']
any elegant way to do it? prefer not to use for loop as my list is long
Why strip ? you can use .replace():
[x.replace('.SI', '') for x in ab]
Output:
['abc', 'SIV', 'ggS']
(this will remove .SI anywhere, have a look at other answers if you want to remove it only at the end)
The reason strip() doesn't work is explained in the docs:
The chars argument is not a prefix or suffix; rather, all combinations of its values are stripped
So it will strip any character in the string that you pass as an argument.
If you want to remove the substring only from the end, the correct way to achieve this will be:
>>> ab = ['abc.SI','SIV.SI','ggS.SI']
>>> sub_string = '.SI'
# checks the presence of substring at the end
# v
>>> [s[:-len(sub_string)] if s.endswith(sub_string) else s for s in ab]
['abc', 'SIV', 'ggS']
Because str.replace() (as mentioned in TrakJohnson's answer) removes the substring even if it is within the middle of string. For example:
>>> 'ab.SIrt'.replace('.SI', '')
'abrt'
use this [x[:-3] for x in ab].
Use split instead of strip and get the first element:
[x.split('.SI')[0] for x in ab]

Getting rid of certain characters in a string in python

I have characters in the middle of a string that I want to get rid of. These characters are =, p,, and H. Since they are not the leftmost and the rightmost characters in the string, I cannot use strip(). Is there a function that gets rid of a certain character in any location in a string?
The usual tool for this job is str.translate
https://docs.python.org/2/library/stdtypes.html#str.translate
>>> 'hello=potato'.translate(None, '=p')
'hellootato'
Check the .replace() function:
> 'aaba'.replace('a','').replace('b','')
< ''
My usual tool for this is the regular expression.
>>> import re
>>> invalidCharacters = r'[=p H]'
>>> mystring = re.sub(invalidCharacters, '', ' poH==hHoPPp p')
'ohoPP'
If you need to constrain the number (i.e., the count) of characters you remove, see the count argument.

Categories

Resources