How to extend str.isspace()?

How to extend str.isspace()? - python

str.isspace() is very convenient to check whether a line is empty (it encompasses the space and return characters).
Is it possible to extend with some other characters (say, a comma) which should also be treated as "space characters" during the check?

There is no way to extend str.isspace. But you can do the same thing yourself, slightly more verbosely, in a few ways.
An explicit loop:
all(c in my_space_set for c in s)
Make a set:
set(s).issubset(my_space_set)
Or a regular expression with a character class, or…

you can use str.strip() with the characters you consider as unimportant, and check if the result is empty:
a=" , "
if not a.strip(", \n\t\v\r"):
print("empty")
(str.strip removes all char occurrences of the passed parameter from both ends of the string, in that case str.lstrip or str.rstrip would work the same way, and would even be slightly faster)
What slightly bothers me is that it creates a throwaway string just to test that it's empty. I liked abamert set concept, although creating a set just to use issubset is done the wrong way (it creates a throwaway set, so same issue). I would do:
spaceset = set(", \n\r\v") # initialized once and for all
then use issuperset on the existing set, with the string as argument:
spaceset.issuperset(a)
(that doesn't create a set everytime)

You can do this with regular expressions
import re
def is_all_modified_whitespace(s)
return not re.search(r'[^\s,]', s)
is_all_modified_whitespace(s)
Regular expressions are great for stuff like this because they allow you to easily modify the character set to check for.

Related

Should I check if a substring exists before trying to replace it?

In python3, is it worth checking if a substring exists before attempting to replace it? I'm checking about 40,000 strings and only expect to find substring1 in about 1% of them. Does it take longer to check and skip or to try and fail to replace?
if substring1 in string:
string = string.replace(substring1, substring2)
or just
string = string.replace(substring1, substring2)

Simple is better than complex. Doing the replacement necessitates checking anyway, so no reason to write it out yourself. Special cases aren't special enough; you don't need any special handling to replace zero instances of the substring - it works the same as replacing any other number of instances.
Don't check.

Both options you have there have the same results. However in terms of performance, the second one is better than the first.
The first one will make you go through the string twice, once to check and once to replace. So it will always go through the string once, possibly twice.
The second one, will always go through the string only once to try and replace.
For the replace function, if the first argument (substring1) isn’t found in the string, nothing will happen. So the second one is perfectly safe to use.
Always remember that simpler is better.

How do I use re.search starting from a certain index in the string?

Seems like a simple thing but I'm not seeing it. How do I start the search in the middle of a string?

The re.search function doesn't take a start argument like the str methods do. But search method of a compiled re.compile/re.RegexObject pattern does take a pos argument.
This makes sense if you think about it. If you really need to use the same regular expressions over and over, you probably should be compiling them. Not so much for efficiency—the cache works nicely for most applications—but just for readability.
But what if you need to use the top-level function, because you can't pre-compile your patterns for some reason?
Well, there are plenty of third-party regular expression libraries. Some of these wrap PCRE or Google's RE2 or ICU, some implement regular expressions from scratch, and they all have at least slightly different, sometimes radically different, APIs.
But the regex module, which is being designed to be an eventual replacement for re in the stdlib (although it's been bumped a couple times now because it's not quite ready) is pretty much usable as a drop-in replacement for re, and (among other extensions) it takes pos and endpos arguments on its search function.
Normally, the most common reason you'd want to do this is to "find the next match after the one I just found", and there's a much easier way to do that: use finditer instead of search.
For example, this str-method loop:
i = 0
while True:
i = s.find(sub, i)
if i == -1:
break
do_stuff_with(s, i)
… translates to this much nicer regex loop:
for match in re.finditer(pattern, s):
do_stuff_with(match)
When that isn't appropriate, you can always slice the string:
match = re.search(pattern, s[index:])
But that makes an extra copy of half your string, which could be a problem if string is actually, say, a 12GB mmap. (Of course for the 12GB mmap case, you'd probably want to map a new window… but there are cases where that won't help.)
Finally, you can always just modify your pattern to skip over index characters:
match = re.search('.{%d}%s' % (index, pattern), s)
All I've done here is to add, e.g., .{20} to the start of the pattern, which means to match exactly 20 of any character, plus whatever else you were trying to match. Here's a simple example:
.{3}(abc)
Debuggex Demo
If I give this abcdefabcdef, it will match the first 'abc' after the 3rd character—that is, the second abc.
But notice that what it actually matches 'defabc'. Because I'm using capture groups for my real pattern, and I'm not putting the .{3} in a group, match.group(1) and so on will work exactly as I'd want them to, but match.group(0) will give me the wrong thing. If that matters, you need lookbehind.

Python: regular expressions in control structures [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to check if text is “empty” (spaces, tabs, newlines) in Python?
I am trying to write a short function to process lines of text in a file. When it encounters a line with significant content (meaning more than just whitespace), it is to do something with that line. The control structure I wanted was
if '\S' in line: do something
or
if r'\S' in line: do something
(I tried the same combinations with double quotes also, and yes I had imported re.) The if statement above, in all the forms I tried, always returns False. In the end, I had to resort to the test
if re.search('\S', line) is not None: do something
This works, but it feels a little clumsy in relation to a simple if statement. My question, then, is why isn't the if statement working, and is there a way to do something as (seemingly) elegant and simple?
I have another question unrelated to control structures, but since my suspicion is that it is also related to a possibly illegal use of regular expressions, I'll ask it here. If I have a string
s = " \t\tsome text \t \n\n"
The code
s.strip('\s')
returns the same string complete with spaces, tabs, and newlines (r'\s' is no different). The code
s.strip()
returns "some text". This, even though strip called with no character string supposedly defaults to stripping whitespace characters, which to my mind is exactly what the expression '\s' is doing. Why is the one stripping whitespace and the other not?
Thanks for any clarification.

Python string functions are not aware of regular expressions, so if you want to use them you have to use the re module.
However if you are only interested in finding out of a string is entirely whitespace or not, you can use the str.isspace() function:
>>> 'hello'.isspace()
False
>>> ' \n\t '.isspace()
True

This is what you're looking for
if not line.isspace(): do something
Also, str.strip does not use regular expressions.

If you are really just want to find out if the line only consists of whitespace characters regex is a little overkill. You should got for the following instead:
if text.strip():
#do stuff
which is basically the same as:
if not text.strip() == "":
#do stuff
Python evaluates every non-empty string to True. So if text consists only of whitespace-characters, text.strip() equals "" and therefore evaluates to False.

The expression '\S' in line does the same thing as any other string in line test; it tests whether the string on the left occurs inside the string on the right. It does not implicitly compile a regular expression and search for a match. This is a good thing. What if you were writing a program that manipulated regular expressions input by the user and you actually wanted to test whether some sub-expression like \S was in the input expression?
Likewise, read the documentation of str.strip. Does it say that will treat it's input as a regular expression and remove matching strings? No. If you want to do something with regular expressions, you have to actually tell Python that, not expect it to somehow guess that you meant a regular expression this time while other times it just meant a plain string. While you might think of searching for a regular expression as very similar to searching for a string, they are completely different operations as far as the language implementation is concerned. And most str methods wouldn't even make sense when applied to a regular expression.
Because re.match objects are "truthy" in boolean context (like most class instances), you can at least shorten your if statement by dropping the is not None test. The rest of the line is necessary to actually tell Python what you want. As for your str.strip case (or other cases where you want to do something similar to a string operation but with a regular expression), have a look at the functions in the re module; there are a number of convenience functions on there that can be helpful. Or else it should be pretty easy to implement a re_split function yourself.

Find which lines in a file contain certain characters

Is there a way to find out if a string contains any one of the characters in a set with python?
It's straightforward to do it with a single character, but I need to check and see if a string contains any one of a set of bad characters.
Specifically, suppose I have a string:
s = 'amanaplanacanalpanama~012345'
and I want to see if the string contains any vowels:
bad_chars = 'aeiou'
and do this in a for loop for each line in a file:
if [any one or more of the bad_chars] in s:
do something
I am scanning a large file so if there is a faster method to this, that would be ideal. Also, not every bad character has to be checked---so long as one is encountered that is enough to end the search.
I'm not sure if there is a builtin function or easy way to implement this, but I haven't come across anything yet. Any pointers would be much appreciated!

any((c in badChars) for c in yourString)
or
any((c in yourString) for c in badChars) # extensionally equivalent, slower
or
set(yourString) & set(badChars) # extensionally equivalent, slower
"so long as one is encountered that is enough to end the search." - This will be true if you use the first method.
You say you are concerned with performance: performance should not be an issue unless you are dealing with a huge amount of data. If you encounter issues, you can try:
Regexes
edit Previously I had written a section here on using regexes, via the re module, programatically generating a regex that consisted of a single character-class [...] and using .finditer, with the caveat that putting a simple backslash before everything might not work correctly. Indeed, after testing it, that is the case, and I would definitely not recommend this method. Using this would require reverse engineering the entire (slightly complex) sub-grammar of regex character classes (e.g. you might have characters like \ followed by w, like ] or [, or like -, and merely escaping some like \w may give it a new meaning).
Sets
Depending on whether the str.__contains__ operation is O(1) or O(N), it may be justifiable to first convert your text/lines into a set to ensure the in operation is O(1), if you have many badChars:
badCharSet = set(badChars)
any((c in badChars) for c in yourString)
(it may be possible to make that a one-liner any((c in set(yourString)) for c in badChars), depending on how smart the python compiler is)
Do you really need to do this line-by-line?
It may be faster to do this once for the entire file O(#badchars), than once for every line in the file O(#lines*#badchars), though the asymptotic constants may be such that it won't matter.

Use python's any function.
if any((bad_char in my_string) for bad_char in bad_chars):
# do something

This should be very efficient and clear. It uses sets:
#!/usr/bin/python
bad_chars = set('aeiou')
with open('/etc/passwd', 'r') as file_:
file_string = file_.read()
file_chars = set(file_string)
if file_chars & bad_chars:
print('found something bad')

This regular expression is twice as fast as any with my minimal testing. You should try it with your own data.
r = re.compile('[aeiou]')
if r.search(s):
# do something

The following Python code should print out any character in bad_chars if it exists in s:
for i in vowels:
if i in your charset:
#do_something
You could also use the python in-built any using an example like this:
>>> any(e for e in bad_chars if e in s)
True

making difflib's SequenceMatcher ignore "junk" characters

I have a lot of strings that i want to match for similarity(each string is 30 characters on average). I found difflib's SequenceMatcher great for this task as it was simple and found the results good. But if i compare hellboy and hell-boy like this
>>> sm=SequenceMatcher(lambda x:x=='-','hellboy','hell-boy')
>>> sm.ratio()
0: 0.93333333333333335
I want such words to give a 100 percent match i.e ratio of 1.0. I understand that the junk character specified in the function above are not used for comparison but finding longest contiguous matching subsequence. Is there some way i can make SequenceMatcher to ignore some "junk" characters for comparison purpose?

If you wish to do as I suggested in the comments, (removing the junk characters) the fastest method is to use str.translate().
E.g:
to_compare = to_compare.translate(None, {"-"})
As shown here, this is significantly (3x) faster (and I feel nicer to read) than a regex.
Note that under Python 3.x, or if you are using Unicode under Python 2.x, this will not work as the delchars parameter is not accepted. In this case, you simply need to make a mapping to None. E.g:
translation_map = str.maketrans({"-": None})
to_compare = to_compare.translate(translation_map)
You could also have a small function to save some typing if you have a lot of characters you want to remove, just make a set and pass through:
def to_translation_map(iterable):
return {key: None for key in iterable}
#return dict((key, None) for key in iterable) #For old versions of Python without dict comps.

If you were to make a function to remove all the junk character before hand you could use re:
string=re.sub('-|_|\*','',string)
for the regular expression '-|_|\*' just put a | between all junk characters and if its a special re character put a \ before it (like * and +)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extend str.isspace()? - python

str.isspace() is very convenient to check whether a line is empty (it encompasses the space and return characters). Is it possible to extend with some other characters (say, a comma) which should also be treated as "space characters" during the check?

There is no way to extend str.isspace. But you can do the same thing yourself, slightly more verbosely, in a few ways. An explicit loop: all(c in my_space_set for c in s) Make a set: set(s).issubset(my_space_set) Or a regular expression with a character class, or…

You can do this with regular expressions import re def is_all_modified_whitespace(s) return not re.search(r'[^\s,]', s) is_all_modified_whitespace(s) Regular expressions are great for stuff like this because they allow you to easily modify the character set to check for.

Related

Should I check if a substring exists before trying to replace it?

How do I use re.search starting from a certain index in the string?

Python: regular expressions in control structures [duplicate]

Find which lines in a file contain certain characters

making difflib's SequenceMatcher ignore "junk" characters

Categories

Resources