Find which lines in a file contain certain characters

Find which lines in a file contain certain characters - python

Is there a way to find out if a string contains any one of the characters in a set with python?
It's straightforward to do it with a single character, but I need to check and see if a string contains any one of a set of bad characters.
Specifically, suppose I have a string:
s = 'amanaplanacanalpanama~012345'
and I want to see if the string contains any vowels:
bad_chars = 'aeiou'
and do this in a for loop for each line in a file:
if [any one or more of the bad_chars] in s:
do something
I am scanning a large file so if there is a faster method to this, that would be ideal. Also, not every bad character has to be checked---so long as one is encountered that is enough to end the search.
I'm not sure if there is a builtin function or easy way to implement this, but I haven't come across anything yet. Any pointers would be much appreciated!

any((c in badChars) for c in yourString)
or
any((c in yourString) for c in badChars) # extensionally equivalent, slower
or
set(yourString) & set(badChars) # extensionally equivalent, slower
"so long as one is encountered that is enough to end the search." - This will be true if you use the first method.
You say you are concerned with performance: performance should not be an issue unless you are dealing with a huge amount of data. If you encounter issues, you can try:
Regexes
edit Previously I had written a section here on using regexes, via the re module, programatically generating a regex that consisted of a single character-class [...] and using .finditer, with the caveat that putting a simple backslash before everything might not work correctly. Indeed, after testing it, that is the case, and I would definitely not recommend this method. Using this would require reverse engineering the entire (slightly complex) sub-grammar of regex character classes (e.g. you might have characters like \ followed by w, like ] or [, or like -, and merely escaping some like \w may give it a new meaning).
Sets
Depending on whether the str.__contains__ operation is O(1) or O(N), it may be justifiable to first convert your text/lines into a set to ensure the in operation is O(1), if you have many badChars:
badCharSet = set(badChars)
any((c in badChars) for c in yourString)
(it may be possible to make that a one-liner any((c in set(yourString)) for c in badChars), depending on how smart the python compiler is)
Do you really need to do this line-by-line?
It may be faster to do this once for the entire file O(#badchars), than once for every line in the file O(#lines*#badchars), though the asymptotic constants may be such that it won't matter.

Use python's any function.
if any((bad_char in my_string) for bad_char in bad_chars):
# do something

This should be very efficient and clear. It uses sets:
#!/usr/bin/python
bad_chars = set('aeiou')
with open('/etc/passwd', 'r') as file_:
file_string = file_.read()
file_chars = set(file_string)
if file_chars & bad_chars:
print('found something bad')

This regular expression is twice as fast as any with my minimal testing. You should try it with your own data.
r = re.compile('[aeiou]')
if r.search(s):
# do something

The following Python code should print out any character in bad_chars if it exists in s:
for i in vowels:
if i in your charset:
#do_something
You could also use the python in-built any using an example like this:
>>> any(e for e in bad_chars if e in s)
True

Related

How to extend str.isspace()?

str.isspace() is very convenient to check whether a line is empty (it encompasses the space and return characters).
Is it possible to extend with some other characters (say, a comma) which should also be treated as "space characters" during the check?

There is no way to extend str.isspace. But you can do the same thing yourself, slightly more verbosely, in a few ways.
An explicit loop:
all(c in my_space_set for c in s)
Make a set:
set(s).issubset(my_space_set)
Or a regular expression with a character class, or…

you can use str.strip() with the characters you consider as unimportant, and check if the result is empty:
a=" , "
if not a.strip(", \n\t\v\r"):
print("empty")
(str.strip removes all char occurrences of the passed parameter from both ends of the string, in that case str.lstrip or str.rstrip would work the same way, and would even be slightly faster)
What slightly bothers me is that it creates a throwaway string just to test that it's empty. I liked abamert set concept, although creating a set just to use issubset is done the wrong way (it creates a throwaway set, so same issue). I would do:
spaceset = set(", \n\r\v") # initialized once and for all
then use issuperset on the existing set, with the string as argument:
spaceset.issuperset(a)
(that doesn't create a set everytime)

You can do this with regular expressions
import re
def is_all_modified_whitespace(s)
return not re.search(r'[^\s,]', s)
is_all_modified_whitespace(s)
Regular expressions are great for stuff like this because they allow you to easily modify the character set to check for.

Splitting string by '],[', but keeping the brackets

Ive got a string in this format
a = "[a,b,c],[e,d,f],[g,h,i]"
Each part I want to be split is separated by ],[. I tried a.split("],[") and I get the end brackets removed.
In my example that would be:
["[a,b,c","e,d,f","g,h,i]"]
I was wondering if there was a way to keep the brackets after the split?
Desired outcome:
["[a,b,c]","[e,d,f]","[g,h,i]"]

The problem is that str.split removes whatever substring you split on from the resulting list. I think it would be better in this case to use the slightly more powerful split function from the re module:
>>> from re import split
>>> a = "[a,b,c],[e,d,f],[g,h,i]"
>>> split(r'(?<=\]),(?=\[)', a)
['[a,b,c]', '[e,d,f]', '[g,h,i]']
>>>
(?<=\]) is a lookbehind assertion which looks for ]. Similarly, (?=\[) is a lookahead assertion which looks for [. Both constructs are explained in Regular Expression Syntax.

Python is very flexible, so you just have to manage it a bit and be adaptive to your case.
In [8]:a = "[a,b,c],[e,d,f],[g,h,i]"
a.replace('],[','] [').split(" ")
Out[8]:['[a,b,c]', '[e,d,f]', '[g,h,i]']

The other answers are correct, but here is another way to go.
Important note: this is just to present another option that may prove useful in certain cases. Don't do it in the general case, and do so only in you're absolutely certain that you have the control over the expression you're passing into exec statement.
# provided you declared a, b, c, d, e, f, g, h, i beforehand
>>> exp = "[a,b,c],[e,d,f],[g,h,i]"
>>> exec("my_object = " + exp)
>>> my_object
([a,b,c],[e,d,f],[g,h,i])
Then, you can do whatever you like with my_object.
Provided that you have full control over exp, this way of doing sounds more appropriate and Pythonic to me because you are treating a piece of Python code written in a string as a... piece of Python code written in a string (hence the exec statement). Without manipulating it through regexp or artificial hacks.
Just keep in mind that it can be dangerous.

making difflib's SequenceMatcher ignore "junk" characters

I have a lot of strings that i want to match for similarity(each string is 30 characters on average). I found difflib's SequenceMatcher great for this task as it was simple and found the results good. But if i compare hellboy and hell-boy like this
>>> sm=SequenceMatcher(lambda x:x=='-','hellboy','hell-boy')
>>> sm.ratio()
0: 0.93333333333333335
I want such words to give a 100 percent match i.e ratio of 1.0. I understand that the junk character specified in the function above are not used for comparison but finding longest contiguous matching subsequence. Is there some way i can make SequenceMatcher to ignore some "junk" characters for comparison purpose?

If you wish to do as I suggested in the comments, (removing the junk characters) the fastest method is to use str.translate().
E.g:
to_compare = to_compare.translate(None, {"-"})
As shown here, this is significantly (3x) faster (and I feel nicer to read) than a regex.
Note that under Python 3.x, or if you are using Unicode under Python 2.x, this will not work as the delchars parameter is not accepted. In this case, you simply need to make a mapping to None. E.g:
translation_map = str.maketrans({"-": None})
to_compare = to_compare.translate(translation_map)
You could also have a small function to save some typing if you have a lot of characters you want to remove, just make a set and pass through:
def to_translation_map(iterable):
return {key: None for key in iterable}
#return dict((key, None) for key in iterable) #For old versions of Python without dict comps.

If you were to make a function to remove all the junk character before hand you could use re:
string=re.sub('-|_|\*','',string)
for the regular expression '-|_|\*' just put a | between all junk characters and if its a special re character put a \ before it (like * and +)

Split string with caret character in python

I have a huge text file, each line seems like this:
Some sort of general menu^a_sub_menu_title^^pagNumber
Notice that the first "general menu" has white spaces, the second part (a subtitle) each word is separate with "_" character and finally a number (a pag number). I want to split each line in 3 (obvious) parts, because I want to create some sort of directory in python.
I was trying with re module, but as the caret character has a strong meaning in such module, I couldn't figure it out how to do it.
Could someone please help me????

>>> "Some sort of general menu^a_sub_menu_title^^pagNumber".split("^")
['Some sort of general menu', 'a_sub_menu_title', '', 'pagNumber']

If you only want three pieces you can accomplish this through a generator expression:
line = 'Some sort of general menu^a_sub_menu_title^^pagNumber'
pieces = [x for x in line.split('^') if x]
# pieces => ['Some sort of general menu', 'a_sub_menu_title', 'pagNumber']

What you need to do is to "escape" the special characters, like r'\^'. But better than regular expressions in this case would be:
line = "Some sort of general menu^a_sub_menu_title^^pagNumber"
(menu, title, dummy, page) = line.split('^')
That gives you the components in a much more straightforward fashion.

You could just say string.split("^") to divide the string into an array containing each segment. The only caveat is that it will divide consecutive caret characters into an empty string. You could protect against this by either collapsing consecutive carats down into a single one, or detecting empty strings in the resultant array.
For more information see http://docs.python.org/library/stdtypes.html
Does that help?

It's also possible that your file is using a format that's compatible with the csv module, you could also look into that, especially if the format allows quoting, because then line.split would break. If the format doesn't use quoting and it's just delimiters and text, line.split is probably the best.
Also, for the re module, any special characters can be escaped with \, like r'\^'. I'd suggest before jumping to use re to 1) learn how to write regular expressions, 2) first look for a solution to your problem instead of jumping to regular expressions - «Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. »

Speed of many regular expressions in python

I'm writing a python program that deals with a fair amount of strings/files. My problem is that I'm going to be presented with a fairly short piece of text, and I'm going to need to search it for instances of a fairly broad range of words/phrases.
I'm thinking I'll need to compile regular expressions as a way of matching these words/phrases in the text. My concern, however, is that this will take a lot of time.
My question is how fast is the process of repeatedly compiling regular expressions, and then searching through a small body of text to find matches? Would I be better off using some string method?
Edit: So, I guess an example of my question would be: How expensive would it be to compile and search with one regular expression versus say, iterating 'if "word" in string' say, 5 times?

You should try to compile all your regexps into a single one using the | operator. That way, the regexp engine will do most of the optimizations for you. Use the grouping operator () to determine which regexp matched.

If speed is of the essence, you are better off running some tests before you decide how to code your production application.
First of all, you said that you are searching for words which suggests that you may be able to do this using split() to break up the string on whitespace. And then use simple string comparisons to do your search.
Definitely do compile your regular expressions and do a timing test comparing that with the plain string functions. Check the documentation for the string class for a full list.

Your requirement appears to be searching a text for the first occurrence of any one of a collection of strings. Presumably you then wish to restart the search to find the next occurrence, and so on until the searched string is exhausted. Only plain old string comparison is involved.
The classic algorithm for this task is Aho-Corasick for which there is a Python extension (written in C). This should beat the socks off any alternative that's using the re module.

If you like to know how does it fast during compiling regex patterns, you need to benchmark it.
Here is how I do that. Its compile 1 Million time each patterns.
import time,re
def taken(f):
def wrap(*arg):
t1,r,t2=time.time(),f(*arg),time.time()
print t2-t1,"s taken"
return r
return wrap
#taken
def regex_compile_test(x):
for i in range(1000000):
re.compile(x)
print "for",x,
#sample tests
regex_compile_test("a")
regex_compile_test("[a-z]")
regex_compile_test("[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}")
Its took around 5 min for each patterns in my computer.
for a 4.88999986649 s taken
for [a-z] 4.70300006866 s taken
for [A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4} 4.78200006485 s taken
The real Bottleneck is not in compiling patterns, its in extracting text like re.findall, replacing re.sub. If you use that against Several MB texts, Its quite slow.
If your text is fixed, use normal str.find, its faster than regex.
Actually, If you give your text samples, and your regex patterns samples, we could give you better idea, there is many many great regex, and python guys out there.
Hope this help, sorry If my answer couldn't help you.

When you compile the regexp, it is converted into a state machine representation. Provided the regexp is efficiently expressed, it should still be very fast to match. Compiling the regexp can be expensive though, so you will want to do that up front, and as infrequently as possible. Ultimately though, only you can answer if it is fast enough for your requirements.
There are other string searching approaches, such as the Boyer-Moore algorithm. But I'd wager the complexity of searching for multiple separate strings is much higher than a regexp that can switch off each successive character.

This is a question that can readily be answered by just trying it.
>>> import re
>>> import timeit
>>> find = ['foo', 'bar', 'baz']
>>> pattern = re.compile("|".join(find))
>>> with open('c:\\temp\\words.txt', 'r') as f:
words = f.readlines()
>>> len(words)
235882
>>> timeit.timeit('r = filter(lambda w: any(s for s in find if w.find(s) >= 0), words)', 'from __main__ import find, words', number=30)
18.404569854548527
>>> timeit.timeit('r = filter(lambda w: any(s for s in find if s in w), words)', 'from __main__ import find, words', number=30)
10.953313759150944
>>> timeit.timeit('r = filter(lambda w: pattern.search(w), words)', 'from __main__ import pattern, words', number=30)
6.8793022576891758
It looks like you can reasonably expect regular expressions to be faster than using find or in. Though if I were you I'd repeat this test with a case that was more like your real data.

If you're just searching for a particular substring, use str.find() instead.

Depending on what you're doing it might be better to use a tokenizer and loop through the tokens to find matches.
However, when it comes to short pieces of text regexes have incredibly good performance. Personally I remember only coming into problems when text sizes became ridiculous like 100k words or something like that.
Furthermore, if you are worried about the speed of actual regex compilation rather than matching, you might benefit from creating a daemon that compiles all the regexes then goes through all the pieces of text in a big loop or runs as a service. This way you will only have to compile the regexes once.

in general case, you can use "in" keyword
for line in open("file"):
if "word" in line:
print line.rstrip()
regex is usually not needed when you use Python :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find which lines in a file contain certain characters - python

Use python's any function. if any((bad_char in my_string) for bad_char in bad_chars): # do something

This should be very efficient and clear. It uses sets: #!/usr/bin/python bad_chars = set('aeiou') with open('/etc/passwd', 'r') as file_: file_string = file_.read() file_chars = set(file_string) if file_chars & bad_chars: print('found something bad')

This regular expression is twice as fast as any with my minimal testing. You should try it with your own data. r = re.compile('[aeiou]') if r.search(s): # do something

The following Python code should print out any character in bad_chars if it exists in s: for i in vowels: if i in your charset: #do_something You could also use the python in-built any using an example like this: >>> any(e for e in bad_chars if e in s) True

Related

How to extend str.isspace()?

Splitting string by '],[', but keeping the brackets

making difflib's SequenceMatcher ignore "junk" characters

Split string with caret character in python

Speed of many regular expressions in python

Categories

Resources