substring with a small change - python

I'm trying to solve this problem were they give me a set of strings where to count how many times a certain word appears within a string like 'code' but the program also counts any variant where the 'd' changes like 'coze' but something like 'coz' doesn't count this is what I made:
def count(word):
count=0
for i in range(len(word)):
lo=word[i:i+4]
if lo=='co': # this is what gives me trouble
count+=1
return count

Test if the first two characters match co and the 4th character matches e.
def count(word):
count=0
for i in range(len(word)-3):
if word[i:i+1] == 'co' and word[i+3] == 'e'
count+=1
return count
The loop only goes up to len(word)-3 so that word[i+3] won't go out of range.

You could use regex for this, through the re module.
import re
string = 'this is a string containing the words code, coze, and coz'
re.findall(r'co.e', string)
['code', 'coze']
from there you could write a function such as:
def count(string, word):
return len(re.findall(word, string))

Regex is the answer to your question as mentioned above but what you need is a more refined regex pattern. since you are looking for certain word appears you need to search for boundary words. So your pattern should be sth. like this:
pattern = r'\bco.e\b'
this way your search will not match with the words like testcodetest or cozetest but only match with code coze coke but not leading or following characters
if you gonna test for multiple times, then it's better to use a compiled pattern, that way it'd be more memory efficient.
In [1]: import re
In [2]: string = 'this is a string containing the codeorg testcozetest words code, coze, and coz'
In [3]: pattern = re.compile(r'\bco.e\b')
In [4]: pattern.findall(string)
Out[4]: ['code', 'coze']
Hope that helps.

Related

Finding letters within items in lists in Python

I'm trying to go through a list to find letter combinations that don't exist in English. After a fair amount of arguing, I have a word list that I can mess with. Each word is listed as 'word\n' since each word is on a line. If I wanted to find, say, the word 'winter', if in works but only if I'm looking for 'winter\n'. I can't look just for 'winter' so I can't find individual letter pairs which is the goal.
There's over a quarter million items, so I can't cycle through the list every time, it would take ages. I don't care about index, I just need a true/false of if a letter pair is anywhere in the list.
Sorry if this was a bit rambly, I hope I got my point across. Thanks!
Assuming you don't want to alter your wordlist, it sounds like you're looking for something like this:
def search(word_list, word): # word_list is your list of words, word is the word you're searching for
for w in word_list: # iterate over the list
if w.startswith(word): # check if any of them start with the word you're looking for
return True # return true if a match is found
return False # return false if no matches are found
If you instead want to find a substring anywhere in a word instead of at the beginning, replace w.startswith(word) with word in w.
There are several ways to do that but the easiest one is like this:
flag = True
STRING = 'YOUR STRING'
def check(letter):
for k in range(33 ,127):
if chr(k) == letter:
return True
return False
for i in STRING:
if not check(i):
break
flag = False
The reason for 33 and 127 in for loop is that they are the ascii code for English words and other things(such as: ?,!,*,(,), etc)
Notice: This code is just for one string!
And also you can use regex library to do that.
You can create a variable like pattern like this:
pattern = '[A-Za-z]'
this pattern is for all of the English letters.
And then:
new_string = re.sub(pattern,STRING,'')
if new_string == '':
flag = True
else:
flag = False
sub method is just like replace and you give a pattern, a string and the replace for pattern in string.
So we replace all of the English letters in a string with '' and when there is nothing left on your string it means that your string is made of English letters.
But I'm not sure about syntax for re. You have to take look at doc.
If you are looking for a fast algorithm, DO NOT USE THE FIRST WAY! BECAUSE THE ORDER OF CODE IS O(2) FOR A SINGLE STRING(NOT A LIST)

I want to split a string by a character on its first occurence, which belongs to a list of characters. How to do this in python?

Basically, I have a list of special characters. I need to split a string by a character if it belongs to this list and exists in the string. Something on the lines of:
def find_char(string):
if string.find("some_char"):
#do xyz with some_char
elif string.find("another_char"):
#do xyz with another_char
else:
return False
and so on. The way I think of doing it is:
def find_char_split(string):
char_list = [",","*",";","/"]
for my_char in char_list:
if string.find(my_char) != -1:
my_strings = string.split(my_char)
break
else:
my_strings = False
return my_strings
Is there a more pythonic way of doing this? Or the above procedure would be fine? Please help, I'm not very proficient in python.
(EDIT): I want it to split on the first occurrence of the character, which is encountered first. That is to say, if the string contains multiple commas, and multiple stars, then I want it to split by the first occurrence of the comma. Please note, if the star comes first, then it will be broken by the star.
I would favor using the re module for this because the expression for splitting on multiple arbitrary characters is very simple:
r'[,*;/]'
The brackets create a character class that matches anything inside of them. The code is like this:
import re
results = re.split(r'[,*;/]', my_string, maxsplit=1)
The maxsplit argument makes it so that the split only occurs once.
If you are doing the same split many times, you can compile the regex and search on that same expression a little bit faster (but see Jon Clements' comment below):
c = re.compile(r'[,*;/]')
results = c.split(my_string)
If this speed up is important (it probably isn't) you can use the compiled version in a function instead of having it re compile every time. Then make a separate function that stores the actual compiled expression:
def split_chars(chars, maxsplit=0, flags=0, string=None):
# see note about the + symbol below
c = re.compile('[{}]+'.format(''.join(chars)), flags=flags)
def f(string, maxsplit=maxsplit):
return c.split(string, maxsplit=maxsplit)
return f if string is None else f(string)
Then:
special_split = split_chars(',*;/', maxsplit=1)
result = special_split(my_string)
But also:
result = split_chars(',*;/', my_string, maxsplit=1)
The purpose of the + character is to treat multiple delimiters as one if that is desired (thank you Jon Clements). If this is not desired, you can just use re.compile('[{}]'.format(''.join(chars))) above. Note that with maxsplit=1, this will not have any effect.
Finally: have a look at this talk for a quick introduction to regular expressions in Python, and this one for a much more information packed journey.

How do you filter a string to only contain letters?

How do I make a function where it will filter out all the non-letters from the string? For example, letters("jajk24me") will return back "jajkme". (It needs to be a for loop) and will string.isalpha() function help me with this?
My attempt:
def letters(input):
valids = []
for character in input:
if character in letters:
valids.append( character)
return (valids)
If it needs to be in that for loop, and a regular expression won't do, then this small modification of your loop will work:
def letters(input):
valids = []
for character in input:
if character.isalpha():
valids.append(character)
return ''.join(valids)
(The ''.join(valids) at the end takes all of the characters that you have collected in a list, and joins them together into a string. Your original function returned that list of characters instead)
You can also filter out characters from a string:
def letters(input):
return ''.join(filter(str.isalpha, input))
or with a list comprehension:
def letters(input):
return ''.join([c for c in input if c.isalpha()])
or you could use a regular expression, as others have suggested.
import re
valids = re.sub(r"[^A-Za-z]+", '', my_string)
EDIT: If it needs to be a for loop, something like this should work:
output = ''
for character in input:
if character.isalpha():
output += character
See re.sub, for performance consider a re.compile to optimize the pattern once.
Below you find a short version which matches all characters not in the range from A to Z and replaces them with the empty string. The re.I flag ignores the case, thus also lowercase (a-z) characters are replaced.
import re
def charFilter(myString)
return re.sub('[^A-Z]+', '', myString, 0, re.I)
If you really need that loop there are many awnsers, explaining that specifically. However you might want to give a reason why you need a loop.
If you want to operate on the number sequences and thats the reason for the loop consider replacing the replacement string parameter with a function like:
import re
def numberPrinter(matchString) {
print(matchString)
return ''
}
def charFilter(myString)
return re.sub('[^A-Z]+', '', myString, 0, re.I)
The method string.isalpha() checks whether string consists of alphabetic characters only. You can use it to check if any modification is needed.
As to the other part of the question, pst is just right. You can read about regular expressions in the python doc: http://docs.python.org/library/re.html
They might seem daunting but are really useful once you get the hang of them.
Of course you can use isalpha. Also, valids can be a string.
Here you go:
def letters(input):
valids = ""
for character in input:
if character.isalpha():
valids += character
return valids
Not using a for-loop. But that's already been thoroughly covered.
Might be a little late, and I'm not sure about performance, but I just thought of this solution which seems pretty nifty:
set(x).intersection(y)
You could use it like:
from string import ascii_letters
def letters(string):
return ''.join(set(string).intersection(ascii_letters))
NOTE:
This will not preserve linear order. Which in my use case is fine, but be warned.

Parse 4th capital letter of line in Python?

How can I parse lines of text from the 4th occurrence of a capital letter onward? For example given the lines:
adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj
oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ
I would like to capture:
`ZsdalkjgalsdkjTlaksdjfgasdkgj`
`PlsdakjfsldgjQ`
I'm sure there is probably a better way than regular expressions, but I was attempted to do a non-greedy match; something like this:
match = re.search(r'[A-Z].*?$', line).group()
I present two approaches.
Approach 1: all-out regex
In [1]: import re
In [2]: s = 'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj'
In [3]: re.match(r'(?:.*?[A-Z]){3}.*?([A-Z].*)', s).group(1)
Out[3]: 'ZsdalkjgalsdkjTlaksdjfgasdkgj'
The .*?[A-Z] consumes characters up to, and including, the first uppercase letter.
The (?:...){3} repeats the above three times without creating any capture groups.
The following .*? matches the remaining characters before the fourth uppercase letter.
Finally, the ([A-Z].*) captures the fourth uppercase letter and everything that follows into a capture group.
Approach 2: simpler regex
In [1]: import re
In [2]: s = 'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj'
In [3]: ''.join(re.findall(r'[A-Z][^A-Z]*', s)[3:])
Out[3]: 'ZsdalkjgalsdkjTlaksdjfgasdkgj'
This attacks the problem directly, and I think is easier to read.
Anyway not using regular expressions will seen to be too verbose -
although at the bytcodelevel it is a very simple algorithm running, and therefore lightweight.
It may be that regexpsare faster, since they are implemented in native code, but the "one obvious way to do it", though boring, certainly beats any suitable regexp in readability hands down:
def find_capital(string, n=4):
count = 0
for index, letter in enumerate(string):
# The boolean value counts as 0 for False or 1 for True
count += letter.isupper()
if count == n:
return string[index:]
return ""
Found this simpler to deal with by using a regular expression to split the string, then slicing the resulting list:
import re
text = ["adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj",
"oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ"]
for t in text:
print "".join(re.split("([A-Z])", t, maxsplit=4)[7:])
Conveniently, this gives you an empty string if there aren't enough capital letters.
A nice, one-line solution could be:
>>> s1 = 'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj'
>>> s2 = 'oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ'
>>> s1[list(re.finditer('[A-Z]', s1))[3].start():]
'ZsdalkjgalsdkjTlaksdjfgasdkgj'
>>> s2[list(re.finditer('[A-Z]', s2))[3].start():]
'PlsdakjfsldgjQ'
Why this works (in just one line)?
Searches for all capital letters in the string: re.finditer('[A-Z]', s1)
Gets the 4th capital letter found: [3]
Returns the position from the 4th capital letter: .start()
Using slicing notation, we get the part we need from the string s1[position:]
I believe that this will work for you, and be fairly easy to extend in the future:
check = 'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj'
print re.match('([^A-Z]*[A-Z]){3}[^A-Z]*([A-Z].*)', check ).group(2)
The first part of the regex ([^A-Z]*[A-Z]){3} is the real key, this finds the first three upper case letters and stores them along with the characters between them in group 1, then we skip any number of non-upper case letters after the third upper case letter, and finally, we capture the rest of the string.
Testing a variety of methods. I original wrote string_after_Nth_upper and didn't post it; seeing that jsbueno's method was similar; except by doing additions/count comparisons for every character (even lowercase letters) his method is slightly slower.
s='adsasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj'
import re
def string_after_Nth_upper(your_str, N=4):
upper_count = 0
for i, c in enumerate(your_str):
if c.isupper():
upper_count += 1
if upper_count == N:
return your_str[i:]
return ""
def find_capital(string, n=4):
count = 0
for index, letter in enumerate(string):
# The boolean value counts as 0 for False or 1 for True
count += letter.isupper()
if count == n:
return string[index:]
return ""
def regex1(s):
return re.match(r'(?:.*?[A-Z]){3}.*?([A-Z].*)', s).group(1)
def regex2(s):
return re.match(r'([^A-Z]*[A-Z]){3}[^A-Z]*([A-Z].*)', s).group(2)
def regex3(s):
return s[list(re.finditer('[A-Z]', s))[3].start():]
if __name__ == '__main__':
from timeit import Timer
t_simple = Timer("string_after_Nth_upper(s)", "from __main__ import s, string_after_Nth_upper")
print 'simple:', t_simple.timeit()
t_jsbueno = Timer("find_capital(s)", "from __main__ import s, find_capital")
print 'jsbueno:', t_jsbueno.timeit()
t_regex1 = Timer("regex1(s)", "from __main__ import s, regex1; import re")
print "Regex1:",t_regex1.timeit()
t_regex2 = Timer("regex2(s)", "from __main__ import s, regex2; import re")
print "Regex2:", t_regex2.timeit()
t_regex3 = Timer("regex3(s)", "from __main__ import s, regex3; import re")
print "Regex3:", t_regex3.timeit()
Results:
Simple: 4.80558681488
jsbueno: 5.92122507095
Regex1: 3.21153497696
Regex2: 2.80767202377
Regex3: 6.64155721664
So regex2 wins for time.
import re
strings = [
'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj',
'oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ',
]
for s in strings:
m = re.match('[a-z]*[A-Z][a-z]*[A-Z][a-z]*[A-Z][a-z]*([A-Z].+)', s)
if m:
print m.group(1)
It's not the prettiest approach, but:
re.match(r'([^A-Z]*[A-Z]){3}[^A-Z]*([A-Z].*)', line).group(2)
caps = set("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
temp = ''
for char in inputStr:
if char in caps:
temp += char
if len(temp) == 4:
print temp[-1] # this is the answer that you are looking for
break
Alternatively, you could use re.sub to get rid of anything that's not a capital letter and get the 4th character of what's left
Another version... not that pretty, but gets the job done.
def stringafter4thupper(s):
i,r = 0,''
for c in s:
if c.isupper() and i < 4:
i+=1
if i==4:
r+=c
return r
Examples:
stringafter4thupper('adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj')
stringafter4thupper('oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ')
stringafter4thupper('')
stringafter4thupper('abcdef')
stringafter4thupper('ABCDEFGH')
Respectively results:
'ZsdalkjgalsdkjTlaksdjfgasdkgj'
'PlsdakjfsldgjQ'
''
''
'DEFGH'
Parsing almost always involves regular expressions. However, a regex by itself does not make a parser. In the most simple sense, a parser consists of:
text input stream -> tokenizer
Usually it has an additional step:
text input stream -> tokenizer -> parser
The tokenizer handles opening the input stream and collecting text in a proper manner, so that the programmer doesn't have to think about it. It consumes text elements until there is only one match available to it. Then it runs the code associated with this "token". If you don't have a tokenizer, you have to roll it yourself(in pseudocode):
while stuffInStream:
currChars + getNextCharFromString
if regex('firstCase'):
do stuff
elif regex('other stuff'):
do more stuff
This loop code is full of gotchas, unless you build them all the time. It is also easy to have a computer produce it from a set of rules. That's how Lex/flex works. You can have the rules associated with a token pass the token to yacc/bison as your parser, which adds structure.
Notice that the lexer is just a state machine. It can do anything when it migrates from state to state. I've written lexers that used would strip characters from the input stream, open files, print text, send email and so on.
So, if all you want is to collect the text after the fourth capital letter, a regex is not only appropriate, it is the correct solution. BUT if you want to do parsing of textual input, with different rules for what to do and an unknown amount of input, then you need a lexer/parser. I suggest PLY since you are using python.

Replacing each match with a different word

I have a regular expression like this:
findthe = re.compile(r" the ")
replacement = ["firstthe", "secondthe"]
sentence = "This is the first sentence in the whole universe!"
What I am trying to do is to replace each occurrence with an associated replacement word from a list so that the end sentence would look like this:
>>> print sentence
This is firstthe first sentence in secondthe whole universe
I tried using re.sub inside a for loop enumerating over replacement but it looks like re.sub returns all occurrences. Can someone tell me how to do this efficiently?
If it is not required to use regEx than you can try to use the following code:
replacement = ["firstthe", "secondthe"]
sentence = "This is the first sentence in the whole universe!"
words = sentence.split()
counter = 0
for i,word in enumerate(words):
if word == 'the':
words[i] = replacement[counter]
counter += 1
sentence = ' '.join(words)
Or something like this will work too:
import re
findthe = re.compile(r"\b(the)\b")
print re.sub(findthe, replacement[1],re.sub(findthe, replacement[0],sentence, 1), 1)
And at least:
re.sub(findthe, lambda matchObj: replacement.pop(0),sentence)
Artsiom's last answer is destructive of replacement variable. Here's a way to do it without emptying replacement
re.sub(findthe, lambda m, r=iter(replacement): next(r), sentence)
You can use a callback function as the replace parameter, see how at:
http://docs.python.org/library/re.html#re.sub
Then use some counter and replace depending on the counter value.

Categories

Resources