searching for specific words in a text python

searching for specific words in a text python - python

I'm trying to make a function that will take an argument that's a word (or set of characters) as well as the speech, and return a boolean expression saying whether the word is there or not, as a function.
speech2 = open("Obama_DNC.txt", "r")
speech2_words = speech2.read()
def search(word):
if word in speech2_words:
if len(word) == len(word in speech2_words):
print(True)
elif len(word) != len(word in speech2_words):
print(False)
elif not word in speech2_words:
print(False)
word = input("search?")
search(word)
I want to make it so that the word that the program searches for in the text matches exactly as the input and that are not a part of another word ("America" in "American"). I thought of using the len() function but it doesn't seem to work and I am stuck. If anyone helps me figure this out that would be very helpful. Thank you in advance

You can use mmap also, for more information about the mmap
mmap in python 3 is treated differently that in python 2.7
Below code is for 2.7, what it does looking for a string in the text file.
#!/usr/bin/python
import mmap
f = open('Obama_DNC.txt')
s = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
if s.find('blabla') != -1:
print 'true'
Why mmap doesnt work with large files.

One option may be to use the the findall() method in the regex module, which can be used to find all occurrences of a specific string.
Optionally, you could include list.count() to check how many times the searched string occurs in the text:
import re
def search(word):
found = re.findall('\\b' + word + '\\b', speech2_words)
if found:
print(True, '{word} occurs {counts} time'.format(word=word, counts=found.count(word)))
else:
print(False)
output:
search?America
(True, 'America occurs 28 time')
search?American
(True, 'American occurs 12 time')

Related

My Python Regex code isn't finding consecutive sets of characters

I'm trying to code a program to find if there are three consecutive sets of double letters in a .txt file (E.G. bookkeeper). So far I have:
import re
text = open(r'C:\Users\Jimbo.Wimbo\Desktop\List.txt')
for line in text:
x = re.finditer(r'((\w)\2)+', line)
if True:
print("Yes")
Else:
print("No")
List.txt has 5 words. There is one word with three consecutive sets of double letters right at the end, but it prints 5 "Yes"'s. What can I do to fix it using re and os?

You don't need re.finditer(), you can just use re.search().
Your regexp is wrong, it will match at least 1 set of duplicate characters, not 3.
if True: doesn't do anything useful. It doesn't mean "if the last assignment was a truthy value". You need to test the result of the regexp search.
Use any() to test if the condition matches any line in the file. Your code will print Yes or No for each line in the file.
if any(re.search(r'((\w)\2)){3}', line) for line in text):
print('Yes')
else:
print('No')

I think your regex is incorrect.
A good way to check your regex is to use an online regex checker, and you can test your regex against any number of strings you provide.
Here is one possible solution to your query:
import re
text = open(r'C:\Users\Jimbo.Wimbo\Desktop\List.txt')
for line in text:
x = len(re.findall(r'(.)\1', line))
if x == 3:
print(f"Found a word with 3 duplicate letters : {line}")
else:
print(f"Word: {line}, Duplicate letters : {x}")
Hope this helps.

Formatting in python(Kivy) like in Stack overflow

My issue is that I would like to take input text with formatting like you would use when creating a Stackoverflow post and reformat it into the required text string. The best way I can think is to give an example....
# This is the input string
Hello **there**, how are **you**
# This is the intended output string
Hello [font=Nunito-Black.ttf]there[/font], how are [font=Nunito-Black.ttf]you[/font]
SO the ** is replaced by a different string that has an opening and a closing part but also needs to work as many times as needed for any string. (As seen 2 times in the example)
I have tried to use a variable to record if the ** in need of replacing is an opening or a closing part, but haven't managed to get a function to work yet, hence it being incomplete
I think replacing the correct ** is hard because I have been trying to use index which will only return the position of the 1st occurrence in the string
My attempt as of now
def formatting_text(input_text):
if input_text:
if '**' in input_text:
d = '**'
for line in input_text:
s = [e+d for e in line.split(d) if e]
count = 0
for y in s:
if y == '**' and count == 0:
s.index(y)
# replace with required part
return output_text
return input_text
I have tried to find this answer so I'm sorry if has already been asked but I have had no luck finding it and don't know what to search
Of course thank you for any help

A general solution for your case,
Using re
import re
def formatting_text(input_text, special_char, left_repl, right_repl):
# Define re pattern.
RE_PATTERN = f"[{special_char}].\w+.[{special_char}]"
for word in re.findall(RE_PATTERN, input_text):
# Re-assign with replacement with the parts.
new_word = left_repl+word.strip(special_char)+right_repl
input_text = input_text.replace(word, new_word)
return input_text
print(formatting_text("Hello **there**, how are **you**", "**", "[font=Nunito-Black.ttf]", "[/font]"))
Without using re
def formatting_text(input_text, special_char, left_repl, right_repl):
while True:
# Replace the left part.
input_text = input_text.replace(special_char, left_repl, 1)
# Replace the right part.
input_text = input_text.replace(special_char, right_repl, 1)
if input_text.find(special_char) == -1:
# Nothing found, time to stop.
break
return input_text
print(formatting_text("Hello **there**, how are **you**", "**", "[font=Nunito-Black.ttf]", "[/font]"))
However the above solution should work for other special_char like __, *, < etc. But if you want to just make it bold only, you may prefer kivy's bold markdown for label i.e. [b] and escape [/b].

So the formatting stack overflow uses is markdown, implemented in javascript. If you just want the single case to be formatted then you can see an implementation here where they use regex to find the matches and then just iterate through them.
STRONG_RE = r'(\*{2})(.+?)\1'
I would recommend against re-implementing an entire markdown solution yourself when you can just import one.

Remove members of a list in another list

I'm writing a program that checks if a word or sentence given by user input is a palindrome or not. This is the program so far:
def reverse(text):
a = text[::-1]
if a == text:
print "Yes, it's a palindrome."
else:
print "No, it's not a palindrome."
string = str(raw_input("Enter word here:")).lower()
reverse(string)
However, this code doesn't work for sentences. So I tried to do it like this:
import string
def reverse(text):
a = text[::-1]
if a == text:
print "Yes, it's a palindrome."
else:
print "No, it's not a palindrome."
notstring = str(raw_input("Enter word here:")).lower()
liststring = list(notstring)
forbiddencharacters = string.punctuation + string.whitespace
listcharacters = list(forbiddencharacters)
newlist = liststring - listcharacters
finalstring = "".join(newlist)
reverse(finalstring)
My goal is to put the punctuation and whitespace into a list and then subtracting those characters to the input of the user so that the program can tell if it's a palindrome even if the string has punctuation and/or whitespace. However, I don't know how I can subtract the elements in a list to the elements in another list. The way I did it, by creating another list that equals the user input minus the characters doesn't work (I tried it in my Xubuntu terminal emulator). Apart from that, when I run the program this error appears:
Traceback (most recent call last):
File "reverse.py", line 12, in <module>
forbiddencharacters = string.punctuation + string.whitespace
AttributeError: 'str' object has no attribute 'punctuation'
Ok so I have changed the variable name and I don't get that mistake above. Now I still don't know how to subtract the elements of the lists.
Since I'm a beginner programmer this might seem stupid to you. If that's the case, I'm sorry in advance. If anyone can solve one or both of the two problems I have, I'd be extremely grateful. Thanks in advance for your help. Sorry for bad english and long post :)

You should add some filtering along the way since palindromes have various syntax tricks (spaces, commas, etc.).
palindrome = "Rail at a liar"
def is_palindrome(text):
text = text.lower() #Avoid case issues
text = ''.join(ch for ch in text if ch.isalnum()) #Strips down everything but alphanumeric characters
return text == text[::-1]
if is_palindrome(palindrome):
print "Yes, it's a palindrome."
else:
print "No, it's not a palindrome."

You are on the right track, but you have used the identifier string for two different purposes.
Since you assigned to this variable name with the line:
string = str(raw_input("Enter word here:")).lower()
You can now no longer access the attributes string.punctuation and string.whitespace from the import string, because the name string is no longer bound to the module but to the user input instead.

A somewhat different approach to testing if a string is a palindrome
def palindrome(s):
s = s.lower()
ln=len(s)
for n in xrange(ln/2):
if s[n] != s[(ln-n)-1]:
return False
return True
print palindrome('Able was I ere I saw Elba')
FYI -- you'll need to tweak this to strip punctuation and white space if you like (left an an exercise to OP)

You can do that by splitting the phrase and storing it in a list. I am going to use your function (but there are more better pythonic ways to do that).
def reverse(textList1):
textList2 = textList1[::-1] #or we can use reversed(textList1)
if textList2 == text:
print "Yes, it's a palindrome."
else:
print "No, it's not a palindrome."
test1= "I am am I"
You should split the phrase and store it in a list:
test1List= test1.split(' ')
reverse(test1List)

Checking for palindrome is simple,
This works for both words and sentences.
import string
def ispalindrome(input_str):
input_str = list(input_str)
forbidden = list(string.punctuation + string.whitespace)
for forbidden_char in forbidden: # Remove all forbidden characters
while forbidden_char in input_str:
input_str.remove(forbidden_char)
return input_str == list(reversed(input_str)) # Checks if it is a palindrome
input_str = raw_input().lower() # Avoid case issues
print ispalindrome(input_str) # Get input

How can I replace substrings without replacing all at the same time? Python

I have written a really good program that uses text files as word banks for generating sentences from sentence skeletons. An example:
The skeleton
"The noun is good at verbing nouns"
can be made into a sentence by searching a word bank of nouns and verbs to replace "noun" and "verb" in the skeleton. I would like to get a result like
"The dog is good at fetching sticks"
Unfortunately, the handy replace() method was designed for speed, not custom functions in mind. I made methods that accomplish the task of selecting random words from the right banks, but doing something like skeleton = skeleton.replace('noun', getNoun(file.txt)) replaces ALL instances of 'noun' with the single call of getNoun(), instead of calling it for each replacement. So the sentences look like
"The dog is good at fetching dogs"
How might I work around this feature of replace() and make my method get called for each replacement? My minimum length code is below.
import random
def getRandomLine(rsv):
#parameter must be a return-separated value text file whose first line contains the number of lines in the file.
f = open(rsv, 'r') #file handle on read mode
n = int(f.readline()) #number of lines in file
n = random.randint(1, n) #line number chosen to use
s = "" #string to hold data
for x in range (1, n):
s = f.readline()
s = s.replace("\n", "")
return s
def makeSentence(rsv):
#parameter must be a return-separated value text file whose first line contains the number of lines in the file.
pattern = getRandomLine(rsv) #get a random pattern from file
#replace word tags with random words from matching files
pattern = pattern.replace('noun', getRandomLine('noun.txt'))
pattern = pattern.replace('verb', getRandomLine('verb.txt'))
return str(pattern);
def main():
result = makeSentence('pattern.txt');
print(result)
main()

The re module's re.sub function does the job str.replace does, but with far more abilities. In particular, it offers the ability to pass a function for the replacement, rather than a string. The function is called once for each match with a match object as an argument and must return the string that will replace the match:
import re
pattern = re.sub('noun', lambda match: getRandomLine('noun.txt'), pattern)
The benefit here is added flexibility. The downside is that if you don't know regexes, the fact that the replacement interprets 'noun' as a regex may cause surprises. For example,
>>> re.sub('Aw, man...', 'Match found.', 'Aw, manatee.')
'Match found.e.'
If you don't know regexes, you may want to use re.escape to create a regex that will match the raw text you're searching for even if the text contains regex metacharacters:
>>> re.sub(re.escape('Aw, man...'), 'Match found.', 'Aw, manatee.')
'Aw, manatee.'

I don't know if you are asking to edit your code or to write new code, so I wrote new code:
import random
verbs = open('verb.txt').read().split()
nouns = open('noun.txt').read().split()
def makeSentence(sent):
sent = sent.split()
for k in range(0, len(sent)):
if sent[k] == 'noun':
sent[k] = random.choice(nouns)
elif sent[k] == 'nouns':
sent[k] = random.choice(nouns)+'s'
elif sent[k] == 'verbing':
sent[k] = random.choice(verbs)
return ' '.join(sent)
var = raw_input('Enter: ')
print makeSentence(var)
This runs as:
$ python make.py
Enter: the noun is good at verbing nouns
the mouse is good at eating cats

Parse 4th capital letter of line in Python?

How can I parse lines of text from the 4th occurrence of a capital letter onward? For example given the lines:
adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj
oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ
I would like to capture:
`ZsdalkjgalsdkjTlaksdjfgasdkgj`
`PlsdakjfsldgjQ`
I'm sure there is probably a better way than regular expressions, but I was attempted to do a non-greedy match; something like this:
match = re.search(r'[A-Z].*?$', line).group()

I present two approaches.
Approach 1: all-out regex
In [1]: import re
In [2]: s = 'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj'
In [3]: re.match(r'(?:.*?[A-Z]){3}.*?([A-Z].*)', s).group(1)
Out[3]: 'ZsdalkjgalsdkjTlaksdjfgasdkgj'
The .*?[A-Z] consumes characters up to, and including, the first uppercase letter.
The (?:...){3} repeats the above three times without creating any capture groups.
The following .*? matches the remaining characters before the fourth uppercase letter.
Finally, the ([A-Z].*) captures the fourth uppercase letter and everything that follows into a capture group.
Approach 2: simpler regex
In [1]: import re
In [2]: s = 'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj'
In [3]: ''.join(re.findall(r'[A-Z][^A-Z]*', s)[3:])
Out[3]: 'ZsdalkjgalsdkjTlaksdjfgasdkgj'
This attacks the problem directly, and I think is easier to read.

Anyway not using regular expressions will seen to be too verbose -
although at the bytcodelevel it is a very simple algorithm running, and therefore lightweight.
It may be that regexpsare faster, since they are implemented in native code, but the "one obvious way to do it", though boring, certainly beats any suitable regexp in readability hands down:
def find_capital(string, n=4):
count = 0
for index, letter in enumerate(string):
# The boolean value counts as 0 for False or 1 for True
count += letter.isupper()
if count == n:
return string[index:]
return ""

Found this simpler to deal with by using a regular expression to split the string, then slicing the resulting list:
import re
text = ["adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj",
"oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ"]
for t in text:
print "".join(re.split("([A-Z])", t, maxsplit=4)[7:])
Conveniently, this gives you an empty string if there aren't enough capital letters.

A nice, one-line solution could be:
>>> s1 = 'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj'
>>> s2 = 'oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ'
>>> s1[list(re.finditer('[A-Z]', s1))[3].start():]
'ZsdalkjgalsdkjTlaksdjfgasdkgj'
>>> s2[list(re.finditer('[A-Z]', s2))[3].start():]
'PlsdakjfsldgjQ'
Why this works (in just one line)?
Searches for all capital letters in the string: re.finditer('[A-Z]', s1)
Gets the 4th capital letter found: [3]
Returns the position from the 4th capital letter: .start()
Using slicing notation, we get the part we need from the string s1[position:]

I believe that this will work for you, and be fairly easy to extend in the future:
check = 'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj'
print re.match('([^A-Z]*[A-Z]){3}[^A-Z]*([A-Z].*)', check ).group(2)
The first part of the regex ([^A-Z]*[A-Z]){3} is the real key, this finds the first three upper case letters and stores them along with the characters between them in group 1, then we skip any number of non-upper case letters after the third upper case letter, and finally, we capture the rest of the string.

Testing a variety of methods. I original wrote string_after_Nth_upper and didn't post it; seeing that jsbueno's method was similar; except by doing additions/count comparisons for every character (even lowercase letters) his method is slightly slower.
s='adsasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj'
import re
def string_after_Nth_upper(your_str, N=4):
upper_count = 0
for i, c in enumerate(your_str):
if c.isupper():
upper_count += 1
if upper_count == N:
return your_str[i:]
return ""
def find_capital(string, n=4):
count = 0
for index, letter in enumerate(string):
# The boolean value counts as 0 for False or 1 for True
count += letter.isupper()
if count == n:
return string[index:]
return ""
def regex1(s):
return re.match(r'(?:.*?[A-Z]){3}.*?([A-Z].*)', s).group(1)
def regex2(s):
return re.match(r'([^A-Z]*[A-Z]){3}[^A-Z]*([A-Z].*)', s).group(2)
def regex3(s):
return s[list(re.finditer('[A-Z]', s))[3].start():]
if __name__ == '__main__':
from timeit import Timer
t_simple = Timer("string_after_Nth_upper(s)", "from __main__ import s, string_after_Nth_upper")
print 'simple:', t_simple.timeit()
t_jsbueno = Timer("find_capital(s)", "from __main__ import s, find_capital")
print 'jsbueno:', t_jsbueno.timeit()
t_regex1 = Timer("regex1(s)", "from __main__ import s, regex1; import re")
print "Regex1:",t_regex1.timeit()
t_regex2 = Timer("regex2(s)", "from __main__ import s, regex2; import re")
print "Regex2:", t_regex2.timeit()
t_regex3 = Timer("regex3(s)", "from __main__ import s, regex3; import re")
print "Regex3:", t_regex3.timeit()
Results:
Simple: 4.80558681488
jsbueno: 5.92122507095
Regex1: 3.21153497696
Regex2: 2.80767202377
Regex3: 6.64155721664
So regex2 wins for time.

import re
strings = [
'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj',
'oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ',
]
for s in strings:
m = re.match('[a-z]*[A-Z][a-z]*[A-Z][a-z]*[A-Z][a-z]*([A-Z].+)', s)
if m:
print m.group(1)

It's not the prettiest approach, but:
re.match(r'([^A-Z]*[A-Z]){3}[^A-Z]*([A-Z].*)', line).group(2)

caps = set("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
temp = ''
for char in inputStr:
if char in caps:
temp += char
if len(temp) == 4:
print temp[-1] # this is the answer that you are looking for
break
Alternatively, you could use re.sub to get rid of anything that's not a capital letter and get the 4th character of what's left

Another version... not that pretty, but gets the job done.
def stringafter4thupper(s):
i,r = 0,''
for c in s:
if c.isupper() and i < 4:
i+=1
if i==4:
r+=c
return r
Examples:
stringafter4thupper('adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj')
stringafter4thupper('oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ')
stringafter4thupper('')
stringafter4thupper('abcdef')
stringafter4thupper('ABCDEFGH')
Respectively results:
'ZsdalkjgalsdkjTlaksdjfgasdkgj'
'PlsdakjfsldgjQ'
''
''
'DEFGH'

Parsing almost always involves regular expressions. However, a regex by itself does not make a parser. In the most simple sense, a parser consists of:
text input stream -> tokenizer
Usually it has an additional step:
text input stream -> tokenizer -> parser
The tokenizer handles opening the input stream and collecting text in a proper manner, so that the programmer doesn't have to think about it. It consumes text elements until there is only one match available to it. Then it runs the code associated with this "token". If you don't have a tokenizer, you have to roll it yourself(in pseudocode):
while stuffInStream:
currChars + getNextCharFromString
if regex('firstCase'):
do stuff
elif regex('other stuff'):
do more stuff
This loop code is full of gotchas, unless you build them all the time. It is also easy to have a computer produce it from a set of rules. That's how Lex/flex works. You can have the rules associated with a token pass the token to yacc/bison as your parser, which adds structure.
Notice that the lexer is just a state machine. It can do anything when it migrates from state to state. I've written lexers that used would strip characters from the input stream, open files, print text, send email and so on.
So, if all you want is to collect the text after the fourth capital letter, a regex is not only appropriate, it is the correct solution. BUT if you want to do parsing of textual input, with different rules for what to do and an unknown amount of input, then you need a lexer/parser. I suggest PLY since you are using python.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.