Search for words (exact matches) in multiple texts using Python - python

I want to let the user choose and open multiple texts and perform a search for exact matches in the texts.
I want the encoding to be unicode.
If I search for "cat" I want it to find "cat", "cat,", ".cat" but not "catalogue".
I don't know how to let the user search for two words ("cat" OR "dog") in all of the texts at the same time??????
Maybe I can use RE?
So far I have just made it possible for the user to insert the path to the directory containing the text files to search in. Now I want to let the user (raw_input) search for two words in all of the texts, and then print and save the results (e.g. "search_word_1" and "search_word_2" found in document1.txt, "search_word_2" found in document4.txt) in a separate document (search_words).
import re, os
path = raw_input("insert path to directory :")
ex_library = os.listdir(path)
search_words = open("sword.txt", "w") # File or maybe list to put in the results
thelist = []
for texts in ex_library:
f = os.path.join(path, texts)
text = open(f, "r")
textname = os.path.basename(texts)
print textname
for line in text.read():
text.close()

Regular expressions are appropriate tool in this case.
I want it to find "cat", "cat,", ".cat" but not "catalogue".
Pattern: r'\bcat\b'
\b matches at a word boundary.
how to let the user search for two words ("cat" OR "dog") in all of the texts at the same time
Pattern: r'\bcat\b|\bdog\b'
To print "filename: <words that are found in it>":
#!/usr/bin/env python
import os
import re
import sys
def fgrep(words, filenames, encoding='utf-8', case_insensitive=False):
findwords = re.compile("|".join(r"\b%s\b" % re.escape(w) for w in words),
flags=re.I if case_insensitive else 0).findall
for name in filenames:
with open(name, 'rb') as file:
text = file.read().decode(encoding)
found_words = set(findwords(text))
yield name, found_words
def main():
words = [w.decode(sys.stdin.encoding) for w in sys.argv[1].split(",")]
filenames = sys.argv[2:] # the rest is filenames
for filename, found_words in fgrep(words, filenames):
print "%s: %s" % (os.path.basename(filename), ",".join(found_words))
main()
Example:
$ python findwords.py 'cat,dog' /path/to/*.txt
Alternative solutions
To avoid reading the whole file in memory:
import codecs
...
with codecs.open(name, encoding=encoding) as file:
found_words = set(w for line in file for w in findwords(line))
You could also print found words in the context they are found e.g., print lines with highlighted words:
from colorama import init # pip install colorama
init(strip=not sys.stdout.isatty()) # strip colors if stdout is redirected
from termcolor import colored # pip install termcolor
highlight = lambda s: colored(s, on_color='on_red', attrs=['bold', 'reverse'])
...
regex = re.compile("|".join(r"\b%s\b" % re.escape(w) for w in words),
flags=re.I if case_insensitive else 0)
for line in file:
if regex.search(line): # line contains words
line = regex.sub(lambda m: highlight(m.group()), line)
yield line

You need to split the text in each file on whitespace and punctuation. Once that's done you can simply look for the words you are searching for in the remaining list. You also need to convert everything to lowercase, unless you also want case sensitive search.

Some (maybe useful) information in addition to the existing answers:
You should be aware that what the user means when he thinks of a "character" (=grapheme) is not always the same as a Unicode character, and some graphemes can be represented by Unicode characters in more than one unique way (e.g. composite character vs. base character + combining mark).
To do a search based on graphemes (=what the user expects in most cases) and not on specific Unicode character sequences, you need to normalize your strings before you search.

Related

python re.match doesn't work for a multipleline text file

I would like to write a Python program that searches and saves the vocabulary definition of an English word.
I have converted a Babylon English-Italian dictionary into a text file.
I would like that re.match() matches the first word of each line, but it doesn't.
I get always: 'not found', namely None for any query (word copied in my code) I use.
The Babylon text file can be found here:
https://tempfile.io/en/ba2voaBnDJsn24P/file
Thanks
import clipboard
import json
import re
wordcopied = clipboard.paste()
print(wordcopied)
dictionary = {}
with open("/home/user/Babylon-EI.txt", "r") as source:
lsource = source.readlines()
for line in lsource:
#print(type(line))
matc = re.match(wordcopied, line, re.MULTILINE | re.DOTALL)
if matc != None:
print(line)
dictionary = {wordcopied:line}
else:
print('not found')
break
I've tried also with re.search, and with multiple flags..
Related questions are all answered regarding flags and blanks

How to take user input of a file name and only count letters as words instead of punctuation?

I have written this code for a word frequency calculator. It works but will count the word 'not,' differently than 'not'. I also am attempting to make the computer ask for user input of the filename and if the user inputs the wrong file it returns 'wrong file'. I am unsure how to code for the user input and how to make sure the program only is counting letters (not punctuation).
file = open('document (1).txt')
empty_dictionary = dict()
for sentence in file:
sentence = sentence.strip()
sentence = sentence.lower()
words = sentence.split(" ")
for word in words:
if word in empty_dictionary:
empty_dictionary[word] = empty_dictionary[word] + 1
else:
empty_dictionary[word] = 1
for key in list(empty_dictionary.keys()):
print(key, "- ", empty_dictionary[key])
If there are a few specific characters you want to remove, you can do it this way:
word = word.replace(',', '')
word = word.replace(';', '')
word = word.replace(':', '')
Or a more general solution that removes any character that isn't a letter:
word = ''.join(ch for ch in word if ch.isalpha())
Regular expressions are actually perfectly suited for this! They have the notion of a word boundary.
The regular expression /\w+\b/ will match any number of "word characters" (\w+) followed by a word boundary (\b). Python provides regex functionality through its re module and you can use it like this:
# import the regex module
import re
# just some example text
text = """Not? Not!
Not, not."""
# pre-compile the regex
pattern = re.compile(r'\w+\b')
empty_dictionary = dict()
for sentence in text.split('\n'):
sentence = sentence.strip()
sentence = sentence.lower()
# `findall`, as you might imagine, finds all matches in a given string
for word in pattern.findall(sentence):
if word in empty_dictionary:
empty_dictionary[word] = empty_dictionary[word] + 1
else:
empty_dictionary[word] = 1
for key in list(empty_dictionary.keys()):
print(key, "- ", empty_dictionary[key])
Regex will help you out, context managers are nice to use as it to help you with the file close (naughty (unexpected error) or nice). A Counter is probably something you will find useful :) (works as a dict when it comes to loops and can be updated). I also got some nice methods like most_common witch might be beneficial to the work you are doing.
import re
from collections import Counter
# with open("document (1).txt") as fp:
# text = fp.read().lower()
# mock file content
text = """Not? Not!
Not, not."""
print(Counter(re.findall(r'\w+\b', text)))
For file input control you could go for something like
import re
from collections import Counter
from pathlib import Path
while True:
candidate = Path(input("Yo give me that file name"))
if candidate.is_file():
break
print(f"{candidate} is not a file on the system")
text = candidate.read_text().lower()
# use .items() instead of .most_common() if order is not important
for k, v in Counter(re.findall(r'\w+\b', text)).most_common():
print(k, v, sep=" - ")
As some of the others said Regex is probably one of the best solutions but for your user input issue:
import tkinter as tk
import re
from tkinter import filedialog
from os.path import exists
from os import execv
root = tk.Tk()
root.withdraw()
file_path = filedialog.askopenfilename()
pattern = re.compile(r'\w+\b')
if exists(file_path):
with open(file_path,'r') as reader:
text = reader.readlines()
empty_dictionary = dict()
for sentence in text.split('\n'):
sentence = sentence.strip().lower()
for work in pattern.findall(sentence):
if word in empty_dictionary:
empty_dictionary[word] = empty_dictionary[word] + 1
else:
empty_dictionary[word] = 1
for key in list(empty_dictionary.keys()):
print(key, "- ", empty_dictionary[key])
else:
from colorama import Fore
print(Fore.RED, "The File does not exist.\nHit Enter to retry.")
input()
os.execv(sys.argv[0], sys.argv) # Restarts to ask again
All of this works like so:
The tkinter GUI import actually creates a window which is instantly withdrawn(hidden) and only shows the file dialog of choice and puts the path into text. Then exists() checks to make sure the file is real and returns true if it does in which you put your code. If not it raises prints red text like an error and awaits enter to restart and try again. The regex takes count of what you want after being compiled and finds all words that match

"Replace" from central file?

I am trying to extend the replace function. Instead of doing the replacements on individual lines or individual commands, I would like to use the replacements from a central text file.
That's the source:
import os
import feedparser
import pandas as pd
pd.set_option('max_colwidth', -1)
RSS_URL = "https://techcrunch.com/startups/feed/"
feed = feedparser.parse(RSS_URL)
entries = pd.DataFrame(feed.entries)
entries = entries[['title']]
entries = entries.to_string(index=False, header=False)
entries = entries.replace(' ', '\n')
entries = os.linesep.join([s for s in entries.splitlines() if s])
print(entries)
I want to be able to replace words from a RSS feed, from a central "Replacement"-file, witch So the source file should have two columns:Old word, New word. Like replace function replace('old','new').
Output/Print Example:
truck
rental
marketplace
D’Amelio
family
launches
to
invest
up
to
$25M
...
In most cases I want to delete the words that are unnecessary for me, so e.g. replace('to',''). But I also want to be able to change special names, e.g. replace('D'Amelio','DAmelio'). The goal is to reduce the number of words and build up a kind of keyword radar.
Is this possible? I can't find any help Googling. But it could well be that I do not know the right terms or can not formulate.
with open('<filepath>','r') as r:
# if you remove the ' marks from around your words, you can remove the [1:-1] part of the below code
words_to_replace = [word.strip()[1:-1] for word in r.read().split(',')]
def replace_words(original_text, words_to_replace):
for word in words_to_replace:
original_text = original_text.replace(word, '')
return original_text
I was unable to understand your question properly but as far as I understand you have strings like cat, dog, etc. and you have a file in which you have data with which you want to replace the string. If this was your requirement, I have given the solution below, so try running it if it satisfies your requirement.
If that's not what you meant, please comment below.
TXT File(Don't use '' around the strings in Text File):
papa, papi
dog, dogo
cat, kitten
Python File:
your_string = input("Type a string here: ") #string you want to replace
with open('textfile.txt',"r") as file1: #open your file
lines = file1.readlines()
for line in lines: #taking the lines of file in one by one using loop
string1 = f'{line}'
string1 = string1.split() #split the line of the file into list like ['cat,', 'kitten']
if your_string == string1[0][:-1]: #comparing the strings of your string with the file
your_string = your_string.replace(your_string, string1[1]) #If string matches like user has given input cat, it will replace it with kitten.
print(your_string)
else:
pass
If you got the correct answer please upvote my answer as it took my time to make and test the python file.

regex extract list of strings from file

I have an input file (input.txt) which contains some data that follows a standard format similar to the following lines:
<descriptor/nnn> <http://www.nnn.org/2004/02/skos/core#prefLabel> "Politische Inklusion"#de .
<descriptor/nnn> <http://www.nnn.org/2004/02/skos/core#prefLabel> "Political inclusion"#en .
<descriptor/nnn> <http://www.nnn.org/2004/02/skos/core#prefLabel> "Radiologische Kampfmittel"#de .
I want to extract a list of English strings which lies between the " "#en in outputfile-en.txt, and German strings which lies between the " "#de in outputfile-de.txt
In this example outputfile-en.txt should contain:
Political inclusion
and outputfile-de.txt should contain:
Politische Inklusion
Radiologische Kampfmittel
Which regex is suitable here?
With such a simple pattern there's no need for regex at all, especially not to re-iterate over the same data to pick up different languages - you can stream parse and write your results on the fly:
with open("input.txt", "r") as f: # open the input file
file_handles = {} # a map of our individual output file handles
for line in f: # read it line by line
rindex = line.rfind("#") # find the last `#` character
language = line[rindex+1:rindex+3] # grab the following two characters as language
if rindex != -1: # char found, consider the line...
lindex = line.rfind("\"", 0, rindex-1) # find the preceding quotation
if lindex != -1: # found, we have a match
if language not in file_handles: # add a file handle for this language:
file_handles[language] = open("outputfile-{}.txt".format(language), "w")
# write the found slice between `lindex` and `rindex` + a new line
file_handles[language].write(line[lindex+1:rindex-1] + "\n")
for handle in file_handles.values(): # lets close our output file handles
handle.close()
Should be significantly faster than regex + as a bonus is that it will work with any language so if you have ...#it lines it will save outputfile-it.txt as well.
You could do something like this:
import re
str = """<descriptor/nnn> <http://www.nnn.org/2004/02/skos/core#prefLabel> "Politische Inklusion"#de .
<descriptor/nnn> <http://www.nnn.org/2004/02/skos/core#prefLabel> "Political inclusion"#en .
<descriptor/nnn> <http://www.nnn.org/2004/02/skos/core#prefLabel> "Radiologische Kampfmittel"#de . """
german = re.compile('"(.*)"#de')
english = re.compile('"(.*)"#en')
print german.findall(str)
print english.findall(str)
This would give you
['Politische Inklusion', 'Radiologische Kampfmittel']
and
['Political inclusion'].
Now you only have to iterate over those results and write them to the appropriate file.

Replace section of text with only knowing the beginning and last word using Python

In Python, it possible to cut out a section of text in a document when you only know the beginning and end words?
For example, using the bill of rights as the sample document, search for "Amendment 3" and remove all the text until you hit "Amendment 4" without actually knowing or caring what text exists between the two end points.
The reason I'm asking is I would like to use this Python script to modify my other Python programs when I upload them to the client's computer -- removing sections of code that exists between a comment that says "#chop-begin" and "#chop-end". I do not want the client to have access to all of the functions without paying for the better version of the code.
You can use Python's re module.
I wrote this example script for removing the sections of code in file:
import re
# Create regular expression pattern
chop = re.compile('#chop-begin.*?#chop-end', re.DOTALL)
# Open file
f = open('data', 'r')
data = f.read()
f.close()
# Chop text between #chop-begin and #chop-end
data_chopped = chop.sub('', data)
# Save result
f = open('data', 'w')
f.write(data_chopped)
f.close()
With data.txt
do_something_public()
#chop-begin abcd
get_rid_of_me() #chop-end
#chop-beginner this should stay!
#chop-begin
do_something_private()
#chop-end The rest of this comment should go too!
but_you_need_me() #chop-begin
last_to_go()
#chop-end
the following code
import re
class Chopper(object):
def __init__(self, start='\\s*#ch'+'op-begin\\b', end='#ch'+'op-end\\b.*?$'):
super(Chopper,self).__init__()
self.re = re.compile('{0}.*?{1}'.format(start,end), flags=re.DOTALL+re.MULTILINE)
def chop(self, s):
return self.re.sub('', s)
def chopFile(self, infname, outfname=None):
if outfname is None:
outfname = infname
with open(infname) as inf:
data = inf.read()
with open(outfname, 'w') as outf:
outf.write(self.chop(data))
ch = Chopper()
ch.chopFile('data.txt')
results in data.txt
do_something_public()
#chop-beginner this should stay!
but_you_need_me()
Use regular expressions:
import re
string = re.sub('#chop-begin.*?#chop-end', '', string, flags=re.DOTALL)
.*? will match all between.

Categories

Resources