I have a source.txt file consisting of words. Each word is in a new line.
apple
tree
bee
go
apple
see
I also have a taget_words.txt file, where the words are also in one line each.
apple
bee
house
garden
eat
Now I have to search for each of the target words in the source file. If a target word is found, e.g. apple, a dictionary entry for the target word and each of the 3 preceding and 3 following words should be made. In the example case, that would be
words_dict = {'apple':'tree', 'apple':'bee', 'apple':'go'}
How can I tell python by creating and populating the dictionary to consider these 3 words before and after the entry in the source_file?
My idea was to use lists but ideally the code should be very efficient and fast as the files consists of some million words. I guess, with lists, the computation is very slow.
from collections import defaultdict
words_occ = {}
defaultdict = defaultdict(words_occ)
with open('source.txt') as s_file, open('target_words.txt') as t_file:
for line in t_file:
keys = [line.split()]
lines = s_file.readlines()
for line in lines:
s_words = line.strip()
# if key is found in s_words
# look at the 1st, 2nd, 3rd word before and after
# create a key, value entry for each of them
Later, I have to count the occurrence of each key, value pair and add the number to a separate dictionary, that is why I started with a defaultdict.
I would be glad about any suggestion for the above code.
The first issue you will face is your lack of understanding of dicts. Each key can occur only once, so if you ask the interpreter to give you the value of the one you gave you might get a surprise:
>>> {'apple':'tree', 'apple':'bee', 'apple':'go'}
{'apple': 'go'}
The problem is that there can only be one value associated with the key 'apple'.
You appear to be searching for suitable data structures, but StackOverflow is for improving or fixing problematic code.
Related
Trying to speed up spelling check on large dataset with 147k rows. The following function has been running for an entire afternoon and is still running. Is there a way to speed up the spelling check? The messages has already been case treated, punctuations removed, lemmatized and they are all in string format.
import autocorrect
from autocorrect import Speller
spell = Speller()
def spell_check(x):
correct_word = []
mispelled_word = x.split()
for word in mispelled_word:
correct_word.append(spell(word))
return ' '.join(correct_word)
df['clean'] = df['old'].apply(spell_check)
The autocorrect library is not very efficient, and is not made for tasks such as you present it. What it does is generate all the possible candidates with one or two typos, and checks which of them are valid words — and does it in plain Python.
Take a six-letter word like "source":
from autocorrect.typos import Word
print(sum(1 for c in Word('source').typos()))
# => 349
print(sum(1 for c in Word('source').double_typos()))
# => 131305
autocorrect generates as many as 131654 candidates to test, just for this word. What if it is longer? Let's try "transcompilation":
print(sum(1 for c in Word('').typos()))
# => 889
print(sum(1 for c in Word('').double_typos()))
# => 813325
That's 814214 candidates, just for one word! And note that numpy can't speed it up, as the values are Python strings, and you're invoking a Python function on every row. The only way to speed this up is to change the method you are using for spell-checking: for example, using aspell-python-py3 library instead (a wrapper for aspell, AFAIK the best free spellchecker for Unix).
Additionally to what #Amadan said and is definitely true (autocorrect does the correction in a very ineffective way):
You treat each word in the giant dataset as if all words in it are looked up for the first time, because you call spell() on each word. In reality (at least after a while) almost all words were previously looked up, so storing these results and loading them would be much more efficient.
Here is one way to do it:
import autocorrect
from autocorrect import Speller
spell = Speller()
# get all unique words in the data as a set (first split each row into words, then put them all in a flat set)
unique_words = {word for words in df["old"].apply(str.split) for word in words}
# get the corrected version of each unique word and put this mapping in a dictionary
corrected_words = {word: spell(word) for word in unique_words}
# write the cleaned row by looking up the corrected version of each unique word
df['clean'] = [" ".join([corrected_words[word] for word in row.split()]) for row in df["old"]]
I am trying to clean up text from an actual English dictionary as my source. I have already written a python program which loads the data from a .txt file into a SQL DB in four different columns - id, word, definition. In the next step though, I am trying to define what 'type' of word it is by fetching from the definition of the word strings like n. for noun, adj. for adjective, adv. for adverb, so on and so forth.
Now, using the following regex I am trying to extract all words that end with a '.' like adv./abbr./n./adj. etc. and get a histogram of all such words to see what all the different types can be. Here my assumption is that such words will obviously be more frequent than normal words which end with '.' but even then I plan to check the top results manually to confirm. Here's my code:
for row in cur:
temp_var = re.findall('\w+[.]+ ',split)
if len(temp_var) >=1 :
temp_var = temp_var.pop()
typ_dict[temp_var] = typ_dict.get(temp_var,0) + 1
for key in typ_dict:
if typ_dict[key] > 50:
print(key, typ_dict[key])
After running this code I am not getting the desired result, with my count of numbers being way lower than in the definition. I have tested the word 'Abbr.' which this code shows occurs for 125 times but if you were to change the regex '\w+[.]+ ' to 'Abbr. ' the result shoots up186. I am not sure why my regex is not capturing all the occurrences.
Any idea as to why I am not getting all the matches?
Edit:
Here is the type of text I am working with
Aback - adv. take aback surprise, disconcert. [old english: related to *a2]
Abm - abbr. Anti-ballistic missile
Abnormal - adj. Deviating from the norm; exceptional. abnormality n. (pl. -ies). Abnormally adv. [french: related to *anomalous]
This is broken down into two the word and the rest into a definition and is loaded into a SQL table.
If you are using a dictionary to count items, then the best variant of a dictionary to use is Counter from the collections package. But you have another problem with your code. You check tep_var for length >= 1 but then you only do one pop operation. What happens when findall returns multiple items? You also do temp_var = temp_var.pop() which would prevent you from popping more items even if you you wanted to. So the result is to just yield the last match.
from collections import Counter
counters = Counter()
for row in cur:
temp_var = re.findall('\w+[.]+ ',split)
for x in temp_var:
counters[x] += 1
for key in counters:
if counters[key] > 50:
print(key, counters[key])
I'm doing a NLP project with my university, collecting data on words in Icelandic that exist both spelled with an i and with a y (they sound the same in Icelandic fyi) where the variants are both actual words but do not mean the same thing. Examples of this would include leyti (an approximation in time) and leiti (a grassy hill), or kirkja (church) and kyrkja (choke). I have a dataset of 2 million words. I have already collected two wordlists, one of which includes words spelled with a y and one includes the same words spelled with a i (although they don't seem to match up completely, as the y-list is a bit longer, but that's a separate issue). My problem is that I want to end up with pairs of words like leyti - leiti, kyrkja - kirkja, etc. But, as y is much later in the alphabet than i, it's no good just sorting the lists and pairing them up that way. I also tried zipping the lists while checking the first few letters to see if I can find a match but that leaves out all words that have y or i as the first letter. Do you have a suggestion on how I might implement this?
Try something like this:
s = "trydfydfgfay"
l = list(s)
candidateWords = []
for idx, c in enumerate(l):
if c=='y':
newList = l.copy()
newList[idx] = "i"
candidateWord = "".join(newList)
candidateWords.append(candidateWord)
print(candidateWords)
#['tridfydfgfay', 'trydfidfgfay', 'trydfydfgfai']
#look up these words to see if they are real words
I do not think this is a programming challenge, but looks more like an NLP challenge itself. The spelling variations are often a hurdle that one would face during preprocessing.
I would suggest that you use an Edit-distance based approaches to identify word pairs that allow for some variations. Specifically for the kind of problem that you have described above, I would recommend "Jaro Winkler Distance". This method allows to give higher similarity score between word pairs that shows variations between specific character pairs, say y and i.
All these approaches are implemented in Jellyfish library.
You may also have a look at the fuzzywuzzy package as well. Hope this helps.
So this accomplishes my task, kind of an easy not-that-pretty solution I suppose but it works:
wordlist = open("data.txt", "r", encoding='utf-8')
y_words = open("y_wordlist.txt", "w+", encoding='utf-8')
all_words = []
y_words = []
for word in wordlist:
word = word.lower()
all_words.append(word)
for word in all_words:
if "y" in word:
y_words.append(word)
word_dict = {}
for word in y_words:
newwith1y = word.replace("y", "i",1)
newwith2y = word.replace("y", "i",2)
newyback = word[::-1].replace("y", "i",1)
newyback = newyback[::-1]
word_dict[word] = newwith1y
word_dict[word] = newwith2y
word_dict[word] = newyback
for key, value in word_dict.items():
if value in all_words:
y_wordlist.write(key)
y_wordlist.write(" - ")
y_wordlist.write(value)
y_wordlist.write("\n")
I have a list of reviews and a list of words that I am trying to count how many times each word shows in each review. The list of keywords is roughly around 30 and could grow/change. The current population of reviews is roughly 5000 with the review word count ranging from 3 to several hundred words. The number of reviews will definitely grow. Right now the keyword list is static and the number of reviews will not be growing to much so any solution to get the counts of keywords in each review will work, but ideally it will be one where there isn't a major performance issue if the number reviews drastically increase or the keywords change and all the reviews have to be reanalyzed.
I have been reading through different methods on stackoverflow and haven't been able to get any to work. I know you can use skikit learn to get the count of each word, but haven't figured out if there is a way to count a phrase. I have also tried various regex expressions. If the keyword list was all single words, I know I could very easily use skikit learn, a loop or regex, but I am having issues when the keyword has multiple words.
Two links I have tried
Python - Check If Word Is In A String
Phrase matching using regex and Python
the solution here is close, but it doesn't count all occurrences of the same word
How to return the count of words from a list of words that appear in a list of lists?
both the list of keywords and reviews are being pulled from a MySQL DB. All keywords are in lowercase. All text has been made lowercase and all non-alphanumeric except spaces have been stripped from the reviews. My original though was to use skikit learn countvectorizer to count the words, but not knowing how to handle counting a phrase I switched. I am currently attempting with loops and regex, but I am open to any solution
# Example of what I am currently attempting with regex
keywords = ['test','blue sky','grass is green']
reviews = ['this is a test. test should come back twice and not 3 times for testing','this pharse contains test and blue sky and look another test','the grass is green test']
for review in reviews:
for word in keywords:
results = re.findall(r'\bword\b',review) #this returns no results, the variable word is not getting picked up
#--also tried variations of this to no avail
#--tried creating the pattern first and passing it
# pattern = "r'\\b" + word + "\\b'"
# results = re.findall(pattern,review) #this errors with the msg: sre_constants.error: multiple repeat at position 9
#The results would be
review1: test=2; 'blue sky'=0;'grass is green'=0
review2: test=2; 'blue sky'=1;'grass is green'=0
review3: test=1; 'blue sky'=0;'grass is green'=1
I would first do it in brute force rather than overcomplicating it and try to optimize it later.
from collections import defaultdict
keywords = ['test','blue sky','grass is green']
reviews = ['this is a test. test should come back twice and not 3 times for testing','this pharse contains test and blue sky and look another test','the grass is green test']
results = dict()
for i in keywords:
for j in reviews:
results[i] = results.get(i, 0) + j.count(i)
print results
>{'test': 6, 'blue sky': 1, 'grass is green': 1}
it's importont that we query the dict with .get, in case we don't have a key set, we don't want to deal with KeyError exception.
If you want to go the complicated route, you can build your own trie and counter structure to do searches in large text files.
Parsing one terabyte of text and efficiently counting the number of occurrences of each word
None of the options you tried search for the value of word:
results = re.findall(r'\bword\b', review) checks for the word word in the string.
When you try pattern = "r'\\b" + word + "\\b'" you check for the string "r'\b[value of word]\b'.
You can use the first option, but the pattern should be r'\b%s\b' % word. That will search for the value of word.
I have a log file containing search queries entered into my site's search engine. I'd like to "group" related search queries together for a report. I'm using Python for most of my webapp - so the solution can either be Python based or I can load the strings into Postgres if it is easier to do this with SQL.
Example data:
dog food
good dog trainer
cat food
veterinarian
Groups should include:
cat:
cat food
dog:
dog food
good dog trainer
food:
dog food
cat food
etc...
Ideas? Some sort of "indexing algorithm" perhaps?
f = open('data.txt', 'r')
raw = f.readlines()
#generate set of all possible groupings
groups = set()
for lines in raw:
data = lines.strip().split()
for items in data:
groups.add(items)
#parse input into groups
for group in groups:
print "Group \'%s\':" % group
for line in raw:
if line.find(group) is not -1:
print line.strip()
print
#consider storing into a dictionary instead of just printing
This could be heavily optimized, but this will print the following result, assuming you place the raw data in an external text file:
Group 'trainer':
good dog trainer
Group 'good':
good dog trainer
Group 'food':
dog food
cat food
Group 'dog':
dog food
good dog trainer
Group 'cat':
cat food
Group 'veterinarian':
veterinarian
Well it seems that you just want to report every query that contains a given word. You can do this easily in plain SQL by using the wildcard matching feature, i.e.
SELECT * FROM QUERIES WHERE `querystring` LIKE '%dog%'.
The only problem with the query above is that it also finds queries with query strings like "dogbah", you need to write a couple of alternatives using OR to cater for the different cases assuming your words are separated by whitespace.
Not a concrete algorithm, but what you're looking for is basically an index created from words found in your text lines.
So you'll need some sort of parser to recognize words, then you put them in an index structure and link each index entry to the line(s) where it is found. Then, by going over the index entries, you have your "groups".
Your algorithm needs the following parts (if done by yourself)
a parser for the data, breaking up in lines, breaking up the lines in words.
A datastructure to hold key value pairs (like a hashtable). The key is a word, the value is a dynamic array of lines (if you keep the lines you parsed in memory pointers or line numbers suffice)
in pseudocode (generation):
create empty set S for name value pairs.
for each line L parsed
for each word W in line L
seek W in set S -> Item
if not found -> add word W -> (empty array) to set S
add line L reference to array in Ietm
endfor
endfor
(lookup (word: W))
seek W in set S into Item
if found return array from Item
else return empty array.
Modified version of #swanson's answer (not tested):
from collections import defaultdict
from itertools import chain
# generate set of all possible words
lines = open('data.txt').readlines()
words = set(chain.from_iterable(line.split() for line in lines))
# parse input into groups
groups = defaultdict(list)
for line in lines:
for word in words:
if word in line:
groups[word].append(line)