Python - Efficiently working with large permutated lists - python

I've written a short Anagram Solver script in Python 2.7.
#Import Permutations Module
from itertools import permutations as perm
#Defined Functions
def check(x,y):
#Opens Dictionary File
with open('wordsEn.txt') as dictionary:
'''Checks the permutations against all of the dictionary words and appends
any matching ones to on of the empty lists'''
for line in dictionary:
for i in x:
if i + '\n' == line:
y.append(i)
#Empty Lists that will be appended to later
anagram_perms = []
possible_words = []
#Anagram User Input
anagram = list(raw_input("Input scrambled word: ").lower())
#Creates single list items from the permutations, deletes duplicates
for char in perm(anagram):
anagram_perms.append("".join(char))
anagram_perms = list(set(anagram_perms))
#Uses the defined function
check(anagram_perms, possible_words)
#Prints the number of perms created, then prints the possible words beneath it
print len(anagram_perms)
print '\n'.join(possible_words)
It essentially takes a user inputted anagram, generates and places into a list all possible combinations of the letters (using itertools.permutations), deleting any duplicates. It then checks each of these of combinations against a 100000 word dictionary text file, placing any matching words into a list to be printed.
I've run into the issue that if a user inputs a word that is above 6 unique letters in length, the number of permutations generated causes hanging and crashes. 9 letter anagrams would be the typical input, however evidently these will output 362880 ('9!') permutations if all letters are different, which is unfeasible.
I've though about a couple of potential solutions:
Creating a number of empty lists which can only hold a certain number of appended permuations. Once these lists are 'full', permutations are added to the next one. Each of these lists are then subsequently checked against the text file.
Creating one empty list contained within a loop. Permutations are generated and appened to the list up to a certain workable number, the list is then used to check the text file before emptying itself and appending in the next number of permutations.
Some other method whereby a certain number of permutations are generated, then the process is paused while the currently generated ones are checked against the text file, and resumed and repeated.
I'm fairly new to Python development however and don't really know if these would be possible or how I would go about implementing them into my code; and other questions on similar topics haven't really been able to help.
If anyone would like to see my code so far I'd be happy to condense it down and post it, but for the sake of not making this question any longer I'll leave it out unless it's requested. (Updated Above)
Thanks!

It think the best solution may be to not use permutations. It's much more likely than not that most generated permutations are not a word - so it's a waste to generate them all.
You can consider pre-processing this dictionary to a dictionary of sorted letters to list of words that those letters consist of. Then, your anagram solver will be a simple lookup in a dictionary after sorting the input.
First, create the dictionary from your word list and save to a file:
from collections import defaultdict
import json
word_list = ['tab', 'bat', 'cat', 'rat', ...] # 100k words
word_dict = defaultdict(list)
for word in word_list:
word_dict[''.join(sorted(word))].append(word)
with open('word_dict.json') as f:
f.write(json.dumps(dict(word_dict)))
Then, when running your anagram code, load the dictionary and use it to look up the sorted input:
import json
empty_list = []
with open('word_dict.json', 'r') as f:
word_dict = json.loads(f.read())
while True:
anagram = raw_input('Enter in an anagram: ')
sorted_anagram = ''.join(sorted(anagram))
print word_dict.get(sorted_anagram, empty_list)

This should help. The itertools.permutations function returns an iterator. This means that the entire list is not stored in memory; rather you can call for the next value and it will compute what is needed on the fly.
from itertools import permutations
with open('./wordlist.txt', 'r') as fp:
wordlist_str = fp.read()
wordlist = set(wordlist_str.lower().split('\n')) #use '\r\n' in Windows
def get_anagrams(word):
out = set()
for w in permutations(word.lower()):
if ''.join(w) in wordlist:
out.add(''.join(w))
return out

Related

Calculate a measure between keywords and each word of a textfile

I have two .txt files, one that contains 200.000 words and the second contains 100 keywords( one each line). I want to calculate the cosine similarity between each of the 100 keywords and each word of my 200.000 words , and display for every keyword the 50 words with the highest score.
Here's what I did, note that Bertclient is what i'm using to extract vectors :
from sklearn.metrics.pairwise import cosine_similarity
from bert_serving.client import BertClient
bc = BertClient()
# Process words
with open("./words.txt", "r", encoding='utf8') as textfile:
words = textfile.read().split()
with open("./100_keywords.txt", "r", encoding='utf8') as keyword_file:
for keyword in keyword_file:
vector_key = bc.encode([keyword])
for w in words:
vector_word = bc.encode([w])
cosine_lib = cosine_similarity(vector_key,vector_word)
print (cosine_lib)
This keeps running but it doesn't stop. Any idea how I can correct this ?
I know nothing of Bert...but there's something fishy with the import and run. I don't think you have it installed correctly or something. I tried to pip install it and just run this:
from sklearn.metrics.pairwise import cosine_similarity
from bert_serving.client import BertClient
bc = BertClient()
print ('done importing')
and it never finished. Take a look at the dox for bert and see if something else needs to be done.
On your code, it is generally better do do ALL of the reading first, then the processing, so import both lists first, separately, check a few values with something like:
# check first five
print(words[:5])
Also, you need to look at a different way to do your comparisons instead of the nested loops. You realize now that you are converting each word in words EVERY TIME for each keyword, which is not necessary and probably really slow. I would recommend you either use a dictionary to pair the word with the encoding or make a list of tuples with the (word, encoding) in it if you are more comfortable with that.
Comment me back if that doesn't makes sense after you get Bert up and running.
--Edit--
Here is a chunk of code that works similar to what you want to do. There are a lot of options for how you can hold results, etc. depending on your needs, but this should get you started with "fake bert"
from operator import itemgetter
# fake bert ... just return something like length
def bert(word):
return len(word)
# a fake compare function that will compare "bert" conversions
def bert_compare(x, y):
return abs(x-y)
# Process words
with open("./word_data_file.txt", "r", encoding='utf8') as textfile:
words = textfile.read().split()
# Process keywords
with open("./keywords.txt", "r", encoding='utf8') as keyword_file:
keywords = keyword_file.read().split()
# encode the words and put result in dictionary
encoded_words = {}
for word in words:
encoded_words[word] = bert(word)
encoded_keywords = {}
for word in keywords:
encoded_keywords[word] = bert(word)
# let's use our bert conversions to find which keyword is most similar in
# length to the word
for word in encoded_words.keys():
result = [] # make a new result set for each pass
for kword in encoded_keywords.keys():
similarity = bert_compare(encoded_words.get(word), encoded_keywords.get(kword))
# stuff the answer into a tuple that can be sorted
result.append((word, kword, similarity))
result.sort(key=itemgetter(2))
print(f'the keyword with the closest size to {result[0][0]} is {result[0][1]}')

How do I read a file and convert each line into strings and see if any of the strings are in a user input?

What I am trying to do in my program is to have the program open a file with many different words inside it.
Receive a user input and check if any word inside the file is in user input.
Inside the file redflags.txt:
happy
angry
ball
jump
each word on a different line.
For example if the user input is "Hey I am a ball" then it will print redflag.
If the user input is "Hey this is a sphere" then it will print noflag.
Redflags = open("redflags.txt")
data = Redflags.read()
text_post = raw_input("Enter text you wish to analyse")
words = text_post.split() and text_post.lower()
if data in words:
print("redflag")
else:
print("noflag")
This should do the trick! Sets are generally much faster than lookups for list comparisons. Sets can tell you the intersection in this case (overlapping words), differences, etc. We consume the file that has a list of words, remove newline characters, lowercase, and we have our first list. The second list is created from the user input and split on spacing. Now we can perform our set intersection to see if any common words exist.
# read in the words and create list of words with each word on newline
# replace newline chars and lowercase
words = [word.replace('\n', '').lower() for word in open('filepath_to_word_list.txt', 'r').readlines()]
# collect user input and lowercase and split into list
user_input = raw_input('Please enter your words').lower().split()
# use set intersection to find if common words exist and print redflag
if set(words).intersection(set(user_input)):
print('redflag')
else:
print('noflag')
with open('redflags.txt', 'r') as f:
# See this post: https://stackoverflow.com/a/20756176/1141389
file_words = f.read().splitlines()
# get the user's words, lower case them all, then split
user_words = raw_input('Please enter your words').lower().split()
# use sets to find if any user_words are in file_words
if set(file_words).intersection(set(user_words)):
print('redflag')
else:
print('noredflag')
I would suggest you to use list comprehensions
Take a look at this code which does what you want (I will explain them below):
Redflags = open("redflags.txt")
data = Redflags.readlines()
data = [d.strip() for d in data]
text_post = input("Enter text you wish to analyse:")
text_post = text_post.split()
fin_list = [i for i in data if i in text_post]
if (fin_list):
print("RedFlag")
else:
print("NoFlag")
Output 1:
Enter text you wish to analyse:i am sad
NoFlag
Output 2:
Enter text you wish to analyse:i am angry
RedFlag
So first open the file and read them using readlines() this gives a list of lines from file
>>> data = Redflags.readlines()
['happy\n', 'angry \n', 'ball \n', 'jump\n']
See all those unwanted spaces,newlines (\n) use strip() to remove them! But you can't strip() a list. So take individual items from the list and then apply strip(). This can be done efficiently using list comprehensions.
data = [d.strip() for d in data]
Also why are you using raw_input() In Python3 use input() instead.
After getting and splitting the input text.
fin_list = [i for i in data if i in text_post]
I'm creating a list of items where i (each item) from data list is also in text_pos list. So this way I get common items which are in both lists.
>>> fin_list
['angry'] #if my input is "i am angry"
Note: In python empty lists are considered as false,
>>> bool([])
False
While those with values are considered True,
>>> bool(['some','values'])
True
This way your if only executes if list is non-empty. Meaning 'RedFlag' will be printed only when some common item is found between two lists. You get what you want.

Can I use range for a list filled with words?

I wanted to make a password guesser in Python, so I thought I could use the method in the code below. I'm new to Python so I don't really know what I'm doing.
In the code, I had a file containing 109538 words from a dictionary read, and now its a list.
For the code to work, the range needs to be set to the list "DictionaryWords", which consists of all the words in the dictionary as a list.
#I made this to open file and read it
dictfile = open('c:/ScienceFairDictionaryFolder/wordsEn.txt', 'r')
DictionaryWords = dictfile.readlines()
Password = "Cat"
for x in range(0, 109538):
if x is Password:
print(x, "Found it!")
else:
print("Nope! Not here!")
Is it possible, or does range only work for lists consisting of integers?

Scan through txt file, translate certain words found in dictionary [Python]

I have a file that contains this:
((S000383212:0.0,JC0:0.244562):0.142727,(S002923086:0.0,(JC1:0.0,JC2:0.0):0.19717200000000001):0.222151,((S000594619:0.0,JC3:0.21869):0.13418400000000003,(S000964423:0.122312,JC4:0.084707):0.18147100000000002):0.011521999999999977);
I have two dictionaries that contain:
org = {'JC4': 'a','JC0': 'b','JC1': 'c','JC2': 'c','JC3': 'd'}
RDP = {'S000383212': 'hello', 'S002923086': 'this', 'S000594619': 'is'}
How would I find every time it says one of the words in one of the dictionaries and convert it to its alternative term?
i.e. if it encounters 'JC0' then it would translate it to 'b'
for key in org.keys() + RDP.keys():
text = text.replace(key, org.get(key, None) or RDP.get(key, None))
Of course, as TryPyPy said, if you just merge the dicts, it becomes much simpler:
org.update(RDP)
for item in org.items():
text = text.replace(*item)
If the performance isn't very important you can use the following code:
with open('your_file_name.txt') as f:
text = f.read()
for key, value in org.items() + RDP.items():
text = text.replace(key, value)
This code has the O(n * k) time complexity, where n is the length of the text and k is the count of entries in both dictionaries. If this complexity doesn't suit your task, the Aho-Corasick algorithm can help you.
You should use the replace string method.

best way to compare sequence of letters inside file?

I have a file, that have lots of sequences of letters.
Some of these sequences might be equal, so I would like to compare them, all to all.
I'm doing something like this but this isn't exactly want I wanted:
for line in fl:
line = line.split()
for elem in line:
if '>' in elem:
pass
else:
for el in line:
if elem == el:
print elem, el
example of the file:
>1
GTCGTCGAAGCATGCCGGGCCCGCTTCGTGTTCGCTGATA
>2
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA
>3
GTCGTCGAAAGAGGCTT-GCCCGCCACGCGCCCGCTGATA
>4
GTCGTCGAAAGAGGCTT-GCCCGCTACGCGCCCCCTGATA
>5
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA
>6
GTCGTCGAAAGAGTCTGACCGCTTCTCGCCCGCTGATACG
>7
GTCGTCGAAAGAGGTCT-GACCGCTTCTCGCCCGCTGATA
So what I want if to known if any sequence is totally equal to 1, or to 2, and so on.
If the goal is to simply group like sequences together, then simply sorting the data will do the trick. Here is a solution that uses BioPython to parse the input FASTA file, sorts the collection of sequences, uses the standard Python itertools.groupby function to merge ids for equal sequences, and outputs a new FASTA file:
from itertools import groupby
from Bio import SeqIO
records = list(SeqIO.parse(file('spoo.fa'),'fasta'))
def seq_getter(s): return str(s.seq)
records.sort(key=seq_getter)
for seq,equal in groupby(records, seq_getter):
ids = ','.join(s.id for s in equal)
print '>%s' % ids
print seq
Output:
>3
GTCGTCGAAAGAGGCTT-GCCCGCCACGCGCCCGCTGATA
>4
GTCGTCGAAAGAGGCTT-GCCCGCTACGCGCCCCCTGATA
>2,5
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA
>7
GTCGTCGAAAGAGGTCT-GACCGCTTCTCGCCCGCTGATA
>6
GTCGTCGAAAGAGTCTGACCGCTTCTCGCCCGCTGATACG
>1
GTCGTCGAAGCATGCCGGGCCCGCTTCGTGTTCGCTGATA
In general for this type of work you may want to investigate Biopython which has lots of functionality for parsing and otherwise dealing with sequences.
However, your particular problem can be solved using a dict, an example of which Manoj has given you.
Comparing long sequences of letters is going to be pretty inefficient. It will be quicker to compare the hash of the sequences. Python offers two built in data types that use hash: set and dict. It's best to use dict here as we can store the line numbers of all the matches.
I've assumed the file has identifiers and labels on alternate lines, so if we split the file text on new lines we can take one line as the id and the next as the sequence to match.
We then use a dict with the sequence as a key. The corresponding value is a list of ids which have this sequence. By using defaultdict from collections we can easily handle the case of a sequence not being in the dict; if the key hasn't be used before defaultdict will automatically create a value for us, in this case an empty list.
So when we've finished working through the file the values of the dict will effectively be a list of lists, each entry containing the ids which share a sequence. We can then use a list comprehension to pull out the interesting values, i.e. entries where more than one id is used by a sequence.
from collections import defaultdict
lines = filetext.split("\n")
sequences = defaultdict(list)
while (lines):
id = lines.pop(0)
data = lines.pop(0)
sequences[data].append(id)
results = [match for match in sequences.values() if len(match) > 1]
print results
The following script will return a count of sequences. It returns a dictionary with the individual, distinct sequences as keys and the numbers (the first part of each line) where these sequences occur.
#!/usr/bin/python
import sys
from collections import defaultdict
def count_sequences(filename):
result = defaultdict(list)
with open(filename) as f:
for index, line in enumerate(f):
sequence = line.replace('\n', '')
line_number = index + 1
result[sequence].append(line_number)
return result
if __name__ == '__main__':
filename = sys.argv[1]
for sequence, occurrences in count_sequences(filename).iteritems():
print "%s: %s, found in %s" % (sequence, len(occurrences), occurrences)
Sample output:
etc#etc:~$ python ./fasta.py /path/to/my/file
GTCGTCGAAAGAGGCTT-GCCCGCTACGCGCCCCCTGATA: 1, found in ['4']
GTCGTCGAAAGAGGCTT-GCCCGCCACGCGCCCGCTGATA: 1, found in ['3']
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA: 2, found in ['2', '5']
GTCGTCGAAAGAGGTCT-GACCGCTTCTCGCCCGCTGATA: 1, found in ['7']
GTCGTCGAAGCATGCCGGGCCCGCTTCGTGTTCGCTGATA: 1, found in ['1']
GTCGTCGAAAGAGTCTGACCGCTTCTCGCCCGCTGATACG: 1, found in ['6']
Update
Changed code to use dafaultdict and for loop. Thanks #KennyTM.
Update 2
Changed code to use append rather than +. Thanks #Dave Webb.

Categories

Resources