Looking for matching strings of length > 4 between two text files

Looking for matching strings of length > 4 between two text files - python

I'm trying to read in two text files, and then search each for strings that are common between the two, of minimum length 5.
The code I've written:
db = open("list_of_2","r").read()
lp = open("lastpass","r").read()
word = ''
length = 0
for dbchar in db:
for lpchar in lp:
if dbchar == lpchar:
word += str(dbchar)
length += 1
else:
length = 0
word = ''
if length > 4:
print(word)
The code currently prints strings like '-----' and '55555', over and over and doesn't seem to break the loop (these particular strings only appear in lp once). I also don't believe it's finding strings that aren't just the same character repeated.
How do I alter the code to:
Only make it run through and print each occurrence once, and
Not just find strings of the same character repeated?
Edit: Here are some mock text files. In these, the string 'ghtyty' appears in file1 three times, and in file2 4 times. The code should print 'ghtyty' to console once.
file1
file2

I would suggest a different approach. Split the files into words and keep only words 5 characters or greater. Use sets to find the intersection--this will be faster.
db_words = set([x for x in db.split() if len(x) > 4])
lp_words = set([x for x in lp.split() if len(x) > 4])
matches = db_words & lp_words
If you want to exclude words of all same character, you can define the list comprehension like this:
[x for x in db.split() if len(x) > 4 and x != x[0]*len(x)]
If you are looking for any consecutive sequence of characters that match, this might work better:
i_skip = set() # characters to skip if they are already in a printed word
j_skip = set()
for i in range(len(db)-4):
if i in i_skip: continue
for j in range(len(lp)-4):
if j in j_skip: continue
if db[i] == lp[j]:
word_len = 5
while db[i:i+word_len] == lp[j:j+word_len]:
if db[i:i+word_len+1] == lp[j:j+word_len+1]:
word_len += 1
else:
print(db[i:i+word_len])
i_skip.update(range(i, i+word_len))
j_skip.update(range(j, j+word_len))
break

Related

How to extract words from repeating strings

Here I have a string in a list:
['aaaaaaappppppprrrrrriiiiiilll']
I want to get the word 'april' in the list, but not just one of them, instead how many times the word 'april' actually occurs the string.
The output should be something like:
['aprilaprilapril']
Because the word 'april' occurred three times in that string.
Well the word actually didn't occurred three times, all the characters did. So I want to order these characters to 'april' for how many times did they appeared in the string.
My idea is basically to extract words from some random strings, but not just extracting the word, instead to extract all of the word that appears in the string. Each word should be extracted and the word (characters) should be ordered the way I wanted to.
But here I have some annoying conditions; you can't delete all the elements in the list and then just replace them with the word 'april'(you can't replace the whole string with the word 'april'); you can only extract 'april' from the string, not replacing them. You can't also delete the list with the string. Just think of all the string there being very important data, we just want some data, but these data must be ordered, and we need to delete all other data that doesn't match our "data chain" (the word 'april'). But once you delete the whole string you will lose all the important data. You don't know how to make another one of these "data chains", so we can't just put the word 'april' back in the list.
If anyone know how to solve my weird problem, please help me out, I am a beginner python programmer. Thank you!

One way is to use itertools.groupby which will group the characters individually and unpack and iterate them using zip which will iterate n times given n is the number of characters in the smallest group (i.e. the group having lowest number of characters)
from itertools import groupby
'aaaaaaappppppprrrrrriiiiiilll'
result = ''
for each in zip(*[list(g) for k, g in groupby('aaaaaaappppppprrrrrriiiiiilll')]):
result += ''.join(each)
# result = 'aprilaprilapril'
Another possible solution is to create a custom counter that will count each unique sequence of characters (Please be noted that this method will work only for Python 3.6+, for lower version of Python, order of dictionaries is not guaranteed):
def getCounts(strng):
if not strng:
return [], 0
counts = {}
current = strng[0]
for c in strng:
if c in counts.keys():
if current==c:
counts[c] += 1
else:
current = c
counts[c] = 1
return counts.keys(), min(counts.values())
result = ''
counts=getCounts('aaaaaaappppppprrrrrriiiiiilll')
for i in range(counts[1]):
result += ''.join(counts[0])
# result = 'aprilaprilapril'

How about using regex?
import re
word = 'april'
text = 'aaaaaaappppppprrrrrriiiiiilll'
regex = "".join(f"({c}+)" for c in word)
match = re.match(regex, text)
if match:
# Find the lowest amount of character repeats
lowest_amount = min(len(g) for g in match.groups())
print(word * lowest_amount)
else:
print("no match")
Outputs:
aprilaprilapril
Works like a charm

Here is a more native approach, with plain iteration.
It has a time complexity of O(n).
It uses an outer loop to iterate over the character in the search key, then an inner while loop that consumes all occurrences of that character in the search string while maintaining a counter. Once all consecutive occurrences of the current letter have been consumes, it updates a the minLetterCount to be the minimum of its previous value or this new count. Once we have iterated over all letters in the key, we return this accumulated minimum.
def countCompleteSequenceOccurences(searchString, key):
left = 0
minLetterCount = 0
letterCount = 0
for i, searchChar in enumerate(key):
while left < len(searchString) and searchString[left] == searchChar:
letterCount += 1
left += 1
minLetterCount = letterCount if i == 0 else min(minLetterCount, letterCount)
letterCount = 0
return minLetterCount
Testing:
testCasesToOracles = {
"aaaaaaappppppprrrrrriiiiiilll": 3,
"ppppppprrrrrriiiiiilll": 0,
"aaaaaaappppppprrrrrriiiiii": 0,
"aaaaaaapppppppzzzrrrrrriiiiiilll": 0,
"pppppppaaaaaaarrrrrriiiiiilll": 0,
"zaaaaaaappppppprrrrrriiiiiilll": 3,
"zzzaaaaaaappppppprrrrrriiiiiilll": 3,
"aaaaaaappppppprrrrrriiiiiilllzzz": 3,
"zzzaaaaaaappppppprrrrrriiiiiilllzzz": 3,
}
key = "april"
for case, oracle in testCasesToOracles.items():
result = countCompleteSequenceOccurences(case, key)
assert result == oracle
Usage:
key = "april"
result = countCompleteSequenceOccurences("aaaaaaappppppprrrrrriiiiiilll", key)
print(result * key)
Output:
aprilaprilapril

A word will only occur as many times as the minimum letter recurrence. To account for the possibility of having repeated letters in the word (for example, appril, you need to factor this count out. Here is one way of doing this using collections.Counter:
from collections import Counter
def count_recurrence(kernel, string):
# we need to count both strings
kernel_counter = Counter(kernel)
string_counter = Counter(string)
# now get effective count by dividing the occurence in string by occurrence
# in kernel
effective_counter = {
k: int(string_counter.get(k, 0)/v)
for k, v in kernel_counter.items()
}
# min occurence of kernel is min of effective counter
min_recurring_count = min(effective_counter.values())
return kernel * min_recurring_count

Finding a word from a text dictionary with given random letters

When a person enters a function (e.g. find_from_dict(letters)), the function searches a word from dictionary.txt that can be made from the letters that the user has inputted—a word that contains the most letters inputted).
For example, letters is input as random typing such as "BAJPPNLE" which will then find "APPLE" from the dictionary since "APPLE" has the most letters from "BAJPPNLE".
def find_from_dict(letters):
n = 0
y = 0
x = 0
dictFile = [line.rstrip('\n') for line in open("dictionary.txt")]
listLetters = list(letters)
final = []
while True:
if n < len(dictFile) and len(list(dictFile[n])) <= len(listLetters) and x < len(list(dictFile[n])) and list(dictFile[n])[x] in listLetters:
x = x + 1
elif n < len(dictFile) and len(list(dictFile[n])) <= len(listLetters) and x < len(list(dictFile[n])) and list(dictFile[n])[x] not in listLetters:
x = 0
n = n + 1
elif n < len(dictFile) and len(list(dictFile[n])) <= len(listLetters) and x == len(list(dictFile[n])):
final.append(dictFile[n])
elif n < len(dictFile) and len(list(dictFile[n])) > len(listLetters):
n = n + 1
else:
print(final)
break
I have this code at the moment, but since my dictionary.txt file is huge and the code is inefficient, it takes forever to go through..
Does anyone have any idea how I could make this code efficient?

You can speed this up by preparing a word index formed of the sorted letters in your word list. Then look for sorted combinations of the letters in that index:
for example:
from collections import defaultdict
from itertools import combinations
with open("/usr/share/dict/words","r") as wordList:
words = defaultdict(list)
for word in wordList.read().upper().split("\n"):
words[tuple(sorted(word))].append(word) # index by sorted letters
def findWords(letters):
for size in range(len(letters),2,-1): # from large to small (minimum 3 letters)
for combo in combinations(sorted(letters),size): # combinations of that size
for word in (w for w in words[combo]): # matching fords from index
yield word # return as you go (iterator)
# If you only want one, change this to: return word
testing:
while True:
letters = input("Enter letters:")
if not letters: break
for word in findWords(letters.upper()):
stop = input(word)
if stop: break
print("")
sample output:
Enter letters:BAJPPNLE
JELAB
BEJAN
LEBAN
NABLE
PEBAN
PEBAN
ALPEN
NEPAL
PANEL
PENAL
PLANE
ALPEN
NEPAL
PANEL
PENAL
PLANE
APPLE
NAPPE.
Enter letters:EPROING
PERIGON
PIGEON
IGNORE
REGION
PROGNE
OPINER.
Enter letters:
if you need a solution without using libraries, you will need to use a recursive approach that does a breadth first traversal of the combination tree:
with open("/usr/share/dict/words","r") as wordList:
words = dict()
for word in wordList.read().upper().split("\n"):
words.setdefault(tuple(sorted(word)),list()).append(word) # index by sorted letters
def findWords(letters,size=None):
if size == None:
letters = sorted(letters)
for size in range(len(letters),2,-1):
for word in findWords(letters,size): yield word
elif len(letters) == size:
for word in words.get(tuple(letters),[]): yield word
elif len(letters)>size:
for i in range(len(letters)):
for word in findWords(letters[:i]+letters[i+1:],size):
yield word

You can kind of "cheat" your way through it by pre-processing the dictionary file.
The idea is: instead of having a list of words, you have a list of groups which is determined by the sorted letters of the words.
For example, something like:
"aeegr": [
"agree",
"eager",
],
"alps": [
"alps",
"laps",
"pals",
]
Then if you wanted to just find the exact match, you could sort the letters from the input and search in the processed file.
But you want the one that matches the most letters, so what you could do is number the letters with prime numbers (I'm only considering lowercase ascii characters), so that a is 2, b is 3, c is 5, d is 7 and so on.
Then, you can get a number by multiplying all the letters, so for example for alps you'd get 2*37*53*67.
In your dictionary file you then have the numbers obtained the same way for each word.
Like:
262774: [
"alps",
"laps",
"pals",
]
You then go through your dictionary and if the initial number divided by the dictionary number has a remainder of 0, that's a possible match.
The maximum number with a remainder of 0 is the one that you want, because that's the one with the most letters present.
Keep in mind that the numbers might get very big very quickly, depending on how many letters you use.

Remove repeated characters in a word

I am trying to clean texts for a data set, and many of the words are misspelled, for example, many times I will see the word "hellllo." I wish to remove repeated characters where a character is repeated more than twice in a row. Obviously this will not work with words such as "nooooo" because that will convert it to "noo", but I have functions to handle this written already. All I want to do is convert words such as "hellllo" to "hello".

Here is a generic function that handles arbitrary number of repetitions allowed:
def remove_multiple(s, n=2):
'''
param s: string
param n: number of max repetition allowed in the string
'''
if n < 0:
return
elif n==1:
return ''.join(sorted(set(s), key=s.index))
elif n > 1:
temp = []
temps = s + ' '*n
for i, c in enumerate(s):
if len(set(temps[i:n+1+i])) > 1:
temp.append(c)
return "".join(temp)
>>> remove_multiple('helllllllllllllllooooooooooooooo', 2)
Out: 'helloo'
>>> remove_multiple('helllllllllllllllooooooooooooooo', 5)
Out[]: 'helllllooooo'

Binary Search using a for loop, searching for words in a list and comparing

I'm trying to compare the words in "alice_list" to "dictionary_list", and if a word isnt found in the "dictionary_list" to print it and say it is probably misspelled. I'm having issues where its not printing anything if its not found, maybe you guys could help me out. I have the "alice_list" being appended to uppercase, as the "dictionary_list" is all in capitals. Any help with why its not working would be appreciated as I'm about to pull my hair out over it!
import re
# This function takes in a line of text and returns
# a list of words in the line.
def split_line(line):
return re.findall('[A-Za-z]+(?:\'[A-Za-z]+)?', line)
# --- Read in a file from disk and put it in an array.
dictionary_list = []
alice_list = []
misspelled_words = []
for line in open("dictionary.txt"):
line = line.strip()
dictionary_list.extend(split_line(line))
for line in open("AliceInWonderLand200.txt"):
line = line.strip()
alice_list.extend(split_line(line.upper()))
def searching(word, wordList):
first = 0
last = len(wordList) - 1
found = False
while first <= last and not found:
middle = (first + last)//2
if wordList[middle] == word:
found = True
else:
if word < wordList[middle]:
last = middle - 1
else:
first = middle + 1
return found
for word in alice_list:
searching(word, dictionary_list)
--------- EDITED CODE THAT WORKED ----------
Updated a few things if anyone has the same issue, and used "for word not in" to double check what was being outputted in the search.
"""-----Binary Search-----"""
# search for word, if the word is searched higher than list length, print
words = alice_list
for word in alice_list:
first = 0
last = len(dictionary_list) - 1
found = False
while first <= last and not found:
middle = (first + last) // 2
if dictionary_list[middle] == word:
found = True
else:
if word < dictionary_list[middle]:
last = middle - 1
else:
first = middle + 1
if word > dictionary_list[last]:
print("NEW:", word)
# checking to make sure words match
for word in alice_list:
if word not in dictionary_list:
print(word)

Your function split_line() returns a list. You then take the output of the function and append it to the dictionary list, which means each entry in the dictionary is a list of words rather than a single word. The quick fix it to use extend instead of append.
dictionary_list.extend(split_line(line))
A set might be a better choice than a list here, then you wouldn't need the binary search.
--EDIT--
To print words not in the list, just filter the list based on whether your function returns False. Something like:
notfound = [word for word in alice_list if not searching(word, dictionary_list)]

Are you required to use binary search for this program? Python has this handy operator called "in". Given an element as the first operand and and a list/set/dictionary/tuple as the second, it returns True if that element is in the structure, and false if it is not.
Examples:
1 in [1, 2, 3, 4] -> True
"APPLE" in ["HELLO", "WORLD"] -> False
So, for your case, most of the script can be simplified to:
for word in alice_list:
if word not in dictionary_list:
print(word)
This will print each word that is not in the dictionary list.

I want to create a program that finds the longest substring in alphabetical order using python using basics

my code for finding longest substring in alphabetical order using python
what I mean by longest substring in alphabetical order?
if the input was"asdefvbrrfqrstuvwxffvd" the output wil be "qrstuvwx"
#we well use the strings as arrays so don't be confused
s='abcbcd'
#give spaces which will be our deadlines
h=s+' (many spaces) '
#creat outputs
g=''
g2=''
#list of alphapets
abc='abcdefghijklmnopqrstuvwxyz'
#create the location of x"the character the we examine" and its limit
limit=len(s)
#start from 1 becouse we substract one in the rest of the code
x=1
while (x<limit):
#y is the curser that we will move the abc array on it
y=0
#putting our break condition first
if ((h[x]==' ') or (h[x-1]==' ')):
break
for y in range(0,26):
#for the second character x=1
if ((h[x]==abc[y]) and (h[x-1]==abc[y-1]) and (x==1)):
g=g+abc[y-1]+abc[y]
x+=1
#for the third to the last character x>1
if ((h[x]==abc[y]) and (h[x-1]==abc[y-1]) and (x!=1)):
g=g+abc[y]
x+=1
if (h[x]==' '):
break
print ("Longest substring in alphabetical order is:" +g )
it doesn't end,as if it's in infinite loop
what should I do?
I am a beginner so I want some with for loops not functions from libraries
Thanks in advance

To avoid infinite loop add x += 1 in the very end of your while-loop. As a result your code works but works wrong in general case.
The reason why it works wrong is that you use only one variable g to store the result. Use at least two variables to compare previous found substring and new found substring or use list to remember all substrings and then choose the longest one.

s = 'abcbcdiawuhdawpdijsamksndaadhlmwmdnaowdihasoooandalw'
longest = ''
current = ''
for idx, item in enumerate(s):
if idx == 0 or item > s[idx-1]:
current = current + item
if idx > 0 and item <= s[idx-1]:
current = ''
if len(current)>len(longest):
longest = current
print(longest)
Output:
dhlmw
For your understanding 'b'>'a' is True, 'a'>'b' is False etc
Edit:
For longest consecutive substring:
s = 'asdefvbrrfqrstuvwxffvd'
abc = 'abcdefghijklmnopqrstuvwxyz'
longest = ''
current = ''
for idx, item in enumerate(s):
if idx == 0 or abc.index(item) - abc.index(s[idx-1]) == 1:
current = current + item
else:
current = item
if len(current)>len(longest):
longest = current
print(longest)
Output:
qrstuvwx

def sub_strings(string):
substring = ''
string +='\n'
i = 0
string_dict ={}
while i < len(string)-1:
substring += string[i]
if ord(substring[-1])+1 != ord(string[i+1]):
string_dict[substring] = len(substring)
substring = ''
i+=1
return string_dict
s='abcbcd'
sub_strings(s)
{'abc': 3, 'bcd': 3}
To find the longest you can domax(sub_strings(s))
So here which one do you want to be taken as the longest??. Now that is a problem you would need to solve

You can iterate through the string and keep comparing to the last character and append to the potentially longest string if the current character is greater than the last character by one ordinal number:
def longest_substring(s):
last = None
current = longest = ''
for c in s:
if not last or ord(c) - ord(last) == 1:
current += c
else:
if len(current) > len(longest):
longest = current
current = c
last = c
if len(current) > len(longest):
longest = current
return longest
so that:
print(longest_substring('asdefvbrrfqrstuvwxffvd'))
would output:
qrstuvwx

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.