I am developing a word game, and for this game, I needed a list of words. Sadly, this list was so long that I just had to refine it (this list of words can be found on any Mac at /usr/share/dict/).
To refine it, I decided to use my own Python scripts. I already wrote a script before that removes all words that start with capital letters (thus removing names of places, etc.), and it worked. This is it:
with open("/Users/me/Desktop/oldwords.txt", "r") as text:
with open("/Users/me/Desktop/newwords.txt", "w") as towriteto:
for word in text:
if word[0]==word[0].lower():
towriteto.write(word)
Then, I decided to refine it even further; I decided that I would delete all words that are not in the pyenchant module English dictionary. This opperation's code is very similar to the previous one's code. This is my code:
import enchant
with open("/Users/me/Desktop/newwords.txt", "r") as text:
with open("/Users/me/Desktop/words.txt", "w") as towriteto:
d = enchant.Dict("en_US")
for word in text:
if d.check(word):
towriteto.write(word)
Sadly, this did not write anything to the "towriteto" file, and after some debugging, I found that
d.check(word) -> False
It always returned false. However, when I checked words separately, real words returned True, and fake words returned False as they should.
I have no idea what is wrong with my second script. The file locations are correct and the pyenchant installation had no issues.
Thanks in advance!
I don't know the input file format but if there is only one word per line, try to remove the end-of-line character of word before to call d.check(word):
word = word.rstrip()
Related
I have a text file that contains some sentences, I'm checking them if they are valid sentences based on some rules and writing valid or not valid to a seperate text file. My main problem is when I'm using ctrl + f and enter my regex to search bar it matches the strings that I wanted to match but in code, it works wrong. Here is my code:
import re
pattern = re.compile('(([A-Z])[a-z\s,]*)((: ["‘][a-z,!?\.\s]*["’][.,!?])|(; [a-zA-Z\s]*[!.?])|(\s["‘][a-z,.;!?\s]*["’])|([\.?!]))')
text=open('validSentences',"w+")
with open('sentences.txt',encoding='utf8') as file:
lines = file.readlines()
for line in lines:
matches = pattern.fullmatch(line)
if(matches==None):
text.write("not valid"+"\n")
else:
text.write("valid"+"\n")
file.close()
In documents it says that fullmatch matches only whole string matches and thats what I'm trying to do but this code writes not valid for all sentences that I have. The text file that I have:
How can you say that to me?
As he looked at his reflection in the mirror, he took a deep breath.
He nodded at himself and, feeling braver, he stepped outside the bathroom. He bumped straight into the
extremely tall man, who was waiting by the door.
David said ‘Oh, sorry!’.
The happy pair discussed their future life 2gether and shared sweet words of admiration.
We will not stop you; I promise!
Come here ASAP!
He pushed his chair back and went to the kitchen at 2 pM.
I do not know...
The main character in the movie said: "Play hard. Work harder."
When I enter my regex in vs code with ctrl+f whole first, second, fourth, seventh and eight lines are highligting so according to fullmatch() funtion they need to print as "valid" but they aren't. I need help with this issue.
First, remove lines = file.readlines() as it already moves the file handle to the end of the file stream. Then, you need to keep in mind that when using for line in lines:, the line variable has a trailing newline, so
Either use line=line.rstrip() to remove the trailing whitespace before running the regex or
Ensure your pattern ends in \n? (an optional newline), or even \s* (any zero or more whitespace).
So, a possible solution looks like
with open('sentences.txt',encoding='utf8') as file:
for line in file:
matches = pattern.fullmatch(line.rstrip('\n'))
...
Or,
pattern = re.compile(r'([A-Z][a-z\s,]*)(?:: ["‘][a-z,!?\.\s]*["’][.,!?]|; [a-zA-Z\s]*[!.?]|\s["‘][a-z,.;!?\s]*["’]|[.?!])\s*')
#...
with open('sentences.txt',encoding='utf8') as file:
for line in file:
....
I have a small issue with PyDictionary; When I enter a list of words, the printing of the words Does NOT keep the order of the word list.
For example:
from PyDictionary import PyDictionary
dictionary=PyDictionary(
"bad ",
"omen",
"azure ",
"sky",
"icy ",
"smile")
print(dictionary.printMeanings())
This list will print first Omen, Then Sky and so on, What I need is to print the word list in its original order. I search on google but there was nothing related, I search the posts in this forum and nothing. I hope you can help me. Thank you in advance.
I found a workaround that gives me a full solution to my initial printing issues.
The Main Problem is that I am using an OLD laptop (older than 12 years), So I have Not been able to use python 3+, Using PyDictionary with python 2.7 arouse the problem of printing the initial word list randomly.
The Solution is to print a single word for each printing, BUT I have to do this about 25,000 times!... Using Notepad++ I made a macro that codes each word to be used with python, Furthermore I was able to even add the Spanish translation to each English word, The printing of each word individually added the benefit that each word definitions are separated from each word.
Using Notepad++ and regex I am able to do the final clean up of each word, and it's meaning.
So I am happy with this workaround... Thank You for your help.
So I am currently trying to build a Caesar encrypted that automatically tries all the possibilities and compares them to a big list of words to see if it is a real word, so some sort of dictionary attack I guess.
I found a list with a lot of German words, and they even are split so that each word is on a new line. Currently, I am struggling with comparing the sentence that I currently have with the whole word list. So that when the program sees that a word in my sentence is also a word in the Word list that it prints out that this is a real word and possible the right sentence.
So this is how far I currently am, I have not included the code with which I try all the 26 letters. Only my way to look through the word list and compares it to a sentence. Maybe someone can tell me what I am doing wrong and why it doesn't work:
No idea why it doesn't work. I have also tried it with regular expressions but nothing works. The list is really long (166k Words).
There are /n at the en of each word of the list you created from the file, so the they will never be the same as what they are compared to.
Remove the newline character before appending (you can, for example, wordlist.append(line.rstrip())
I'm trying to capitalize the first letter of every name in a file, so I wrote the following code:
with open('C:/Users/Nishesh/Documents/updated_firstnames.txt', 'r+', encoding='utf-8') as updated_fnames_file:
with open('C:/Users/Nishesh/Documents/capitalized.txt', 'w', encoding='utf-8') as new_fnames:
for line in updated_fnames_file:
new_fnames.write(line.capitalize())
I'm new to Python, so I'm well aware that this is probably poor formatting/logic (and I'd appreciate suggestions to improve it), but for my purposes, this did manage to correctly capitalize every item in the file other than the very first one, as far as I can tell. Actually, the first name in the original file was already capitalized, but after I ran this it ended up lower case in the resulting file. The other items in the first file which were already capitalized were not made lower case however - just this one. Why is this happening?
capitalize() :
It returns a copy of the string with only its first character capitalized.
You probably need capwords() from string lib.
string.capwords() :
Split the argument into words using str.split(), capitalize each word using str.capitalize(), and join the capitalized words using str.join().
Or you can do the same method by hand
new_fnames.write(' '.join(map(str.capitalize, line.split())))
I'm using NLTK to process some text that is extracted from PDF files. I can recover the text mostly intact, but there are lots of instances where spaces between words are not captured, so I get words like ifI instead of if I, or thatposition instead of that position, or andhe's instead of and he's.
My question is this: how can I use NLTK to look for words it does not recognize/has not learned, and see if there are "nearby" word combinations that are much more likely to occur? Is there a more graceful way to implement this kind of check than simply marching through the unrecognized word, one character at a time, splitting it, and seeing if it makes two recognizable words?
I would suggest that you consider using pyenchant instead, since it is a more robust solution for this sort of problem. You can download pyenchant here. Here is an example of how you would obtain your results after you install it:
>>> text = "IfI am inthat position, Idon't think I will." # note the lack of spaces
>>> from enchant.checker import SpellChecker
>>> checker = SpellChecker("en_US")
>>> checker.set_text(text)
>>> for error in checker:
for suggestion in error.suggest():
if error.word.replace(' ', '') == suggestion.replace(' ', ''): # make sure the suggestion has exact same characters as error in the same order as error and without considering spaces
error.replace(suggestion)
break
>>> checker.get_text()
"If I am in that position, I don't think I will." # text is now fixed