concatenate words in a text file - python

I have exported a pdf file as a .txt and I observed that many words were broken into two parts due to line breaks. So, in this program, I want to join the words that are separated in the text while maintaining the correct words in the sentence. In the end, I want to get a final .txt file (or at least a list of tokens) with all words properly spelt. Can anyone help me?
my current text is like this:
I need your help be cause I am not a good progra mmer.
result I need:
I need your help because I am not a good programmer.
from collections import defaultdict
import re
import string
import enchant
document_text=open('test-list.txt','r')
text_string=document_text.read().lower()
lst=[]
errors=[]
dic=enchant.Dict('en_UK')
d=defaultdict(int)
match_pattern = re.findall(r'\b[a-zA-Z0-9_]{1,15}\b', text_string)
for w in match_pattern:
lst.append(w)
for i in lst:
if dic.check(i) is True:
continue
else:
a=list(map(''.join, zip(*([iter(lst)]*2))))
if dic.check(a) is True:
continue
else:
errors.append(a)
print (lst)

You have a bigger problem - how will your program know that:
be
cause
... should be treated as one word?
If you really wanted to, you could replace newline characters with empty spaces:
import re
document_text = """
i need your help be
cause i am not a good programmer
""".lower().replace("\n", '')
print([w for w in re.findall(r'\b[a-zA-Z0-9_]{1,15}\b', document_text)])
This will spellcheck because correctly, but will fail in cases like:
Hello! My name is
Foo.
... because isFoo is not a word.

Related

How do I remove non-English words from a file?

I am trying to process a file with 2 columns of text and categories. From the text column, I need to remove non-English words. I am new to Python so would appreciate if there are any suggestions on how to do this. My file has 60,000 rows of instances.
And I can get to this point below but need help on how to move forward
If you want to remove non English characters, such as punctuation, symbols or script of any other language, you can use isalpha() method of String module.
words=[word.lower() for word in words if word.isalpha()]
To remove meaningless English words you can proceed with #Infinity suggestion but creating a dictionary with 20,000 words will not cover all the scenarios.
As this question is tagged text-mining, you can select a source similar to what corpus you are using, find all the words in the source and then proceed with #Infinity approach.
This code should do the trick.
import pandas
import requests
import string
# The following link contains a text file with the 20,000
# most frequent words in english, one in each line.
DICTIONARY_URL = 'https://raw.githubusercontent.com/first20hours/' \
'google-10000-english/master/20k.txt'
PATH = r"C:\path\to\file.csv"
FILTER_COLUMN_NAME = 'username'
PRINTABLES_SET = set(string.printable)
def is_english_printable(word):
return PRINTABLES_SET >= set(word)
def prepare_dictionary(url):
return set(requests.get(url).text.splitlines())
DICTIONARY = prepare_dictionary(DICTIONARY_URL)
df = pandas.read_csv(PATH, encoding='ISO-8859-1')
df = df[df[FILTER_COLUMN_NAME].map(is_english_printable) &
df[FILTER_COLUMN_NAME].map(str.lower).isin(DICTIONARY)]

how to take a txt file and split it into strings, getting rid of any floats or int

New to programming, using python 3.0.
I have to write a program that takes an input file name for a .txt file, reads the file but only reads certain words, ignoring the floats and integers and any words that don't match anything in another list.
Basically, I have wordsList and messages.txt.
This program has to read through messages.txt (example of text:
[41.298669629999999, -81.915329330000006] 6 2011-08-28 19:02:36 Work needs to fly by ... I'm so excited to see Spy Kids 4 with the love of my life ... ARREIC)
It then has to ignore all the numbers and search for whether or not any of the words in the message match the words in the wordsList and then match those words to a value(int) in the hvList.
What I have so far: (the wordsList and hvList are in another part of code that I don't think is necessary to show to understand what I'm trying to do (let me know if you do need it to help)
def tweetsvalues ():
tweetwList = []
tinputfile = open(tweetsinputfile,"r")
for line in tinputfile:
entries = line.split()
The last line, entries = line.split() is the one that needs to be changed I'm guessing.
Yes, split is your best friend here. You can also look up the documentation for the is methods. Although the full code is beyond the normal scope of StackOverflow, your central work will look something like
words = sentence.split()
good_words = [word for word in words if isalpha(word)]
This can be done in a "Pythonic" fashion using filter, removing punctuation, etc. However, from the way you write in your posting, I suspect you can take it from here.
This is some code that I wrote earlier today as a really basic spellchecker. I would answer your question more specifically but I do not have the time as of now. This should accomplish what you are looking for. The hard coded .txt file that I am opening contains a lot of english words that are spelled correctly. Feel free to supplement my ideas into your work as needed, but make sure to understand all of the code you are using, otherwise I would only be hindering your learning by giving this code to you. In your case, you would likely want to output all of the words regardless of their spelling, in my code, I am only outputting incorrectly spelled words. Feel free to ask any questions
#---------------------------------------------------------
# The "spellCheck" function determines whether the input
# from the inputFile is a correctly spelled word, and if not
# it will return the word and later be written to a file
# containing misspelled words
#---------------------------------------------------------
def spell_check(word, english):
if word in english:
return None
else:
return word
#---------------------------------------------------------
# The main function will include all of the code that will
# perform actions that are not contained within our other
# functions, and will generally call on those other functions
# to perform required tasks
#---------------------------------------------------------
def main():
# Grabbing user input
inputFile = input('Enter the name of the file to input from: ')
outputFile = input('Enter the name of the file to output to: ')
english = {} # Will contain all available correctly spelled words.
wrong = [] # Will contain all incorrectly spelled words.
num = 0 # Used for line counter.
# Opening, Closing, and adding words to spell check dictionary
with open('wordlist.txt', 'r') as c:
for line in c:
(key) = line.strip()
english[key] = ''
# Opening, Closing, Checking words, and adding wrong ones to wrong list
with open(inputFile, 'r') as i:
for line in i:
line = line.strip()
fun = spell_check(line, english)
if fun is not None:
wrong.append(fun)
# Opening, Closing, and Writing to output file
with open(outputFile, 'w') as o:
for i in wrong:
o.write('%d %s\n' % (num, i))
num += 1
main()

Split function splitting words with punctuations. Want to prevent. After splitting how to Alphabetize?

I have two issues. Here goes the code:
Read =open("C:\Users\Moondra\Desktop/test1.txt",'r')
text =Read.read()
words =text.split()
print(words)
print(words.sort())
##counts=dict()
##for word in words:
## counts[word] = counts.get(word,0)+1
##
##
##print counts
And the text that I am trying to read:
test1.txt
Hello Hello Hello.
How is everything. What is going on?
Where are you? Hello!!
Hope to see you soon.
When are you coming by?
What should I make for dinner?
The end!
End of text from txt file
My two questions are the following:
I'm trying to implement a count-each-word code where I count the number of times each word appears in a document.
However when I split the words using the above code, the word "Hello" will appear as "Hello!," or even "Hello." separately. How can I avoid this?
Next, I tried to sort the elements of the list, alphabetically, but all I get in return after running the sort() method is none which is really confusing me.
Thanks!
This code should work for what you described:
import re
with open("C:\Users\Moondra\Desktop/test1.txt", 'r') as file:
file = file.read()
words_list = re.findall(r"[\w]+", file)
words_list = sorted(words_list, key=str.lower)
patterns = ["Hello"]
counter = 0
for word in words_list:
for pattern in patterns:
if word == pattern:
counter+=1
print("The word Hello occurred {0} times".format(counter)) # prints the number of times 'Hello' was found
print(words_list) # prints your list alphabetically
There are a few things you should note however:
I used the re module instead of sort. This is because using the regular expression engine in the re module would be much less complex than trying to split the strings using the split() function.
I renamed some of your variables to follow the PEP8 guide and naming convention for Python. Feel free to rename to your liking.
the reason that sort() returned a list, is because the sort() attribute of a list, does not return a new list, but changes the old one. That is, the sort() attribute of a list sorts in place. The sort() you were using returns the data type None. You need to use the builtin Python function sorted() instead. The sorted() function returns the data type list.

Nested for loop in python does not increment the outer for loop

I have two files: q.txt contains words and p.txt contains sentences. I need to check if any of the words in q.txt is present in p.txt. Following is what I wrote:
#!/usr/bin/python
twts=open('p.txt','r');
words=open('q.txt','r');
for wrd in words:
for iter in twts:
if (wrd in iter):
print "Found at line" +iter
It does not print the output even if there is a match. Also I could see that the outer for loop does not proceed to the next value in the words object. Could someone please explain what am I doing wrong here?
Edit 1: I'm using Python 2.7
Edit 2: Sorry I've mixed up the variable names. Have corrected it now.
When you iterate over a file object, after completing the iteration the cursor end up at the end of the file . So trying to iterate over it again (in the next iteration of the outer for loop) would not work. The easiest way for your code to work would be to seek to the starting of the file at the start of the outer for loop. Example -
#!/usr/bin/python
words=open('q.txt','r');
twts=open('p.txt','r');
for wrd in words:
twts.seek(0)
for twt in twts:
if (wrd.strip() in twt):
print "Found at line" +iter
Also, according to the question , seems like you are using wrong files , twts should be the file with sentences, and words the file with words. But you have openned p.txt for words , and q.txt for `sentences. If its opposite, you should open the files otherway round.
Also, would advice against using iter as a variable name, as that is also the name of a built-in function , and you defining it in - for iter in twts - shadows the built-in function - iter() .
Would be better if you had posted the content of the files but have you striped the \n from the lines? This works for me:
words = open('words.txt', 'r')
twts = open('sentences.txt', 'r')
for w in words:
for t in twts:
if w.rstrip('\n') in t.rstrip('\n'):
print w, t
It seems that you mixed up the 2 files. You say that q.txt contains the words, but you stored p.txt into the words variable.
When you iterate over tweets once you have exhausted the iterator, you the pointer is at the end of the file so there is nothing to iterate over after the first iteration, you can seek repeatedly but if words is not a huge file you can make a set of all the words so you only iterate over the sentences file once for an 0(n*k) running time as opposed to a quadratic solution reading every single line for every word in your words file, splitting will also match exact words not substrings:
from string import punctuation
with open('p.txt','r') as twts, open('q.txt','r') as words:
st = set(map(str.rstrip,words))
for line in twts:
if any(word.rstrip(punctuation) in st for word in line.split()):
print("Found at line {}".format(line))

How to remove special characters from txt files using Python

from glob import glob
pattern = "D:\\report\\shakeall\\*.txt"
filelist = glob(pattern)
def countwords(fp):
with open(fp) as fh:
return len(fh.read().split())
print "There are" ,sum(map(countwords, filelist)), "words in the files. " "From directory",pattern
import os
uniquewords = set([])
for root, dirs, files in os.walk("D:\\report\\shakeall"):
for name in files:
[uniquewords.add(x) for x in open(os.path.join(root,name)).read().split()]
print "There are" ,len(uniquewords), "unique words in the files." "From directory", pattern
So far my code is this. This counts the number of unique words and total words from D:\report\shakeall\*.txt
The problem is, for example, this code recognizes code code. and code! different words. So, this can't be an answer to an exact number of unique words.
I'd like to remove special characters from 42 text files using Windows text editor
Or make an exception rule that solve this problem.
If using the latter, how shoud I make up my code?
Make it to directly modify text files? Or make an exception that doesn't count special characters?
import re
string = open('a.txt').read()
new_str = re.sub('[^a-zA-Z0-9\n\.]', ' ', string)
open('b.txt', 'w').write(new_str)
It will change every non alphanumeric char to white space.
I'm pretty new and I doubt this is very elegant at all, but one option would be to take your string(s) after reading them in and running them through string.translate() to strip out the punctuation. Here is the Python documentation for it for version 2.7 (which i think you're using).
As far as the actual code, it might be something like this (but maybe someone better than me can confirm/improve on it):
fileString.translate(None, string.punctuation)
where "fileString" is the string that your open(fp) read in. "None" is provided in place of a translation table (which would normally be used to actually change some characters into others), and the second parameter, string.punctuation (a Python string constant containing all the punctuation symbols) is a set of characters that will be deleted from your string.
In the event that the above doesn't work, you could modify it as follows:
inChars = string.punctuation
outChars = ['']*32
tranlateTable = maketrans(inChars, outChars)
fileString.translate(tranlateTable)
There are a couple of other answers to similar questions i found via a quick search. I'll link them here, too, in case you can get more from them.
Removing Punctuation From Python List Items
Remove all special characters, punctuation and spaces from string
Strip Specific Punctuation in Python 2.x
Finally, if what I've said is completely wrong please comment and i'll remove it so that others don't try what I've said and become frustrated.
import re
Then replace
[uniquewords.add(x) for x in open(os.path.join(root,name)).read().split()]
By
[uniquewords.add(re.sub('[^a-zA-Z0-9]*$', '', x) for x in open(os.path.join(root,name)).read().split()]
This will strip all trailing non-alphanumeric characters from each word before adding it to the set.
When working in Linux, some system files in /proc lib contains chars with ascii value 0.
full_file_path = 'test.txt'
result = []
with open(full_file_path, encoding='utf-8') as f:
line = f.readline()
for c in line:
if ord(c) == 0:
result.append(' ')
else:
result.append(c)
print (''.join(result))

Categories

Resources