Text Cleaning Issues

Text Cleaning Issues - python

I'm learning text cleaning using python online.
I have get rid of some stop words and lower the letter.
but when i execute this code, it doesn't show anything.
I don't know why.
# we add some words to the stop word list
texts, article = [], []
for w in doc:
# if it's not a stop word or punctuation mark, add it to our article!
if w.text != '\n' and not w.is_stop and not w.is_punct and not w.like_num and w.text != 'I':
# we add the lematized version of the word
article.append(w.lemma_)
# if it's a new line, it means we're onto our next document
if w.text == '\n':
texts.append(article)
article = []
when i try to output texts, it's just blank.

I believe the 'texts' list and 'article' list refer to same content and hence, clearing one list's content also clears the other list.
Here is a link to a similar question: Python: Append a list to another list and Clear the first list
Please see if the above are useful.

Related

stripping punctuation and finding unique words in Python

So my task is as such:
Write a program that displays a list of all the unique words found in the file uniq_words.txt.
Print your results in alphabetic order and lowercase. Hint: Store words as the elements of a set;
remove punctuations by using the string.punctuation from the string module.
Currently, the code that I have is:
def main():
import string
with open('uniq_words.txt') as content:
new = sorted(set(content.read().split()))
for i in new:
while i in string.punctuation:
new.discard(i)
print(new)
main()
If I run the code as such, it goes into an infinite loop printing the unique words over and over again. There sre still words in my set that appear as i.e "value." or "never/have". How do I remove the punctuation with the string.punctuation module? Or am I approaching this from a wrong direction? Would appreciate any advice!
Edit: The link does not help me, in that the method given does not work in a list.

My solution:
import string
with open('sample_string.txt') as content:
sample_string = content.read()
print(sample_string)
# Sample string: containing punctuation! As well as CAPITAL LETTERS and duplicates duplicates.
sample_string = sample_string.strip('\n')
sample_string = sample_string.translate(str.maketrans('', '', string.punctuation)).lower()
out = sorted(list(set(sample_string.split(" "))))
print(out)
# ['and', 'as', 'capital', 'containing', 'duplicates', 'letters', 'punctuation', 'sample', 'string', 'well']

This is actually two tasks, so let's split this into two questions. I will deal with your problem regarding stripping punctuation, because you have shown own efforts in this matter. For the problem of determining unique words, please open a new question (and also look for similar questions here on stack overflow before posting a new question, I am pretty sure you will find something useful!)
You correctly found out that you are ending up in an infinite loop. This is because your while loop condition is always true, once i is a punctuation character. Removing i from new does not change that. You avoid this by using a simple if-condition. Actually, your code is mixing up the concept of while and of if and your scenario is tailored for an if-statement. I think you thought you needed a while loop, because you had the concept of iteration in mind. But you are already iterating over content in the for loop. So, the bug fix would be:
for i in new:
if i in string.punctuation:
new.discard(i)
However, a different and more "pythonic" way would be to use list comprehension instead of a for-loop
with open("uniq_words.txt") as content:
stripped_content = "".join([
x
for x in content.read()
if x not in string.punctuation
])

Correctly parse PDF paragraphs with Python

I am creating a Python script that is supposed to load a bunch of PDF files from the system, do some data analysis and output the results. The nature of the data analysis is such that I must parse the PDF by paragraph, and for every paragraph I must iterate over every phrase check if some conditions are met.
I am currently parsing using Tika. And this is the way I am evaluating paragraphs.
This is what I am currently doing, I am loading the content, then, replace every occurrence of one or more newlines with a unique string. Replace every regular newline with a space, replace that unique string with a double new line. I did this so it's more clear which newline delimits a paragraph. Then I proceed to extract paragraphs and return the list of paragraphs with no dupes (Tika sometimes duplicates stuff).
def getpdfcontent(path):
pdf_content = extract_pdf(path)
text = re.sub(r"\n{2,}", "<131313>", pdf_content['content'])
text = text.replace("\n", " ")
text = text.replace("<131313>", "\n\n")
paragraphs = extractparagraphs(text.splitlines())
return removeduplicates(paragraphs)
This is how I extract paragraphs. I check if the current line is empty, an if the current paragraph has something in it, and I append it to a list.
def extractparagraphs(lines):
current = ""
paragraphs = []
for line in lines:
if not line.strip():
if current.strip():
paragraphs.append(current)
current = ""
continue
current += line.strip()
return paragraphs
This is how I get pharses, I might add !? to the split too.
def getphrases(document):
phrases = []
phr = document.split(".")
phrases.extend(phr)
return phrases
Now my priority is to know if I can improve the parsing ?
If not, is there any optimisations I can do ?

Change two characters into one symbol (Python)

Im currently working on a file compression task for school, and I find myself unable to understand what's happening in this code (more specifically what ISN'T happening and why it is not happening).
So in this section of the code what I'm aiming to do is, in non-coding terms, change two adjacent letters which are the same into one symbol, therefore taking up less memory:
for i, word in enumerate(file_contents):
#file_contents = LIST of words in any given text file
word_contents = (file_contents[i]).split()
for ind,letter in enumerate(word_contents[:-1]):
if word_contents[ind] == word_contents[ind+1]:
word_contents[ind] = ''
word_contents[ind+1] = '★'
However, when I run the full code with a sample text file, it seemingly doesn't do what I told it to do. For instance, the word 'Sally' should be 'Sa★y' but instead stays the same.
Could anyone help me get on the right track?
EDIT: I missed out a pretty key detail. I want the compressed string to somehow appear back in the original file_contents list where there are double letters, as the purpose of the full compression algorithm is to return a compressed version of the text in an inputted file.

I would suggest use a regex matching same adjacent characters.
Example:
import re
txt = 'sally and bobby'
print(re.sub(r"(.)\1", '*', txt))
# sa*y and bo*y
Loop and condition checking in your code are not required. Use below line instead:
word_contents = re.sub(r"(.)\1", '*', word_contents)

There are a few things wrong with your code (I think).
1) split produces a list not a str, so when you say this enumerate(word_contents[:-1]) It looks like you're assuming that gets you a string?!? at any rate... I'm not sure it is or not.
but then!
2)with this line:
if word_contents[ind] == word_contents[ind+1]:
word_contents[ind] = ''
word_contents[ind+1] = '★'
You're operating on your list again. Where it looks pretty clear that you want to be operating on the string, or a list of characters in a word you're processing. At best this function will do nothing, and at worst, you're corrupting the word content list.
So when you perform your modifications you are modifying the word_contents list and not the list item [:-1] you are actually looking over. There are more issues, but I think that answers your question (I hope)
If you really want to understand what you're doing wrong I recommend putting in print statements along what you're doing. If you're looking for someone to do your homework for you, there is another which already gave you an answer I guess.
Here is an example of how you should add logging to the function
for i, word in enumerate(file_contents):
#file_contents = LIST of words in any given text file
word_contents = (file_contents[i]).split()
# See what the word content list actually is
print(word_contents)
# See what your slice is actually returning
print(word_contents[:-1])
# Unless you have something modifying your list elsewhere you probably want to iterate over the words list generally and not just the slice of it as well.
for ind,letter in enumerate(word_contents[:-1]):
# See what your other test is testing
print(word_contents[ind], word_contents[ind+1])
# Here you probably actually want
# word_contents[:-1][ind]
# which is the list item you iterate over and then the actual string I suspect you get back
if word_contents[ind] == word_contents[ind+1]:
word_contents[ind] = ''
word_contents[ind+1] = '★'
UPDATE: based on the follow up questions from the OP I've made a sample program annotated with descriptions. Note this isn't an optimal solution, but mainly an exercise in teaching flow control and using basic structures.
# define the initial data...
file = "sally was a quick brown fox and jumped over the lazy dog which we'll call billy"
file_contents = file.split()
# Enumerate isn't needed in your example unless you intend to use the index later (example below)
for list_index, word in enumerate(file_contents):
# changing something you iterate over is dangerous and sometimes confusing like in your case you iterated over
# word contents and then modified it. if you have to take
# two characters you change the index and size of the structure making changes potentially invalid. So we'll create a new data structure to dump the results in
compressed_word = []
# since we have a list of strings we'll just iterate over each string (or word) individually
for character in word:
# Check to see if there is any data in the intermediate structure yet if not there are no duplicate chars yet
if compressed_word:
# if there are chars in new structure, test to see if we hit same character twice
if character == compressed_word[-1]:
# looks like we did, replace it with your star
compressed_word[-1] = "*"
# continue skips the rest of this iteration the loop
continue
# if we haven't seen the character before or it is the first character just add it to the list
compressed_word.append(character)
# I guess this is one reason why you may want enumerate, to update the list with the new item?
# join() is just converting the list back to a string
file_contents[list_index] = "".join(compressed_word)
# prints the new version of the original "file" string
print(" ".join(file_contents))
outputs: "sa*y was a quick brown fox and jumped over the lazy dog which we'* ca* bi*y"

Where i am wrong? Count total words excluding header and footer in python?

This is the file i am trying to read and count the total no of words in this file test.txt
I have written a code for it:
def create_wordlist(filename, is_Gutenberg=True):
words = 0
wordList = []
data = False
regex = re.compile('[%s]' % re.escape(string.punctuation))
file1 = open("temp",'w+')
with open(filename, 'r') as file:
if is_Gutenberg:
for line in file:
if line.startswith("*** START "):
data = True
continue
if line.startswith("End of the Project Gutenberg EBook"):
#data = False
break
if data:
line = line.strip().replace("-"," ")
line = line.replace("_"," ")
line = regex.sub("",line)
for word in line.split():
wordList.append(word.lower())
#print(wordList)
#words = words + len(wordList)
return len(wordList)
#return wordList
create_wordlist('test.txt', True)
Here are few rules to be followed:
1. Strip off whitespace, and punctuation
2. Replace hyphens with spaces
3.skip the file header and footer. Header ends with a line that starts with "*** START OF THIS" and footer starts with "End of the Project".
My answer: 60513 but the actual answer is 60570. This answer came with the question itself. It may be correct or wrong. Where I am doing it wrong.

You give a number for the actual answer -- the answer you consider correct, that you want your code to output.
You did not tell us how you got that number.
It looks to me like the two numbers come from different definitions of "word".
For example, you have in your example text several numbers in the form:
140,000,000
Is that one word or three?
You are replacing hyphens with spaces, so a hyphenated word will be counted as two. Other punctuation you are removing. That would make the above number (and there are other, similar, examples in your text) into one word. Is that what you intended? Is that what was done to get your "correct" number? I suspect this is all or part of your difference.
At a quick glance, I see three numbers in the form above (counted as either 3 or 9, difference 6)
I see 127 apostrophes (words like wife's, which could be counted as either one word or two) for a difference of 127.
Your difference is 57, so the answer is not quite so simple, but I still strongly suspect different definitions of what is a word, for specific corner cases.
By the way, I am not sure why you are collecting all the words into a huge list and then getting the length. You could skip the append loop and just accumulate a sum of len(line.split()). This would remove complexity, which lessens the possibility of bugs (and probably make the program faster, if that matters in this case)
Also, you have a line:
if line.startswith("*** START " in"):
When I try that in my python interpreter, I get a syntax error. Are you sure the code you posted here is what you are running? I would have expected:
if line.startswith("*** START "):

Without an example text file that shows this behaviour it is difficult to guess what goes wrong. But there is one clue: your number is less than what you expect. That seems to imply that you somehow glue together separate words, and count them as a single word. And the obvious candidate for this behaviour is the statement line = regex.sub("",line): this replaces any punctuation character with an empty string. So if the text contains that's, your program changes this to thats.
If that is not the cause, you really need to provide a small sample of text that shows the behaviour you get.
Edit: if your intention is to treat punctuation as word separators, you should replace the punctuation character with a space, so: line = regex.sub(" ",line).

how to take a txt file and split it into strings, getting rid of any floats or int

New to programming, using python 3.0.
I have to write a program that takes an input file name for a .txt file, reads the file but only reads certain words, ignoring the floats and integers and any words that don't match anything in another list.
Basically, I have wordsList and messages.txt.
This program has to read through messages.txt (example of text:
[41.298669629999999, -81.915329330000006] 6 2011-08-28 19:02:36 Work needs to fly by ... I'm so excited to see Spy Kids 4 with the love of my life ... ARREIC)
It then has to ignore all the numbers and search for whether or not any of the words in the message match the words in the wordsList and then match those words to a value(int) in the hvList.
What I have so far: (the wordsList and hvList are in another part of code that I don't think is necessary to show to understand what I'm trying to do (let me know if you do need it to help)
def tweetsvalues ():
tweetwList = []
tinputfile = open(tweetsinputfile,"r")
for line in tinputfile:
entries = line.split()
The last line, entries = line.split() is the one that needs to be changed I'm guessing.

Yes, split is your best friend here. You can also look up the documentation for the is methods. Although the full code is beyond the normal scope of StackOverflow, your central work will look something like
words = sentence.split()
good_words = [word for word in words if isalpha(word)]
This can be done in a "Pythonic" fashion using filter, removing punctuation, etc. However, from the way you write in your posting, I suspect you can take it from here.

This is some code that I wrote earlier today as a really basic spellchecker. I would answer your question more specifically but I do not have the time as of now. This should accomplish what you are looking for. The hard coded .txt file that I am opening contains a lot of english words that are spelled correctly. Feel free to supplement my ideas into your work as needed, but make sure to understand all of the code you are using, otherwise I would only be hindering your learning by giving this code to you. In your case, you would likely want to output all of the words regardless of their spelling, in my code, I am only outputting incorrectly spelled words. Feel free to ask any questions
#---------------------------------------------------------
# The "spellCheck" function determines whether the input
# from the inputFile is a correctly spelled word, and if not
# it will return the word and later be written to a file
# containing misspelled words
#---------------------------------------------------------
def spell_check(word, english):
if word in english:
return None
else:
return word
#---------------------------------------------------------
# The main function will include all of the code that will
# perform actions that are not contained within our other
# functions, and will generally call on those other functions
# to perform required tasks
#---------------------------------------------------------
def main():
# Grabbing user input
inputFile = input('Enter the name of the file to input from: ')
outputFile = input('Enter the name of the file to output to: ')
english = {} # Will contain all available correctly spelled words.
wrong = [] # Will contain all incorrectly spelled words.
num = 0 # Used for line counter.
# Opening, Closing, and adding words to spell check dictionary
with open('wordlist.txt', 'r') as c:
for line in c:
(key) = line.strip()
english[key] = ''
# Opening, Closing, Checking words, and adding wrong ones to wrong list
with open(inputFile, 'r') as i:
for line in i:
line = line.strip()
fun = spell_check(line, english)
if fun is not None:
wrong.append(fun)
# Opening, Closing, and Writing to output file
with open(outputFile, 'w') as o:
for i in wrong:
o.write('%d %s\n' % (num, i))
num += 1
main()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.