Python Removing Custom Stop-Words from CSV files

Python Removing Custom Stop-Words from CSV files - python

Hi I am new to Python programing and I need help removing custom made stop-words from multiple files in a directory. I have read almost all the relevant posts online!!!!
I am using Python 2.7
Here are two sample lines of one of my files.I want to keep this format and just remove the stop-words from the rows:
"8806";"Demonstrators [in Chad] demand dissolution of Legis Assembly many hurt as police disperse crowd.";"19"
"44801";"Role that American oil companies played in Iraq's oil-for-food program is coming under greater scrutiny.";"19"
I have a list of stop-words in a dat file called Stopwords.
This is my code:
import io
import os
import os.path
import csv
os.chdir('/home/Documents/filesdirectory')
stopwords = open('/home/StopWords.dat','r').read().split('\n')
for i in os.listdir(os.getcwd()):
name= os.path.splitext(i)[0]
with open(i,"r") as fin:
with open(name,"w") as fout:
writer=csv.writer(fout)
for w in csv.reader(fin):
if w not in stopwords:
writer.writerow(w)
It does not give me any errors but creates empty files. Any help is very much appreciated.

import os
import os.path
os.chdir('/home/filesdirectory')
for i in os.listdir(os.getcwd()):
filein = open(i, 'r').readlines()
fileout = open(i, 'w')
stopwords= open('/home/stopwords.dat', 'r').read().split()
for line in filein:
linewords= line.split()
filteredtext1 = []
filteredtext1 = [t for t in linewords if t not in stopwords]
filteredtext = str(filteredtext1)
fileout.write(filteredtext + '\n')
Well, I solved the problem.
This code removes the stopwords(or any list of the words you give it) for each line, writes each line to a file with the same filenmae and at the end replaces the old file with a new file without stopwords. Here are the steps:
declare the working directory
enter a loop to go over each file
open the file to read and read each line using readlines()
open a file to write
read the stopwords file and split its words
enter to a for loop to deal with each line separately
split the line to words
create a list
write the words of the line as items of the list if they are not in the stopwords list
change the list to string
write the string to a file

Related

Running out of Memory when trying to find and replace in a CSV

When trying to Find and Replace on a 12MB CSV, I am running out of memory.
This code checks against a list of 5000 names for names in a CSV file and replaces them with the word 'REDACTED'
I've tried putting this onto an AWS XL instance and still ran out of memory.
import csv
input_file = csv.DictReader(open("names.csv"))
newword = 'REDACTED'
with open('new.txt', 'w') as outfile, open('test.txt') as infile:
for line in infile:
for oldword, newword in input_file:
line = line.replace(oldword, newword)
print('Replaced')
outfile.write(line)
I expect it to output the new.txt with the replacements intact. I currently getting MemoryError.

There are multiple problems with your code before we can even check whats causing the MemoryError problem.
for oldword, newword in input_file: overrides newword = 'REDACTED'
Then, as far as i know, you cannot iterate over DictReader multiple times
input_file = csv.DictReader(open("names.csv"))
for line in infile:
for oldword, newword in input_file:
And at last, i assume "names.csv" contains all possible names, why read it with a DictReader. What is the structure of the names file, and if it is a csv-file, shouldn't you only take the values of one column and not the whole line.

Issue with appending to txt file

I am trying to to read and write to the same file. currently the data in 2289newsML.txt exists as normal sentences but I want to append the file so it stores only tokenized versions of the same sentences.
I used the code below but even tho it prints out tokenized sentences it doesnt write them to the file.
from pathlib import Path
from nltk.tokenize import word_tokenize
news_folder = Path("file\\path\\")
news_file = (news_folder / "2289newsML.txt")
f = open(news_file, 'r+')
data = f.readlines()
for line in data:
words = word_tokenize(line)
print(words)
f.writelines(words)
f.close
any help will be appreciated.
Thanks :)

from nltk.tokenize import word_tokenize
with open("input.txt") as f1, open("output.txt", "w") as f2:
f2.writelines(("\n".join(word_tokenize(line)) for line in f1.readlines()))
Using with comprehension ensures file handle will be taken care of. So you do not need f1.close()
This program is writing to a different file.
Of course, you can do it this way too:
f = open(news_file)
data = f.readlines()
file = open("output.txt", "w")
for line in data:
words = word_tokenize(line)
print(words)
file.write('\n'.join(words))
f.close
file.close
Output.txt will have the tokenized words.

I am trying to to read and write to the same file. currently the data
in 2289newsML.txt exists as normal sentences but I want to append the
file...
Because you are opening file in r+ mode.
'r+' Open for reading and writing. The stream is positioned at the beginning of the file.
If you want to append new text at the end of file consider opening file in a+ mode.
Read more about open
Read more about file modes

Writing results into a .txt file

I created a code to take two .txt files, compare them and export the results to another .txt file. Below is my code (sorry about the mess).
Any ideas? Or am I just an imbecile?
Using python 3.5.2:
# Barcodes Search (V3actual)
# Import the text files, putting them into arrays/lists
with open('Barcodes1000', 'r') as f:
barcodes = {line.strip() for line in f}
with open('EANstaging1000', 'r') as f:
EAN_staging = {line.strip() for line in f}
##diff = barcodes ^ EAN_staging
##print (diff)
in_barcodes_but_not_in_EAN_staging = barcodes.difference(EAN_staging)
print (in_barcodes_but_not_in_EAN_staging)
# Exporting in_barcodes_but_not_in_EAN_staging to a .txt file
with open("BarcodesSearch29_06_16", "wt") as BarcodesSearch29_06_16: # Create .txt file
BarcodesSearch29_06_16.write(in_barcodes_but_not_in_EAN_staging) # Write results to the .txt file

From the comments to your question, it sounds like your issue is that you want to save your list of strings as a file. File.write expects a single string as input, while File.writelines expects a list of strings, which is what your data appears to be.
with open("BarcodesSearch29_06_16", "wt") as BarcodesSearch29_06_16:
BarcodesSearch29_06_16.writelines(in_barcodes_but_not_in_EAN_staging)
That will iterate through your list in_barcodes_but_not_in_EAN_staging, and write each element as a separate line in the file BarcodesSearch29_06_16.

Try BarcodesSearch29_06_16.write(str(in_barcodes_but_not_in_EAN_staging)). Also, you'll want to close the file after you're done writing to it with BarcodesSearch29_06_16.close().

How to add a removed word back into the original text file

I'm quite new to python. With help from the stack over flow community, I have managed to do part of my task. My program removes a random word from a small text file and assigns it to a variable and puts it in another text file.
However, at the end of my program, I need to put that random word back into the text file for my program to work so someone can use multiple times.
All the words in the text file are in no specific order but each word is on and need to be on a separate line.
This is the program which removes the random word from the text file.
with open("words.txt") as f: #Open the text file
wordlist = [x.rstrip() for x in f]
replaced_word = random.choice(wordlist)
newwordlist = [word for word in wordlist if word != replaced_word]
with open("words.txt", 'w') as f: # Open file for writing
f.write('\n'.join(newwordlist))
If I have missed out any vital information which is needed I'm happy to provide that :)

Why not just copy the text file at the start of your program? Perform what you have so far on your copy so you will always leave the original file unaltered.
import shutil
shutil.copyfile("words.txt", "newwords.txt")
with open("newwords.txt") as f: #Open the text file
wordlist = [x.rstrip() for x in f]
replaced_word = random.choice(wordlist)
newwordlist = [word for word in wordlist if word != replaced_word]
with open("newwords.txt", 'w') as f: # Open file for writing
f.write('\n'.join(newwordlist))

You are replacing your words.txt file, therefore losing all the words. If you just create a new file for the random word you don't need to rewrite your original file. Something like :
...
with open("words_random.txt", 'w') as f:
w.write(replaced_word)
You will have a new text file with only the random word.

Deleting a specific word from a file in python

I am quite new to python and have just started importing text files. I have a text file which contains a list of words, I want to be able to enter a word and this word to be deleted from the text file. Can anyone explain how I can do this?
text_file=open('FILE.txt', 'r')
ListText = text_file.read().split(',')
DeletedWord=input('Enter the word you would like to delete:')
NewList=(ListText.remove(DeletedWord))
I have this so far which takes the file and imports it into a list, I can then delete a word from the new list but want to delete the word also from the text file.

Here's what I would recommend since its fairly simple and I don't think you're concerned with performance.:
f = open("file.txt",'r')
lines = f.readlines()
f.close()
excludedWord = "whatever you want to get rid of"
newLines = []
for line in lines:
newLines.append(' '.join([word for word in line.split() if word != excludedWord]))
f = open("file.txt", 'w')
for line in lines:
f.write("{}\n".format(line))
f.close()
This allows for a line to have multiple words on it, but it will work just as well if there is only one word per line
In response to the updated question:
You cannot directly edit the file (or at least I dont know how), but must instead get all the contents in Python, edit them, and then re-write the file with the altered contents
Another thing to note, lst.remove(item) will throw out the first instance of item in lst, and only the first one. So the second instance of item will be safe from .remove(). This is why my solution uses a list comprehension to exclude all instances of excludedWord from the list. If you really want to use .remove() you can do something like this:
while excludedWord in lst:
lst.remove(excludedWord)
But I would discourage this in favor for the equivalent list comprehension

We can replace strings in files (some imports needed;)):
import os
import sys
import fileinput
for line in fileinput.input('file.txt', inplace=1):
sys.stdout.write(line.replace('old_string', 'new_string'))
Find this (maybe) here: http://effbot.org/librarybook/fileinput.htm
If 'new_string' change to '', then this would be the same as to delete 'old_string'.

So I was trying something similar, here are some points to people whom might end up reading this thread. The only way you can replace the modified contents is by opening the same file in "w" mode. Then python just overwrites the existing file.
I tried this using "re" and sub():
import re
f = open("inputfile.txt", "rt")
inputfilecontents = f.read()
newline = re.sub("trial","",inputfilecontents)
f = open("inputfile.txt","w")
f.write(newline)

#Wnnmaw your code is a little bit wrong there it should go like this
f = open("file.txt",'r')
lines = f.readlines()
f.close()
excludedWord = "whatever you want to get rid of"
newLines = []
for line in newLines:
newLines.append(' '.join([word for word in line.split() if word != excludedWord]))
f = open("file.txt", 'w')
for line in lines:
f.write("{}\n".format(line))
f.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Removing Custom Stop-Words from CSV files - python

Related

Running out of Memory when trying to find and replace in a CSV

Issue with appending to txt file

Writing results into a .txt file

How to add a removed word back into the original text file

Deleting a specific word from a file in python

Categories

Resources