I'm using this code to get a text file as an input, and turn it to csv file as an output. The csv file has two columns, one for the words, and the other for the count of the words.
from collections import Counter
file = open(r"/Users/abdullahtammour/Documents/txt/1984/1984.txt", "r", encoding="utf-8-sig")
wordcount={}
wordcount = Counter((file.read().split()))
for item in wordcount.items():
print("{}\t{}".format(*item), file=open("/Users/abdullahtammour/Documents/txt/1984/1984.csv", "a"))
file.close()
I want to enhance the code and add two feature:
1st (and the most important) I want only the words in the output file, no numbers, no characters like (*&-//.,!?) and so on.
2nd to turn all the words in the output file to be lower case.
Any help will be appreciated.
You can use the string method isalpha() to check if there are only alphabetic characters in a word, and you can use lower() to convert it to lower case. I'm assuming you don't want apostrophes or other punctuation in your words either, but if that is OK then you could strip such characters out with replace, like this:
word.replace("'",'').isalpha()
It's also better to just open a file once than to open & close it a thousand times, which is what you do by opening it in the body of the loop. It is not only inefficient but could conceivably have weird results if buffering is involved.
I rewrote it with a 'with' clause which is roughly equal to opening the file at the beginning of the clause and closing it at the end.
Not as important, but you can use the 'sep' keyword in print() instead of manually inserting a tab, like this:
print(arg1, arg2, sep='\t')
Revising your code:
from collections import Counter
file = open(r"/Users/abdullahtammour/Documents/txt/1984/1984.txt", "r", encoding="utf-8-sig")
wordcount={}
wordcount = Counter((file.read().split()))
file.close()
with open("/Users/abdullahtammour/Documents/txt/1984/1984.csv", "w") as file:
for word, count in wordcount.items():
if word.isalpha():
print(word.lower(), count, sep='\t', file=file)
Related
I am working with a vcf file. I try to extract information from this file, but the file has errors in the format.
In this file there is a column that contains long character strings. The error is, that a number of tabs and a new line character are erronously placed within some rows of this column. So when I try to read in this tab delimited file, all columns are messed up.
I have an idea how to solve this, but don't know how to execute it in code. The string is DNA, so always has ATCG. Basically, if one could look for a number of tabs and a newline within characters ATCG and remove them, then the file is fixed:
ACTGCTGA\t\t\t\t\nCTGATCGA would become:
ACTGCTGACTGATCGA
So one would need to look into this file, look for [ACTG] followed by tabs or newlines, followed by more [ACTG], and then replace this with nothing. Any idea how to do this?
with open(file.vcf, 'r') as f:
lines = [l for l in f if not l.startswith('##')]
Here's one way with regex:
First read the file in:
import re
with open('file.vcf', 'r') as file:
dnafile = file.read()
Then write a new file with the changes:
with open('fileNew.vcf', 'w') as file:
file.write(re.sub("(?<=[ACTG]{2})((\\t)*(\\n)*)(?=[ACTG]{2})", "", dnafile))
I have created one txt.file, consisting of five other text files (all text.txt). I also have a text file with words on each line (remove words.txt). I would like to remove the words from removewords.txt from alltext.txt, without creating a new textfile and without writing the words from removewords.txt manually.
I have thought about using sets, but is a but confused how to approach this?
My mergin of files looks like this:
files=["file1.txt", "file2.txt"...."file5.txt"]
with open("compare_out.txt", "w") as fout:
for file in files:
with open (file) as complete_file:
for line in complete_file:
fout.write(line)
Any suggestions? Thank you very much
I would do the following:
read all words from "removewords.txt" into a list called remove_words
read all words from "alltext.txt" into a list called all_words
open the file "alltext.txt" in write mode ("w") and write content to it as follows:
for each word in all_words, check if that word is in the list remove_words. If it is not, write it to "alltext.txt"
Are these steps detailed enough so that you can solve your problem?
If not, comment below on what you are having problems with.
If it is not a problem you can load all the words to remove in to a set using split, then check each word before you write it to the output file.
Split separates a string in to list elements based on a delimiting character - in the case of words we can use a space character " " to separate each word from other words.
rm_word_file = open('removewords.txt', 'r')
remove_words = set(rm_word_file.read().split(" "))
rm_word_file.close()
files=["file1.txt", "file2.txt"...."file5.txt"]
with open("compare_out.txt", "w") as fout:
for file in files:
with open (file) as complete_file:
for line in complete_file:
for word in line.split(" "):
if(word not in remove_words):
fout.write(line)
Something else to think about is, if there is punctuation in your text body, how you will handle that?
You can just remove all punctuation, but then its and it's would be treated as the same word, which may not be the intended behaviour.
Issue: Remove the hyperlinks, numbers and signs like ^&*$ etc from twitter text. The tweet file is in CSV tabulated format as shown below:
s.No. username tweetText
1. #abc This is a test #abc example.com
2. #bcd This is another test #bcd example.com
Being a novice at python, I search and string together the following code, thanks to a the code given here:
import re
fileName="path-to-file//tweetfile.csv"
fileout=open("Output.txt","w")
with open(fileName,'r') as myfile:
data=myfile.read().lower() # read the file and convert all text to lowercase
clean_data=' '.join(re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",data).split()) # regular expression to strip the html out of the text
fileout.write(clean_data+'\n') # write the cleaned data to a file
fileout.close()
myfile.close()
print "All done"
It does the data stripping, but the output file format is not as I desire. The output text file is in a single line like
s.no username tweetText 1 abc This is a cleaned tweet 2 bcd This is another cleaned tweet 3 efg This is yet another cleaned tweet
How can I fix this code to give me an output like given below:
s.No. username tweetText
1 abc This is a test
2 bcd This is another test
3 efg This is yet another test
I think something needs to be added in the regular expression code but I don't know what it could be. Any pointers or suggestions will be helpful.
You can read the line, clean it, and write it out in one loop. You can also use the CSV module to help you build out your result file.
import csv
import re
exp = r"(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"
def cleaner(row):
return [re.sub(exp, " ", item.lower()) for item in row]
with open('input.csv', 'r') as i, open('output.csv', 'wb') as o:
reader = csv.reader(i, delimiter=',') # Comma is the default
writer = csv.writer(o, delimiter=',')
# Take the first row from the input file (the header)
# and write it to the output file
writer.writerow(next(reader))
for row in reader:
writer.writerow(cleaner(row))
The csv module knows correctly how to add separators between items; as long as you pass it a collection of items.
So, what the cleaner method does it take each item (column) in the row from the input file, apply the substitution to the lowercase version of the item; and then return back a list.
The rest of the code is simply opening the file, configuring the CSV module with the separators you want for the input and output files (in the example code, the separator for both files is a tab, but you can change the output separator).
Next, the first row of the input file is read and written out to the output file. No transformation is done on this row (which is why it is not in the loop).
Reading the row from the input file automatically puts the file pointer on the next row - so then we simply loop through the input rows (in reader), for each row apply the cleaner function - this will return a list - and then write that list back to the output file with writer.writerow().
instead of applying the re.sub() and the .lower() expressions to the entire file at once try iterating over each line in the CSV file like this:
for line in myfile:
line = line.lower()
line = re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",line)
fileout.write(line+'\n')
also when you use the with <file> as myfile expression there is no need to close it at the end of your program, this is done automatically when you use with
Try this regex:
clean_data=' '.join(re.sub("[#\^&\*\$]|#\S+|\S+[a-z0-9]\.(com|net|org)"," ",data).split()) # regular expression to strip the html out of the text
Explanation:
[#\^&\*\$] matches on the characters, you want to replace
#\S+matches on hash tags
\S+[a-z0-9]\.(com|net|org) matches on domain names
If the URLs can't be identified by https?, you'll have to complete the list of potential TLDs.
Demo
I am trying to write a python script to read in a large text file from some modeling results, grab the useful data and save it as a new array. The text file is output in a way that has a ## starting each line that is not useful. I need a way to search through and grab all the lines that do not include the ##. I am used to using grep -v in this situation and piping to a file. I want to do it in python!
Thanks a lot.
-Tyler
I would use something like this:
fh = open(r"C:\Path\To\File.txt", "r")
raw_text = fh.readlines()
clean_text = []
for line in raw_text:
if not line.startswith("##"):
clean_text.append(line)
Or you could also clean the newline and carriage return non-printing characters at the same time with a small modification:
for line in raw_text:
if not line.startswith("##"):
clean_text.append(line.rstrip("\r\n"))
You would be left with a list object that contains one line of required text per element. You could split this into individual words using string.split() which would give you a nested list per original list element which you could easily index (assuming your text has whitespaces of course).
clean_text[4][7]
would return the 5th line, 8th word.
Hope this helps.
[Edit: corrected indentation in loop]
My suggestion would be to do the following:
listoflines = [ ]
with open(.txt, "r") as f: # .txt = file, "r" = read
for line in f:
if line[:2] != "##": #Read until the second character
listoflines.append(line)
print listoflines
If you're feeling brave, you can also do the following, CREDITS GO TO ALEX THORNTON:
listoflines = [l for l in f if not l.startswith('##')]
The other answer is great as well, especially teaching the .startswith function, but I think this is the more pythonic way and also has the advantage of automatically closing the file as soon as you're done with it.
Strange question here.
I have a .txt file that I want to iterate over. I can get all the words into an array from the file, which is good, but what I want to know how to do is, how do I iterate over the whole file, but not the individual letters, but the words themselves.
I want to be able to go through the array which houses all the text from the file, and basically count all the instances in which a word appears in it.
Only problem is I don't know how to write the code for it.
I tried using a for loop, but that just iterates over every single letter, when I want the whole words.
This code reads the space separated file.txt
f = open("file.txt", "r")
words = f.read().split()
for w in words:
print w
file = open("test")
for line in file:
for word in line.split(" "):
print word
Untested:
def produce_words(file_):
for line in file_:
for word in line.split():
yield word
def main():
with open('in.txt', 'r') as file_:
for word in produce_words(file_):
print word
If you want to loop over an entire file, then the sensible thing to do is to iterate over the it, taking the lines and splitting them into words. Working line-by-line is best as it means we don't read the entire file into memory first (which, for large files, could take a lot of time or cause us to run out of memory):
with open('in.txt') as input:
for line in input:
for word in line.split():
...
Note that you could use line.split(" ") if you want to preserve more whitespace, as line.split() will remove all excess whitespace.
Also note my use of the with statement to open the file, as it's more readable and handles closing the file, even on exceptions.
While this is a good solution, if you are not doing anything within the first loop, it's also a little inefficient. To reduce this to one loop, we can use itertools.chain.from_iterable and a generator expression:
import itertools
with open('in.txt') as input:
for word in itertools.chain.from_iterable(line.split() for line in input):
...