reading the files and findout unique set of words - python

I am new to the world of python/or programming in general.
I have a folder which consist of two .txt files. I want to read the files and create a data structure to store all unique words in those files. This what I have written,
import glob
import errno
path = '/path/to/my/files/*.txt'
files = glob.glob(path)
for name in files:
try:
with open(name, encoding="ISO-8859-1") as f:
f.read()
except IOError as exc:
if exc.errno != errno.EISDIR:
raise
But I dont know how to modify the program to find the unique words. I would appreciate if you could guide me. Thank you.

you may do this:
import glob
import errno
path = '/path/to/my/files/*.txt'
files = glob.glob(path)
unique = dict()
for name in files:
try:
with open(name, encoding="ISO-8859-1") as f:
data = f.read()
for word in data.split(' '):
if word.strip():
unique[word] = word
except IOError as exc:
if exc.errno != errno.EISDIR:
raise
print unique.keys()

[Edited] Changed dictionary to set.
Use a set to save the words.
I recommend you to create a function that reads a file and then use it in your for.
For example:
term_list = set()
def unique_words(path+"filename.txt"):
text = open(path+"filename.txt","r")
for line in text:
if line != '\n':
line = line.strip().split(' ')
for word in line:
term_list.add(word)
return

try adding 'encoding="latin-1"' to the open function. So
with open(name, encoding="latin-1") as f:

Related

How to read and process data from a set of text files sequentially?

I have 50 text files (namely Force1.txt, Force2.txt, ..., Force50.txt). The files look like this:
0.0000000e+000 -1.4275799e-003
2.0000000e-002 -1.1012760e-002
4.0000000e-002 -1.0298970e-002
6.0000000e-002 -8.9733599e-003
8.0000000e-002 -9.6871497e-003
1.0000000e-001 -1.2236400e-002
1.2000000e-001 -1.4479739e-002
1.4000000e-001 -1.3052160e-002
1.6000000e-001 -1.1216700e-002
1.8000000e-001 -8.6674497e-003
2.0000000e-001 -8.6674497e-003
2.2000000e-001 -1.3358070e-002
2.4000000e-001 -1.7946720e-002
2.6000000e-001 -1.9782179e-002
I wish to read data from Force1.txt, store data in a list of tuples, and analize these data (the details of such analysis are not relevant to the question). Then I have to do the same with Force2.txt, Force3.txt, and so on.
Here is my attempt:
def load_data(fn):
with open(fn) as f:
lines = f.readlines()
return [tuple(map(float, x)) for x in [row.split() for row in lines]]
def display_data(lst):
return lst.__repr__().replace('[', '').replace(']', '')
pp = []
for file in os.listdir("dir"):
if file.endswith('.txt'):
if file.startswith('Force'):
print os.path.join(r"dir", file)
with open(file) as f:
for line in f:
pp.append(map(float, line.split()))
mdb.models['Model-1'].TabularAmplitude(data=pp, name='Table', smooth=SOLVER_DEFAULT, timeSpan=STEP)
I'm getting this error:
'Table', smooth=SOLVER_DEFAULT, timeSpan=STEP):
Invalid time values, expected monotonically increasing numbers
How can I solve this issue?
This code should do:
import os
def load_data(fn):
with open(fn) as f:
lines = f.readlines()
return [tuple(map(float, x)) for x in [row.split() for row in lines]]
def display_data(lst):
return lst.__repr__().replace('[', '').replace(']', '')
dirname = r"Your dir name goes here"
for filename in os.listdir(dirname):
if filename.endswith('.txt'):
if filename.startswith('Force'):
pathfile = os.path.join(dirname, filename)
print pathfile
pp = load_data(pathfile)
print display_data(pp)
mdb.models['Model-1'].TabularAmplitude(data=pp,
name='Table',
smooth=SOLVER_DEFAULT,
timeSpan=STEP)
You just need to update dirname with the name of the directory which contains the text files. I recommend you not to use file as a variable identifier because file is a reserved word in Python. I've used filename instead.

How do I perform error handling with two files?

So , I am having two files , so to checks its validity I am performing try and except two times . But I don't thinks this is a good method, can you suggest a better way?
Here is my code:
def form_density_dictionary(self,word_file,fp_exclude):
self.freq_dictionary={}
try:
with open(fp_exclude,'r')as fp2:
words_excluded=fp2.read().split() #words to be excluded stored in a list
print("**Read file successfully :" + fp_exclude + "**")
words_excluded=[words.lower() for words in words_excluded] # converted to lowercase
except IOError:
print("**Could not read file:", fp_exclude, " :Please check file name**")
sys.exit()
try:
with open(word_file,'r') as file:
print("**Read file successfully :" + word_file + "**")
words_list=file.read()
if not words_list:
print("**No data in file:",word_file +":**")
sys.exit()
words_list=words_list.split()
words_list=[words.lower() for words in words_list] # lowercasing entire list
unique_words=list((set(words_list)-set(words_excluded)))
self.freq_dictionary= {word:("%6.2f"%(float((words_list.count(word))/len(words_list))*100)) for word in unique_words}
#print((len(self.freq_dictionary)))
except IOError:
print("**Could not read file:", word_file, " :Please check file name**")
sys.exit()
Any other suggestion is also welcomed to make it more pythonic.
The first thing that jumps out is the lack of consistency and readability: in some lines you indent with 4 spaces, on others you only use two; in some places you put a space after a comma, in others you don't, in most places you don't have spaces around the assignment operator (=)...
Be consistent and make your code readable. The most commonly used formatting is to use four spaces for indenting and to always have a space after a comma but even more important than that is to be consistent, meaning that whatever you choose, stick with it throughout your code. It makes it much easier to read for everyone, including yourself.
Here are a few other things I think you could improve:
Have a single exception handling block instead of two.
You can also open both files in a single line.
Even better, combine both previous suggestions and have a separate method to read data from the files, thus eliminating code repetition and making the main method easier to read.
For string formatting it's preferred to use .format() instead of %. Check this out: https://pyformat.info/
Overall try to avoid repetition in your code. If there's something you're doing more than once, extract it to a separate function or method and use that instead.
Here's your code quickly modified to how I'd probably write it, and taking these things into account:
import sys
class AtifImam:
def __init__(self):
self.freq_dictionary = {}
def form_density_dictionary(self, word_file, exclude_file):
words_excluded = self.read_words_list(exclude_file)
words_excluded = self.lowercase(words_excluded)
words_list = self.read_words_list(word_file)
if len(words_list) == 0:
print("** No data in file: {} **".format(word_file))
sys.exit()
words_list = self.lowercase(words_list)
unique_words = list((set(words_list) - set(words_excluded)))
self.freq_dictionary = {
word: ("{:6.2f}".format(
float((words_list.count(word)) / len(words_list)) * 100))
for word in unique_words
}
#staticmethod
def read_words_list(file_name):
try:
with open(file_name, 'r') as file:
data = file.read()
print("** Read file successfully: {} **".format(file_name))
return data.split()
except IOError as e:
print("** Could not read file: {0.filename} **".format(e))
sys.exit()
#staticmethod
def lowercase(word_list):
return [word.lower() for word in word_list]
Exceptions thrown that involve a file system path have a filename attribute that can be used instead of explicit attributes word_file and fp_exclude as you do.
This means you can wrap these IO operations in the same try-except and use the exception_instance.filename which will indicate in which file the operation couldn't be performed.
For example:
try:
with open('unknown_file1.py') as f1, open('known_file.py') as f2:
f1.read()
f2.read()
except IOError as e:
print("No such file: {0.filename}".format(e))
Eventually prints out:
No such file: unknown_file1.py
While the opposite:
try:
with open('known_file.py') as f1, open('unknown_file2.py') as f2:
f1.read()
f2.read()
except IOError as e:
print("No such file: {0.filename}".format(e))
Prints out:
No such file: unknown_file2.py
To be more 'pythonic' you could use something what is callec Counter, from collections library.
from collections import Counter
def form_density_dictionary(self, word_file, fp_exclude):
success_msg = '*Read file succesfully : {filename}'
fail_msg = '**Could not read file: {filename}: Please check filename'
empty_file_msg = '*No data in file :{filename}:**'
exclude_read = self._file_open(fp_exclude, success_msg, fail_msg, '')
exclude = Counter([word.lower() for word in exclude_read.split()])
word_file_read = self._file_open(word_file, success_msg, fail_msg, empty_file_msg)
words = Counter([word.lower() for word in word_file_read.split()])
unique_words = words - excluded
self.freq_dictionary = {word: '{.2f}'.format(count / len(unique_words))
for word, count in unique_words.items()}
Also it would be better if you would just create the open_file method, like:
def _open_file(self, filename, success_msg, fails_msg, empty_file_msg):
try:
with open(filename, 'r') as file:
if success_msg:
print(success_msg.format(filename= filename))
data = file.read()
if empty_file_msg:
print(empty_file_msg.format(filename= filename))
return data
except IOError:
if fail_msg:
print(fail_msg.format(filename= filename))
sys.exit()

Searching and extracting WH-word from a file line by line with Python and regex

I have a file that has one sentence per line. I am trying to read the file and search if the sentence is a question using regex and extract the wh-word from the sentences and save them back into another file according the order it appeared in the first file.
This is what I have so far..
def whWordExtractor(inputFile):
try:
openFileObject = open(inputFile, "r")
try:
whPattern = re.compile(r'(.*)who|what|how|where|when|why|which|whom|whose(\.*)', re.IGNORECASE)
with openFileObject as infile:
for line in infile:
whWord = whPattern.search(line)
print whWord
# Save the whWord extracted from inputFile into another whWord.txt file
# writeFileObject = open('whWord.txt','a')
# if not whWord:
# writeFileObject.write('None' + '\n')
# else:
# whQuestion = whWord
# writeFileObject.write(whQuestion+ '\n')
finally:
print 'Done. All WH-word extracted.'
openFileObject.close()
except IOError:
pass
The result after running the code above: set([])
Is there something I am doing wrong here? I would be grateful if someone can point it out to me.
Something like this:
def whWordExtractor(inputFile):
try:
with open(inputFile) as f1:
whPattern = re.compile(r'(.*)who|what|how|where|when|why|which|whom|whose(\.*)', re.IGNORECASE)
with open('whWord.txt','a') as f2: #open file only once, to reduce I/O operations
for line in f1:
whWord = whPattern.search(line)
print whWord
if not whWord:
f2.write('None' + '\n')
else:
#As re.search returns a sre.SRE_Match object not string, so you will have to use either
# whWord.group() or better use whPattern.findall(line)
whQuestion = whWord.group()
f2.write(whQuestion+ '\n')
print 'Done. All WH-word extracted.'
except IOError:
pass
Not sure if it's what you're looking for, but you could try something like this:
def whWordExtractor(inputFile):
try:
whPattern = re.compile(r'who|what|how|where|when|why|which|whom|whose', re.IGNORECASE)
with open(inputFile, "r") as infile:
for line in infile:
whMatch = whPattern.search(line)
if whMatch:
whWord = whMatch.group()
print whWord
# save to file
else:
# no match
except IOError:
pass
Change '(.*)who|what|how|where|when|why|which|whom|whose(\.*)' to
".*(?:who|what|how|where|when|why|which|whom|whose).*\."

How to read and divide individual lines of a file in python?

Thanks to stackoverflow, I am able to read and copy a file. However, I need to read a picture file one line at a time, and the buffer array can't exceed 3,000 integers. How would I separate the lines, read them, and then copy them? Is that the best way to execute this?
Here is my code, courtesy of #Chayim:
import os
import sys
import shutil
import readline
source = raw_input("Enter source file path: ")
dest = raw_input("Enter destination path: ")
file1 = open(source,'r')
if not os.path.isfile(source):
print "Source file %s does not exist." % source
sys.exit(3)
file_line = infile.readline()
try:
shutil.copy(source, dest)
infile = open(source,'r')
outfile = open(dest,'r')
file_contents = infile.read()
file_contents2 = outfile.read()
print(file_contents)
print(file_contents2)
infile.close()
outfile.close()
except IOError, e:
print "Could not copy file %s to destination %s" % (source, dest)
print e
sys.exit(3)
I added
file_line = infile.readline()
but I'm concerned that infile.readline() will return a string, instead of integers. Also, how do I limit the number of integers it processes?
I think you want to do something like this:
infile = open(source,'r')
file_contents_lines = infile.readlines()
for line in file_contents_lines:
print line
This will get you all the lines in the file and put them into a list containing each line as an element in the list.
Take a look at the documentation here.

Python: Regular Expression only works on directories with 1 file in

I have the following function which uses a RE:
def friendSearch():
os.chdir("C:/Users/David/myFiles")
files = os.listdir(".")
for x in files:
inputFile = open(x, "r")
content = inputFile.read()
inputFile.close()
match = re.search(r'(?<="NAME":)("[^"]+")',content)
print (match)
It works fine when the file containing the string is in a directory on its own, but when other files are added to the directory it returns nothing.
Is this because "match" is over written with each file that is processed? If so how can I stop this?
Thanks in advance
You are correct that the issue is match being written over with each file. I am assuming you want a single list with all of the matches from each file, so instead of doing match = ... use matches.extend(...) and initialize matches to an empty list before your loop.
For example:
def friendSearch():
matches = []
os.chdir("C:/Users/Luke/Desktop/Files")
files = os.listdir(".")
for x in files:
inputFile = open(x, "r")
try:
content = inputFile.read()
except UnicodeDecodeError:
continue
inputFile.close()
matches.extend(re.findall(r'(?<="text":)("[^"]+")',content))
print (matches)
Your match will contain search results from the very last file only. This line:
match = re.findall(r'(?<="text":)("[^"]+")',content)
discards what was in match before it. Try this:
match = []
for x in files:
inputFile = open (x, "r")
try:
content = inputFile.read()
except UnicodeDecodeError:
continue
inputFile.close ()
match = match + re.findall (r'(?<="text":)("[^"]+")', content)

Categories

Resources