Naming lists based on the names of text files - python

I'm trying to write a function which takes word lists from text files and appends each word in the file to a list, with the same name as the text file. For instance, using the text files Verbs.txt and Nouns.txt would result in all the words in Verbs.txt being in a verbs list and all the nouns in a nounslist. I'm trying to do it in a for loop:
def loadAllWords():
fileList = ['Adjectives.txt', 'Adverbs.txt', 'Conjunctions.txt',
'IntransitiveVerbs.txt', 'Leadin.txt', 'Nounmarkers.txt',
'Nouns.txt', 'TransitiveVerbs.txt']
for file in fileList:
infile = open(file, 'r')
word_type = file[:-4]
word_list = [line for line in infile]
return word_list
Of course, I could do it easily once for each text file:
def loadAllWords():
infile = open("Adjectives.txt", "r")
wordList = []
wordList = [word for word in infile]
return wordList
but I'd like my function to do it automatically with each one. Is there a way to do this, or should I just stick with a for loop for each file?

You should use a dict for that like (untested):
results = {}
for file in file_list:
infile = open(file, 'r')
word_type = file[:-4]
results[word_type] = [line for line in infile]
return results
also you don't need the list comprehension, you can just do:
results[word_type] = list(infile)

You can create new variables with custom names by manipulating the locals() dictionary, which is where local variables are stored. But it is hard to imagine any case where this would be a good idea. I strongly recommend Stephen Roach’s suggestion of using a dictionary, which will let you keep track of the lists more neatly. But if you really want to create local variables for each file, you can use a slight variation on his code:
results = {}
for file in file_list:
with open(file, 'r') as infile:
word_type = file[:-4]
results[word_type] = list(infile)
# store each list in a local variable with that name
locals.update(results)

Related

Read multiple files, search for string and store in a list

I am trying to search through a list of files, look for the words 'type' and the following word. then put them into a list with the file name. So for example this is what I am looking for.
File Name, Type
[1.txt, [a, b, c]]
[2.txt, [a,b]]
My current code returns a list for every type.
[1.txt, [a]]
[1.txt, [b]]
[1.txt, [c]]
[2.txt, [a]]
[2.txt, [b]]
Here is my code, i know my logic will return a single value into the list but I'm not sure how to edit it to it will just be the file name with a list of types.
output = []
for file_name in find_files(d):
with open(file_name, 'r') as f:
for line in f:
line = line.lower().strip()
match = re.findall('type ([a-z]+)', line)
if match:
output.append([file_name, match])
Learn to categorize your actions at the proper loop level.
In this case, you say that you want to accumulate all of the references into a single list, but then your code creates one output line per reference, rather than one per file. Change that focus:
with open(file_name, 'r') as f:
ref_list = []
for line in f:
line = line.lower().strip()
match = re.findall('type ([a-z]+)', line)
if match:
ref_list.append(match)
# Once you've been through the entire file,
# THEN you add a line for that file,
# with the entire reference list
output.append([file_name, ref_list])
You might find it useful to use a dict here instead
output = {}
for file_name in find_files(d):
with open(file_name, 'r') as f:
output[file_name] = []
for line in f:
line = line.lower().strip()
match = re.findall('type ([a-z]+)', line)
if match:
output[file_name].append(*match)

Searching keywords in a text file with a dictionary with python

I am trying to search a lot of keywords in a textfile and return integers/floats that come after the keyword.
I think it's possible using a dictionary where the keys are the keywords that are in the text file and the values are functions that return the following value.
import re
def store_text():
with open("path_to_file.txt", 'r') as f:
text = f.readlines()
return text
abc = store_text()
def search():
for index, line in enumerate(abc):
if "His age is:" in line:
return int(re.search(r"\d+", line).group())
dictionary = {
"His age is:": print(search())
}
The code returns the value I search in the text file but in search() I want to get rid of typing the keyword again, because its already in the dictionary.
Later on I want to store the values found in an excel file.
If you have the keywords ready to be in a list, the following approach can help.
import re
from multiprocessing import Pool
search_kwrds = ["His age is:", "His name is:"] # add more keywords if you need.
search_regex = "|".join(search_kwrds)
def read_search_text():
with open("path_to_file.txt", 'r') as f:
text = f.readlines()
return text
def search(search_line):
search_res = re.search(search_regex, search_line)
if search_res:
kwrd_found = search_res.group(0)
if kwrd_found:
suffix_val = int(re.search(r"\d+", search_line).group())
return {kwrd_found: suffix_val }
return {}
if __name__ == '__main__':
search_lines = read_search_text()
p = Pool(processes=1) # increase, if you want a faster search
s_res = p.map(search,search_lines)
search_results ={kwrd: suffix for d in s_res for kwrd, suffix in d.items()}
print(search_results)
You can add more keywords to the list and search for them. This focuses on searches where you will have a single keyword on a given line and keywords are not repeating in further lines.
You can put up your keywords that you need to search in a list. This way you end up specifying your input keywords just once in your program. Also, I've modified your program to make it a bit efficient. Explanation given in comments.
import re
import csv
list_of_keywords = ["His age is:","His number is:","Your_Keyword3"] # You can add more keywords to find and match to this list
def store_text():
with open("/Users/karthick/Downloads/sample.txt", 'r') as f:
text = f.readlines()
return text
abc = store_text()
def search(input_file):
# Initialize an empty dictionary to store the extracted values
dictionary = dict()
#Iterate through lines of textfile
for line in input_file:
#FOr every line in text file, iterate through the keywords to check if any keyword is present in the line
for keyword in list_of_keywords:
if keyword in line:
#If any matching keyword is present, append the dictionary with new values
dictionary.update({keyword : re.search(r"\d+", line).group()})
return dictionary
#Call the above function with input
output_dict = search(abc)
For storing the output values in an Excel csv:
#Write the extracted dictionary to an Excel csv file
with open('mycsvfile.csv','w') as f: #Specify the path of your output csv file here
w = csv.writer(f)
w.writerows(output_dict.items())

How to insert lines in file if not already present?

So, I have textfile with multiple lines:
orange
melon
applez
more fruits
abcdefg
And I have a list of strings that I want to check:
names = ["apple", "banana"]
Now I want to go through all the lines in the file, and I want to insert the missing strings from the names list, if they are not present. If they are present, then they should not be inserted.
Generally this should not be to difficult but taking care of all the newlines and such is pretty finicky. This is my attempt:
if not os.path.isfile(fpath):
raise FileNotFoundError('Could not load username file', fpath)
with open(fpath, 'r+') as f:
lines = [line.rstrip('\n') for line in f]
if not "banana" in lines:
lines.insert(0, 'banana')
if not "apple" in lines:
lines.insert(0, 'apple')
f.writelines(lines)
print("done")
The problem is, my values are not inserted in new lines but are appended.
Also I feel like my solution is generally a bit clunky. Is there a better way to do this that automatically inserts the missing strings and takes care of all the newlines and such?
You need to seek to the first position in the file and use join to write each word into a new line, to overwrite its contents:
names = ["apple", "banana"]
with open(fpath, 'r+') as f:
lines = [line.rstrip('\n') for line in f]
for name in names:
if name not in lines:
# inserts on top, elsewise use lines.append(name) to append at the end of the file.
lines.insert(0, name)
f.seek(0) # move to first position in the file, to overwrite !
f.write('\n'.join(lines))
print("done")
First get a list of all the usernames in your file by using readlines() and then use list comprehension to identify the missing usernames from your names list.
Create a new list and write that one to your file.
names = ["apple", "banana"]
new_list = List()
with open(fpath, 'r+') as f:
usernames = f.readlines()
res = [user for user in usernames if user not in names]
new_list = usernames + res
with open(fpath, 'r+') as f:
for item in new_list:
f.write("%s\n" % item)
file_name = r'<file-path>' # full path of file
names = ["apple", "banana"] # user list of word
with open(file_name, 'r+') as f: # opening file as object will automatically handle all the conditions
x = f.readlines() # reading all the lines in a list, no worry about '\n'
# checking if the user word is present in the file or not
for name in names:
if name not in x: # if word not in file then write the word to the file
f.write('\n'+name )

import filenames iteratively from a different file

I have a large number of entries in a file. Let me call it file A.
File A:
('aaa.dat', 'aaa.dat', 'aaa.dat')
('aaa.dat', 'aaa.dat', 'bbb.dat')
('aaa.dat', 'aaa.dat', 'ccc.dat')
I want to use these entries, line by line, in a program that would iteratively pick an entry from file A, concatenate the files in this way:
filenames = ['aaa.dat', 'aaa.dat', 'ccc.dat'] ###entry number 3
with open('out.dat', 'w') as outfile: ###the name has to be aaa-aaa-ccc.dat
for fname in filenames:
with open(fname) as infile:
outfile.write(infile.read().strip())
All I need to do is to substitute the filenames iteratively and create an output in a "aaa-aaa-aaa.dat" format. I would appreciate any help-- feeling a bit lost!
Many thanks!!!
You can retrieve and modify the file names in the following way:
import re
pattern = re.compile('\W')
with open('fnames.txt', 'r') as infile:
for line in infile:
line = (re.sub(pattern, ' ', line)).split()
# Old filenames - to concatenate contents
content = [x + '.dat' for x in line[::2]];
# New filename
new_name = ('-').join(line[::2]) + '.dat'
# Write the concatenated content to the new
# file (first read the content all at once)
with open(new_name, 'w') as outfile:
for con in content:
with open(con, 'r') as old:
new_content = old.read()
outfile.write(new_content)
This program reads your input file, here named fnames.txt with the exact structure from your post, line by line. For each line it splits the entries using a precompiled regex (precompiling regex is suitable here and should make things faster). This assumes that your filenames are only alphanumeric characters, since the regex substitutes all non-alphanumeric characters with a space.
It retrieves only 'aaa' and dat entries as a list of strings for each line and forms a new name by joining every second entry starting from 0 and adding a .dat extension to it. It joins using a - as in the post.
It then retrieves the individual file names from which it will extract the content into a list content by selecting every second entry from line.
Finally, it reads each of the files in content and writes them to the common file new_name. It reads each of them all at ones which may be a problem if these files are big and in general there may be more efficient ways of doing all this. Also, if you are planning to do more things with the content from old files before writing, consider moving the old file-specific operations to a separate function for readability and any potential debugging.
Something like this:
with open(fname) as infile, open('out.dat', 'w') as outfile:
for line in infile:
line = line.strip()
if line: # not empty
filenames = eval(line.strip()) # read tuple
filenames = [f[:-4] for f in filenames] # remove extension
filename = '-'.join(filenames) + '.dat' # make filename
outfile.write(filename + '\n') # write
If your problem is just calculating the new filenames, how about using os.path.splitext?
'-'.join([
f[0] for f in [os.path.splitext(path) for path in filenames]
]) + '.dat'
Which can be probably better understood if you see it like this:
import os
clean_fnames = []
filenames = ['aaa.dat', 'aaa.dat', 'ccc.dat']
for fname in filenames:
name, extension = os.path.splitext(fname)
clean_fnames.append(name)
name_without_ext = '-'.join(clean_fnames)
name_with_ext = name_without_ext + '.dat'
print(name_with_ext)
HOWEVER: If your issue is that you can not get the filenames in a list by reading the file line by line, you must keep in mind that when you read files, you get text (strings) NOT Python structures. You need to rebuild a list from a text like: "('aaa.dat', 'aaa.dat', 'aaa.dat')\n".
You could take a look to ast.literal_eval or try to rebuild it yourself. The code below outputs a lot of messages to show what's happening:
import pprint
collected_fnames = []
with open('./fileA.txt') as f:
for line in f:
print("Read this (literal) line: %s" % repr(line))
line_without_whitespaces_on_the_sides = line.strip()
if not line_without_whitespaces_on_the_sides:
print("line is empty... skipping")
continue
else:
line_without_parenthesis = (
line_without_whitespaces_on_the_sides
.lstrip('(')
.rstrip(')')
)
print("Cleaned parenthesis: %s" % line_without_parenthesis)
chunks = line_without_parenthesis.split(', ')
print("Collected %s chunks in a %s: %s" % (len(chunks), type(chunks), chunks))
chunks_without_quotations = [chunk.replace("'", "") for chunk in chunks]
print("Now we don't have quotations: %s" % chunks_without_quotations)
collected_fnames.append(chunks_without_quotations)
print("collected %s lines with filenames:\n%s" %
(len(collected_fnames), pprint.pformat(collected_fnames)))

Return only returning 1st item

In my function below I'm trying to return the full text from multiple .txt files and append the results to a list, however, my function only returns the text from the first file in my directory. That said if I replace return with print it logs all the results to the console on its own line but I can't seem to append the results to a list. What am I doing wrong with the return function.
Thanks.
import glob
import copy
file_paths = []
file_paths.extend(glob.glob("C:\Users\7812397\PycharmProjects\Sequence diagrams\*"))
matching_txt = [s for s in file_paths if ".txt" in s]
full_text = []
def fulltext():
for file in matching_txt:
f = open(file, "r")
ftext = f.read()
all_seqs = ftext.split("title ")
return all_seqs
print fulltext()
You're putting return inside your loop. I think you want to do. Although, you could yield the values here.
for file in matching_txt:
f = open(file, "r")
....
full_text.append(all_seqs)
return full_text
You can convert your function to a generator which is more efficient in terms of memory use:
def fulltext():
for file_name in matching_txt:
with open(file_name) as f:
ftext = f.read()
all_seqs = ftext.split("title ")
yield all_seqs
Then convert the generator to a list if they are not huge, other wise you can simply loop over the result if you want to use the generator's items:
full_text = list(fulltext())
Some issues in your function:
First off, don't use python built-in names as variable names (in this case file).
Secondly for dealing with files (and other internal connections) you better to use with statement which will close the file at the end of the block automatically.

Categories

Resources