Strip file names from files and open recursively? Saving previous strings? - PYTHON - python

I have a question about reading in a .txt rile and taking the string from inside to be used later on in the code.
If I have a file called 'file0.txt' and it contains:
file1.txt
file2.txt
The rest of the files either contain more string file names or are empty.
How can I save both of these strings for later use. What I attempted to do was:
infile = open(file, 'r')
line = infile.readline()
line.split('\n')
But that returned the following:
['file1.txt', '']
I understand that readline only reads one line, but I thought that by splitting it by the return key it would also grab the next file string.
I am attempting to simulate a file tree or to show which files are connected together, but as it stands now it is only going through the first file string in each .txt file.
Currently my output is:
File 1 crawled.
File 3 crawled.
Dead end reached.
My hope was that instead of just recursivley crawling the first file it would go through the entire web, but that goes back to my issue of not giving the program the second file name in the first place.
I'm not asking for a specific answer, just a push in the right direction on how to better handle the strings from the files and be able to store both of them instead of 1.
My current code is pretty ugly, but hopefully it gets the idea across, I will just post it for reference to what I'm trying to accomplish.
def crawl(file):
infile = open(file, 'r')
line = infile.readline()
print(line.split('\n'))
if 'file1.txt' in line:
print('File 1 crawled.')
return crawl('file1.txt')
if 'file2.txt' in line:
print('File 2 crawled.')
return crawl('file2.txt')
if 'file3.txt' in line:
print('File 3 crawled.')
return crawl('file3.txt')
if 'file4.txt' in line:
print('File 4 crawled.')
return crawl('file4.txt')
if 'file5.txt' in line:
print('File 5 crawled.')
return crawl('file5.txt')
#etc...etc...
else:
print('Dead end reached.')
Outside the function:
file = 'file0.txt'
crawl(file)

Using read() or readlines() will help. e.g.
infile = open(file, 'r')
lines = infile.readlines()
print list(lines)
gives
['file1.txt\n', 'file2.txt\n']
or
infile = open(file, 'r')
lines = infile.read()
print list(lines.split('\n'))
gives
['file1.txt', 'file2.txt']

Readline only gets one line from the file so it has a newline at the end. What you want is file.read() which will give you the whole file as a single string. Split that using newline and you should have what you need. Also remember that you need to save the list of lines as a new variable i.e. assign to your line.split('\n') action. You could also just use readlines which will get a list of lines from the file.

change readline to readlines. and no need to split(\n), its already a list.
here is a tutorial you should read

I prepared file0.txt with two files in it, file1.txt, with one file in it, plus file2.txt and file3.txt, which contained no data. Note, this won't extract values already in the list
def get_files(current_file, files=[]):
# Initialize file list with previous values, or intial value
new_files = []
if not files:
new_files = [current_file]
else:
new_files = files
# Read files not already in list, to the list
with open(current_file, "r") as f_in:
for new_file in f_in.read().splitlines():
if new_file not in new_files:
new_files.append(new_file.strip())
# Do we need to recurse?
cur_file_index = new_files.index(current_file)
if cur_file_index < len(new_files) - 1:
next_file = new_files[cur_file_index + 1]
# Recurse
get_files(next_file, new_files)
# We're done
return new_files
initial_file = "file0.txt"
files = get_files(initial_file)
print(files)
Returns: ['file0.txt', 'file1.txt', 'file2.txt', 'file3.txt']
file0.txt
file1.txt
file2.txt
file1.txt
file3.txt
file2.txt and file3.txt were blank
Edits: Added .strip() for safety, and added the contents of the data files so this can be replicated.

Related

Failure while trying to append in a try statement

So I have a small program that reads a file and creates it if it doesn't exist.
But it fails when you try to read the contents of the second and third file and append it to the first.
I marked in the code exactly where it fails.
It always jumps to the except part, I didnt include it here because it seemed unnecesary (the except part)
with open ('lista1.txt','r') as file_1:
reader_0 = file_1.readlines() #reads a list of searchterms, the first search term of this list is "gt-710"
for search in reader_0:
# creates the txt string component of the file to be created, this is the first one
file_0 = search.replace("\n","") +".txt"
file_1 = str(file_0.strip())
# creates the txt string component of the file to be created, this is the second one
files_2 = search.replace("\n","") +"2.txt"
file_2 = str(files_2.strip())
# creates the txt string component of the file to be created, this is the second one
files_3 = search.replace("\n","") +"3.txt"
file_3 = str(files_3.strip())
try: #if the file named the same as the searchterm exists, read its contents
file = open(file_1,"r")
file2 = open(file_2,"r")
file3 = open(file_3,"r")
file_contents = file.readlines()
file_contents2 = file2.readlines()
file_contents3 = file3.readlines()
file = open(file_1,"a") #appends the contents of file 3 and file 2 to file 1
print("im about here")
file.write(file_contents2) #fails exactly here I don't know why
file.write(file_contents3)
file2 = open(file_2,"w+")
file2.write(file_contents)
file3 = open(file_3,"w+")
file3.write(file_contents2)
The reason it fails at the point you mention is that you are trying to write a list into the file (not a string). file2.readlines() returns a list of strings, each being their own line so to fix this change all the readlines to filexxx.read() which returns the whole file contents as a string.
I also recommend making the changes the other answer states to make your code more readable/robust.
You start reading from file_ with file = open(file_1, 'r'), and then open it again in append mode, without closing the first I/O operation - causing a failure when you attempt to write to the find while it's open in read mode.
Change your file reading/writing to utilize the less error-prone with open syntax, as follows:
with open(file_1, 'r') as file_handle:
file_contents = file_handle.read()
with open(file_2, 'r') as file_handle:
file_contents2 = file_handle.read()
with open(file_3, 'r') as file_handle:
file_contents3 = file_handle.read()
with open(file_1, 'a') as file_handle:
file_handle.write(file_contents2)
# etc.
This syntax makes it very evident when a file is no longer open, and in what state it is open in.

Pull specific line from every .txt file in folder and output lines to another .txt

I have a folder of 400 .txt files and am attempting to take the sixth line from every file in the directory, and output each line all into a new singular .txt file with the sixth line from each file listed one after the other in the new file. For example, the output I am attempting to create should look like:
**output.txt**
This is the sixth line from 1.txt
This is the sixth line from 2.txt
This is the sixth line from 3.txt
So far I'm able to print off all the files in the directory in a list to be acted upon with:
import os
entries = os.listdir(r'C:/Users/defaultuser/Desktop/UprocScripts')
for entry in entries:
print(entry)
I have researched and tried various combinations of the readlines() method, but I'm not sure exactly how to combine them in multiples over an entire directory of 400 files. I'm still trying to learn, any ideas if I'm on the right path and how to combine them is appreciated.
Here is another way if you want to use for loop for iterating over your text file and pick a specific line.In this code all the .txt files are fetched at the beginning.
import glob
list_of_txt = glob.glob(r"C:\Users\defaultuser\Desktop\UprocScripts\*.txt")
for textfiles in list_of_txt:
with open(r"C:\Users\defaultuser\Desktop\UprocScripts\final.txt", 'a+') as final_text_file:
with open(textfiles, 'r') as textFile:
for n, line in enumerate(textFile):
if n+1 == 6: # if it's line no. 6 then write it on your final txt file
final_text_file.writelines(line)
Also note that I am using the glob module here. In addition if you want to add "from some.txt" after each line then just replace the last line with this:
final_text_file.write(line.strip() + " from " + textfiles.split('\\')[-1] + "\r\n")
You need to read each file, get the sixth line from each of them, then write that line to the output file.
Like so:
import os
entries = os.listdir(r'C:/Users/defaultuser/Desktop/UprocScripts')
for entry in entries:
with open('output.txt', 'w') as out_file:
with open(entry) as text_file:
lines = text_file.readlines()
target_line = lines[5] # sixth line
out_file.write(target_line)
Note this does read the complete file for each of the input files- which might be inefficient. You can try to get around that by trying to utilize the hint parameter to readlines - which accepts an approximate number of bytes to read until. If you know the apprx size of each line (in bytes) you can pass 6 * line_size as hint to try & optimize the read part.
You don't need to read all the file, you can read only the first 6 lines like this:
import os
entries = os.listdir(r'C:/Users/defaultuser/Desktop/UprocScripts')
final = []
for entry in entries
# Read the first 6 lines and add the last one (you don't need to read everything):
with open(entry) as f:
lines = []
for _ in range(6):
lines.append(f.readline())
final.append(lines[-1])
# And write
with open("final.txt", "r") as f:
f.writelines(final)
import os
files_list = []
sixth_line_list = []
output_list = []
directory = 'C:\\Users\\defaultuser\\Desktop\\UprocScripts'
for file in os.listdir(directory):
if file.endswith('.txt'):
files_list.append(''.join([directory, '\\', file]))
for file in files_list:
with open(file, 'r') as file_:
sixth_line_list.append({file: file_.readlines()[5]})
for i in range(0, len(sixth_line_list), 1):
output_list.append(''.join([sixth_line_list[i].values()[0], ' from ', sixth_line_list[i].keys()[0]]))
with open(''.join([directory, '\\output.txt']), 'w') as output:
output.writelines(output_list)

Read file and find if all lines are the same length

Using python I need to read a file and determine if all lines are the same length or not. If they are I move the file into a "good" folder and if they aren't all the same length I move them into a "bad" folder and write a word doc that says which line was not the same as the rest. Any help or ways to start?
You should use all():
with open(filename) as read_file:
length = len(read_file.readline())
if all(len(line) == length for line in read_file):
# Move to good folder
else:
# Move to bad folder
Since all() is short-circuiting, it will stop reading the file at the first non-match.
First off, you can read the file, here example.txt and put all lines in a list, content:
with open(filename) as f:
content = f.readlines()
Next you need to trim all the newline characters from the end of a line and put it in another list result:
for line in content:
line = line.strip()
result.append(line)
Now it's not that hard to get the length of every sentence, and since you want lines that are bad, you loop through the list:
for line in result:
lengths.append(len(line))
So the i-th element of result has length [i-th element of lengths]. We can make a counter for what line length occurs the most in the list, it is as simple as one line!
most_occuring = max(set(lengths), key=lengths.count)
Now we can make another for-loop to check which lengths don't correspond with the most-occuring and add those to bad-lines:
for i in range(len(lengths)):
if (lengths[i] != most_occuring):
bad_lines.append([i, result[i]])
The next step is check where the file needs to go, the good folder, or the bad folder:
if len(bad_lines == 0):
#Good file, move it to the good folder, use the os or shutil module
os.rename("path/to/current/file.foo", "path/to/new/desination/for/file.foo")
else:
#Bad file, one or more lines are bad, thus move it to the bad folder
os.rename("path/to/current/file.foo", "path/to/new/desination/for/file.foo")
The last step is writing the bad lines to another file, which is do-able, since we have the bad lines already in a list bad_lines:
with open("bad_lines.txt", "wb") as f:
for bad_line in bad_lines:
f.write("[%3i] %s\n" % (bad_line[0], bad_line[1]))
It's not a doc file, but I think this is a nice start. You can take a look at the docx module if you really want to write to a doc file.
EDIT: Here is an example python script.
with open("example.txt") as f:
content = f.readlines()
result = []
lengths = []
#Strip the file of \n
for line in content:
line = line.strip()
result.append(line)
lengths.append(len(line))
most_occuring = max(set(lengths), key=lengths.count)
bad_lines = []
for i in range(len(lengths)):
if (lengths[i] != most_occuring):
#Append the bad_line to bad_lines
bad_lines.append([i, result[i]])
#Check if it's a good, or a bad file
#if len(bad_lines == 0):
#Good File
#Move file to the good folder...
#else:
#Bad File
with open("bad_lines.txt", "wb") as f:
for bad_line in bad_lines:
f.write("[%3i] %s\n" % (bad_line[0], bad_line[1]))

"Move" some parts of the file to another file

Let say I have a file with 48,222 lines. I then give an index value, let say, 21,000.
Is there any way in Python to "move" the contents of the file starting from index 21,000 such that now I have two files: the original one and the new one. But the original one now is having 21,000 lines and the new one 27,222 lines.
I read this post which uses partition and is quite describing what I want:
with open("inputfile") as f:
contents1, sentinel, contents2 = f.read().partition("Sentinel text\n")
with open("outputfile1", "w") as f:
f.write(contents1)
with open("outputfile2", "w") as f:
f.write(contents2)
Except that (1) it uses "Sentinel Text" as separator, (2) it creates two new files and require me to delete the old file. As of now, the way I do it is like this:
for r in result.keys(): #the filenames are in my dictionary, don't bother that
f = open(r)
lines = f.readlines()
f.close()
with open("outputfile1.txt", "w") as fn:
for line in lines[0:21000]:
#write each line
with open("outputfile2.txt", "w") as fn:
for line in lines[21000:]:
#write each line
Which is quite a manual work. Is there a built-in or more efficient way?
You can also use writelines() and dump the sliced list of lines from 0 to 20999 into one file and another sliced list from 21000 to the end into another file.
with open("inputfile") as f:
content = f.readlines()
content1 = content[:21000]
content2 = content[21000:]
with open("outputfile1.txt", "w") as fn1:
fn1.writelines(content1)
with open('outputfile2.txt','w') as fn2:
fn2.writelines(content2)

Read first not empty line from file and store in XML

I should create a Script that collects data about files in a certain folder and store them in an xml file, which is created during the process.
I am stuck at the point, at which the first sentence of the file should be stored in .
The files have one sentence per line, but some start with empty lines. The files with empty lines would need to store the first non empty line.
This is my attempt:
first_sent = et.SubElement(file_corpus, 'firstsentence')
text = open(filename, 'rU')
first_sent.text=text.readline() #this line was before if!!
if text.readline() != '':
print text.readline()
first_sent.text = text.readline()
Currently it only some (random) sentence for very few files.
You're calling text.readline() again instead of checking the value previously read. And you need a loop in order to skip all the blank lines.
Something resembling this should work:
first_sent = et.SubElement(file_corpus, 'firstsentence')
text = open(filename, 'rU')
first_sent.text=text.readline() #this line was before if!!
while first_sent.text == '':
first_sent.text = text.readline()

Categories

Resources