Read file and find if all lines are the same length - python

Using python I need to read a file and determine if all lines are the same length or not. If they are I move the file into a "good" folder and if they aren't all the same length I move them into a "bad" folder and write a word doc that says which line was not the same as the rest. Any help or ways to start?

You should use all():
with open(filename) as read_file:
length = len(read_file.readline())
if all(len(line) == length for line in read_file):
# Move to good folder
else:
# Move to bad folder
Since all() is short-circuiting, it will stop reading the file at the first non-match.

First off, you can read the file, here example.txt and put all lines in a list, content:
with open(filename) as f:
content = f.readlines()
Next you need to trim all the newline characters from the end of a line and put it in another list result:
for line in content:
line = line.strip()
result.append(line)
Now it's not that hard to get the length of every sentence, and since you want lines that are bad, you loop through the list:
for line in result:
lengths.append(len(line))
So the i-th element of result has length [i-th element of lengths]. We can make a counter for what line length occurs the most in the list, it is as simple as one line!
most_occuring = max(set(lengths), key=lengths.count)
Now we can make another for-loop to check which lengths don't correspond with the most-occuring and add those to bad-lines:
for i in range(len(lengths)):
if (lengths[i] != most_occuring):
bad_lines.append([i, result[i]])
The next step is check where the file needs to go, the good folder, or the bad folder:
if len(bad_lines == 0):
#Good file, move it to the good folder, use the os or shutil module
os.rename("path/to/current/file.foo", "path/to/new/desination/for/file.foo")
else:
#Bad file, one or more lines are bad, thus move it to the bad folder
os.rename("path/to/current/file.foo", "path/to/new/desination/for/file.foo")
The last step is writing the bad lines to another file, which is do-able, since we have the bad lines already in a list bad_lines:
with open("bad_lines.txt", "wb") as f:
for bad_line in bad_lines:
f.write("[%3i] %s\n" % (bad_line[0], bad_line[1]))
It's not a doc file, but I think this is a nice start. You can take a look at the docx module if you really want to write to a doc file.
EDIT: Here is an example python script.
with open("example.txt") as f:
content = f.readlines()
result = []
lengths = []
#Strip the file of \n
for line in content:
line = line.strip()
result.append(line)
lengths.append(len(line))
most_occuring = max(set(lengths), key=lengths.count)
bad_lines = []
for i in range(len(lengths)):
if (lengths[i] != most_occuring):
#Append the bad_line to bad_lines
bad_lines.append([i, result[i]])
#Check if it's a good, or a bad file
#if len(bad_lines == 0):
#Good File
#Move file to the good folder...
#else:
#Bad File
with open("bad_lines.txt", "wb") as f:
for bad_line in bad_lines:
f.write("[%3i] %s\n" % (bad_line[0], bad_line[1]))

Related

Read first not empty line from file and store in XML

I should create a Script that collects data about files in a certain folder and store them in an xml file, which is created during the process.
I am stuck at the point, at which the first sentence of the file should be stored in .
The files have one sentence per line, but some start with empty lines. The files with empty lines would need to store the first non empty line.
This is my attempt:
first_sent = et.SubElement(file_corpus, 'firstsentence')
text = open(filename, 'rU')
first_sent.text=text.readline() #this line was before if!!
if text.readline() != '':
print text.readline()
first_sent.text = text.readline()
Currently it only some (random) sentence for very few files.
You're calling text.readline() again instead of checking the value previously read. And you need a loop in order to skip all the blank lines.
Something resembling this should work:
first_sent = et.SubElement(file_corpus, 'firstsentence')
text = open(filename, 'rU')
first_sent.text=text.readline() #this line was before if!!
while first_sent.text == '':
first_sent.text = text.readline()

How to Iterate over readlines() in python

I am trying to add lines from a txt file to a python list for iteration, and the script wants to print every line and return an error. I'm using the readlines() function, but when I use list.remove(lines), it returns an error: File "quotes.py", line 20, in main list.remove(lines) TypeError: remove() takes exactly one argument (0 given).
def main():
while True:
try:
text_file = open("childrens-catechism.txt", "r")
lines = text_file.readlines()
# print lines
# print len(lines)
if len(lines) > 0:
print lines
list.remove(lines)
time.sleep(60)
else:
print "No more lines!"
break
text_file.close()
I can't see what I'm doing wrong. I know it has to do with list.remove(). Thank you in advance.
You can write in this way. It will save you some time and give you more efficiency.
import time
def main():
with open("childrens-catechism.txt", "r") as file:
for line in file:
print line,
time.sleep(60)
Try this as per your requirements, this will do what you need.
import time
def main():
with open("childrens-catechism.txt", "r") as file:
for lines in file.readlines():
if len(lines) > 0:
for line in lines:
print line
lines.remove(line)
else:
print "No more lines to remove"
time.sleep(60)
lines is a list here from your txt. files, and list.remove(lines) is not a correct syntax, you trying to delete a list on list. list is a function in Python. You can delete the elements in lines like;
del lines[0]
del lines[1]
...
or
lines.remove("something")
The logic is, remove() is deleting an element in a list, you have to write that list before remove() after then you have to write the thing that you want to delete in paranthesis of remove() function.
On opening a file, we can convert the file lines onto a list,
lines = list(open("childrens-catechism.txt", "r"))
From this list we can now remove entries with length greater than zero, like this,
for line in lines:
if len(line) > 0:
# do sth
lines.remove(line)
If you are trying to read all the lines from the file and then print them in order, and then delete them after printing them I would recommend this approach:
import time
try:
file = open("childrens-catechism.txt")
lines = file.readlines()
while len(lines) != 0:
print lines[0],
lines.remove(lines[0])
time.sleep(60)
except IOError:
print 'No such file in directory'
This prints the first line and then deletes it. When the first value is removed, the list shifts one up making the previous line (lines[1]) the new start to the list namely lines[0].
EDITED:
If you wanted to delete the line from the file as well as from the list of lines you will have to do this:
import time
try:
file = open("childrens-catechism.txt", 'r+') #open the file for reading and writing
lines = file.readlines()
while len(lines) != 0:
print lines[0],
lines.remove(lines[0])
time.sleep(60)
file.truncate(0) #this truncates the file to 0 bytes
except IOError:
print 'No such file in directory'
As far as deleting the lines from the file line for line I am not too sure if that is possible or efficient.

parse blocks of text from text file using Python

I am trying to parse some text files and need to extract blocks of text. Specifically, the lines that start with "1:" and 19 lines after the text. The "1:" does not start on the same row in each file and there is only one instance of "1:". I would prefer to save the block of text and export it to a separate file. In addition, I need to preserve the formatting of the text in the original file.
Needless to say I am new to Python. I generally work with R but these files are not really compatible with R and I have about 100 to process. Any information would be appreciated.
The code that I have so far is:
tmp = open(files[0],"r")
lines = tmp.readlines()
tmp.close()
num = 0
a=0
for line in lines:
num += 1
if "1:" in line:
a = num
break
a = num is the line number for the block of text I want. I then want to save to another file the next 19 lines of code, but can't figure how how to do this. Any help would be appreciated.
Here is one option. Read all lines from your file. Iterate till you find your line and return next 19 lines. You would need to handle situations where your file doesn't contain additional 19 lines.
fh = open('yourfile.txt', 'r')
all_lines = fh.readlines()
fh.close()
for count, line in enumerate(all_lines):
if "1:" in line:
return all_lines[count+1:count+20]
Could be done in a one-liner...
open(files[0]).read().split('1:', 1)[1].split('\n')[:19]
or more readable
txt = open(files[0]).read() # read the file into a big string
before, after = txt.split('1:', 1) # split the file on the first "1:"
after_lines = after.split('\n') # create lines from the after text
lines_to_save = after_lines[:19] # grab the first 19 lines after "1:"
then join the lines with a newline (and add a newline to the end) before writing it to a new file:
out_text = "1:" # add back "1:"
out_text += "\n".join(lines_to_save) # add all 19 lines with newlines between them
out_text += "\n" # add a newline at the end
open("outputfile.txt", "w").write(out_text)
to comply with best practice for reading and writing files you should also be using the with statement to ensure that the file handles are closed as soon as possible. You can create convenience functions for it:
def read_file(fname):
"Returns contents of file with name `fname`."
with open(fname) as fp:
return fp.read()
def write_file(fname, txt):
"Writes `txt` to a file named `fname`."
with open(fname, 'w') as fp:
fp.write(txt)
then you can replace the first line above with:
txt = read_file(files[0])
and the last line with:
write_file("outputfile.txt", out_text)
I always prefer to read the file into memory first, but sometimes that's not possible. If you want to use iteration then this will work:
def process_file(fname):
with open(fname) as fp:
for line in fp:
if line.startswith('1:'):
break
else:
return # no '1:' in file
yield line # yield line containing '1:'
for i, line in enumerate(fp):
if i >= 19:
break
yield line
if __name__ == "__main__":
with open('ouput.txt', 'w') as fp:
for line in process_file('intxt.txt'):
fp.write(line)
It's using the else: clause on a for-loop which you don't see very often anymore, but was created for just this purpose (the else clause if executed if the for-loop doesn't break).

Strip file names from files and open recursively? Saving previous strings? - PYTHON

I have a question about reading in a .txt rile and taking the string from inside to be used later on in the code.
If I have a file called 'file0.txt' and it contains:
file1.txt
file2.txt
The rest of the files either contain more string file names or are empty.
How can I save both of these strings for later use. What I attempted to do was:
infile = open(file, 'r')
line = infile.readline()
line.split('\n')
But that returned the following:
['file1.txt', '']
I understand that readline only reads one line, but I thought that by splitting it by the return key it would also grab the next file string.
I am attempting to simulate a file tree or to show which files are connected together, but as it stands now it is only going through the first file string in each .txt file.
Currently my output is:
File 1 crawled.
File 3 crawled.
Dead end reached.
My hope was that instead of just recursivley crawling the first file it would go through the entire web, but that goes back to my issue of not giving the program the second file name in the first place.
I'm not asking for a specific answer, just a push in the right direction on how to better handle the strings from the files and be able to store both of them instead of 1.
My current code is pretty ugly, but hopefully it gets the idea across, I will just post it for reference to what I'm trying to accomplish.
def crawl(file):
infile = open(file, 'r')
line = infile.readline()
print(line.split('\n'))
if 'file1.txt' in line:
print('File 1 crawled.')
return crawl('file1.txt')
if 'file2.txt' in line:
print('File 2 crawled.')
return crawl('file2.txt')
if 'file3.txt' in line:
print('File 3 crawled.')
return crawl('file3.txt')
if 'file4.txt' in line:
print('File 4 crawled.')
return crawl('file4.txt')
if 'file5.txt' in line:
print('File 5 crawled.')
return crawl('file5.txt')
#etc...etc...
else:
print('Dead end reached.')
Outside the function:
file = 'file0.txt'
crawl(file)
Using read() or readlines() will help. e.g.
infile = open(file, 'r')
lines = infile.readlines()
print list(lines)
gives
['file1.txt\n', 'file2.txt\n']
or
infile = open(file, 'r')
lines = infile.read()
print list(lines.split('\n'))
gives
['file1.txt', 'file2.txt']
Readline only gets one line from the file so it has a newline at the end. What you want is file.read() which will give you the whole file as a single string. Split that using newline and you should have what you need. Also remember that you need to save the list of lines as a new variable i.e. assign to your line.split('\n') action. You could also just use readlines which will get a list of lines from the file.
change readline to readlines. and no need to split(\n), its already a list.
here is a tutorial you should read
I prepared file0.txt with two files in it, file1.txt, with one file in it, plus file2.txt and file3.txt, which contained no data. Note, this won't extract values already in the list
def get_files(current_file, files=[]):
# Initialize file list with previous values, or intial value
new_files = []
if not files:
new_files = [current_file]
else:
new_files = files
# Read files not already in list, to the list
with open(current_file, "r") as f_in:
for new_file in f_in.read().splitlines():
if new_file not in new_files:
new_files.append(new_file.strip())
# Do we need to recurse?
cur_file_index = new_files.index(current_file)
if cur_file_index < len(new_files) - 1:
next_file = new_files[cur_file_index + 1]
# Recurse
get_files(next_file, new_files)
# We're done
return new_files
initial_file = "file0.txt"
files = get_files(initial_file)
print(files)
Returns: ['file0.txt', 'file1.txt', 'file2.txt', 'file3.txt']
file0.txt
file1.txt
file2.txt
file1.txt
file3.txt
file2.txt and file3.txt were blank
Edits: Added .strip() for safety, and added the contents of the data files so this can be replicated.

Python: Copying lines that meet requirements

So, basically, I need a program that opens a .dat file, checks each line to see if it meets certain prerequisites, and if they do, copy them into a new csv file.
The prerequisites are that it must 1) contain "$W" or "$S" and 2) have the last value at the end of the line of the DAT say one of a long list of acceptable terms. (I can simply make-up a list of terms and hardcode them into a list)
For example, if the CSV was a list of purchase information and the last item was what was purchased, I only want to include fruit. In this case, the last item is an ID Tag, and I only want to accept a handful of ID Tags, but there is a list of about 5 acceptable tags. The Tags have very veriable length, however, but they are always the last item in the list (and always the 4th item on the list)
Let me give a better example, again with the fruit.
My original .DAT might be:
DGH$G$H $2.53 London_Port Gyro
DGH.$WFFT$Q5632 $33.54 55n39 Barkdust
UYKJ$S.52UE $23.57 22#3 Apple
WSIAJSM_33$4.FJ4 $223.4 Ha25%ek Banana
Only the line: "UYKJ$S $23.57 22#3 Apple" would be copied because only it has both 1) $W or $S (in this case a $S) and 2) The last item is a fruit. Once the .csv file is made, I am going to need to go back through it and replace all the spaces with commas, but that's not nearly as problematic for me as figuring out how to scan each line for requirements and only copy the ones that are wanted.
I am making a few programs all very similar to this one, that open .dat files, check each line to see if they meet requirements, and then decides to copy them to the new file or not. But sadly, I have no idea what I am doing. They are all similar enough that once I figure out how to make one, the rest will be easy, though.
EDIT: The .DAT files are a few thousand lines long, if that matters at all.
EDIT2: The some of my current code snippets
Right now, my current version is this:
def main():
#NewFile_Loc = C:\Users\J18509\Documents
OldFile_Loc=raw_input("Input File for MCLG:")
OldFile = open(OldFile_Loc,"r")
OldText = OldFile.read()
# for i in range(0, len(OldText)):
# if (OldText[i] != " "):
# print OldText[i]
i = split_line(OldText)
if u'$S' in i:
# $S is in the line
print i
main()
But it's very choppy still. I'm just learning python.
Brief update: the server I am working on is down, and might be for the next few hours, but I have my new code, which has syntax errors in it, but here it is anyways. I'll update again once I get it working. Thanks a bunch everyone!
import os
NewFilePath = "A:\test.txt"
Acceptable_Values = ('Apple','Banana')
#Main
def main():
if os.path.isfile(NewFilePath):
os.remove(NewFilePath)
NewFile = open (NewFilePath, 'w')
NewFile.write('Header 1,','Name Header,','Header 3,','Header 4)
OldFile_Loc=raw_input("Input File for Program:")
OldFile = open(OldFile_Loc,"r")
for line in OldFile:
LineParts = line.split()
if (LineParts[0].find($W)) or (LineParts[0].find($S)):
if LineParts[3] in Acceptable_Values:
print(LineParts[1], ' is accepted')
#This Line is acceptable!
NewFile.write(LineParts[1],',',LineParts[0],',',LineParts[2],',',LineParts[3])
OldFile.close()
NewFile.close()
main()
There are two parts you need to implement: First, read a file line by line and write lines meeting a specific criteria. This is done by
with open('file.dat') as f:
for line in f:
stripped = line.strip() # remove '\n' from the end of the line
if test_line(stripped):
print stripped # Write to stdout
The criteria you want to check for are implemented in the function test_line. To check for the occurrence of "$W" or "$S", you can simply use the in-Operator like
if not '$W' in line and not '$S' in line:
return False
else:
return True
To check, if the last item in the line is contained in a fixed list, first split the line using split(), then take the last item using the index notation [-1] (negative indices count from the end of a sequence) and then use the in operator again against your fixed list. This looks like
items = line.split() # items is an array of strings
last_item = items[-1] # take the last element of the array
if last_item in ['Apple', 'Banana']:
return True
else:
return False
Now, you combine these two parts into the test_line function like
def test_line(line):
if not '$W' in line and not '$S' in line:
return False
items = line.split() # items is an array of strings
last_item = items[-1] # take the last element of the array
if last_item in ['Apple', 'Banana']:
return True
else:
return False
Note that the program writes the result to stdout, which you can easily redirect. If you want to write the output to a file, have a look at Correct way to write line to file in Python
inlineRequirements = ['$W','$S']
endlineRequirements = ['Apple','Banana']
inputFile = open(input_filename,'rb')
outputFile = open(output_filename,'wb')
for line in inputFile.readlines():
line = line.strip()
#trailing and leading whitespace has been removed
if any(req in line for req in inlineRequirements):
#passed inline requirement
lastWord = line.split(' ')[-1]
if lastWord in endlineRequirements:
#passed endline requirement
outputFile.write(line.replace(' ',','))
#replaced spaces with commas and wrote to file
inputFile.close()
outputFile.close()
tags = ['apple', 'banana']
match = ['$W', '$S']
OldFile_Loc=raw_input("Input File for MCLG:")
OldFile = open(OldFile_Loc,"r")
for line in OldFile.readlines(): # Loop through the file
line = line.strip() # Remove the newline and whitespace
if line and not line.isspace(): # If the line isn't empty
lparts = line.split() # Split the line
if any(tag.lower() == lparts[-1].lower() for tag in tags) and any(c in line for c in match):
# $S or $W is in the line AND the last section is in tags(case insensitive)
print line
import re
list_of_fruits = ["Apple","Bannana",...]
with open('some.dat') as f:
for line in f:
if re.findall("\$[SW]",line) and line.split()[-1] in list_of_fruits:
print "Found:%s" % line
import os
NewFilePath = "A:\test.txt"
Acceptable_Values = ('Apple','Banana')
#Main
def main():
if os.path.isfile(NewFilePath):
os.remove(NewFilePath)
NewFile = open (NewFilePath, 'w')
NewFile.write('Header 1,','Name Header,','Header 3,','Header 4)
OldFile_Loc=raw_input("Input File for Program:")
OldFile = open(OldFile_Loc,"r")
for line in OldFile:
LineParts = line.split()
if (LineParts[0].find(\$W)) or (LineParts[0].find(\$S)):
if LineParts[3] in Acceptable_Values:
print(LineParts[1], ' is accepted')
#This Line is acceptable!
NewFile.write(LineParts[1],',',LineParts[0],',',LineParts[2],',',LineParts[3])
OldFile.close()
NewFile.close()
main()
This worked great, and has all the capabilities I needed. The other answers are good, but none of them do 100% of what I needed like this one does.

Categories

Resources