Extracting a substring makes the for loop break in python - python

I am having a lot of trouble with a text file given to me that I need to parse. This is my third attempt at parsing it (I tried both C and php which seem to fail in different ways).
I have this extremely simple code :
import fileinput
for line in fileinput.input(['basin_stclair.txt']):
print line[0:64]
For some reason the code exits after the first print.
If I print the lines whole then it never stops but the lines are still combined. (If I only let the loop run for one iteration I get two lines(14 floats).
The text file looks like this(Several hundred lines like this one, 7 floats) :
1.749766 3.735660 0.294098 310.461737 0.000000 0.231367 0.230505
When I copy the entire text in kate it gets all jumbled and lines combine.
The text file was made using excell on a windows machine. (I'm working on a linux box).
Any ideas?

You have some problem with the newline characters in your file. Try opening the file using Python's universal newline support:
for line in open('basin_stclair.txt', 'U'):
print line[0:64]

Are you trying to print the first 64 lines? If so, try something like this:
i = 0
for line in fileinput.input(['basin_stclair.txt']):
print line[0:64]
if i > 63:
break
i = i + 1
Are you trying to print the first 64 characters of each line? Try something like this:
for line in fileinput.input(['basin_stclair.txt']):
if len(line) > 63:
print line[0:64]

Related

Precurse with open() or .write()?

Is there a way to precurse a write function in python (I'm working with fasta files but any write function that works with text files should work)?
The only way I could think is to read the whole file in as an array and count the number of lines I want to start at and just re-write that array, at that value, to a text file.
I was just thinking there might be a write an option or something somewhere.
I would add some code, but I'm writing it right now, and everyone on here seems to be pretty well versed, and probably know what I'm talking about. I'm an EE in the CS domain and just calling on the StackOverflow community to enlighten me.
From what I understand you want to truncate a file from the start - i.e remove the first n lines.
Then no - there is no way you can do without reading in the lines and ignoring the lines - this is what I would do :
import shutil
remove_to = 5 # Remove lines 0 to 5
try:
with open('precurse_me.txt') as inp, open('temp.txt') as out:
for index, line in enumerate(inp):
if index <= remove_to:
continue
out.write(line)
# If you don't want to replace the original file - delete this
shutil.move('temp.txt', 'precurse_me.txt')
except Exception as e:
raise e
Here I open a file for the output and then use shutil.move() to replace the input file only after the processing (the for loop) is complete. I do this so that I don't break the 'precurse_me.txt' file in case the processing fails. I wrap the whole thing in a try/except so that if anything fails it doesn't try to move the file by accident.
The key is the for loop - read the input file line by line; using the enumerate() function to count the lines as they come in.
Ignore those lines (by using continue) until the index says to not ignore the line - after that simply write each line to the out file.

Python conditional statement based on text file string

Noob question here. I'm scheduling a cron job for a Python script for every 2 hours, but I want the script to stop running after 48 hours, which is not a feature of cron. To work around this, I'm recording the number of executions at the end of the script in a text file using a tally mark x and opening the text file at the beginning of the script to only run if the count is less than n.
However, my script seems to always run regardless of the conditions. Here's an example of what I've tried:
with open("curl-output.txt", "a+") as myfile:
data = myfile.read()
finalrun = "xxxxx"
if data != finalrun:
[CURL CODE]
with open("curl-output.txt", "a") as text_file:
text_file.write("x")
text_file.close()
I think I'm missing something simple here. Please advise if there is a better way of achieving this. Thanks in advance.
The problem with your original code is that you're opening the file in a+ mode, which seems to set the seek position to the end of the file (try print(data) right after you read the file). If you use r instead, it works. (I'm not sure that's how it's supposed to be. This answer states it should write at the end, but read from the beginning. The documentation isn't terribly clear).
Some suggestions: Instead of comparing against the "xxxxx" string, you could just check the length of the data (if len(data) < 5). Or alternatively, as was suggested, use pickle to store a number, which might look like this:
import pickle
try:
with open("curl-output.txt", "rb") as myfile:
num = pickle.load(myfile)
except FileNotFoundError:
num = 0
if num < 5:
do_curl_stuff()
num += 1
with open("curl-output.txt", "wb") as myfile:
pickle.dump(num, myfile)
Two more things concerning your original code: You're making the first with block bigger than it needs to be. Once you've read the string into data, you don't need the file object anymore, so you can remove one level of indentation from everything except data = myfile.read().
Also, you don't need to close text_file manually. with will do that for you (that's the point).
Sounds more for a job scheduling with at command?
See http://www.ibm.com/developerworks/library/l-job-scheduling/ for different job scheduling mechanisms.
The first bug that is immediately obvious to me is that you are appending to the file even if data == finalrun. So when data == finalrun, you don't run curl but you do append another 'x' to the file. On the next run, data will be not equal to finalrun again so it will continue to execute the curl code.
The solution is of course to nest the code that appends to the file under the if statement.
Well there probably is an end of line jump \n character which makes that your file will contain something like xx\n and not simply xx. Probably this is why your condition does not work :)
EDIT
What happens if through the python command line you type
open('filename.txt', 'r').read() # where filename is the name of your file
you will be able to see whether there is an \n or not
Try using this condition along with if clause instead.
if data.count('x')==24
data string may contain extraneous data line new line characters. Check repr(data) to see if it actually a 24 x's.

File contents not as long as expected

with open(sourceFileName, 'rt') as sourceFile:
sourceFileConents = sourceFile.read()
sourceFileConentsLength = len(sourceFileConents)
i = 0
while i < sourceFileConentsLength:
print(str(i) + ' ' + sourceFileConents[i])
i += 1
Please forgive the unPythonic for i loop, this is only the test code & there are reasons to do it that way in the real code.
Anyhoo, the real code seemed to be ending the loop sooner than expected, so I knocked up the dummy above, which removes all of the logic of the real code.
The sourceFileConentsLength reports as 13,690, but when I print it out char for char, there are still a few 100 chars more in the file, which are not being printed out.
What gives?
Should I be using something other than <fileHandle>.read() to get the file's entire contents into a single string?
Have I hit some maximum string length? If so, can I get around it?
Might it be line endings if the file was edited in Windows & the script is run in Linux (sorry, I can't post the file, it's company confidential)
What else?
[Update] I think that we strike two of those ideas.
For maximum string length, see this question.
I did an ls -lAF to a temp directory. Only 6k+ chars, but the script handed it just fine. Should I be worrying about line endings? If so, what can I do about it? The source files tend to get edited under both Windows & Linux, but the script will only run under Linux.
[Updfate++] I changed the line endings on my input file to Linux in Eclipse, but still got the same result.
If you read a file in text mode it will automatically convert line endings like \r\n to \n.
Try using
with open(sourceFileName, newline='') as sourceFile:
instead; this will turn off newline-translation (\r\n will be returned as \r\n).
If your file is encoded in something like UTF-8, you should decode it before counting the characters:
sourceFileContents_utf8 = open(sourceFileName, 'r+').read()
sourceFileContents_unicode = sourceFileContents_utf8.decode('utf8')
print(len(sourceFileContents_unicode))
i = 0
source_file_contents_length = len(sourceFileContents_unicode)
while i < source_file_contents_length:
print('%s %s' % (str(i), sourceFileContents[i]))
i += 1

Fastest way to split super long line into multiple lines

I have a huge XML-File (about 1TB) that is written in one long line.
I want to extract some of its features and think that it is easier to do this, as soon as I have the long line split into new lines after each tag.
The file is built like that:
<textA textB textC> <textD textE textF> <textG textH textI>
I now started cracking the long line with this code:
eof = 0
while eof == 0:
character = historyfile.read(1)
if character != ">" and character != "":
output.write(character)
if character == ">":
output.write('>' + '\n')
if character == "":
eof = 1
Unfortuantely this code will take about 12 days to process the whole file.
I am now thinking whether there are much faster ways that can process the file in a similiar way with at least double time.
My first idea is to maybe just parse through the file and replace the closing tag like this:
for line in infile:
line.replace('>', '>' + '\n')
Do you think this approach will be much faster? I would try it by myself, but I already have the first code running for 1 and a half days ;)
If you would try to just read the file line by line, which would be just one line of 1TB you would get a str variable of the same length. I do not know the implementation details, but I would guess, a MemoryError is raised long before reading finished.

Reading file in Python one line at a time

I do appreciate this question has been asked million of time, but I can't figure out while attempting to read a .txt file line by line I get the entire file read in one go.
This is my little snippet
num = 0
with open(inStream, "r") as f:
for line in f:
num += 1
print line + " ..."
print num
Having a look at the open function there is anything that suggest a second param to limit the reading as that is just the "mode" to pen the file.
So I can only guess there are same problem with my file, but this is a txt file, with entry line by line.
Any hint?
Without a little more information, it's hard to be absolutely sure… but most likely, your problem is inappropriate line endings.
For example, on a modern Mac OS X system, lines in text files end with '\n' newline characters. So, when you do for line in f:, Python breaks the text file on '\n' characters.
But on classic Mac OS 9, lines in text files ended with '\r' instead. If you have some ancient classic Mac text files lying around, and you give one to Python, it will go looking for '\n' characters and not find any, so it'll think the whole file is one giant line.
(Of course in real life, Windows is a problem more often than classic Mac OS, but I used this example because it's simpler.)
Python 2: Fortunately, Python has a feature called "universal newlines". For full details, see the link, but the short version is that adding "U" onto the end of the mode when opening a text file means Python will read any of the three standard line-ending conventions (and give them to your code as Unix-style '\n').
In other words, just change one line:
with open(inStream, "rU") as f:
Python 3: Universal newlines are part of the standard behavior; adding "U" has no effect and is deprecated.

Categories

Resources