Read the last lineof the file - python

Imagine I have a file with
Xpto,50,30,60
Xpto,a,v,c
Xpto,1,9,0
Xpto,30,30,60
that txt file can be appended a lot of times and when I open the file I want only to get the values of the last line of the txt file... How can i do that on python? reading the last line?

I think my answer from the last time this came up was sadly overlooked. :-)
If you're on a unix box,
os.popen("tail -10 " +
filepath).readlines() will probably
be the fastest way. Otherwise, it
depends on how robust you want it to
be. The methods proposed so far will
all fall down, one way or another.
For robustness and speed in the most
common case you probably want
something like a logarithmic search:
use file.seek to go to end of the file
minus 1000 characters, read it in,
check how many lines it contains, then
to EOF minus 3000 characters, read in
2000 characters, count the lines, then
EOF minus 7000, read in 4000
characters, count the lines, etc.
until you have as many lines as you
need. But if you know for sure that
it's always going to be run on files
with sensible line lengths, you may
not need that.
You might also find some inspiration
in the source code for the unix
tail command.

f.seek( pos ,2) seeks to 'pos' relative to the end of the file.
try a reasonable value for pos then readlines() and get the last line.
You have to account for when 'pos' is not a good guess, i.e. suppose you choose 300, but the last line is 600 chars long! in that case, just try again with a reasonable guess, until you capture the entire line. (this worst case should be very rare)

Um why not just seek to the end of the file and read til you hit a newline?.
i=0
while(1):
f.seek(i, 2)
c = f.read(1)
if(c=='\n'):
break

Not sure about a python specific implementation, but in a more language agnostic fashion, what you would want to do is skip (seek) to the end of the file, and then read each character in backwards order until you reach the line feed character that your file is using, usually a character with value 13. just read forward from that point to the end of the file, and you will have the last line in the file.

Related

quote by placing > at start of line in python?

I need to do a or linebreak add 2 spaces at end
You need to use some sort of Realloc() function. This function is used to extend the allocated size. The program should be something like that:
Allocate default value with malloc.
Read next number from your input.
If you got the number and this is not EOF (End of file), then use realloc to extend the allocated size by 1 and put the new number at the end.
Keep doing this untill you reach EOF.
Of course this is just one solution, and there may be others.
Another solution is some kind of a trick without using realloc(). You can read your file twice.
Open a file
Iterate through its content and find a size of the future array
Close your file
Allocate memory
Open a file again
Read numbers from the file and fill your array
P.S. In the future, try to be more specific while writing questions titles.

Efficiently search for many different strings in large file

I am trying to find a fast way of searching strings in a file. First of all, I don't have only one string to find. I have a list of 1900 strings to find in a file which is 150MB. So basically I am opening a file, looping for 1900 times to find all occurrences of that string in that file. Here are some of the attributes of my search.
Size of the file to be searched is 150mb – it’s text file.
I need to find all occurrences of 1900 strings in a file. Means I am looping 1900 times entire file to search for all occurrences.
It’s not simple search, I have to use regex to search the string.
In few cases, I need a line above and a line below the where I found the search string. So I need to use file.readlines() not file.read()
In few cases I also have to replace the searched string with new string.
First I am trying to find a best way to search in the file. My code is taking too long. I am not sure if this is best way to do it:
#searchstrings is list of 1900 strings
file = open("mytextfile.txt", "r")
for line in file:
for i in range(len(searchstrings)):
if searchstrings[i] in line:
print(line)
file.close()
This code does the job but it’s extremely slow. Also it does not give me option to choose the line above or below where the searchstring is found.
Another code I am using to replace the string is like below. This code is also extremely slow. Here I am using regex.
file = open("mytextfile.txt", "r")
file_data = file.read()
#searchstrings is list of 1900 strings
#replacestrings is list of 1900 strings that needs to be replaced
for i in range(len(searchstrings)):
src_str = re.compile(searchstrings[i], re.IGNORECASE)
file_data = src_str.sub(replacestrings[i], file_data)
file.close()
I know the performance of the code depends on the computing power as well, however, I just want to know what is the best way to write this code that will work at optimum speed for given hardware. Also I would like to know how to time the program execution.
I like Unix commands, they are fun, fast and efficient.
import re, sys
map(sys.stdout.write,(string_x for string_x in sys.stdin if re.search(sys.argv[1],string_x)))
A few observations.
For idiomatic Python, you usually want
for string in searchstrings:
...
instead of
for i in range(len(searchstrings)):
searchstrings[i]
and with open(filename) as f: ... instead of open()/close(). The with statement will close the file automatically.
When you want to replace any of several strings with a regex, you can do
re.sub('|'.join(YOUR_STRINGS), replacement, text)
because | is the regex symbol for "or", instead of looping over them all individually.
For performance, I might try switching from CPython to PyPy. PyPy is another implementation of the same language but often much faster.
On the other hand, if that's really all your program is supposed to do, you might want to use a dedicated tool for the job, like Ag or RipGrep which has already been optimized for this job. Possibly through the subprocess.run() function if you're working in Python.

How to loop through binary blocks using python and stop at file end?

I have a binary file that needs to be read sequentially because it is structured in variable-size blocks that self-describe themselves at start.
I thus want to loop through a set of:
+ get next block structure and limits
+ parse it
+ move on to the next block.
At some point I need to now that the file has reached an end.
How can I do this since python doesn't have an EOF check to put in a while loop?
Answers to similar questions online simply stated that you can stop your parsing when the file.read() gives you back no bytes or less bytes than you ask for.
Fine, but after parsing one block I need to read next byte to do that, which is annoying because it could be a byte that is part of the next structure definition.
Below the couple of while loops I devised to solve this.
1) Taking advantage of the file.read(num_of_bytes) in python, that can tell you there are no more bytes to read one solution could be
while (dat.read(1)):
dat.seek(-1,1)
[your business here]
The back-seek lets you continue from where you left in the previous block.
2) If you know your file doesn't change while reading it, you can take advantage of its length and simply use
flength = os.stat(your_file).st_size # your file size in bytes
while (dat.tell() < flength):
[your business here]
Takes advantage of the .tell() method that gives you the byte you're on in the file.

Python readline() and readlines() not working

I'm trying to read the contents of a 5GB file and then sort them and find duplicates. The file is basically just a list of numbers (each on a new line). There are no empty lines or any symbols other than digits. The numbers are all pretty big (at least 6 digits). I am currently using
for line in f:
do something to line
to avoid memory problems. I am fine with using that. However, I am interested to know why readline() and readlines() didn't work for me. When I try
print f.readline(10)
the program always returns the same line no matter which number I use as a parameter. To be precise, if I do readline(0) it returns an empty line, even though the first line in the file is a big number. If I try readline(1) it returns 2, even though the number 2 is not in the file. When the parameter is >= 6, it always returns the same number: 291965.
Additionally, the readlines() method always returns the same lines no matter what the parameter is. Even if I try to print f.readlines(2), it's still giving me a list of over 1000 numbers.
I am not sure if I explained it very well. Sorry, English is not my first language. Anyway, I can make it work without the readline methods but I really want to know why they don't work as expected.
This is what the first 10 lines of the file look like:
548098
968516
853181
485102
69638
689242
319040
610615
936181
486052
I can not reproduce f.readline(1) returning 2, or f.readlines(10) returning "thousands of lines", but it seems like you misunderstood what the integer parameters to those functions do.
Those number do not specify the number of the line to read, but the maximum bytes readline will read.
>>> f = open("data.txt")
>>> f.readline(1)
'5'
>>>f.readline(100)
'48098\n'
Both commands will read from the first line, which is 548098; the first will only read 1 byte, and the second command reads the rest of the line, as there are less than 100 bytes left. If you call readline again, it will continue with the second line, etc.
Similarly, f.readlines(10) will read full lines until the total amount of bytes read is larger than the specified number:
>>> f.readlines(10)
['968516\n', '853181\n']

how to load a big file and cut it into smaller files?

I have file about 4MB (which i called as big one)...this file has about 160000 lines..in a specific format...and i need to cut them at regular interval(not at equal intervals) i.e at the end of a certain format and write the part into another file..
Basically,what i wanted is to copy the information for the big file into the many smaller files ...as i read the big file keep writing the information into one file and after the a certain pattern occurs then end this and starting writing for that line into another file...
Normally, if it is a small file i guess it can be done dont know if i can perform file.readline() to read each line check if pattern end if not then write it to a file if patter end then change the file name open new file..so on but how to do it for this big file..
thanks in advance..
didnt mention the file format as i felt it is not neccesary will mention if required..
I would first read all of the allegedly-big file in memory as a list of lines:
with open('socalledbig.txt', 'rt') as f:
lines = f.readlines()
should take little more than 4MB -- tiny even by the standard of today's phones, much less ordinary computers.
Then, perform whatever processing you need to determine the beginning and ending of each group of lines you want to write out to a smaller files (I'm not sure by your question's text whether such groups can overlap or leave gaps, so I'm offering the most general solution where they're fully allowed to -- this will also cover more constrained use cases, with no real performance penalty, though code might be a tad simpler if the constraints were very rigid).
Say that you put these numbers in lists starts (index from 0 of first line to write, included), ends (index from 0 of first line to NOT write -- may legitimately and innocuosly be len(lines) or more), names (filenames to which you want to write), all lists having the same length of course.
Then, lastly:
assert len(starts) == len(ends) == len(names)
for s, e, n in zip(starts, ends, names):
with open(n, 'wt') as f:
f.writelines(lines[s:e])
...and that's all you need to do!
Edit: the OP seems to be confused by the concept of having these lists, so let me try to give an example: each block written out to a file starts at a line containing 'begin' (included) and ends at the first immediately succeeding line containing 'end' (also included), and the names of the files to be written are to be result0.txt, result1.txt, and so on.
It's an error if the number of "closing ends" differ from that of "opening begins" (and remember, the first immediately succeeding "end" terminates all pending "begins"); no line is allowed to contain both 'begin' and 'end'.
A very arbitrary set of conditions, to be sure, but then, the OP leaves us totally in the dark about the actual specifics of the problem, so what else can we do but guess most wildly?-)
outfile = 0
starts = []
ends = []
names = []
for i, line in enumerate(lines):
if 'begin' in line:
if 'end' in line:
raise ValueError('Both begin and end: %r' % line)
starts.append(i)
names.append('result%d.txt' % outfile)
outfile += 1
elif 'end' in line:
ends.append(i + 1) # remember ends are EXCLUDED, hence the +1
That's it -- the assert about the three lists having identical lengths will take care of checking that the constraints are respected.
As the constraints and specs are changed, so of course will this snippet of code change accordingly -- as long as it fills the three equal-length lists starts, ends, and names, exactly how it does so matters not in the least to the rest of the code.
A 4MB file is very small, it fits in memory for sure. The fastest approach would be to read it all and then iterate over each line searching for the pattern, writing out the line to the appropriate file depending on the pattern (your approach for small files.)
I'm not going to get into the actual code, but pseudo code would do this.
BIGFILE="filename"
SMALLFILE="smallfile1"
while(readline(bigfile)) {
write(SMALLFILE, line)
if(line matches pattern) {
SMALLFILE="smallfile++"
}
}
Which is really bad code, but maybe you get the point. I should also have said that it doesn't matter how big your file is since you have to read the file anyway.

Categories

Resources