I need to do a or linebreak add 2 spaces at end
You need to use some sort of Realloc() function. This function is used to extend the allocated size. The program should be something like that:
Allocate default value with malloc.
Read next number from your input.
If you got the number and this is not EOF (End of file), then use realloc to extend the allocated size by 1 and put the new number at the end.
Keep doing this untill you reach EOF.
Of course this is just one solution, and there may be others.
Another solution is some kind of a trick without using realloc(). You can read your file twice.
Open a file
Iterate through its content and find a size of the future array
Close your file
Allocate memory
Open a file again
Read numbers from the file and fill your array
P.S. In the future, try to be more specific while writing questions titles.
Related
I have a binary file that needs to be read sequentially because it is structured in variable-size blocks that self-describe themselves at start.
I thus want to loop through a set of:
+ get next block structure and limits
+ parse it
+ move on to the next block.
At some point I need to now that the file has reached an end.
How can I do this since python doesn't have an EOF check to put in a while loop?
Answers to similar questions online simply stated that you can stop your parsing when the file.read() gives you back no bytes or less bytes than you ask for.
Fine, but after parsing one block I need to read next byte to do that, which is annoying because it could be a byte that is part of the next structure definition.
Below the couple of while loops I devised to solve this.
1) Taking advantage of the file.read(num_of_bytes) in python, that can tell you there are no more bytes to read one solution could be
while (dat.read(1)):
dat.seek(-1,1)
[your business here]
The back-seek lets you continue from where you left in the previous block.
2) If you know your file doesn't change while reading it, you can take advantage of its length and simply use
flength = os.stat(your_file).st_size # your file size in bytes
while (dat.tell() < flength):
[your business here]
Takes advantage of the .tell() method that gives you the byte you're on in the file.
I'm trying to read the contents of a 5GB file and then sort them and find duplicates. The file is basically just a list of numbers (each on a new line). There are no empty lines or any symbols other than digits. The numbers are all pretty big (at least 6 digits). I am currently using
for line in f:
do something to line
to avoid memory problems. I am fine with using that. However, I am interested to know why readline() and readlines() didn't work for me. When I try
print f.readline(10)
the program always returns the same line no matter which number I use as a parameter. To be precise, if I do readline(0) it returns an empty line, even though the first line in the file is a big number. If I try readline(1) it returns 2, even though the number 2 is not in the file. When the parameter is >= 6, it always returns the same number: 291965.
Additionally, the readlines() method always returns the same lines no matter what the parameter is. Even if I try to print f.readlines(2), it's still giving me a list of over 1000 numbers.
I am not sure if I explained it very well. Sorry, English is not my first language. Anyway, I can make it work without the readline methods but I really want to know why they don't work as expected.
This is what the first 10 lines of the file look like:
548098
968516
853181
485102
69638
689242
319040
610615
936181
486052
I can not reproduce f.readline(1) returning 2, or f.readlines(10) returning "thousands of lines", but it seems like you misunderstood what the integer parameters to those functions do.
Those number do not specify the number of the line to read, but the maximum bytes readline will read.
>>> f = open("data.txt")
>>> f.readline(1)
'5'
>>>f.readline(100)
'48098\n'
Both commands will read from the first line, which is 548098; the first will only read 1 byte, and the second command reads the rest of the line, as there are less than 100 bytes left. If you call readline again, it will continue with the second line, etc.
Similarly, f.readlines(10) will read full lines until the total amount of bytes read is larger than the specified number:
>>> f.readlines(10)
['968516\n', '853181\n']
Goal
Reading in a massive binary file approx size 1.3GB and change certain bits and then writing it back to a separate file (cannot modify original file).
Method
When I read in the binary file it gets stored in a massive string encoded in hex format which is immutable since I am using python.
My algorithm loops through the entire file and stores in a list all the indexes of the string that need to be modified. The catch is that all the indexes in the string need to be modified to the same value. I cannot do this in place due to immutable nature. I cannot convert this into a list of chars because that blows up my memory constraints and takes a hell lot of time. The viable thing to do is to store it in a separate string, but due to the immutable nature I have to make a ton of string objects and keep on concatenating to them.
I used some ideas from https://waymoot.org/home/python_string/ however it doesn't give me a good performance. Any ideas, the goal is to copy an existing super long string exactly into another except for certain placeholders determined by the values in the index List ?
So, to be honest, you shouldn't be reading your file into a string. You shouldn't especially be writing anything but the bytes you actually change.
That is just a waste of resources, since you only seem to be reading linearly through the file, noting the down the places that need to be modified.
On all OSes with some level of mmap support (that is, Unixes, among them Linux, OS X, *BSD and other OSes like Windows), you can use Python's mmap module to just open the file in read/write mode, scan through it and edit it in place, without the need to ever load it to RAM completely and then write it back out. Stupid example, converting all 12-valued bytes by something position-dependent:
Note: this code is mine, and not MIT-licensed. It's for text-enhancement purposes and thus covered by CC-by-SA. Thanks SE for making this stupid statement necessary.
import mmap
with open("infilename", "r") as in_f:
in_view = mmap.mmap(in_f.fileno(), 0) ##length = 0: complete file mapping
length = in_view.size()
with open("outfilename", "w") as out_f
out_view = mmap.mmap(out_f.fileno(), length)
for i in range(length):
if in_view[i] == 12:
out_view[i] = in_view[i] + i % 10
else:
out_view[i] = in_view[i]
What about slicing the string, modify each slice, write it back on the disk before moving on to the next slice? Too intensive for the disk?
I have file about 4MB (which i called as big one)...this file has about 160000 lines..in a specific format...and i need to cut them at regular interval(not at equal intervals) i.e at the end of a certain format and write the part into another file..
Basically,what i wanted is to copy the information for the big file into the many smaller files ...as i read the big file keep writing the information into one file and after the a certain pattern occurs then end this and starting writing for that line into another file...
Normally, if it is a small file i guess it can be done dont know if i can perform file.readline() to read each line check if pattern end if not then write it to a file if patter end then change the file name open new file..so on but how to do it for this big file..
thanks in advance..
didnt mention the file format as i felt it is not neccesary will mention if required..
I would first read all of the allegedly-big file in memory as a list of lines:
with open('socalledbig.txt', 'rt') as f:
lines = f.readlines()
should take little more than 4MB -- tiny even by the standard of today's phones, much less ordinary computers.
Then, perform whatever processing you need to determine the beginning and ending of each group of lines you want to write out to a smaller files (I'm not sure by your question's text whether such groups can overlap or leave gaps, so I'm offering the most general solution where they're fully allowed to -- this will also cover more constrained use cases, with no real performance penalty, though code might be a tad simpler if the constraints were very rigid).
Say that you put these numbers in lists starts (index from 0 of first line to write, included), ends (index from 0 of first line to NOT write -- may legitimately and innocuosly be len(lines) or more), names (filenames to which you want to write), all lists having the same length of course.
Then, lastly:
assert len(starts) == len(ends) == len(names)
for s, e, n in zip(starts, ends, names):
with open(n, 'wt') as f:
f.writelines(lines[s:e])
...and that's all you need to do!
Edit: the OP seems to be confused by the concept of having these lists, so let me try to give an example: each block written out to a file starts at a line containing 'begin' (included) and ends at the first immediately succeeding line containing 'end' (also included), and the names of the files to be written are to be result0.txt, result1.txt, and so on.
It's an error if the number of "closing ends" differ from that of "opening begins" (and remember, the first immediately succeeding "end" terminates all pending "begins"); no line is allowed to contain both 'begin' and 'end'.
A very arbitrary set of conditions, to be sure, but then, the OP leaves us totally in the dark about the actual specifics of the problem, so what else can we do but guess most wildly?-)
outfile = 0
starts = []
ends = []
names = []
for i, line in enumerate(lines):
if 'begin' in line:
if 'end' in line:
raise ValueError('Both begin and end: %r' % line)
starts.append(i)
names.append('result%d.txt' % outfile)
outfile += 1
elif 'end' in line:
ends.append(i + 1) # remember ends are EXCLUDED, hence the +1
That's it -- the assert about the three lists having identical lengths will take care of checking that the constraints are respected.
As the constraints and specs are changed, so of course will this snippet of code change accordingly -- as long as it fills the three equal-length lists starts, ends, and names, exactly how it does so matters not in the least to the rest of the code.
A 4MB file is very small, it fits in memory for sure. The fastest approach would be to read it all and then iterate over each line searching for the pattern, writing out the line to the appropriate file depending on the pattern (your approach for small files.)
I'm not going to get into the actual code, but pseudo code would do this.
BIGFILE="filename"
SMALLFILE="smallfile1"
while(readline(bigfile)) {
write(SMALLFILE, line)
if(line matches pattern) {
SMALLFILE="smallfile++"
}
}
Which is really bad code, but maybe you get the point. I should also have said that it doesn't matter how big your file is since you have to read the file anyway.
Imagine I have a file with
Xpto,50,30,60
Xpto,a,v,c
Xpto,1,9,0
Xpto,30,30,60
that txt file can be appended a lot of times and when I open the file I want only to get the values of the last line of the txt file... How can i do that on python? reading the last line?
I think my answer from the last time this came up was sadly overlooked. :-)
If you're on a unix box,
os.popen("tail -10 " +
filepath).readlines() will probably
be the fastest way. Otherwise, it
depends on how robust you want it to
be. The methods proposed so far will
all fall down, one way or another.
For robustness and speed in the most
common case you probably want
something like a logarithmic search:
use file.seek to go to end of the file
minus 1000 characters, read it in,
check how many lines it contains, then
to EOF minus 3000 characters, read in
2000 characters, count the lines, then
EOF minus 7000, read in 4000
characters, count the lines, etc.
until you have as many lines as you
need. But if you know for sure that
it's always going to be run on files
with sensible line lengths, you may
not need that.
You might also find some inspiration
in the source code for the unix
tail command.
f.seek( pos ,2) seeks to 'pos' relative to the end of the file.
try a reasonable value for pos then readlines() and get the last line.
You have to account for when 'pos' is not a good guess, i.e. suppose you choose 300, but the last line is 600 chars long! in that case, just try again with a reasonable guess, until you capture the entire line. (this worst case should be very rare)
Um why not just seek to the end of the file and read til you hit a newline?.
i=0
while(1):
f.seek(i, 2)
c = f.read(1)
if(c=='\n'):
break
Not sure about a python specific implementation, but in a more language agnostic fashion, what you would want to do is skip (seek) to the end of the file, and then read each character in backwards order until you reach the line feed character that your file is using, usually a character with value 13. just read forward from that point to the end of the file, and you will have the last line in the file.