Using fopen in Raspberry pi4 - python

In Raspberry Pi 4, I am trying to read a series of files(more than 1000 files) in the specific directory with the fopen function in the for loop, but fopen cannot read the file if it exceeds a certain number of iterations. How do I solve this?

but fopen cannot read the file if it exceeds a certain number of iterations.
A wild guess: you neglect to fclose the files after you are done with them, leading to eventual exhaustion of either memory or available file descriptors.
How do I solve this?
Make sure to fclose your files.

When you open file using fopen the system uses a file descriptor to point to that file. And there are only so many of them available. A quick google search says microsoft usually has 512 file descriptors. Also, the file is loaded to memory. So, loading a lot of files will use up your memory fast. You should close each file after you are done with them. This usually is not a problem when working with a few files. But in case like yours where thousands of files are necessary, they should be closed as soon as possible after using them.

Related

Python: Can I read a large text file from different scripts?

I have a large text file stored in a shared directory on a server in which different other machines have access to that. I'm running various analysis on this text file without changing or updating it. I'd like to know whether I can run different python scripts on different machines in which all of them reading that large text file? None of the scripts make any change to that file, they just need to read it.
You should be able to do multiple read access, but it might get really, really, really slow, compared to reading the same file by several scripts on the same computer (obviously, the degree will very much depend on how much reading you are doing). You may want to copy the file over before processing.

Read entire physical file including file slack with python?

Is there a simple way to read all the allocated clusters of a given file with python? The usual python read() seemingly only allows me to read up to the logical size of the file (which is reasonable, of course), but I want to read all the clusters including slack space.
For example, I have a file called "test.bin" that is 1234 byte in logical size, but because my file system uses clusters of size 4096 bytes, the file has a physical size of 4096 bytes on disk. I.e., there are 2862 bytes in file slack space.
I'm not sure where to even start with this problem... I know I can read the raw disk from /dev/sda, but I'm not sure how to locate the clusters of interest... of course this is the whole point of having a file-system (to match up names of files to sectors on disk), but I don't know enough about how python interacts with the file-system to figure this out... yet... any help or pointers to references would be greatly appreciated.
Assuming an ext2/3/4 filesytem, as you guess yourself, your best bet is probably to:
use a wrapper (like this one) around debugfs to get the list of blocks associated with a given file:
debugfs: blocks ./f.txt
2562
to read-back that/those block(s) from the block device / image file
>>> f = open('/tmp/test.img','rb')
>>> f.seek(2562*4*1024)
10493952
>>> bytes = f.read(4*1024)
>>> bytes
b'Some data\n\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
...
Not very fancy but that will work. Please note you don't have to mount the FS to do any of these steps. This is especially important for forensic applications where you cannot trust in anyway the content of the disk and/or are not allowed per regulation to mount the disk image.
There is a C open source forensic tool that implements file access successfully.
Here is an overview of the tool Link.
You can download it here.
It basically uses the POSIX system call open() that returns a raw file discriptor (integer) that you can use with the POSIX system calls read() and write() without the restrinction of stopping at EOF which stops you from accessing the file slack.
There are lots of examples online how to do a system call with python e.g. this one

Python - HardDrive access when opening files

If you open a file for reading, read from it, close it, and then repeat the process (in a loop) does python continually access the hard-drive? Because it doesn't seem to from my experimentation, and I'd like to understand why.
An (extremely) simple example:
while True:
file = open('/var/log/messages', 'r')
stuff = file.read()
file.close()
time.sleep(2)
If I run this, my hard-drive access LED lights up once and then the hard-drive remains dormant. How is this possible? Is python somehow keeping the contents of the file stored in RAM? Because logic would tell me that it should be accessing the hard-drive every two seconds.
The answer depends on your operating system and the type of hard drive you have. Most of the time, when you access something off the drive, the information is cached in main memory in case you need it again soon. Depending on the replacement strategy used by your OS, the data may stay in main memory for a while or be replaced relatively soon.
In some cases, your hard drive will cache frequently accessed information in its own memory, and then while the drive might be accessed, it will retrieve the information faster and send it to the processor faster than if it had to search the drive platters.
Likely your operating system or file-system is smart enough to serve the file from the OS cache if the file has not changed in between.
Python does not cache, the operating system does. You can find out the size of these caches with top. In the third line, it will say something like:
Mem: 5923332k total, 5672832k used, 250500k free, 43392k buffers
That means about 43MB are being used by the OS to cache data recently written to or read from the hard disk. You can turn this caching off by writing 2 or 3 to /proc/sys/vm/drop_caches.

Read/Write CSV as binary with Python

I just found out that I can save space\ speed up reads of CSV files.
Using the answer of my previous question
How do I create a CSV file from database in Python?
And 'wb' for opens
w = csv.writer(open(Fn,'wb'),dialect='excel')
How can I open all files in a directory and saves all files with the same name as starting name and use 'wb' to reformat all files. I guess convert all CSV's to binary CSV's.
You can't "overwrite a file on the fly". You have two options:
if the files are small enough (smaller than the amount of available RAM by
a comfortable margin), just loop over them (os.listdir makes that loop
easy, or os.walk if you want to catch the whole tree of subdirectories,
not just one directory), and for each, read it in memory first, then
overwrite the on-disk copy.
otherwise, loop over them, and each time write to a new file (e.g. by
appending .new to the name), then move the new file over the old. This
is safer (no risk of running out of memory, no risk of damaging a file if
the computer crashes) but more complicated.
So, what is your situation: small-enough files (and backups for safeguard against computer and disk crashes), in which case I can if you wish show the simple code; or huge multi-GB files -- in which case it will have to be the complex code? Let us know!

Having several file pointers open simultaneaously alright?

I'm reading from certain offsets in several hundred and possibly thousands files. Because I need only certain data from certain offsets at that particular time, I must either keep the file handle open for later use OR I can write the parts I need into seperate files.
I figured keeping all these file handles open rather than doing a significant amount of writing to the disk of new temporary files is the lesser of two evils. I was just worried about the efficiency of having so many file handles open.
Typically, I'll open a file, seek to an offset, read some data, then 5 seconds later do the same thing but at another offset, and do all this on thousand of files within a 2 minute timeframe.
Is that going to be a problem?
A followup: Really, I"m asking which is better to leave these thousands file handles open, or to constantly close them and re-open them just when I instantaneously need them.
Some systems may limit the number of file descriptors that a single process can have open simultaneously. 1024 is a common default, so if you need "thousands" open at once, you might want to err on the side of portability and design your application to use a smaller
pool of open file descriptors.
I recommend that you take a look at Storage.py in BitTorrent. It includes an implementation of a pool of file handles.

Categories

Resources