Iterating through files is faster after the first time [duplicate]

Iterating through files is faster after the first time [duplicate] - python

This question already has answers here:
How did Python read this binary faster the second time?
(2 answers)
Closed 1 year ago.
This was something I came across while working on a project and I'm kind of confused. I have a .txt file with ~15000 lines. And when I run the program once, it takes around 4-5 seconds to go through all the lines. But I added a while True before opening the file and I did file.close() so that it continuously opens, goes through all the lines, and then closes.
But after the first run, I noticed that it takes around 1 second to complete. I made sure to close the files afterwards so what might be causing it to be so much faster?

It's called "file caching" or "warming the cache". All of the major operating systems allocate a goodly portion of your RAM to a file cache. When you read a file, those buffers are retained for a while instead of being released right away. If you read the same file again, it can often pull the data from RAM instead of going to disk.

Related

How to make a script wait for another script to finish reading a file [duplicate]

This question already has answers here:
Locking a file in Python
(15 answers)
Closed 3 years ago.
This is not a duplicate of Locking a file in Python
I have two scripts, one runs every 30 minutes, the other one runs every one minute.
and they both use the same file to do few things.
at some point, every 30 minute they try to access the same file at the same time and they corrupt the file.
I was thinking about using wait. but they are two independent scripts and I am not sure if this is possible.
Any idea?
I thought about using
with FileLock("document.txt")
The problem that arise is; if script-1 acquire the lock for "document.txt" then script-2 wants to access document.txt, is it going to wait that script-1 finish? or is it going to skip that line of code? as the 2nd one isn't an option?
also. once the lock is acquired, how to remove it when it's no longer needed?

One of the simpliest ways to get this done (in case you have write access to the file's directory) is to create an additional file (like filename.lck) to point out that some script is working on that file. Obviously, once a script has finished working with the file, that lock-file needs to be removed.
But honestly, I would be very surprised if such a locking mechanism is not foreseen in Python. How exactly do you open and close the mentioned file? Maybe some parameter already takes care of the locking.

How to check disk space when running python script? [duplicate]

This question already has answers here:
Find free disk space in python on OS/X
(7 answers)
Closed 3 years ago.
I have a script that is going to download a lot of data from the internet. But, I have no idea how long will it take or how big the data it will be.
To be more precise I want to analyze some live videos and for that I will download the content using youtube-dl. Since I want to leave it running for a week or two, is there a way so that can avoid running into low memory problem, that the computer checks on a specific interval what is my memory status and if it is below a certain value to stop the execution?
Thanks

You can use shutil.disk_usage(path) from the docs:
shutil.disk_usage(path)
Return disk usage statistics about the given path as a named tuple with the attributes total, used and free, which are the amount of
total, used and free space, in bytes. On Windows, path must be a
directory; on Unix, it can be a file or directory.

Use shutil.disk_usage.
total, used, free = shutil.disk_usage("/")

Improve speed when reading very large files in Python

So I'm running multiple functions, each function takes a section out of the million line .txt file. Each function has a for loop that runs through every line in that section of million line file.
It takes info from those lines to see if it matches info in 2 other files, one about 50,000-100,000 lines long, the other about 500-1000 lines long. I checked if the lines match by running for loops through the other 2 files. Once the info matches I write the output to a new file, all functions write to the same file. The program will produce about 2,500 lines a minute, but will slow down the longer it runs. Also, when I run one of the function, it does in about 500 a minute, but when I do it with 23 other processes it only makes 2500 a minute, why is that?
Does anyone know why that would happen? Anyway, I could import something to make the program run/read through files faster, I am already using the with "as file1:" method.
Can the multi-processes be redone to run faster?

The thread can only use your ressources. 4 cores = 4 thread with full ressource. There are a few cases where having more thread can improve performance, but this is not the case for you. So keep the thread count to the number of cores you have.
Also, because you have a concurrent access to a file, you need a lock on this file which will slow down the process a bit.
What could be improve however is your code to compare the string, but that is another question.

Why should I close files in Python? [duplicate]

This question already has answers here:
Is explicitly closing files important?
(7 answers)
Is close() necessary when using iterator on a Python file object [duplicate]
(8 answers)
Closed 8 years ago.
Usually when I open files I never call the close() method, and nothing bad happens. But I've been told this is bad practice. Why is that?

For the most part, not closing files is a bad idea, for the following reasons:
It puts your program in the garbage collectors hands - though the file in theory will be auto closed, it may not be closed. Python 3 and Cpython generally do a pretty good job at garbage collecting, but not always, and other variants generally suck at it.
It can slow down your program. Too many things open, and thus more used space in the RAM, will impact performance.
For the most part, many changes to files in python do not go into effect until after the file is closed, so if your script edits, leaves open, and reads a file, it won't see the edits.
You could, theoretically, run in to limits of how many files you can have open.
As #sai stated below, Windows treats open files as locked, so legit things like AV scanners or other python scripts can't read the file.
It is sloppy programming (then again, I'm not exactly the best at remembering to close files myself!)

Found some good answers:
(1) It is a matter of good programming practice. If you don't close
them yourself, Python will eventually close them for you. In some
versions of Python, that might be the instant they are no longer
being used; in others, it might not happen for a long time. Under
some circumstances, it might not happen at all.
(2) When writing to a file, the data may not be written to disk until
the file is closed. When you say "output.write(...)", the data is
often cached in memory and doesn't hit the hard drive until the file
is closed. The longer you keep the file open, the greater the
chance that you will lose data.
(3) Since your operating system has strict limits on how many file
handles can be kept open at any one instant, it is best to get into
the habit of closing them when they aren't needed and not wait for
"maid service" to clean up after you.
(4) Also, some operating systems (Windows, in particular) treat open
files as locked and private. While you have a file open, no other
program can also open it, even just to read the data. This spoils
backup programs, anti-virus scanners, etc.
http://python.6.x6.nabble.com/Tutor-Why-do-you-have-to-close-files-td4341928.html
https://docs.python.org/2/tutorial/inputoutput.html

Open files use resources and may be locked, preventing other programs from using them. Anyway, it is good practice to use with when reading files, as it takes care of closing the file for you.
with open('file', 'r') as f:
read_data = f.read()
Here's an example of something "bad" that might happen if you leave a file open.
Open a file for writing in your python interpreter, write a string to it, then open that file in a text editor. On my system, the file will be empty until I close the file handle.

The close() method of a file object flushes any unwritten information and closes the file object, after which no more writing can be done.
Python automatically closes a file when the reference object of a file is reassigned to another file. It is a good practice to use the close() method to close a file.Here is the link about the close() method. I hope this helps.

You only have to call close() when you're writing to a file.
Python automatically closes files most of the time, but sometimes it won't, so you want to call it manually just in case.

I had a problem with that recently:
I was writing some stuff to a file in a for-loop, but if I interrupt the script with ^C, a lot of data which should have actually been written to the file wasn't there. It looks like Python stops to writing there for no reason. I opened the file before the for loop. Then I changed the code so that Python opens and closes the file for ever single pass of the loop.
Basically, if you write stuff for your own and you don't have any issues - it's fine, if you write stuff for more people than just yourself - put a close() inside the code, because someone could randomly get an error message and you should try to prevent this.

Python memory limit [duplicate]

This question already has answers here:
In-memory size of a Python structure
(7 answers)
Closed 8 years ago.
So clearly there cannot be unlimited memory in Python. I am writing a script that creates lists of dictionaries. But each list has between 250K and 6M objects (there are 6 lists).
Is there a way to actually calculate (possibly based on the RAM of the machine) the maximum memory and the memory required to run the script?
The actual issue I came across:
In running one of the scripts that populates a list with 1.25-1.5 million dictionaries, when it hits 1.227... it simply stops, but returns no error let alone MemoryError. So I am not even sure if this is a memory limit. I have print statements so I can watch what is going on, and it seems to buffer forever as nothing is printing and up until that specific section, the code is running a couple thousand lines per second. Any ideas as to what is making it stop? Is this memory or something else?

If you have that many objects you need to store, you need store them on disk, not in memory. Consider using a database.
If you import the sys module you will have access to the function sys.getsizeof(). You will have to look at each object of the list and for each dictionary compute the value for every key. For more on this see this previous question - In-memory size of a Python structure.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iterating through files is faster after the first time [duplicate] - python

Related

How to make a script wait for another script to finish reading a file [duplicate]

How to check disk space when running python script? [duplicate]

Improve speed when reading very large files in Python

Why should I close files in Python? [duplicate]

Python memory limit [duplicate]

Categories

Resources