I'm using github's linguist to identify unknown source code files. Running this from the command line after a gem install github-linguist is insanely slow. I'm using python's subprocess module to make a command-line call on a stock Ubuntu 14 installation.
Running against an empty file: linguist __init__.py takes about 2 seconds (similar results for other files). I assume this is completely from the startup time of Ruby. As #MartinKonecny points out, it seems that it is the linguist program itself.
Is there some way to speed this process up -- or a way to bundle the calls together?
One possibility is to just adapt the linguist program (https://github.com/github/linguist/blob/master/bin/linguist) to take multiple paths on the command-line. It requires mucking with a bit of Ruby, sure, but it would make it possible to pass multiple files without the startup overhead of Linguist each time.
A script this simple could suffice:
require 'linguist/file_blob'
ARGV.each do |path|
blob = Linguist::FileBlob.new(path, Dir.pwd)
# print out blob.name, blob.language, blob.sloc, etc.
end
When I load the file into json, pythons memory usage spikes to about 1.8GB and I can't seem to get that memory to be released. I put together a test case that's very simple:
with open("test_file.json", 'r') as f:
j = json.load(f)
I'm sorry that I can't provide a sample json file, my test file has a lot of sensitive information, but for context, I'm dealing with a file in the order of 240MB. After running the above 2 lines I have the previously mentioned 1.8GB of memory in use. If I then do del j memory usage doesn't drop at all. If I follow that with a gc.collect() it still doesn't drop. I even tried unloading the json module and running another gc.collect.
I'm trying to run some memory profiling but heapy has been churning 100% CPU for about an hour now and has yet to produce any output.
Does anyone have any ideas? I've also tried the above using cjson rather than the packaged json module. cjson used about 30% less memory but otherwise displayed exactly the same issues.
I'm running Python 2.7.2 on Ubuntu server 11.10.
I'm happy to load up any memory profiler and see if it does better then heapy and provide any diagnostics you might think are necessary. I'm hunting around for a large test json file that I can provide for anyone else to give it a go.
I think these two links address some interesting points about this not necessarily being a json issue, but rather just a "large object" issue and how memory works with python vs the operating system
See Why doesn't Python release the memory when I delete a large object? for why memory released from python is not necessarily reflected by the operating system:
If you create a large object and delete it again, Python has probably released the memory, but the memory allocators involved don’t necessarily return the memory to the operating system, so it may look as if the Python process uses a lot more virtual memory than it actually uses.
About running large object processes in a subprocess to let the OS deal with cleaning up:
The only really reliable way to ensure that a large but temporary use of memory DOES return all resources to the system when it's done, is to have that use happen in a subprocess, which does the memory-hungry work then terminates. Under such conditions, the operating system WILL do its job, and gladly recycle all the resources the subprocess may have gobbled up. Fortunately, the multiprocessing module makes this kind of operation (which used to be rather a pain) not too bad in modern versions of Python.
When I launch one .jar in Python, I will give filename as argument for that execute .jar file. such as: java -jar xx.jar -file xx.file
I just noticed, when that Java process try to read xx.file as i/o read, it will cost 30% CPU usage in task manager.
So I think about, if we can pre-read files to memory, can mmap do that?
Any suggestion to improve it? If I have more than 50 java.exe process, CPU usage and i/o issue should be an big issue for me.
This does not really seem like a python question at all. But...
If you have 50 processes reading the same file, your OS will most likely cache this file for you already (if there's enough space to cache it of course)
That should remove the problems with I/O (disk) cost.
But you say that reading results in 30% CPU usage in task manager. Do you know what this CPU time is really used for? Is it for reading the file? JIT-compiling your java code? Just starting up the JVM:s?
You should make sure you know exactly what the problem is before trying to solve it.
In a django project, I need to generate some pdf files for objects in db. Since each file takes a few seconds to generate, I use celery to run tasks asynchronously.
Problem is, I need to add each file to a zip archive. I was planning to use the python zipfile module, but different tasks can be run in different threads, and I wonder what will happen if two tasks try to add a file to the archive at the same time.
Is the following code thread safe or not? I cannot find any valuable information in the python's official doc.
try:
zippath = os.path.join(pdf_directory, 'archive.zip')
zipfile = ZipFile(zippath, 'a')
zipfile.write(pdf_fullname)
finally:
zipfile.close()
Note: this is running under python 2.6
No, it is not thread-safe in that sense.
If you're appending to the same zip file, you'd need a lock there, or the file contents could get scrambled.
If you're appending to different zip files, using separate ZipFile() objects, then you're fine.
Python 3.5.5 makes writing to ZipFile and reading multiple ZipExtFiles threadsafe: https://docs.python.org/3.5/whatsnew/changelog.html#id93
As far as I can tell, the change has not been backported to Python 2.7.
Update: after studying the code and some testing, it becomes apparent that the locking is still not thoroughly implemented. It correctly works only for writestr and doesn't work for open and write.
While this question is old, it's still high up on google results, so I just want to chime in to say that I noticed on python 3.4 64bit on windows the lzma zipfile is thread-safe; all others fail.
with zipfile.ZipFile("test.zip", "w", zipfile.ZIP_LZMA) as zip:
#do stuff in threads
Note that you can't bind the same file with multiple zipfile.ZipFile instances, instead you have to use the same one in all threads; here that is the variable named zip.
In my case I get about 80-90% CPU use on 8 cores and SSD, which is nice.
I'm looking for a high performance method or library for scanning all files on disk or in a given directory and grabbing their basic stats - filename, size, and modification date.
I've written a python program that uses os.walk along with os.path.getsize to get the file list, and it works fine, but is not particularly fast. I noticed one of the freeware programs I had downloaded accomplished the same scan much faster than my program.
Any ideas for speeding up the file scan? Here's my python code, but keep in mind that I'm not at all married to os.walk and perfectly willing to use others APIs (including windows native APIs) if there are better alternatives.
for root, dirs, files in os.walk(top, topdown=False):
for name in files:
...
I should also note I realize the python code probably can't be sped up that much; I'm particularly interested in any native APIs that provide better speed.
Well, I would expect this to be heavily I/O bound task.
As such, optimizations on python side would be quite ineffective; the only optimization I could think of is some different way of accessing/listing files, in order to reduce the actual read from the file system.
This of course requires a deep knowledge of the file system, that I do not have, and I do not expect python's developer to have while implementing os.walk.
What about spawning a command prompt, and then issue 'dir' and parse the results?
It could be a bit an overkill, but with any luck, 'dir' is making some effort for such optimizations.
It seems as if os.walk has been considerably improved in python 2.5, so you might check if you're running that version.
Other than that, someone has already compared the speed of os.walk to ls and noticed a clear advance of the latter, but not in a range that would actually justify using it.
You might want to look at the code for some Python version control systems like Mercurial or Bazaar. They have devoted a lot of time to coming up with ways to quickly traverse a directory tree and detect changes (or "finding basic stats about the files").
Use scandir python module (formerly betterwalk) on github by Ben Hoyt.
http://github.com/benhoyt/scandir
It is much faster than python walk, but uses the same syntax. Just import scandir and change os.walk() to scandir.walk(). That's it. It is the fastest way to traverse directories and files in python.
When you look at the code for os.walk, you'll see that there's not much fat to be trimmed.
For example, the following is only a hair faster than os.walk.
import os
import stat
listdir= os.listdir
pathjoin= os.path.join
fstat= os.stat
is_dir= stat.S_ISDIR
is_reg= stat.S_ISREG
def yieldFiles( path ):
for f in listdir(path):
nm= pathjoin( path, f )
s= fstat( nm ).st_mode
if is_dir( s ):
for sub in yieldFiles( nm ):
yield sub
elif is_reg( s ):
yield f
else:
pass # ignore these
Consequently, the overheads must he in the os module itself. You'll have to resort to making direct Windows API calls.
Look at the Python for Windows Extensions.
I'm wondering if you might want to group your I/O operations.
For instance, if you're walking a dense directory tree with thousands of files, you might try experimenting with walking the entire tree and storing all the file locations, and then looping through the (in-memory) locations and getting file statistics.
If your OS stores these two data in different locations (directory structure in one place, file stats in another), then this might be a significant optimization.
Anyway, that's something I'd try before digging further.
Python 3.5 just introduced os.scandir (see PEP-0471) which avoids a number of non-required system calls such as stat() and GetFileAttributes() to provide a significantly quicker file-system iterator.
os.walk() will now be implemented using os.scandir() as its iterator, and so you should see potentially large performance improvements whilst continuing to use os.walk().
Example usage:
for entry in os.scandir(path):
if not entry.name.startswith('.') and entry.is_file():
print(entry.name)
The os.path module has a directory tree walking function as well. I've never run any sort of benchmarks on it, but you could give it a try. I'm not sure there's a faster way than os.walk/os.path.walk in Python, however.
This is only partial help, more like pointers; however:
I believe you need to do the following:
fp = open("C:/$MFT", "rb")
using an account that includes SYSTEM permissions, because even as an admin, you can't open the "Master File Table" (kind of an inode table) of an NTFS filesystem. After you succeed in that, then you'll just have to locate information on the web that explains the structure of each file record (I believe it's commonly 1024 bytes per on-disk file, which includes the file's primary pathname) and off you go for super-high speeds of disk structure reading.
I would suggest using folderstats for creating statistics from a folder structure, I have tested on folder/files structures up to 400k files and folders.
It is as simple as:
import folderstats
import pandas as pd
df = folderstats.folderstats(path5, ignore_hidden=True)
df.head()
df.shape
Output will be a dataframe see example below:-
path
name
extension
size
atime
mtime
ctime
folder
num_files
depth
uid
md5
./folder_structure.png
folder_structure
png
525239
2022-01-10 16:08:32
2020-11-22 19:38:03
2020-11-22 19:38:03
False
0
1000
a3cac43de8dd5fc33d7bede1bb1849de
./requirements-dev.txt
requirements-dev
txt
33
2022-01-10 14:14:50
2022-01-08 17:54:50
2022-01-08 17:54:50
False
0
1000
42c7e7d9bc4620c2c7a12e6bbf8120bb