Short description of the problem:
I have some directories (dir_1,...dir_N) and want to merge them in a new directory (dir_X) but without copying all the files from those directories (would be a waste of memory). All directories are on the same pyhsical disk. Because files in dir_1,...,dir_N can have the same names I also need to give them new names in dir_X. If I walk through files in dir_X all those probably linked files shall be used/accessed like normal files.
I read a little bit about sym and hard links but don't know whats best to use. If I understand it write hardlinks are something like shared pointer to space on disk and sym links are something like normal pointer, spoken in C++ ;-) So I suppose hardlinks seems to meet best my requirements, right? How can create them with python for the required usage?
Thanks for help!
Related
I'm mostly done with the functionality part of a python program I'm writing and now I have to take care of the performance site. The part I first want to optimize is the searching process for directories meeting certain criteria.
In detail:
I'm searching in a certain directory and all subdirectories for folders that contain specific subfolders, because when they do, they are most certainly the folders I'm looking for.
For example, a folder containing the subfolders "wp-content", "wp-admin" and "wp-includes" is most certainly the WordPress installation folder. The chance this is not the case is pretty slim, so I can live with that.
Now the thing is that (in this case) the WordPress installation folder can be anywhere, the only thing I know for sure is that it is somewhere in the html/ directory.
Of course the html/ directory can contain hundreds of thousands of files and subdirectories. If this is the case, this part of my program quickly becomes its bottleneck, so I need to make sure to search in the most efficient way.
What I've originally been doing:
1) Recursively search all directories and subdirectories
2) Configure the maximum search depth
3) Use os.path.isdir(...) to check whether (in this case) "wp-content" etc. exist in the current folder.
So the only limitation here is the search depth. I've also thought about:
Excluding directories form search with more than XXX files in them
Problem: Counting the files using len(os.listdir()) may still slow the program down
Getting the entire folder tree structure in one go before searching
I'm not really sure this is faster
Excluding directories with certain names that belong most likely to unrelated applications
But first I want to know if there's something on the Python side I can do to improve the performance of my search. It's easy to increase the performance when file reading is involved, but in this case I'm not really reading anything, just checking if directories with a certain path exist. I'm not sure how efficient Python's os.path.isdir(path) works. So... are there better ways?
Have you looked at os.walk option? it does the recursion and directory tree walking for you...
https://docs.python.org/2/library/os.html
you can make use of regex with os.walk like below code snippet.
This will find all files starting with two digits and ending in gif, you can add the files into a global list, if you wish:
import re
import os
r = re.compile(r'\d{2}.+gif$')
for root, dirs, files in os.walk('/home/vinko'):
l = [os.path.join(root,x) for x in files if r.match(x)]
if l: print l #Or append to a global list, whatever
Is it possible for the user to input a specific directory on their computer, and for Python to write the all of the file/folder names to a text document? And if so, how?
The reason I ask is because, if you haven't seen my previous posts, I'm learning Python, and I want to write a little program that tracks how many files are added to a specific folder on a daily basis. This program isn't like a project of mine or anything, but I'm writing it to help me further learn Python. The more I write, the more I seem to learn.
I don't need any other help than the original question so far, but help will be appreciated!
Thank you!
EDIT: If there's a way to do this with "pickle", that'd be great! Thanks again!
os.walk() is what you'll be looking for. Keep in mind that this doesn't change the current directory your script is executing from; it merely iterates over all possible paths that stem from the arguments to it.
for dirpath, dirnames, files in os.walk(os.path.abspath(os.curdir)):
print files
# insert long (or short) list of files here
You probably want to be using os.walk, this will help you in creating a listing of the files down from your directory.
Once you have the listing, you can indeed store it in a file using pickle, but that's another problem (i.e.: Pickle is not about generating the listing but about storing it).
While the approach of #Makato is certainly right, for your 'diff' like application you want to capture the inode 'stat()' information of the files in your directory and pickle that python object from day-to-day looking for updates; this is one way to do it - overkill maybe - but more suitable than save/parse-load from text-files IMO.
In short:
I need to split a single (or more) file(s) into multiple max-sized archives using dummy-safe format (e.g. zip or rar anything that work will do!).
I would love to know when a certain part is done (callback?) so I could start shipping it away.
I would rather not do it using rar or zip command line utilities unless impossible otherwise.
I'm trying to make it os independent for the future but right now I can live if the compression could be made only on linux (my main pc) I still need to make it easily opened in windows (wife's pc)
In long:
I'm writing an hopefully to-be-awesome backup utility that scans my pictures folder, zips each folder and uploads them to whatever uploading class is registered (be it mail-sending, ftp-uploading, http-uploading).
I used zipfile to create a gigantic archive for each folder but since my uploading speed is really bad I let it work at only at nights but my internet goes off occassionally and the whole thing messes up. So I decided to split it into ~10MB pieces. I found no way of doing it with zipfile so I just added files to the zip until it reached > 10MB.
Problem is there are often 200-300MB and sometimes more videos in there and again we reach the middle-of-the-night cutoffs.
I am using Subprocess with "rar" right now to create the split archives but since directories are so big and I'm using large compression this thing takes ages even the first files are already ready - this is why I love to know when a file is ready to be sent.
So short story long I need a good way to split it into max-sized archives.
I am looking at making it somewhat generic and as dummy-proof as possible as eventually I'm planning on making it some awesome extensible backup library..
I am creating a sort of "Command line" in Python. I already added a few functions, such as changing login/password, executing, etc., But is it possible to browse files in the directory that the main file is in with a command/module, or will I have to make the module myself and use the import command? Same thing with changing directories to view, too.
Browsing files is as easy as using the standard os module. If you want to do something with those files, that's entirely different.
import os
all_files = os.listdir('.') # gets all files in current directory
To change directories you can issue os.chdir('path/to/change/to'). In fact there are plenty of useful functions found in the os module that facilitate the things you're asking about. Making them pretty and user-friendly, however, is up to you!
I'd like to see someone write a a semantic file-browser, i.e. one that auto-generates tags for files according to their input and then allows views and searching accordingly.
Think about it... take an MP3, lookup the lyrics, run it through Zemanta, bam! a PDF file, a OpenOffice file, etc., that'd be pretty kick-butt! probably fairly intensive too, but it'd be pretty dang cool!
Cheers,
-C
is it bad to output many files to the same directory in unix/linux? I run thousands of jobs on a cluster and each outputs a file, to one directory. The upper bound here is around ~50,000 files. Can IO be limited in speed in light of this? If so, does the problem go away with a nested directory structure?
Thanks.
See:
How many files can I put in a directory?
I believe that most filesystems store the names of contained files in a list (or some other linear-time access data structure) so storing large numbers of files in a single directory can cause slowness for simple operations like listing. Having a nested structure can ameliorate this problem by creating a tree structure (or even a Trie, if it makes sense) of names which can reduce the time it takes to retrieve file stats.
My suggestion is to use nested directory structure (ie categorization). You can name them using timestamps, special prefixes for each application etc. This gives you a sense of order when you need to search for specific files and for easier management of your files.