Alternative for nested loop operation in python? - python

I want a fast alternative of a nested loop operation in which the second loop occurs after some operation in first loop.
For example:
date = target_date_list = pd.date_range(start=start_date, end=end_date).strftime(f'year=%Y/month=%m/day=%d')
for date in target_date_list:
folder = f'path_to_folder/{date}'
for file in folder:
//some operation

There is no meaningfully faster alternative here. The inner loop's values are dependent on the value generated by the outer loop, so the micro-optimization of using itertools.product isn't available.
If you're actually iterating a directory (not characters in a string describing a directory), I'd strongly recommend using os.scandir over os.listdir (assuming like many folks you were using the latter without knowing the former existed), as it's much faster when:
You're operating on large directories
You're filtering the contents based on stat info (in particular entry types, which come for free without a stat at all; on Windows, you get even more for free, and anywhere else if you do stat, it's cached on the entry so you can check multiple results without triggering a re-stat)
With os.scandir, and inner loop previously implemented like:
for file in os.listdir(dir):
path = os.path.join(dir, file)
if file.endswith('.txt') and os.path.isfile(path) and os.path.getsize(path) > 4096:
# do stuff with 4+KB file described by "path"
can simplify slightly and speed up by changing to:
with os.scandir(dir) as direntries:
for entry in direntries:
if entry.name.endswith('.txt') and entry.is_file() and entry.stat().st_size >= 4096:
# do stuff with 4+KB file described by "entry.path"
but fundamentally, this optimization has nothing to do with avoiding nested loops; if you want to iterate all the files, you have to iterate all the files. A nested loop will need to occur somehow even if you hide it behind utility methods, and the cost will not be meaningful relative to the cost of file system access.

As a rule of thumb, your best bet for better performance in a for loop is to use a generator expression. However, I suspect that the performance boost for your particular example will be minimal, since your outer loop is just a trivial task of assigning a variable to a string.
date = target_date_list = pd.date_range(start=start_date, end=end_date).strftime(f'year=%Y/month=%m/day=%d')
for file in (f'path_to_folder/{date}' for date in target_date_list):
//some operation

Related

Fast looping through two lists comparing regex pattern using value from one list against values in another list

I'm currently trying to deal with a tricky problem in Python. To set the scene, I'm using Let's Encrypt, so I have the (posix)paths of the live directory and I have a list of domains from a CMS.
I'm trying to compare the domains against the paths, which requires a regular expression match containing the value from the first list, because the paths will contain the domain names, but I can't/don't understand how I would do a set intersection because the values don't match, but this is really painfully slow with a traditional for loop (when you have >12000 domain names and >10000 certificates (paths)).
So, some explanatory code:
import re
from cryptography.x509 import load_pem_x509_certificate
from cryptography.hazmat.backends import default_backend, openssl
all_domains = function_that_returns_domains_as_list()
all_paths = function_that_returns_certificate_paths()
nocert_list = list()
def cert_check(path):
cert = load_pem_x509_certificate(path.read_bytes(), default_backend())
cur_date = datetime.now()
end_date = cert.not_valid_after
... # More logic and functions for checking if the certificate has expired etc.
def path(domain):
for path in all_paths:
if path.match(f"*/{domain}*"):
return path
def check_domain_certs():
for domain in all_domains:
path_check = path(domain)
if not path_check:
nocert_list.append(domain)
if path_check:
cert_path = path_check
cert_check(cert_path)
Even if I don't call the cert_check function in the check_domain_certs function and instead add the path to a list, to call outside of the loop and the check_domain_certs function, the looping itself takes a long time (I ran it whilst typing out this message and it has only just finished some ~30 minutes later. Probably something do do with it having to loop about 120 million times.)
I've run down a lot of stackoverflow rabbit holes today so I'm actually turning to the community for help this time.
The things that you may try are:
Search for optimization in Python. (Using python builtins might help!)
Generally list comprehensions are better than for loops
My general advice is to use list comprehension with Cython. First utilize the list comprehension, and then check for Cython optimizations with the .pyx files and cythonize it.
Cython, in general, translates the python code to the C. Especially loops in Python works significantly faster with Cython. You can check one of the Sentdex's video explaining the general principles for cython. As you could see in the 23:20 of the video, it makes a significant change like ~100x faster than the traditional python loop.
In case the search fails more often than succeeds, this may be faster:
def check_domain_certs():
all_paths_as_string = str(set(all_paths))
for domain in all_domains:
if not domain in all_paths_as_string:
nocert_list.append(domain)
else:
cert_path = path(domain) # full search returning cert path
cert_check(cert_path)
...

One-line "for" loop using list comprehension

Someone has challenged me to create a program that sorts their pictures into folders based on the month they were taken, and I want to do it in one line (I know, it's inefficient and unreadable, but I still want to do it because one-liners are cool)
I needed a for loop to accomplish this, but the only way I know of to use a for loop in one line is list comprehension, so that's what I did, but it creates an empty list, and doesn't print anything from the list or anything.
What I'm doing is renaming the file to be the month created + original filename (ex: bacon.jpg --> May\bacon.jpg)
Here is my code (Python 3.7.3):
import time
import os.path
[os.rename(str(os.fspath(f)), str(time.ctime(os.path.getctime(str(os.fspath(f))))).split()[1] + '\\' + str(os.fspath(f))) for f in os.listdir() if f.endswith('.jpg')]
and the more readable, non-list-comprehension version:
import time
import os.path
for f in os.listdir():
fn = str(os.fspath(f))
dateCreated = str(time.ctime(os.path.getctime(fn)))
monthCreated = dateCreated.split()[1]
os.rename(fn, monthCreated + '\\' + fn)
Is list comprehension a bad way to do it? Also, is there a reason why, if I print the list it's [] instead of [None, None, None, None, None, (continuing "None"s for every image moved)]?
Please note: I understand that it's inefficient and bad practice. If I were doing this for purposes other than just for fun to see if I could do it, I would obviously not try to do it in one line.
This is bad in two immediate respects:
You're using a list comprehension when you're not actually interested in constructing a list -- you ignore the object you just constructed.
Your construction has an ugly side effect in the OS.
Your purpose appears to be renaming a sequence of files, not constructing a list. The Python facility you want is, I believe, the map function. Write a function to change one file name, and then use map on a list of file names -- or tuples of old, new file names -- to run through the sequence of desired changes.
Is list comprehension a bad way to do it?
YES. But if you want to do it in one line, it is either that or using ";". For instance:
for x in range(5): print(x);print(x+2)
And, by the way, just renaming a file including a slash will not create a folder. You have to use os.mkdir('foldername').
In the end, if you really want to do that, I would just recommend doing it normally in many lines and then separating it with semicolons in a single line.

Data structure for filesystem

I'm storing / caching the filesystem (filenames only) in memory to be able to do fast research à la Everything. Thus I don't want to use the OS's built-in file search GUI.
I do it with:
import os
L = []
for root,dirs,files in os.walk(PATH):
L.append([root, files])
and the result is like this:
[['D:\\', ['a.jpg', 'b.jpg']],
...
['D:\\Temp12', ['test.txt', 'test2.txt']]]
The problem is that doing research takes too much time when L will contain millions of elements:
query = 'test2' #searching for filename containg this text
for dir in L:
for f in dir[1]:
if query in f:
print '%s found: %s' % (query, os.path.join(dir[0],f))
Indeed, this is a very naive search because it requires to browse the whole list to find items.
How to make the queries faster?
Maybe it seems that a list is not the right data structure to do full-text research, is there a tree-like structure?
Research in a lists are O(n), Research in dictionaries are amortized O(1). If you don't need to associate values, use sets.
If you want more about this : https://www.ics.uci.edu/~pattis/ICS-33/lectures/complexitypython.txt
In your case, I would use sets. It will make your queries a lot faster.
EDIT :
The way you are doing it, checking every file for a match can't be quicker that way. Even if you use a dict, you would check every filename for a match.
New idea :
You can create a dict with all filename as keys and root as value for each. This way you can recreate the full path later.
The idea is now to create a tree were each node is a letter and were path between each will made words (filename). It could be difficult to implement and the result may not be faster depending on the way you construct the tree.
You have to remember that you want to check each and every filename and using a list or a dict won't change that. The tree/graph is the only solution I can think of.
Could you consider using a database for this?
SQLite offers :memory: option, which creates your database in memory only. Of course you can optimise your algorithm and data structure as pointed out in other answers and comments, but databases generally are already very good at this with their indexing, and you would not need to design something similar.
Your table(s) could be either simply one table with fields full_path and filename, and if you indexed it by filename, it would be fast. This would store a lot of redundant information, as every file would have the full path in full_path. A better solution would be to have a table for directories and another for files, and you would just reference directories from files to get the full path of a match.
Just a thought.
Hannu

Most efficient way in Python to iterate over a large file (10GB+)

I'm working on a Python script to go through two files - one containing a list of UUIDs, the other containing a large amount of log entries - each line containing one of the UUIDs from the other file. The purpose of the program is to create a list of the UUIDS from file1, then for each time that UUID is found in the log file, increment the associated value for each time a match is found.
So long story short, count how many times each UUID appears in the log file.
At the moment, I have a list which is populated with UUID as the key, and 'hits' as the value. Then another loop which iterates over each line of the log file, and checking if the UUID in the log matches a UUID in the UUID list. If it matches, it increments the value.
for i, logLine in enumerate(logHandle): #start matching UUID entries in log file to UUID from rulebase
if logFunc.progress(lineCount, logSize): #check progress
print logFunc.progress(lineCount, logSize) #print progress in 10% intervals
for uid in uidHits:
if logLine.count(uid) == 1: #for each UUID, check the current line of the log for a match in the UUID list
uidHits[uid] += 1 #if matched, increment the relevant value in the uidHits list
break #as we've already found the match, don't process the rest
lineCount += 1
It works as it should - but I'm sure there is a more efficient way of processing the file. I've been through a few guides and found that using 'count' is faster than using a compiled regex. I thought reading files in chunks rather than line by line would improve performance by reducing the amount of disk I/O time but the performance difference on a test file ~200MB was neglible. If anyone has any other methods I would be very grateful :)
Think functionally!
Write a function which will take a line of the log file and return the uuid. Call it uuid, say.
Apply this function to every line of the log file. If you are using Python 3 you can use the built-in function map; otherwise, you need to use itertools.imap.
Pass this iterator to a collections.Counter.
collections.Counter(map(uuid, open("log.txt")))
This will be pretty much optimally efficient.
A couple comments:
This completely ignores the list of UUIDs and just counts the ones that appear in the log file. You will need to modify the program somewhat if you don't want this.
Your code is slow because you are using the wrong data structures. A dict is what you want here.
This is not a 5-line answer to your question, but there was an excellent tutorial given at PyCon'08 called Generator Tricks for System Programmers. There is also a followup tutorial called A Curious Course on Coroutines and Concurrency.
The Generator tutorial specifically uses big log file processing as its example.
Like folks above have said, with a 10GB file you'll probably hit the limits of your disk pretty quickly. For code-only improvements, the generator advice is great. In python 2.x it'll look something like
uuid_generator = (line.split(SPLIT_CHAR)[UUID_FIELD] for line in file)
It sounds like this doesn't actually have to be a python problem. If you're not doing anything more complex than counting UUIDs, Unix might be able to solve your problems faster than python can.
cut -d${SPLIT_CHAR} -f${UUID_FIELD} log_file.txt | sort | uniq -c
Have you tried mincemeat.py? It is a Python implementation of the MapReduce distributed computing framework. I'm not sure if you'll have performance gain since I've not yet processed 10GB of data before using it, though you might explore this framework.
Try measuring where most time is spent, using a profiler http://docs.python.org/library/profile.html
Where best to optimise will depend on the nature of your data: If the list of uuids isn't very long, you may find, for example, that a large proportion of time is spend on the "if logFunc.progress(lineCount, logSize)". If the list is very long, you it could help to save the result of uidHits.keys() to a variable outside the loop and iterate over that instead of the dictionary itself, but Rosh Oxymoron's suggesting of finding the id first and then checking for it in uidHits would probably help even more.
In any case, you can eliminate the lineCount variable, and use i instead. And find(uid) != -1 might be better than count(uid) == 1 if the lines are very long.

Comparing file contents in Python

I have two files, say source and target. I compare each element in source to check if it also exists in target. If it does not exist in target, I print it ( the end goal is to have 0 difference). Here is the code I have written.
def finddefaulters(source,target):
f = open(source,'r')
g = open(target,'r')
reference = f.readlines()
done = g.readlines()
for i in reference:
if i not in done:
print i,
I need help with
How would this code be rated on a scale of 1-10
How can I make it better and optimal if the file sizes are huge.
Another question - When I read all the lines as list elements, they are interpreted as 'element\n' - So for correct comparison, I have to add a newline at the end of each file. Is there a way to strip the newlines so I do not have to add newline at the end of files. I tried rstrip. But it did not work.
Thanks in advance.
Regarding efficiency: The method you show has an asymptotic runtime complexity of O(m*n) where m and n are the number of elements in reference and done, i.e. if you double the size of both lists, the algorithm will run 4 times longer (times a fixed constant that is uniteresting to theoretical computer scientists). If m and n are very large, you will probably want to choose a faster algorithm, e.g sort the two lists first using the .sort() (runtime complexity: O(n * log(n))) and then go through the lists just once (runtime complexity: O(n)). That algorithm has a worst-case runtime complexity of O(n * log(n)), which is already a big improvement. However, you trade readability and simplicity of the code for efficiency, so I would only advise you to do this if absolutely necessary.
Regarding coding style: You do not .close() the file handles which you should. Instead of opening and closing the file handle, you could use the with language construct of python. Also, if you like the functional style, you could replace the for loop by a list expression:
for i in reference:
if i not in done:
print i,
then becomes:
items = [i.strip() for i in reference if i not in done]
print ' '.join(items)
However, this way you will not see any progress while the list is being composed.
As joaquin already mentions, you can loop over f directly instead of f.readlines() as file handles support the iterator protocol.
Some ideas:
1) use [with] to open files safely:
with open(source) as f:
.............
The with statement is used to wrap the
execution of a block with methods
defined by a context manager. This
allows common try...except...finally
usage patterns to be encapsulated for
convenient reuse.
2) you can iterate over the lines of a file instead of using readlines:
for line in f:
..........
3) Although for this short snippet it could be enough, try to use more informative names for your variables. One-letter names are not recommended.
4) If you want to get profit of python lib, try functions in difflib module. For example:
make_file(fromlines, tolines[, fromdesc][, todesc][, context][, numlines])
Compares fromlines and tolines (lists
of strings) and returns a string which
is a complete HTML file containing a
table showing line by line differences
with inter-line and intra-line changes
highlighted.

Categories

Resources