first, I am aware of the amount of Python memory error questions on SO, but so far, none has matched my use case.
I am currently trying to parse a bunch of textfiles (~6k files with ~30 GB) and store each unique word. Yes, I am building a wordlist, no I am not planning on doing evil things with it, it is for the university.
I implemented the list of found words as a set (created with words = set([]), used with words.add(word)) and I am just adding every found word to it, considering that the set mechanics should remove all duplicates.
This means that I need permanent access to the whole set for this to work (Or at least I see no alternative, since the whole list has to be checked for duplicates on every insert).
Right now, I am running into MemoryError about 25% through, when it uses about 3.4 GB of my RAM. I am on a Linux 32bit, so I know where that limitation comes from, and my PC only has 4 Gigs of RAM, so even 64 bit would not help here.
I know that the complexity is probably terrible (Probably O(n) on each insert, although I don't know how Python sets are implemented (trees?)), but it is still (probably) faster and (definitly) more memory efficient than adding each word to a primitive list and removing duplicates afterwards.
Is there any way to get this to run? I expect about 6-10 GB of unique words, so using my current RAM is out of the question, and upgrading my RAM is currently not possible (and does not scale too well once I start letting this script loose on larger amounts of files).
My only Idea at the moment is caching on Disk (Which will slow the process down even more), or writing temporary sets to disk and merging them afterwards, which will take even more time and the complexity would be horrible indeed. Is there even a solution that will not result in horrible runtimes?
For the record, this is my full source. As it was written for personal use only, it is pretty horrible, but you get the idea.
import os
import sys
words=set([])
lastperc = 0
current = 1
argl = 0
print "Searching for .txt-Files..."
for _,_,f in os.walk("."):
for file in f:
if file.endswith(".txt"):
argl=argl+1
print "Found " + str(argl) + " Files. Beginning parsing process..."
print "0% 50% 100%"
for r,_,f in os.walk("."):
for file in f:
if file.endswith(".txt"):
fobj = open(os.path.join(r,file),"r")
for line in fobj:
line = line.strip()
word, sep, remains = line.partition(" ")
if word != "":
words.add(word)
word, sep, remains = remains.partition(" ")
while sep != "":
words.add(word)
word, sep, remains2 = remains.partition(" ")
remains = remains2
if remains != "":
words.add(remains)
newperc = int(float(current)/argl*100)
if newperc-lastperc > 0:
for i in range(newperc-lastperc):
sys.stdout.write("=")
sys.stdout.flush()
lastperc = newperc
current = current+1
print ""
print "Done. Set contains " + str(len(words)) + " different words. Sorting..."
sorteddic = sorted(words, key=str.lower)
print "Sorted. Writing to File"
print "0% 50% 100%"
lastperc = 0
current = 1
sdicl = len(sorteddic)-1
fobj = open(sys.argv[1],"w")
for element in sorteddic:
fobj.write(element+"\n")
newperc = int(float(current)/sdicl*100)
if newperc-lastperc > 0:
for i in range(newperc-lastperc):
sys.stdout.write("=")
sys.stdout.flush()
lastperc = newperc
current = current+1
print ""
print "Done. Enjoy your wordlist."
Thanks for your help and Ideas.
You're probably going to need to store the keys on disk. A key-value store like Redis might fit the bill.
Do you really mean 6-10GB of unique words? Is this English text? Surely even counting proper nouns and names there shouldn't be more than a few million unique words.
Anyway, what I would do is process one file at a time, or even one section (say, 100k) of a file at a time, a build a unique wordlist just for that portion. Then just union all the sets as a post-processing step.
My inclination is towards a database table, but if you want to stay within a single framework, checkout PyTables: http://www.pytables.org/moin
The first thing I'd try would be to restrict words to lower-case characters – as Tyler Eaves pointed out, this will probably reduce the set size enough to fit into memory. Here's some very basic code to do this:
import os
import fnmatch
import re
def find_files(path, pattern):
for root, files, directories in os.walk(path):
for f in fnmatch.filter(files, pattern):
yield os.path.join(root, f)
words = set()
for file_name in find_files(".", "*.txt"):
with open(file_name) as f:
data = f.read()
words.update(re.findall("\w+", data.lower()))
A few more comments:
I would usually expect the dictionary to grow rapidly at the beginning; very few new words should be found late in the process, so your extrapolation might severely overestimate the final size of the word list.
Sets are very efficient for this purpose. They are implemented as hash tables, and adding a new word has an amortised complexity of O(1).
Hash your keys into a codespace that is smaller and more managable. Key the hash to a file containing the keys with that hash. The table of hashes is much smaller and the individual key files are much smaller.
Related
I am trying to process ~20GB of data on a Ubuntu system having 64 GB of RAM.
This step is a part of a some preprocessing steps to generate feature vectors for training an ML algo.
The original implementation(written by someone in my team) had lists in it. It does not scale up well as we add more training data. It is something like this.
all_files = glob("./Data/*.*")
file_ls = []
for fi in tqdm(all_files):
with open(file=fi, mode="r", encoding='utf-8', errors='ignore') as f:
file_ls.append(f.read())
This runs into a memory error(process gets killed).
So I though I should try out replacing the list based thing with tries
def insert(word):
cur_node = trie_root
for letter in word:
if letter in cur_node:
cur_node = cur_node[letter]
else:
cur_node[letter] = {}
cur_node = cur_node[letter]
cur_node[None] = None
trie_root = {}
for fi in tqdm(all_files):
with open(file=fi, mode="r", encoding='utf-8', errors='ignore') as f:
insert(f.read().split())
This too gets killed. The above is a demo code that I have written to capture the memory footprint of the objects. The worse part is that the demo code for list runs standalone but the demo code for trie gets killed, leading me to believe that this implementation is worse than the list implementation.
My goal is to write some efficient code in Python to resolve this issue.
Kindly help me solve this problem.
EDIT:
Responding to #Paul Hankin, the data processing involves first taking up each file and adding a generic placeholder for terms with a normalized term frequency greater than 0.01 after which each file is splitted into a list and a vocabulary is calculated taking all the processed files into consideration.
One of the simple solutions to this problem might be to NOT store data in a list or any data structure. You can try writing these data to a file while doing the reading.
I'm trying to write a script in Python for sorting through files (photos, videos), checking metadata of each, finding and moving all duplicates to a separate directory. Got stuck with the metadata checking part. Tried os.stat - doesn't return True for duplicate files. Ideally, I should be able to do something like :
if os.stat("original.jpg")== os.stat("duplicate.jpg"):
shutil.copy("duplicate.jpg","C:\\Duplicate Folder")
Pointers anyone?
There's a few things you can do. You can compare the contents or hash of each file or you can check a few select properties from the os.stat result, ex
def is_duplicate(file1, file2):
stat1, stat2 = os.stat(file1), os.stat(file2)
return stat1.st_size==stat2.st_size and stat1.st_mtime==stat2.st_mtime
A basic loop using a set to keep track of already encountered files:
import glob
import hashlib
uniq = set()
for fname in glob.glob('*.txt'):
with open(fname,"rb") as f:
sig = hashlib.sha256(f.read()).digest()
if sig not in uniq:
uniq.add(sig)
print fname
else:
print fname, " (duplicate)"
Please note as with any hash function there is a slight chance of collision. That is two different files having the same digest. Depending your needs, this is acceptable of not.
According to Thomas Pornin in an other answer :
"For instance, with SHA-256 (n=256) and one billion messages (p=109) then the probability [of collision] is about 4.3*10-60."
Given your need, if you have to check for additional properties in order to identify "true" duplicates, change the sig = ....line to whatever suits you. For example, if you need to check for "same content" and "same owner" (st_uidas returned by os.stat()), write:
sig = ( hashlib.sha256(f.read()).digest(),
os.stat(fname).st_uid )
If two files have the same md5 they are exact duplicates.
from hashlib import md5
with open(file1, "r") as original:
original_md5 = md5(original.read()).hexdigest()
with open(file2, "r") as duplicate:
duplicate_md5 = md5(duplicate.read()).hexdigest()
if original_md5 == duplicate_md5:
do_stuff()
In your example you're using jpg file in that case you want to call the method open with its second argument equals to rb. For that see the documentation for open
os.stat offers information about some file's metadata and features, including the creation time. That is not a good approach in order to find out if two files are the same.
For instance: Two files can be the same and have different time creation. Hence, comparing stats will fail here. Sylvain Leroux approach is the best one when combining performance and accuracy, since it is very rare two different files has the same hash.
So, unless you have an incredibly large amount of data and a repeated file will cause a system fatality, this is the way to go.
If that your case (it not seems to be), well ... the only way you can be 100% sure two file are the same is iterating and perform a comparison byte per byte.
I have big svmlight files that I'm using for machine learning purpose. I'm trying to see if a sumsampling of those files would lead to good enough results.
I want to extract random lines of my files to feed them into my models but I want to load the less possible information in RAM.
I saw here (Read a number of random lines from a file in Python) that I could use linecache but all the solution end up loading everything in memory.
Could someone give me some hints? Thank you.
EDIT : forgot to say that I know the number of lines in my files beforehand.
You can use a heapq to select n records based on a random number, eg:
import heapq
import random
SIZE = 10
with open('yourfile') as fin:
sample = heapq.nlargest(SIZE, fin, key=lambda L: random.random())
This is remarkably efficient as the heapq remains fixed size, it doesn't require a pre-scan of the data and elements get swapped out as other elements get chosen instead - so at most you'll end up with SIZE elements in memory at once.
One option is to do a random seek into the file then look backwards for a newline (or the start of the file) before reading a line. Here's a program that prints a random line of each of the Python programs it finds in the current directory.
import random
import os
import glob
for name in glob.glob("*.py"):
mode, ino, den, nlink, uid, gid, size, atime, mtime, ctime = os.stat(name)
inf = open(name, "r")
location = random.randint(0, size)
inf.seek(location)
while location > 0:
char = inf.read(1)
if char == "\n":
break
location -= 1
inf.seek(location)
line = inf.readline()
print name, ":", line[:-1]
As long as the lines aren't huge this shouldn't be unduly burdensome.
You could scan the file once, counting the number of lines. Once you know that, you can generate the random line number, re-read the file and emit that line when you see it.
Actually since you're interested in multiple lines, you should look at Efficiently selecting a set of random elements from a linked list.
I've a json file data_large of size 150.1MB. The content inside the file is of type [{"score": 68},{"score": 78}]. I need to find the list of unique scores from each item.
This is what I'm doing:-
import ijson # since json file is large, hence making use of ijson
f = open ('data_large')
content = ijson.items(f, 'item') # json loads quickly here as compared to when json.load(f) is used.
print set(i['score'] for i in content) #this line is actually taking a long time to get processed.
Can I make print set(i['score'] for i in content) line more efficient. Currently it's taking 201secs to execute. Can it be made more efficient?
This will give you the set of unique score values (only) as ints. You'll need the 150 MB of free memory. It uses re.finditer() to parse which is about three times faster than the json parser (on my computer).
import re
import time
t = time.time()
obj = re.compile('{.*?: (\d*?)}')
with open('datafile.txt', 'r') as f:
data = f.read()
s = set(m.group(1) for m in obj.finditer(data))
s = set(map(int, s))
print time.time() - t
Using re.findall() also seems to be about three times faster than the json parser, it consumes about 260 MB:
import re
obj = re.compile('{.*?: (\d*?)}')
with open('datafile.txt', 'r') as f:
data = f.read()
s = set(obj.findall(data))
I don't think there is any way to improve things by much. The slow part is probably just the fact that at some point you need to parse the whole JSON file. Whether you do it all up front (with json.load) or little by little (when consuming the generator from ijson.items), the whole file needs to be processed eventually.
The advantage to using ijson is that you only need to have a small amount of data in memory at any given time. This probably doesn't matter too much for a file with a hundred or so megabytes of data, but would be a very big deal if your data file grew to be gigabytes or more. Of course, this may also depend on the hardware you're running on. If your code is going to run on an embedded system with limited RAM, limiting your memory use is much more important. On the other hand, if it is going to be running on a high performance server or workstation with lots and lots of ram available, there's may not be any reason to hold back.
So, if you don't expect your data to get too big (relative to your system's RAM capacity), you might try testing to see if using json.load to read the whole file at the start, then getting the unique values with a set is faster. I don't think there are any other obvious shortcuts.
On my system, the straightforward code below handles 10,000,000 scores (139 megabytes) in 18 seconds. Is that too slow?
#!/usr/local/cpython-2.7/bin/python
from __future__ import print_function
import json # since json file is large, hence making use of ijson
with open('data_large', 'r') as file_:
content = json.load(file_)
print(set(element['score'] for element in content))
Try using a set
set([x['score'] for x in scores])
For example
>>> scores = [{"score" : 78}, {"score": 65} , {"score" : 65}]
>>> set([x['score'] for x in scores])
set([65, 78])
In the directory I have say, 30 txt files each containing two columns of numbers with roughly 6000 numbers in each column. What i want to do is to import the first 3 txt files, process the data which gives me the desired output, then i want to move onto the next 3 txt files.
The directory looks like:
file0a
file0b
file0c
file1a
file1b
file1c ... and so on.
I don't want to import all of the txt files simultaneously, I want to import the first 3, process the data, then the next 3 and so forth. I was thinking of making a dictionary - though i have a feeling this might involve writing each file name in the dictionary, which would take far too long.
EDIT:
For those that are interested, I think i have come up with a work around. Any feedback would greatly be appreciated, since i'm not sure if this is the quickest way to do things or the most pythonic.
import glob
def chunks(l,n):
for i in xrange(0,len(l),n):
yield l[i:i+n]
Data = []
txt_files = glob.iglob("./*.txt")
for data in txt_files:
d = np.loadtxt(data, dtype = np.float64)
Data.append(d)
Data_raw_all = list(chunks(Data,3))
Here the list 'Data' is all of the text files from the directory, and 'Data_raw_all' uses the function 'chunks' to group the elements in 'Data' into sets of 3. This way you can selecting one element in Data_raw_all selects the corresponding 3 text files in the directory.
First of all, I have nothing original to include here and I definitely do not want to claim credit for it at all because it all comes from the Python Cookbook 3rd Ed and from this wonderful presentation on generators by David Beazley (one of the co-authors of the aforementioned Cookbook). However, I think you might really benefit from the examples given in the slideshow on generators.
What Beazley does is chain a bunch of generators together in order to do the following:
yields filenames matching a given filename pattern.
yields open file objects from a sequence of filenames.
concatenates a sequence of generators into a single sequence
greps a series of lines for those that match a regex pattern
All of these code examples are located here. The beauty of this method is that the chained generators simply chew up the next pieces of information: they don't load all files into memory in order to process all the data. It's really a nice solution.
Anyway, if you read through the slideshow, I believe it will give you a blueprint for exactly what you want to do: you just have to change it for the information you are seeking.
In short, check out the slideshow linked above and follow along and it should provide a blueprint for solving your problem.
I'm presuming you want to hardcode as few of the file names as possible. Therefore most of this code is for generating the filenames. The files are then opened with a with statement.
Example code:
from itertools import cycle, count
root = "UVF2CNa"
for n in count(1):
for char in cycle("abc"):
first_part = "{}{}{}".format(root, n, char)
try:
with open(first_part + "i") as i,\
open(first_part + "j") as j,\
open(first_part + "k") as k:
# do stuff with files i, j and k here
pass
except FileNotFoundError:
# deal with this however
pass