I'd like to calculated the number of documents stored in a mongodb bson file without having to import the file into the db via mongo restore.
The best I've been able to come up with in python is
bson_doc = open('./archive.bson','rb')
it = bson.decode_file_iter(bson_doc)
total = sum(1 for _ in it)
print(total)
This works in theory, but is slow in practice when bson documents are large. Anyone have a quicker approach to counting the number of documents in a bson document without doing a full decode?
I am currently using the python 2.7 and pymongo.
https://api.mongodb.com/python/current/api/bson/index.html
I don't have a file at hand to try, but I believe there's a way - if you'll parse the data by hand.
The source for bson.decode_file_iter (sans the docstring) goes like this:
_UNPACK_INT = struct.Struct("<i").unpack
def decode_file_iter(file_obj, codec_options=DEFAULT_CODEC_OPTIONS):
while True:
# Read size of next object.
size_data = file_obj.read(4)
if len(size_data) == 0:
break # Finished with file normaly.
elif len(size_data) != 4:
raise InvalidBSON("cut off in middle of objsize")
obj_size = _UNPACK_INT(size_data)[0] - 4
elements = size_data + file_obj.read(obj_size)
yield _bson_to_dict(elements, codec_options)
I presume, the time-consuming operation is _bson_to_dict call - and you don't need one.
So, all you need is to read the file - get the int32 value with the next document's size and skip it. Then count how many documents you've encountered doing this.
So, I believe, this function should do the trick:
import struct
import os
from bson.errors import InvalidBSON
def count_file_documents(file_obj):
"""Counts how many documents provided BSON file contains"""
cnt = 0
while True:
# Read size of next object.
size_data = file_obj.read(4)
if len(size_data) == 0:
break # Finished with file normaly.
elif len(size_data) != 4:
raise InvalidBSON("cut off in middle of objsize")
obj_size = struct.Struct("<i").unpack(size_data)[0] - 4
# Skip the next obj_size bytes
file_obj.seek(obj_size, os.SEEK_CUR)
cnt += 1
return cnt
(I haven't tested the code, though. Don't have MongoDB at hand.)
Related
I am writing a code for an information retrieval project, which reads Wikipedia pages in XML format from a file, processes the string (I've omitted this part for the sake of simplicity), tokenizes the strings and build positional indexes for the terms found on the pages. Then it saves the indexes to a file using pickle once, and then reads it from that file for the next usages for less processing time (I've included the code for that parts, but they're commented)
After that, I need to fill a 1572 * ~97000 matrix (1572 is the number of Wiki pages, and 97000 is the number of terms found in them. Like each Wiki page is a vector of words, and vectors[i][j] is, the number of occurrences of the i'th word of the word set in the j'th Wiki Page. (Again it's been simplified but it doesn't matter)
The problem is that it takes way too much memory to run the code, and even then, from a point between the 350th and 400th row of the matrix beyond, it doesn't proceed to run the code (it doesn't stop either). I thought the problem was with memory, because when its usage exceeded my 7.7GiB RAM and 1.7GiB swap, it stopped and printed:
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)
But when I added a 6GiB memory by making a swap file for Python3.7 (using the script recommended here, the program didn't run out of memory, but instead got stuck when it had 7.7GiB RAM + 3.9GiB swap memory occupied) as I said, at a point between the 350th and 400th iteration of i in the loop at the bottom. Instead of Ubuntu 18.04,I tried it on Windows 10, the screen simply went black. I tried this on Windows 7, again to no avail.
Next I thought it was a PyCharm issue, so I ran the python file using the python3 file.py command, and it got stuck at the very point it had with PyCharm. I even used the numpy.float16 datatype to save memory, but it had no effect. I asked a colleague about their matrix dimensions, they were similar to mine, but they weren't having problems with it. Is it malware or a memory leak? Or is it something am I doing something wrong here?
import pickle
from hazm import *
from collections import defaultdict
import numpy as np
'''For each word there's one of these. it stores the word's frequency, and the positions it has occurred in on each wiki page'''
class Positional_Index:
def __init__(self):
self.freq = 0
self.title = defaultdict(list)
self.text = defaultdict(list)
'''Here I tokenize words and construct indexes for them'''
# tree = ET.parse('Wiki.xml')
# root = tree.getroot()
# index_dict = defaultdict(Positional_Index)
# all_content = root.findall('{http://www.mediawiki.org/xml/export-0.10/}page')
#
# for page_index, pg in enumerate(all_content):
# title = pg.find('{http://www.mediawiki.org/xml/export-0.10/}title').text
# txt = pg.find('{http://www.mediawiki.org/xml/export-0.10/}revision') \
# .find('{http://www.mediawiki.org/xml/export-0.10/}text').text
#
# title_arr = word_tokenize(title)
# txt_arr = word_tokenize(txt)
#
# for term_index, term in enumerate(title_arr):
# index_dict[term].freq += 1
# index_dict[term].title[page_index] += [term_index]
#
# for term_index, term in enumerate(txt_arr):
# index_dict[term].freq += 1
# index_dict[term].text[page_index] += [term_index]
#
# with open('texts/indices.txt', 'wb') as f:
# pickle.dump(index_dict, f)
with open('texts/indices.txt', 'rb') as file:
data = pickle.load(file)
'''Here I'm trying to keep the number of occurrences of each word on each page'''
page_count = 1572
vectors = np.array([[0 for j in range(len(data.keys()))] for i in range(page_count)], dtype=np.float16)
words = list(data.keys())
word_count = len(words)
const_log_of_d = np.log10(1572)
""" :( """
for i in range(page_count):
for j in range(word_count):
vectors[i][j] = (len(data[words[j]].title[i]) + len(data[words[j]].text[i]))
if i % 50 == 0:
print("i:", i)
Update : I tried this on a friend's computer, this time it killed the process at someplace between the 1350th-1400th iteration.
Trying to count the number of docs in a firestore collection with python. When i use db.collection('xxxx").stream() i get the following error:
503 The datastore operation timed out, or the data was temporarily unavailable.
about half way through. It was working fine. Here is the code:
docs = db.collection(u'theDatabase').stream()
count = 0
for doc in docs:
count += 1
print (count)
Every time I get a 503 error at about 73,000 records. Does anyone know how to overcome the 20 second timeout?
Although Juan's answer works for basic counting, in case you need more of the data from Firebase and not just the id (a common use case of which is total migration of the data that is not through GCP), the recursive algorithm will eat your memory.
So I took Juan's code and transformed it to a standard iterative algorithm. Hope this helps someone.
limit = 1000 # Reduce this if it uses too much of your RAM
def stream_collection_loop(collection, count, cursor=None):
while True:
docs = [] # Very important. This frees the memory incurred in the recursion algorithm.
if cursor:
docs = [snapshot for snapshot in
collection.limit(limit).order_by('__name__').start_after(cursor).stream()]
else:
docs = [snapshot for snapshot in collection.limit(limit).order_by('__name__').stream()]
for doc in docs:
print(doc.id)
print(count)
# The `doc` here is already a `DocumentSnapshot` so you can already call `to_dict` on it to get the whole document.
process_data_and_log_errors_if_any(doc)
count = count + 1
if len(docs) == limit:
cursor = docs[limit-1]
continue
break
stream_collection_loop(db_v3.collection('collection'), 0)
Try using a recursive function to batch document retrievals and keep them under the timeout. Here's an example based on the delete_collections snippet:
from google.cloud import firestore
# Project ID is determined by the GCLOUD_PROJECT environment variable
db = firestore.Client()
def count_collection(coll_ref, count, cursor=None):
if cursor is not None:
docs = [snapshot.reference for snapshot
in coll_ref.limit(1000).order_by("__name__").start_after(cursor).stream()]
else:
docs = [snapshot.reference for snapshot
in coll_ref.limit(1000).order_by("__name__").stream()]
count = count + len(docs)
if len(docs) == 1000:
return count_collection(coll_ref, count, docs[999].get())
else:
print(count)
count_collection(db.collection('users'), 0)
In other answers was shown how to use the pagination to solve the timeout issue.
I suggest to use a generator in combination with pagination, which lets you to process the documents in the same way as you were doing it with query.stream().
Here is an example of function that takes a Query and returns a generator in the same way as the Query stream() method.
from typing import Generator, Optional, Any
from google.cloud.firestore import Query, DocumentSnapshot
def paginate_query_stream(
query: Query,
order_by: str,
cursor: Optional[DocumentSnapshot] = None,
page_size: int = 10000,
) -> Generator[DocumentSnapshot, Any, None]:
paged_query = query.order_by(order_by)
document = cursor
has_any = True
while has_any:
has_any = False
if document:
paged_query = paged_query.start_after(document)
paged_query = paged_query.limit(page_size)
for document in paged_query.stream():
has_any = True
yield document
Take in mind if your target collection constantly grows then you need to filter the upper bound in the query in advance to prevent a potential infinite loop.
A usage example with counting of documents.
from google.cloud.firestore import Query
docs = db.collection(u'theDatabase')
# Query without conditions, get all documents.
query = Query(docs)
count = 0
for doc in paginate_query_stream(query, order_by='__name__'):
count += 1
print(count)
Is there a limit to memory for python? I've been using a python script to calculate the average values from a file which is a minimum of 150mb big.
Depending on the size of the file I sometimes encounter a MemoryError.
Can more memory be assigned to the python so I don't encounter the error?
EDIT: Code now below
NOTE: The file sizes can vary greatly (up to 20GB) the minimum size of the a file is 150mb
file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w")
files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
for u in files:
line = u.readlines()
list_of_lines = []
for i in line:
values = i.split('\t')
list_of_lines.append(values)
count = 0
for j in list_of_lines:
count +=1
for k in range(0,count):
list_of_lines[k].remove('\n')
length = len(list_of_lines[0])
print_counter = 4
for o in range(0,length):
total = 0
for p in range(0,count):
number = float(list_of_lines[p][o])
total = total + number
average = total/count
print average
if print_counter == 4:
file_write.write(str(average)+'\n')
print_counter = 0
print_counter +=1
file_write.write('\n')
(This is my third answer because I misunderstood what your code was doing in my original, and then made a small but crucial mistake in my second—hopefully three's a charm.
Edits: Since this seems to be a popular answer, I've made a few modifications to improve its implementation over the years—most not too major. This is so if folks use it as template, it will provide an even better basis.
As others have pointed out, your MemoryError problem is most likely because you're attempting to read the entire contents of huge files into memory and then, on top of that, effectively doubling the amount of memory needed by creating a list of lists of the string values from each line.
Python's memory limits are determined by how much physical ram and virtual memory disk space your computer and operating system have available. Even if you don't use it all up and your program "works", using it may be impractical because it takes too long.
Anyway, the most obvious way to avoid that is to process each file a single line at a time, which means you have to do the processing incrementally.
To accomplish this, a list of running totals for each of the fields is kept. When that is finished, the average value of each field can be calculated by dividing the corresponding total value by the count of total lines read. Once that is done, these averages can be printed out and some written to one of the output files. I've also made a conscious effort to use very descriptive variable names to try to make it understandable.
try:
from itertools import izip_longest
except ImportError: # Python 3
from itertools import zip_longest as izip_longest
GROUP_SIZE = 4
input_file_names = ["A1_B1_100000.txt", "A2_B2_100000.txt", "A1_B2_100000.txt",
"A2_B1_100000.txt"]
file_write = open("average_generations.txt", 'w')
mutation_average = open("mutation_average", 'w') # left in, but nothing written
for file_name in input_file_names:
with open(file_name, 'r') as input_file:
print('processing file: {}'.format(file_name))
totals = []
for count, fields in enumerate((line.split('\t') for line in input_file), 1):
totals = [sum(values) for values in
izip_longest(totals, map(float, fields), fillvalue=0)]
averages = [total/count for total in totals]
for print_counter, average in enumerate(averages):
print(' {:9.4f}'.format(average))
if print_counter % GROUP_SIZE == 0:
file_write.write(str(average)+'\n')
file_write.write('\n')
file_write.close()
mutation_average.close()
You're reading the entire file into memory (line = u.readlines()) which will fail of course if the file is too large (and you say that some are up to 20 GB), so that's your problem right there.
Better iterate over each line:
for current_line in u:
do_something_with(current_line)
is the recommended approach.
Later in your script, you're doing some very strange things like first counting all the items in a list, then constructing a for loop over the range of that count. Why not iterate over the list directly? What is the purpose of your script? I have the impression that this could be done much easier.
This is one of the advantages of high-level languages like Python (as opposed to C where you do have to do these housekeeping tasks yourself): Allow Python to handle iteration for you, and only collect in memory what you actually need to have in memory at any given time.
Also, as it seems that you're processing TSV files (tabulator-separated values), you should take a look at the csv module which will handle all the splitting, removing of \ns etc. for you.
Python can use all memory available to its environment. My simple "memory test" crashes on ActiveState Python 2.6 after using about
1959167 [MiB]
On jython 2.5 it crashes earlier:
239000 [MiB]
probably I can configure Jython to use more memory (it uses limits from JVM)
Test app:
import sys
sl = []
i = 0
# some magic 1024 - overhead of string object
fill_size = 1024
if sys.version.startswith('2.7'):
fill_size = 1003
if sys.version.startswith('3'):
fill_size = 497
print(fill_size)
MiB = 0
while True:
s = str(i).zfill(fill_size)
sl.append(s)
if i == 0:
try:
sys.stderr.write('size of one string %d\n' % (sys.getsizeof(s)))
except AttributeError:
pass
i += 1
if i % 1024 == 0:
MiB += 1
if MiB % 25 == 0:
sys.stderr.write('%d [MiB]\n' % (MiB))
In your app you read whole file at once. For such big files you should read the line by line.
No, there's no Python-specific limit on the memory usage of a Python application. I regularly work with Python applications that may use several gigabytes of memory. Most likely, your script actually uses more memory than available on the machine you're running on.
In that case, the solution is to rewrite the script to be more memory efficient, or to add more physical memory if the script is already optimized to minimize memory usage.
Edit:
Your script reads the entire contents of your files into memory at once (line = u.readlines()). Since you're processing files up to 20 GB in size, you're going to get memory errors with that approach unless you have huge amounts of memory in your machine.
A better approach would be to read the files one line at a time:
for u in files:
for line in u: # This will iterate over each line in the file
# Read values from the line, do necessary calculations
Not only are you reading the whole of each file into memory, but also you laboriously replicate the information in a table called list_of_lines.
You have a secondary problem: your choices of variable names severely obfuscate what you are doing.
Here is your script rewritten with the readlines() caper removed and with meaningful names:
file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w") # not used
files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
for afile in files:
table = []
for aline in afile:
values = aline.split('\t')
values.remove('\n') # why?
table.append(values)
row_count = len(table)
row0length = len(table[0])
print_counter = 4
for column_index in range(row0length):
column_total = 0
for row_index in range(row_count):
number = float(table[row_index][column_index])
column_total = column_total + number
column_average = column_total/row_count
print column_average
if print_counter == 4:
file_write.write(str(column_average)+'\n')
print_counter = 0
print_counter +=1
file_write.write('\n')
It rapidly becomes apparent that (1) you are calculating column averages (2) the obfuscation led some others to think you were calculating row averages.
As you are calculating column averages, no output is required until the end of each file, and the amount of extra memory actually required is proportional to the number of columns.
Here is a revised version of the outer loop code:
for afile in files:
for row_count, aline in enumerate(afile, start=1):
values = aline.split('\t')
values.remove('\n') # why?
fvalues = map(float, values)
if row_count == 1:
row0length = len(fvalues)
column_index_range = range(row0length)
column_totals = fvalues
else:
assert len(fvalues) == row0length
for column_index in column_index_range:
column_totals[column_index] += fvalues[column_index]
print_counter = 4
for column_index in column_index_range:
column_average = column_totals[column_index] / row_count
print column_average
if print_counter == 4:
file_write.write(str(column_average)+'\n')
print_counter = 0
print_counter +=1
I'm using a function to build an array of strings (which happens to be 0s and 1s only), which are rather large. The function works when I am building smaller strings, but somehow the data type seems to be restricting the size of the string to 32 characters long (U32), without my having asked for it. Am I missing something simple?
As I build the strings, I am first casting them as lists so as to more easily manipulate individual characters before joining them into a string again. Am I somehow limiting my ability to use 'larger' data types by my method? The value of np.max(CM1) in this case is something like ~300 (one recent run yielded 253), but the string only come out 32 characters long...
''' Function to derive genome and count mutations in provided list of cells '''
def derive_genome_biopsy(biopsy_list, family_dict, CM1):
derived_genomes_inBx = np.zeros(len(biopsy_list)).astype(str)
for position, cell in np.ndenumerate(biopsy_list):
if cell == 0: continue
temp_parent = 2
bitstring = list('1')
bitstring += (np.max(CM1)-1)*'0'
if cell == 1:
derived_genomes_inBx[position] = ''.join(bitstring)
continue
else:
while temp_parent > 1:
temp_parent = family_dict[cell]
bitstring[cell-1] = '1'
if temp_parent == 1: break
cell = family_dict[cell]
derived_genomes_inBx[position] = ''.join(bitstring)
return derived_genomes_inBx
The specific error message I get is:
Traceback (most recent call last):
File "biopsyCA.py", line 77, in <module>
if genome[site] == '1':
IndexError: string index out of range
family_dict is a dictionary which carries a list of parents and children that the algorithm above works through to reconstruct the 'genome' of individuals from the branching family tree. it basically sets positions in the bitstring to '1' if your parent had it, then if your grandparent etc... until you get to the first bit, which is always '1', then it should be done.
The 32 character limitation comes from the conversion of float64 array to string array in this line:
derived_genomes_inBx = np.zeros(len(biopsy_list)).astype(str)
The resulting array contains datatype S32 values which limit the contents to 32 characters.
To change this limit, use 'S300' or larger instead of str.
You may also use map(str, np.zeros(len(biopsy_list)) to get more flexible string list and convert it back to numpy array with numpy.array() after you have populated it.
Thanks to help from a number of folks here and local, I finally got this working and the working function is:
''' Function to derive genome and count mutations in provided list of cells '''
def derive_genome_biopsy(biopsy_list, family_dict, CM1):
derived_genomes_inBx = list(map(str, np.zeros(len(biopsy_list))))
for biopsy in range(0,len(biopsy_list)):
if biopsy_list[biopsy] == 0:
bitstring = (np.max(CM1))*'0'
derived_genomes_inBx[biopsy] = ''.join(bitstring)
continue
bitstring = list('1')
bitstring += (np.max(CM1)-1)*'0'
if biopsy_list[biopsy] == 1:
derived_genomes_inBx[biopsy] = ''.join(bitstring)
continue
else:
temp_parent = family_dict[biopsy_list[biopsy]]
bitstring[biopsy_list[biopsy]-1] = '1'
while temp_parent > 1:
temp_parent = family_dict[position]
bitstring[temp_parent-1] = '1'
if temp_parent == 1: break
derived_genomes_inBx[biopsy] = ''.join(bitstring)
return derived_genomes_inBx
The original problem was as Teppo Tammisto pointed out an issue with the 'str' datastructure taking 'S32' format. Once I changed to using the list(map(str, ...) functionality a few more issues arose with the original code, which I've now fixed. When I finish this thesis chapter I'll publish the whole family of functions to use to virtually 'biopsy' a cellular automaton model (well, just an array really) and reconstruct 'genomes' from family tree data and the current automaton state vector.
Thanks all!
I have a large 40 million line, 3 gigabyte text file (probably wont be able to fit in memory) in the following format:
399.4540176 {Some other data}
404.498759292 {Some other data}
408.362737492 {Some other data}
412.832976111 {Some other data}
415.70665675 {Some other data}
419.586515381 {Some other data}
427.316825959 {Some other data}
.......
Each line starts off with a number and is followed by some other data. The numbers are in sorted order. I need to be able to:
Given a number x and and a range y, find all the lines whose number is within y range of x. For example if x=20 and y=5, I need to find all lines whose number is between 15 and 25.
Store these lines into another separate file.
What would be an efficient method to do this without having to trawl through the entire file?
If you don't want to generate a database ahead of time for line lengths, you can try this:
import os
import sys
# Configuration, change these to suit your needs
maxRowOffset = 100 #increase this if some lines are being missed
fileName = 'longFile.txt'
x = 2000
y = 25
#seek to first character c before the current position
def seekTo(f,c):
while f.read(1) != c:
f.seek(-2,1)
def parseRow(row):
return (int(row.split(None,1)[0]),row)
minRow = x - y
maxRow = x + y
step = os.path.getsize(fileName)/2.
with open(fileName,'r') as f:
while True:
f.seek(int(step),1)
seekTo(f,'\n')
row = parseRow(f.readline())
if row[0] < minRow:
if minRow - row[0] < maxRowOffset:
with open('outputFile.txt','w') as fo:
for row in f:
row = parseRow(row)
if row[0] > maxRow:
sys.exit()
if row[0] >= minRow:
fo.write(row[1])
else:
step /= 2.
step = step * -1 if step < 0 else step
else:
step /= 2.
step = step * -1 if step > 0 else step
It starts by performing a binary search on the file until it is near (less than maxRowOffset) the row to find. Then it starts reading every line until it finds one that is greater than x-y. That line, and every line after it are written to an output file until a line is found that is greater than x+y, and which point the program exits.
I tested this on a 1,000,000 line file and it runs in 0.05 seconds. Compare this to reading every line which took 3.8 seconds.
You need random access to the lines which you won't get with a text files unless the lines are all padded to the same length.
One solution is to dump the table into a database (such as SQLite) with two columns, one for the number and one for all the other data (assuming that the data is guaranteed to fit into whatever the maximum number of characters allowed in a single column in your database is). Then index the number column and you're good to go.
Without a database, you could read through file one time and create an in-memory data structure with pairs of values showing containing (number, line-offset). You calculate the line-offset by adding the lengths of each row (including line end). Now you can binary search these value pairs on number and randomly access the lines in the file using the offset. If you need to repeat the search later, pickle the in-memory structure and reload for later re-use.
This reads the entire file (which you said you don't want to do), but does so only once to build the index. After that you can execute as many requests against the file as you want and they will be very fast.
Note that this second solution is essentially creating a database index on your text file.
Rough code to create the index in second solution:
import Pickle
line_end_length = len('\n') # must be a better way to do this!
offset = 0
index = [] # probably a better structure to use than a list
f = open(filename)
for row in f:
nbr = float(row.split(' ')[0])
index.append([nbr, offset])
offset += len(row) + line_end_length
Pickle.dump(index, open('filename.idx', 'wb')) # saves it for future use
Now, you can perform a binary search on the list. There's probably a much better data structure to use for accruing the index values than a list, but I'd have to read up on the various collection types.
Since you want to match the first field, you can use gawk:
$ gawk '{if ($1 >= 15 && $1 <= 25) { print }; if ($1 > 25) { exit }}' your_file
Edit: Taking a file with 261,775,557 lines that is 2.5 GiB big, searching for lines 50,010,015 to 50,010,025 this takes 27 seconds on my Intel(R) Core(TM) i7 CPU 860 # 2.80GHz. Sounds good enough for me.
In order to find the line that starts with the number just above your lower limit, you have to go through the file line by line until you find that line. No other way, i.e. all data in the file has to be read and parsed for newline characters.
We have to run this search up to the first line that exceeds your upper limit and stop. Hence, it helps that the file is already sorted. This code will hopefully help:
with open(outpath) as outfile:
with open(inpath) as infile:
for line in infile:
t = float(line.split()[0])
if lower_limit <= t <= upper_limit:
outfile.write(line)
elif t > upper_limit:
break
I think theoretically there is no other option.