Process millions of image files and insert each file to mongodb - python

I have a large collection (~800,000 images, 1TB+ total size) of image files on S3 that I use some Python code to process into a dictionary for insertion to MongoDB. The dictionary contains a buffer that is used by a command like np.frombuffer to reconstruct the image.
I need to process each file and insert it into a MongoDB. So far I've tried multiprocessing the code and while this is effective, it gets slower and slower with each insert - it takes 20 min for 50,000 files but 5 hours for 250,000 files.
I have 2 things I'm unsure about:
Why does inserting get so much slower as the number of documents in the database increases, how can I address that? I'm guessing it's because of the more records you have, the more work Mongo has to do to check if the record it's trying to insert already exists but I'm not sure how to mitigate this.
What is the best approach to this type of problem? Another idea I had was bulk inserts after writing the processed image files locally.
Code sample below:
def process_image(img_file):
# define MongoClient and collections
client = MongoClient(...)
collection = client['collection_name']
# read image file from s3
obj = s3.Object(bucket_name='test_bucket', key=img_file)
im = Image.open(obj.get()['Body'].read()
# create image buffer
buffer = cv2.imencode(".jpg", im)
buffer = buffer.flatten().tobytes() # usually around 100,000 bytes
# dict to be written to mongo
d = {}
d['filename'] = img_file
d['buffer'] = buffer
# insert to mongo
collection.insert_one(d)
### multiprocessing code
from multiprocessing import Pool
pool = Pool(processes=16)
results = pool.map(process_image, ls_filenames, chunksize=500)
pool.close()
pool.join()
ls_filenames has around 800k image paths in it.

There's unnecessary overhead creating the MongoClient each time. Create it once and reuse the connection.

Related

SQLite Database: One Big vs. Several Small? Write to Database in Parallel?

As I'm new to sqlite databases, I highly appreciate every useful comment, answer or reference to interesting threads and websites. Here's my situation:
I have a directory with 400 txt files each with the size of ~7GB. The relevant information in these files are written into a sqlite database resulting in a 17.000.000x4 table, which takes approximately 1 day. Later on the database will be queried only by me to further analyze the data.
The whole process of creating the database could be significantly accelerated, if it is possible to write to a database in parallel. For instance, I could run several processes in parallel, each process taking only one of the 400 txt files as input and writing the results to the database. So is it possible to let several processes write to a database in parallel?
EDIT1: Answer w.r.t. W4t3randWinds comment: It is possible (and faster) to process 1 file per core, write the results into a database and merge all databases after that. However, write into 1 database using multi threading is not possible.
Furthermore, I was wondering whether it would be more efficient to create several databases instead of one big database? For instance, does it make sense to create a database per txt file resulting in 400 databases consisting of a 17.000.000/400 x 4 table?
At last, I'm storing the database as a file on my machine. However, I also read about the possibility to set up a server. So when does it make sense to use a server and more specifically, would it make sense to use a server in my case?
Please see below my code for the creation of the database.
### SET UP
# set up database
db = sqlite3.connect("mydatabase.db")
cur = db.cursor()
cur.execute("CREATE TABLE t (sentence, ngram, word, probability);")
# set up variable to store db rows
to_db = []
# set input directory
indir = '~/data/'
### PARSE FILES
# loop through filenames in indir
for filename in os.listdir(indir):
if filename.endswith(".txt"):
filename = os.path.join(indir, filename)
# open txt files in dir
with io.open(filename, mode = 'r', encoding = 'utf-8') as mytxt:
### EXTRACT RELEVANT INFORMATION
# for every line in txt file
for i, line in enumerate(mytxt):
# strip linebreak
line = line.strip()
# read line where the sentence is stated
if i == 0 or i % 9 == 0:
sentence = line
ngram = " ".join(line.split(" ")[:-1])
word = line.split(" ")[-1]
# read line where the result is stated
if (i-4) == 0 or (i-4) % 9 == 0:
result = line.split(r'= ')[1].split(r' [')[0]
# make a tuple representing a new row of db
db_row = (sentence, ngram, word, result)
to_db.append(db_row)
### WRITE TO DATABASE
# add new row to db
cur.executemany("INSERT INTO t (sentence, ngram, word, results) VALUES (?, ?, ?, ?);", to_db)
db.commit()
db.close()
The whole process of creating the database could be significantly accelerated, if it is possible to write to a database in parallel
I am not sure of that. You only have little processing, so the whole process is likely to be io bound. SQLite is a very nice tool, but it only support one single thread to write into it.
Possible improvements:
use x threads to read and process the text file, a single one to write to the database in large chunks and a queue. As the process is IO bound, the Python Global Interprocess Lock should not be a problem
use a full featured database like PostgreSQL or MariaDB on a separate machine and multiple processes on the client machine each processing its own set of input files
In either case, I am unsure of the benefit...
I do daily updates to an SQLite database using python mutlithreading. It works beautifully. Two different tables have nearly 20,000,000 records one with 8 fields the other with 10. This is on my laptop which is 4 years old.
If you are having performance issues I recommend looking into how your tables are constructed (a proper primary key and indexes) and your equipment. If you are still using an HDD you will gain amazing performance by upgrading to an SSD.

Hitting AWS Lambda Memory limit in Python

I am looking for some advice on this project. My thought was to use Python and a Lambda to aggregate the data and respond to the website. The main parameters are date ranges and can be dynamic.
Project Requirements:
Read data from monthly return files stored in JSON (each file contains roughly 3000 securities and is 1.6 MB in size)
Aggregate the data into various buckets displaying counts and returns for each bucket (for our purposes here lets say the buckets are Sectors and Market Cap ranges which can vary)
Display aggregated data on a website
Issue I face
I have successfully implemted this in an AWS Lambda, however in testing requests that are 20 years of data (and yes I get them), I begin to hit the memory limits in AWS Lambda.
Process I used:
All files are stored in S3, so I use the boto3 library to obtain the files, reading them into memory. This is still small and not of any real significance.
I use json.loads to convert the files into a pandas dataframe. I was loading all of the files into one large dataframe. - This is where the it runs out of memory.
I then pass the dataframe to custom aggregations using groupby to get my results. This part is not as fast as I would like but does the job of getting what I need.
The end result dataframe that is then converted back into JSON and is less than 500 MB.
This entire process when it works locally outside the lambda is about 40 seconds.
I have tried running this with threads and processing single frames at once but the performance degrades to about 1 min 30 seconds.
While I would rather not scrap everything and start over, I am willing to do so if there is a more efficient way to handle this. The old process did everything inside of node.js without the use of a lambda and took almost 3 minutes to generate.
Code currently used
I had to clean this a little to pull out some items but here is the code used.
Read data from S3 into JSON this will result in a list of string data.
while not q.empty():
fkey = q.get()
try:
obj = self.s3.Object(bucket_name=bucket,key=fkey[1])
json_data = obj.get()['Body'].read().decode('utf-8')
results[fkey[1]] = json_data
except Exception as e:
results[fkey[1]] = str(e)
q.task_done()
Loop through the JSON files to build a dataframe for working
for k,v in s3Data.items():
lstdf.append(buildDataframefromJson(k,v))
def buildDataframefromJson(key, json_data):
tmpdf = pd.DataFrame(columns=['ticker','totalReturn','isExcluded','marketCapStartUsd',
'category','marketCapBand','peGreaterThanMarket', 'Month','epsUsd']
)
#Read the json into a dataframe
tmpdf = pd.read_json(json_data,
dtype={
'ticker':str,
'totalReturn':np.float32,
'isExcluded':np.bool,
'marketCapStartUsd':np.float32,
'category':str,
'marketCapBand':str,
'peGreaterThanMarket':np.bool,
'epsUsd':np.float32
})[['ticker','totalReturn','isExcluded','marketCapStartUsd','category',
'marketCapBand','peGreaterThanMarket','epsUsd']]
dtTmp = datetime.strptime(key.split('/')[3], "%m-%Y")
dtTmp = datetime.strptime(str(dtTmp.year) + '-'+ str(dtTmp.month),'%Y-%m')
tmpdf.insert(0,'Month',dtTmp, allow_duplicates=True)
return tmpdf

Optimising / minimizing a local copy of the DB table using python (Currently using Pickle)

I am establishing a connection to a DB hosted on RedShift and want to save the table locally as a pickle file (or any other way) to save time.
When I am saving 100k records, it's taking 163 MBs of space and the total number of records ~ 14 million, so it would take around 165 GBs for the entire table.
Whereas one of my colleagues saved the table as .rds file through R and it only took ~500 MBs for the entire 14 million records.
How can I save space and make things faster in python? Why is rds format taking such low space?
Here's a sample of my code (Using Python 2.7.11, cPickle, protocol=cPickle.HIGHEST_PROTOCOL)
def write_pickle(folder, filename, my_obj, overwrite = False):
filepath = os.path.join(folder, filename)
if overwrite:
print('Writing into pickle file...')
with gzip.open(filepath, 'wb') as handle:
pickle.dump(my_obj, handle, protocol=cPickle.HIGHEST_PROTOCOL)
else:
if os.path.isfile(filepath) == False:
print('Writing into pickle file...')
with gzip.open(filepath, 'wb') as handle:
pickle.dump(my_obj, handle, protocol=cPickle.HIGHEST_PROTOCOL)
connection_obj = form_connection_obj()
cursor = connection_obj.cursor()
cursor.execute('SELECT TOP 100000 * FROM my_dataset;')
write_pickle('folder', 'filename.p', list(cursor.fetchall()))
UPDATE 1: After converting to cPickle from pickle and using the HIGHEST_PROTOCOL the size reduced to half (~83 MBs for 100k records) but it is still very large compared to the R counterpart.
UPDATE 2: After using gzip the size was reduced (~17 MB for 100k records) but it is still more than 3 times R's .rds format and slower too.
This question would be little different to the one suggested since
1) I am not fixated on using pickle, I just want to have a local dump of data to avoid remote DB connection everytime.
2) I want to find the best way to do this (it may not even involve dumping the table locally).
3) Solution suggested in the other question using gzip has made loading the pickle considerably slow.

Pickling pandas dataframe multiplies by 5 the file size

I am reading a 800 Mb CSV file with pandas.read_csv, and then use the original Python pickle.dump(datfarame) to save it. The result is a 4 Gb pkl file, so the CSV size is multiplied by 5.
I expected pickle to compress data rather than extend it. Also because I can do a gzip on the CSV file which compress it to 200 Mb, dividing it by 4.
I am willing to accelerate the loading time of my program, and thought that pickling would help, but considering disk access is the main bottleneck I am understanding that I would rather have to compress the files and then use the compression option from pandas.read_csv to speed up the loading time.
Is that correct?
Is it normal that pickling pandas dataframe extend the data size?
How do you speed up loading time usually?
What are the data-size limit would you load with pandas?
Not sure why you think pickling compresses the data size, pickling creates a string version of your python object so that it can be loaded back as a python object:
In [388]:
import sys
import os
df = pd.DataFrame({'a':np.arange(5)})
df.to_pickle(r'c:\data\df.pkl')
print(sys.getsizeof(df))
statinfo = os.stat(r'c:\data\df.pkl')
print(statinfo.st_size)
with open(r'c:\data\df.pkl', 'rb') as f:
print(f.read())
56
700
b'\x80\x04\x95\xb1\x02\x00\x00\x00\x00\x00\x00\x8c\x11pandas.core.frame\x94\x8c\tDataFrame\x94\x93\x94)}\x94\x92\x94\x8c\x15pandas.core.internals\x94\x8c\x0cBlockManager\x94\x93\x94)}\x94\x92\x94(]\x94(\x8c\x11pandas.core.index\x94\x8c\n_new_Index\x94\x93\x94h\x0b\x8c\x05Index\x94\x93\x94}\x94(\x8c\x04data\x94\x8c\x15numpy.core.multiarray\x94\x8c\x0c_reconstruct\x94\x93\x94\x8c\x05numpy\x94\x8c\x07ndarray\x94\x93\x94K\x00\x85\x94C\x01b\x94\x87\x94R\x94(K\x01K\x01\x85\x94\x8c\x05numpy\x94\x8c\x05dtype\x94\x93\x94\x8c\x02O8\x94K\x00K\x01\x87\x94R\x94(K\x03\x8c\x01|\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK?t\x94b\x89]\x94\x8c\x01a\x94at\x94b\x8c\x04name\x94Nu\x86\x94R\x94h\rh\x0b\x8c\nInt64Index\x94\x93\x94}\x94(h\x11h\x14h\x17K\x00\x85\x94h\x19\x87\x94R\x94(K\x01K\x05\x85\x94h\x1f\x8c\x02i8\x94K\x00K\x01\x87\x94R\x94(K\x03\x8c\x01<\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00t\x94b\x89C(\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x94t\x94bh(Nu\x86\x94R\x94e]\x94h\x14h\x17K\x00\x85\x94h\x19\x87\x94R\x94(K\x01K\x01K\x05\x86\x94h\x1f\x8c\x02i4\x94K\x00K\x01\x87\x94R\x94(K\x03h5NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00t\x94b\x89C\x14\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x94t\x94ba]\x94h\rh\x0f}\x94(h\x11h\x14h\x17K\x00\x85\x94h\x19\x87\x94R\x94(K\x01K\x01\x85\x94h"\x89]\x94h&at\x94bh(Nu\x86\x94R\x94a}\x94\x8c\x060.14.1\x94}\x94(\x8c\x06blocks\x94]\x94}\x94(\x8c\x06values\x94h>\x8c\x08mgr_locs\x94\x8c\x08builtins\x94\x8c\x05slice\x94\x93\x94K\x00K\x01K\x01\x87\x94R\x94ua\x8c\x04axes\x94h\nust\x94bb.'
The method to_csv does support compression as a kwarg, 'gzip' and 'bz2':
In [390]:
df.to_csv(r'c:\data\df.zip', compression='bz2')
statinfo = os.stat(r'c:\data\df.zip')
print(statinfo.st_size)
29
It is likely in your best interest to stash your CSV file in a database of some sort and perform operations on that rather than loading the CSV file to RAM, as Kathirmani suggested. You will see the speedup in loading time that you expect due simply to the fact that you are not filling up 800 Mb worth of RAM every time you load your script.
File compression and loading time are two conflicting elements of what you seem to be trying to accomplish. Compressing the CSV file and loading that will take more time; you've now added the extra step of having to decompress the file, which doesn't solve your problem.
Consider a precursory step to ship the data to an sqlite3 database, as described here: Importing a CSV file into a sqlite3 database table using Python.
You now have the pleasure of being able to query a subset of your data and quickly load it into a pandas.DataFrame for further use, as follows:
from pandas.io import sql
import sqlite3
conn = sqlite3.connect('your/database/path')
query = "SELECT * FROM foo WHERE bar = 'FOOBAR';"
results_df = sql.read_frame(query, con=conn)
...
Conversely, you can use pandas.DataFrame.to_sql() to save these for later use.
Dont load 800MB file to memory. It will increase your loading time. Pickle objects too takes more time to load. Instead store the csv file as a sqlite3 (which comes along with python) table. And then query the table every time depending upon your need.
You can also use panda's pickle methods which should compress your data.
Save a dataframe:
df.to_pickle(filename)
Load it:
df = pd.read_pickle(filename)

Loading in data via Django orm very memory intensive

So I have a script that loads in data that was created from a python pickle file.
dump_file = open('movies.pkl')
movie_data = pickle.load(dump_file)
#transaction.commit_manually
def load_data(data):
start = False
counter = 0
for item in data:
counter += 1
film_name = item.decode(encoding='latin1')
print "at", film_name, str(counter), str(len(data))
film_rating = float(data[item][0])
get_votes = int(data[item][2]['votes'])
full_date = data[item][2]['year']
temp_film = Film(name=film_name,date=full_date,rating=film_rating, votes=get_votes)
temp_film.save()
for actor in data[item][1]:
actor = actor.decode(encoding='latin1')
print "adding", actor
person = Person.objects.get(full=actor)
temp_film.actors.add(person)
if counter % 10000 == 0 or counter % len(data) == 0:
transaction.commit()
print "COMMITED"
load_data(movie_data)
So this is a very large data set. And it takes up a lot of memory where it slows down to a crawl, and in the past my solution was to just restart the script from where I left off, so it would take quite a few runs to actually save everything into the database.
I'm wondering if there's a better way to do this (even an optimization in my code would be nice) other than writing raw sql to input the data? I've tried JSON fixtures previously and it was even worse than this method.
If size of movie_data is large, you might wanna divide it into smaller files first and then iterate over them one by one.
Remember to free memory of previously loaded pkl files or keep overwriting the same variable.
If movie data is a list, you can free memory of of say 1000 records after you have iterated over them by slicing such as movie_data=movie_data[1000:] to reduce memory consumption over time
You can use bulk_create() method on the QuerySet object to create mutliple object in a single query, it's available in Django 1.4. Please go through following documentation link -
Bulk Create - https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create
You can also optimize you code, by open the file with "with" keyword in python. "With" statement, it automatically closes the files for you, do all the operations inside the with block, so it'll keep the files open for you and will close the files once you're out of the with block.

Categories

Resources