improve performance fetching data from mongodb

improve performance fetching data from mongodb - python

earnings = self.collection.find({}) #return 60k documents
----
data_dic = {'score': [], "reading_time": [] }
for earning in earnings:
data_dic['reading_time'].append(earning["reading_time"])
data_dic['score'].append(earning["score"])
----
df = pd.DataFrame()
df['reading_time'] = data_dic["reading_time"]
df['score'] = data_dic["score"]
The code between --- takes 4 seconds to complete. How can I improve this function ?

The time consists of these parts: Mongodb query time, time used to transfer data, network round trip, python list operation. You can optimize each of them.
One is to reduce data amount to transfer. Since you only need reading_time and score, you can only fetch them. If your average document size is big, this approach is very effective.
earnings = self.collection.find({}, {'reading_time': True, 'score': True})
Second. Mongo transfer limited amount of data in a batch. The data contains up to 60k rows, it will take multiple times to transfer data. You can adjust cursor.batchSize to reduce round trip count.
Third, increase network bandwith if you can.
Fourth. You can accelerate by leveraging numpy array. It's a C-like array data structure which is faster than python list. Pre-allocate fixed length array and assigne value by index. This avoid internal adjustment when call list.append.
count = earnings.count()
score = np.empty((count,), dtype=float)
reading_time = np.empty((count,), dtype='datetime64[us]')
for i, earning in enumerate(earnings):
score[i] = earning["score"]
reading_time[i] = earning["reading_time"]
df = pd.DataFrame()
df['reading_time'] = reading_time
df['score'] = score

Related

Hadoop stuck on reduce 67% (only with large data)

I'm a beginner at Hadoop and Linux.
The Problem
Hadoop reduce stuck (or move really really slow) when the input data is large (e.x. 600k rows or 6M rows) even though the Map and Reduce functions are quite simple, 2021-08-08 22:53:12,350 INFO mapreduce.Job: map 100% reduce 67%.
In Linux System Monitor I can see when reduce hit the 67% only one CPU keep running at the time at 100% and the rest of them are sleeping :) see this picture
What ran successfully
I ran the MapReduce job with small input data (600 rows) fast and successfully without any issue map 100% reduce 100%, 2021-08-08 19:44:13,350 INFO mapreduce.Job: map 100% reduce 100%.
Mapper (Python)
#!/usr/bin/env python3
import sys
from itertools import islice
from operator import itemgetter
def read_input(file):
# read file except first line
for line in islice(file, 1, None):
# split the line into words
yield line.split(',')
def main(separator='\t'):
# input comes from STDIN (standard input)
data = read_input(sys.stdin)
for words in data:
# for each row we take only the needed columns
data_row = list(itemgetter(*[1, 2, 4, 5, 6, 9, 10, 18])(words))
data_row[7] = data_row[7].replace('\n', '')
# taking year and month No.from first column to create the
# key that will send to reducer
date = data_row[0].split(' ')[0].split('-')
key = str(date[0]) + '_' + str(date[1])
# value that will send to reducer
value = ','.join(data_row)
# print here will send the output pair (key, value)
print('%s%s%s' % (key, separator, value))
if __name__ == "__main__":
main()
Reducer (Python)
#!/usr/bin/env python3
from itertools import groupby
from operator import itemgetter
import sys
import pandas as pd
import numpy as np
import time
def read_mapper_output(file):
for line in file:
yield line
def main(separator='\t'):
all_rows_2015 = []
all_rows_2016 = []
start_time = time.time()
names = ['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'trip_distance',
'pickup_longitude', 'pickup_latitude', 'dropoff_longitude',
'dropoff_latitude', 'total_amount']
df = pd.DataFrame(columns=names)
# input comes from STDIN (standard input)
data = read_mapper_output(sys.stdin)
for words in data:
# get key & value from Mapper
key, value = words.split(separator)
row = value.split(',')
# split data with belong to 2015 from data belong to 2016
if key in '2015_01 2015_02 2015_03':
all_rows_2015.append(row)
if len(all_rows_2015) >= 10:
df=df.append(pd.DataFrame(all_rows_2015, columns=names))
all_rows_2015 = []
elif key in '2016_01 2016_02 2016_03':
all_rows_2016.append(row)
if len(all_rows_2016) >= 10:
df=df.append(pd.DataFrame(all_rows_2016, columns=names))
all_rows_2016 = []
print(df.to_string())
print("--- %s seconds ---" % (time.time() - start_time))
if __name__ == "__main__":
main()
More Info
I'm using Hadoop v3.2.1 on Linux installed on VMware to run MapReduce job in Python.
Reduce Job in Numbers:
Input Data Size
Number of rows
Reduce job time
~98 Kb
600 rows
~0.1 sec
good
~953 Kb
6,000 rows
~1 sec
good
~9.5 Mb
60,000 rows
~52 sec
good
~94 Mb
600,000 rows
~5647 sec (~94 min)
very slow
~11 Gb
76,000,000 rows
??
impossible
The goal is running on ~76M rows input data, it's impossible with this issue remaining.

"when reduce hit the 67% only one CPU keep running at the time at 100% and the rest of them are sleeping" - you have skew. One key has far more values than any other key.

I see some problems here.
In the reduce phase you don't make any summarization, just fiter 2015Q1 and 2015Q2 - reduce is supposed to be used for summarization like grouping by key or doing some calculations based on the keys.
If you just need to filter data, do it on the map phase to save cycles (assume you're billed for all data):
You store a lot of stuff in RAM inside a dataframe. Since you don't know how big is the key, you are experiencing trashing. This combined with heavy keys will make your process do a page fault on every DataFrame.append after some time.
There are some fixes:
Do you really need a reduce phase? Since you are just filtering the first three months os 2015 and 2016 you cand do this on the Map phase. This will make the process go a bit faster if you need a reduce later since it will need less data for the reduce phase.
def main(separator='\t'):
# input comes from STDIN (standard input)
data = read_input(sys.stdin)
for words in data:
# for each row we take only the needed columns
data_row = list(itemgetter(*[1, 2, 4, 5, 6, 9, 10, 18])(words))
# Find out first if you are filtering this data
# taking year and month No.from first column to create the
# key that will send to reducer
date = data_row[0].split(' ')[0].split('-')
# Filter out
if (date[1] in [1,2,3]) and (date[0] in [2015,2016]):
# We keep this data. Calulate key and clean up data_row[7]
key = str(date[0]) + '_' + str(date[1])
data_row[7] = data_row[7].replace('\n', '')
# value that will send to reducer
value = ','.join(data_row)
# print here will send the output pair (key, value)
print('%s%s%s' % (key, separator, value))
Try not to store data in memory during the reduce. Since you are filtering, print() the results as soon as you have it. If your source data is not sorted, the reduce will serve as a way to have all data from the same month together.
You've got a bug in your reduce phase: you're losing number_of_records_per_key modulo 10 because you don't append the results to the dataframe. Dont' append to the dataframe and print the result asap.

Trying to calculate correct returns and set a constraints on the max and min invested in each asset using 'quad_form'

I am trying to hack together some code that looks like it should print our risk and returns of a portfolio, but the first return is 0.00, and that can't be right. Here's the code that I'm testing.
import pandas as pd
# initialize list of lists
data = [[130000, 150000, 190000, 200000], [100000, 200000, 300000, 900000], [350000, 450000, 890000, 20000], [400000, 10000, 500000, 600000]]
# Create the pandas DataFrame
data = pd.DataFrame(data, columns = ['HOSPITAL', 'HOTEL', 'STADIUM', 'SUBWAY'])
# print dataframe.
data
That gives me this data frame.
symbols = data.columns
# convert daily stock prices into daily returns
returns = data.pct_change()
r = np.asarray(np.mean(returns, axis=1))
r = np.nan_to_num(r)
C = np.asmatrix(np.cov(returns))
C = np.nan_to_num(C)
# print expected returns and risk
for j in range(len(symbols)):
print ('%s: Exp ret = %f, Risk = %f' %(symbols[j],r[j], C[j,j]**0.5))
The result is this.
The hospital risk and return can't be zero. That doesn't make sense. Something is off here, but I'm not sure what.
Finally, I am trying to optimize the portfolio. So, I hacked together this code.
# Number of variables
n = len(data)
# The variables vector
x = Variable(n)
# The minimum return
req_return = 0.02
# The return
ret = r.T*x
# The risk in xT.Q.x format
risk = quad_form(x, C)
# The core problem definition with the Problem class from CVXPY
prob = Problem(Minimize(risk), [sum(x)==1, ret >= req_return, x >= 0])
try:
prob.solve()
print ("Optimal portfolio")
print ("----------------------")
for s in range(len(symbols)):
print (" Investment in {} : {}% of the portfolio".format(symbols[s],round(100*x.value[s],2)))
print ("----------------------")
print ("Exp ret = {}%".format(round(100*ret.value,2)))
print ("Expected risk = {}%".format(round(100*risk.value**0.5,2)))
except:
print ("Error")
It seems to run but I don't know how to add a constraint. I want to invest at least 5% in every asset and don't invest more than 40% in any one asset. How can I add a constraint to do that?
The idea comes from this link.
https://tirthajyoti.github.io/Notebooks/Portfolio_optimization.html

Based on the idea from the link, they skip the NaN row from the monthly return dataframe, and after converting the return to a matrix, the following step is transposing the matrix, that is the step you are missing hence you are getting the returns and risk as 0 for Hospital. You might want to add this line C = np.asmatrix(np.cov(returns.dropna().transpose()))to skip the first NaN line. This should give you the correct Returns and Risk values.
As for your second question, i had a quick glance into the class definition of cxpy Problem class and there doesnt seem to be a provision for add constraints. The function was programmed to solve equations based on the Minimizing or Maximizing constraint given to it.
For a work around you might want to try taking the outputs and then capping the investment to 40% and and the remaining you distribute it among other ventures proportionally. Example lets say it tells you to invest 5%, 80% and 15% of your assets in A, B and C. You could cap investment in B to 40% and the part remainder of the asset (5/(5+15))*40 = 10% more into A and 30% of your total investing asset more ,into B.
DISCLAIMER: I am not an expert in finance, i am just stating my opinion.

Efficient way to loop through GroupBy DataFrame

Since my last post did lack in information:
example of my df (the important col):
deviceID: unique ID for the vehicle. Vehicles send data all Xminutes.
mileage: the distance moved since the last message (in km)
positon_timestamp_measure: unixTimestamp of the time the dataset was created.
deviceID mileage positon_timestamp_measure
54672 10 1600696079
43423 20 1600696079
42342 3 1600701501
54672 3 1600702102
43423 2 1600702701
My Goal is to validate the milage by comparing it to the max speed of the vehicle (which is 80km/h) by calculating the speed of the vehicle using the timestamp and the milage. The result should then be written in the orginal dataset.
What I've done so far is the following:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
for group_name, group in df:
#sort group by time
group = group.sort_values(by='position_timestamp_measure')
group = group.reset_index()
#since I can't validate the first point in the group, I set it to valid
df_ori.loc[df_ori.index == group.dataIndex.values[0], 'validPosition'] = 1
#iterate through each data in the group
for i in range(1, len(group)):
timeGoneSec = abs(group.position_timestamp_measure.values[i]-group.position_timestamp_measure.values[i-1])
timeHours = (timeGoneSec/60)/60
#calculate speed
if((group.mileage.values[i]/timeHours)<maxSpeedKMH):
df_ori.loc[dataset.index == group.dataIndex.values[i], 'validPosition'] = 1
dataset.validPosition.value_counts()
It definitely works the way I want it to, however it lacks in performance a lot. The df contains nearly 700k in data (already cleaned). I am still a beginner and can't figure out a better solution. Would really appreciate any of your help.

If I got it right, no for-loops are needed here. Here is what I've transformed your code into:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
df_ori = df_ori.sort_values(['position_timestamp_measure'])
# Subtract preceding values from currnet value
df_ori['timeGoneSec'] = \
df_ori.groupby('device_id')['position_timestamp_measure'].transform('diff')
# The operation above will produce NaN values for the first values in each group
# fill the 'valid' with 1 according the original code
df_ori[df_ori['timeGoneSec'].isna(), 'valid'] = 1
df_ori['timeHours'] = df_ori['timeGoneSec']/3600 # 60*60 = 3600
df_ori['flag'] = (df_ori['mileage'] / df_ori['timeHours']) <= maxSpeedKMH
df_ori.loc[df_ori['flag'], 'valid'] = 1
# Remove helper columns
df_ori = df.drop(columns=['flag', 'timeHours', 'timeGoneSec'])
The basic idea is try to use vectorized operation as much as possible and to avoid for loops, typically iteration row by row, which can be insanly slow.
Since I can't get the context of your code, please double check the logic and make sure it works as desired.

Merge CRSP and Eikon through the Eikon api

I am trying to merge CRSP and IBES through the Eikon API.
I have extracted CUSIP codes from CRSP and wants to convert those to RIC codes in order to extract analyst estimates.
When I do the following in python it returns an error (Payload Too Large). I guess that means I have reached some datalimit. But how can the datalimit be so low - we are talking approx 28.000 requests (datapoints)? and secondly, how can I circumvent it - if that is possible?
ric = ek.get_symbology(cusips,from_symbol_type="CUSIP", to_symbol_type="RIC")

You could create a loop to retrieve the data in batches:
dfs = [] # Will be a list of dataframes
batchsize = 200
for i in range(0, len(cusips), batchsize):
batch = cusips[i:i + batchsize]
r = ek.get_symbology(batch,from_symbol_type="CUSIP", to_symbol_type="RIC")
dfs.append(r)
rics = pd.concat(dfs)
print(rics)
NB: I haven't tested this specific batch size, you can play around with the number to see what works best for you.
Hopefully this helps!

mpi4py: dynammic data processing

I have a vector that contain stock tickers like tickers = ['AAPL','XOM','GOOG'] and in my "traditional" python program I would loop over this tickers vector, select one ticker string like AAPL, import a csv file that contains AAPL stock returns, use the returns as an input to a common function, and finally generate a csv file as an output. I have over 4000 tickers and the function to apply to each ticker takes time to process. I have access to a computer cluster with the mpi4py package with access to about 100 processors per job. I understand well (and was able to implement) this mpi example in python:
from mpi4py import MPI
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
if rank == 0:
data = [i for i in range(8)]
# dividing data into chunks
chunks = [[] for _ in range(size)]
for i, chunk in enumerate(data):
chunks[i % size].append(chunk)
else:
data = None
chunks = None
data = comm.scatter(chunks, root=0)
print str(rank) + ': ' + str(data)
[cha#cluster] ~/utils> mpirun -np 3 ./mpi.py
2: [2, 5]
0: [0, 3, 6]
1: [1, 4, 7]
So in this example, we have a data vector of size 8 and assign to each processor (3 in total) an equal number of elements of the data. How can I use the similar above example and assign to each processor one stock ticker and apply the function that needs to be run for each ticker? How can I tell python that once a processor get free, to go back in the tickers vector and process a ticker that has not yet been processed?

There's another way to think of this. You have 100 processors processing 4000 chunks of data. One way you can look at this is that each processor gets a block of data on which to operate. Evenly split, each processor will get 40 tickers to process. Processor 1 will get 0-39, processor 2 will get 40-79, etc.
Thinking this way, you don't need to worry about what happens when a processor finishes its tasks. Just have a loop:
block_size = len(tickers) / size # this will be 40 in your example
for i in range(block_size):
ticker = tickers[rank * block_size + i]
process(ticker)
def process(ticker):
# load data
# process data
# output data
Does this make sense?
[edit]
If you're wanting to read more, this is really just a variation on row-major order indexing, a common method for accessing multidimensional data that's stored in a single dimension of memory.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

improve performance fetching data from mongodb - python

Related

Hadoop stuck on reduce 67% (only with large data)

Trying to calculate correct returns and set a constraints on the max and min invested in each asset using 'quad_form'

Efficient way to loop through GroupBy DataFrame

Merge CRSP and Eikon through the Eikon api

mpi4py: dynammic data processing

Categories

Resources