I have 22 million rows of house property sale data in a database table called sale_transactions. I am performing a job where I read information from this table, perform some calculations, and use the results to create entries to a new table. The process looks like this:
for index, row in zipcodes.iterrows(): # ~100k zipcodes
sql_string = """SELECT * from sale_transactions WHERE zipcode = '{ZIPCODE}' """
sql_query = sql_string.format(ZIPCODE=row['zipcode'])
df = pd.read_sql(sql_query, _engine)
area_stat = create_area_stats(df) # function does calculations
area_stat.save() # saves a Django model
At the moment each iteration of this loop takes about 20 seconds on my macbook pro (16GB RAM), which means that the code is going to take weeks to finish. The expensive part is the read_sql line.
How can I optimize this? I can't read the whole sale_transactions table into memory, it is about 5 GB, hence using the sql query each time to capture the relevant rows with the WHERE clause.
Most answers about optimizing pandas talk about reading with chunking, but in this case I need to perform the WHERE on all the data combined, since I am performing calculations in the create_area_stats function like number of sales over a ten year period. I don't have easy access to a machine with loads of RAM, unless I start going to town with EC2, which I worry will be expensive and quite a lot of hassle.
Suggestions would be greatly appreciated.
I also faced similar problem and the below code helped me to read database (~ 40 million rows) effectively .
offsetID = 0
totalrow = 0
while (True):
df_Batch=pd.read_sql_query('set work_mem="1024MB"; SELECT * FROM '+tableName+' WHERE row_number > '+ str(offsetID) +' ORDER BY row_number LIMIT 100000' ,con=engine)
offsetID = offsetID + len(df_Batch)
#your operation
totalrow = totalrow + len(df_Batch)
you have to create a index called row_number in your table. So this code will read your table (100 000 rows) index wise. for example when you want to read rows from 200 000 - 210 000 you don't need to read from 0 to 210 000. It will directly read by index. So It will improve your performance.
Since the bottleneck in the operation was the SQL WHERE query, the solution was to index the column upon which the WHERE statement was operating (i.e. the zipcode column).
In MySQL, the command for this was:
ALTER TABLE `db_name`.`table`
ADD INDEX `zipcode_index` USING BTREE (`zipcode` ASC);
After making this change, the loop execution speed increased by 8 fold.
I found this article useful because it encouraged profiling queries using EXPLAIN and observing opportunities for column indexing when key and possible_key values were NULL
Related
I have a large dataset for 2M rows which columns as warehouse code, part code, transport mode, time_taken. there are 10 ways for shipping a product, I wanted to find the transport mode which took the least time for each product. There are 10 different warehouses and more than 2700 unique products. I wanted to find the least time taken for each product in the data set. I have written this function but This code is taking more than 14Hr to execute can anyone share a better solution to this problem
def model_selection_df(df):
value =pd.DataFrame()
for WH in Whse:
wh_df = df[df['WAREHOUSE_CODE']==WH]
unique_SKU = wh_df['VEND_PART'].unique()
for part in unique_SKU:
df_1 = wh_df[wh_df['VEND_PART']==part]
other_method = wh_df[(wh_df['VEND_PART']==part) & (wh_df['TIME_TAKEN']== df_1['TIME_TAKEN'].min() )]
value= value.append(other_method)
return value
This question already has answers here:
How to use time-series with Sqlite, with fast time-range queries?
(2 answers)
Closed 2 years ago.
In the following example, t is an increasing sequence going roughly from 0 to 5,000,000 in 1 million of rows.
import sqlite3, random, time
t = 0
db = sqlite3.connect(':memory:')
db.execute("CREATE TABLE IF NOT EXISTS data(id INTEGER PRIMARY KEY, t INTEGER, label TEXT);")
for i in range(1000*1000):
t += random.randint(0, 10)
db.execute("INSERT INTO data(t, label) VALUES (?, ?)", (t, 'hello'))
Selecting a range (let's say t = 1,000,000 ... 2,000,000) with an index:
db.execute("CREATE INDEX t_index ON data(t);")
start = time.time()
print(list(db.execute(f"SELECT COUNT(id) FROM data WHERE t BETWEEN 1000000 AND 2000000")))
print("index: %.1f ms" % ((time.time()-start)*1000)) # index: 15.0 ms
is 4-5 times faster than doing it without an index:
db.execute("DROP INDEX IF EXISTS t_index;")
start = time.time()
print(list(db.execute(f"SELECT COUNT(id) FROM data WHERE t BETWEEN 1000000 AND 2000000")))
print("no index: %.1f ms" % ((time.time()-start)*1000)) # no index: 73.0 ms
but the database size is at least 30% bigger with an index.
Question: in general, I understand how indexes massively speed up queries, but in such a case where t is integer + increasing, why is an index even needed to speed up the query?
After all, we only need to find the row for which t=1,000,000 (this is possible in O(log n) since the sequence is increasing), find the row for which t=2,000,000, and then we have the range.
TL;DR: when a column is an increasing integer sequence, is there a way to have a fast query for a range, without having to increase +30% the database size with an index?
For example by setting a parameter when creating the table, informing Sqlite that the column is increasing / already sorted?
The short answer is that sqlite doesn't know that your table is sorted by column t. This means it has to scan through the whole table to extract the data.
When you add an index on the column, the column t is sorted in the index, so it can skip the first million rows and then stream the rows of interest to you. You are extracting 20% of the rows, and it returns in 15 ms / 73 ms = 21% of the time. If that fraction is smaller, the benefit you derive from the index is larger.
If the column t is unique, then consider using that column as the primary key, as you would get the index for "free". If you can bound the number of rows with the same t, then you use (t, offset) as primary key where offset might be a tinyint. The point being that size(primary key index) + size(t index) would larger than size(t+offset index). If t was in ms or ns instead of s, it might be unique in practice, or you could fiddle with it when it was not (and just truncate to second resolution when you need the data).
If you don't need a primary key (as unique index), leave it out and just have the non-unique index on t. Without a primary key you can identify a unique row by rowid, or if all the columns collectively create an unique row. If you create the table without rowid, you could still use limit to operate on identical rows.
You could use database warehouse techniques, if you don't need per record data, store it in a less granular fashion (record per minute, or per hour and group_concat the text column).
Finally, there are databases that are optimized for time-series data. They may, for instance, only allow you to remove the oldest data or append new data but not make any changes. This would allow such as system to store the data pre-sorted (mysql, btw, call this feature index ordered table). As the data cannot change, such a database my run-length or delta compress data by column so it only stores differences between rows.
I have two pandas dataframes bookmarks and ratings where columns are respectively :
id_profile, id_item, time_watched
id_profile, id_item, score
I would like to find score for each couple (profile,item) in the ratings dataframe (set to 0 if does not exist). The problem is the bookmarks dataframe has 73 million rows and it takes so much time (after 15 min the code continues to run). I suppose there is a better way to do it.
Here is my code :
def find_rating(val):
res = ratings.loc[(ratings['id_profile'] == val[0]) & (ratings['id_asset'] == val[1])]
if res.empty :
return 0
return res['score'].values[0]
arr = bookmarks[['id_profile','id_asset']].values
rates = [find_rating(i) for i in arr]
I work on collab.
Do you think I can improve speed of the execution?
Just some thoughts ! I have not tried such large data in Pandas.
In pandas, the data is index on row as well as columns. So if you have 1 million rows with 5 columns ==> we have indexed 5 million records.
For performance boost,
Check if you can use Sparse* data structures - https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html
Filter out the unnecessary data as much feasible.
If you can, try stick to using only numpy. As this reduces few features, we also we loose some drag. Worth exploring.
Use some distributed multithreaded/multiprocessing tools like Dask\Ray. If you have 4 cores -> 4 parallel jobs => 25% faster
SELECT A,B, C, D, E, F ,EXTRACT(MONTH FROM PARSE_DATE('%b',Month))
as MonthNumber,PARSE_DATETIME(' %Y%b%d ', CONCAT(CAST(Year AS STRING),Month,'1'))
as G FROM `XXX.YYY.ZZZ`
where A !='null' and B = 'MYSTRING' order by A,Year
The query processes about 20 GB per run.
My table ZZZ has 396,567,431 (396 million) rows with a size of 53 GB. If I execute the above query without a LIMIT clause , I get an error saying "Resources exceeded".
If i execute it with a LIMIT clause , then it gives the same error for larger limits.
I am writing a python script using the API that runs the query above and then computes some metrics and then writes the output to another table. It writes some 1.7 million output rows, so basically aggregates the first table based on column A i.e original table has multiple rows for column A.
Now I know we can set Allow large results to on and select an output table to get around this error but for the purposes of my script it doesn't server the purpose.
Also , I read that order by is the expesnive part causing this but below is my algorithm and I dont see a way around order by.
Also my script pages the query results 100000 rows at a time.
log=[]
while True:
rows, total_rows, page_token = query_job.results.fetch_data(max_results=100000, page_token=page_token)
for row in rows:
try:
lastAValue=log[-1][0]
except IndexError:
lastAValue=None
if(lastAValue==None or row[0]==lastAValue):
log.append(row)
else:
res=Compute(lastAValue,EntityType,lastAValue)
allresults.append(res)
log=[]
log.append(row)
if not page_token:
break
I have two questions :
Column A | Column B ......
123 | NDG
123 | KOE
123 | TR
345 | POP
345 | KOP
345 | POL
The way I kept my logic is : I iterate through the rows and check if column A is same as last row column A. If same , then I add that row to an array. The moment I encounter a different column A i.e 345 , I send the first group of column A for processing , compute and add the data to my array. Based on this approach I had some questions :
1) I am effectively querying only once . So , I should be charged only for 1 query. Does big query charge as per totalRows/noOf pages ? i.e will individual pages from above code be separate query and charged separately ?
2) Assume page size in the above example would be 5 , what would happen is the 345 entries would be spread across pages , in this case will I lose information about the 6 th 345 -POL entry as it will be in a different page ? Is there a work around for this ?
3)Is there a direct way to get around the whole check the successive rows if they differ in values ? like a direct group by and get groups as array mechanism ? The above approach takes a couple of hours (estimated ) to run if i add a limit of 1 million.
4) How can I get around this error of Resources exceeded by specifying higher limits than 1 million.?
You are asking BigQuery to produce one huge sorted result, which BigQuery currently cannot efficiently parallelize, so you get the "Resources exceeded" error.
The efficient way to perform this kind of queries is to allow your computation to happen in SQL inside of BigQuery, rather than extracting huge result from it, and doing post-processing in Python. Analytical functions is common way to do what you described, if the Compute() function can be expressed in SQL.
E.g. for finding value of B in last row before A changes, you can find this row using LAST_VALUE function, something like
select LAST_VALUE(B) OVER(PARTITION BY A ORDER BY Yeah) from ...
If you could describe what Compute() does, we could try to fill details.
Imagine a large dataset (>40GB parquet file) containing value observations of thousands of variables as triples (variable, timestamp, value).
Now think of a query in which you are just interested in a subset of 500 variables. And you want to retrieve the observations (values --> time series) for those variables for specific points in time (observation window or timeframe). Such having a start and end time.
Without distributed computing (Spark), you could code it like this:
for var_ in variables_of_interest:
for incident in incidents:
var_df = df_all.filter(
(df.Variable == var_)
& (df.Time > incident.startTime)
& (df.Time < incident.endTime))
My question is: how to do that with Spark/PySpark? I was thinking of either:
joining the incidents somehow with the variables and filter the dataframe afterward.
broadcasting the incident dataframe and use it within a map-function when filtering the variable observations (df_all).
use RDD.cartasian or RDD.mapParitions somehow (remark: the parquet file was saved partioned by variable).
The expected output should be:
incident1 --> dataframe 1
incident2 --> dataframe 2
...
Where dataframe 1 contains all variables and their observed values within the timeframe of incident 1 and dataframe 2 those values within the timeframe of incident 2.
I hope you got the idea.
UPDATE
I tried to code a solution based on idea #1 and the code from the answer given by zero323. Work's quite well, but I wonder how to aggregate/group it to the incident in the final step? I tried adding a sequential number to each incident, but then I got errors in the last step. Would be cool if you can review and/or complete the code. Therefore I uploaded sample data and the scripts. The environment is Spark 1.4 (PySpark):
Incidents: incidents.csv
Variable value observation data (77MB): parameters_sample.csv (put it to HDFS)
Jupyter Notebook: nested_for_loop_optimized.ipynb
Python Script: nested_for_loop_optimized.py
PDF export of Script: nested_for_loop_optimized.pdf
Generally speaking only the first approach looks sensible to me. Exact joining strategy on the number of records and distribution but you can either create a top level data frame:
ref = sc.parallelize([(var_, incident)
for var_ in variables_of_interest:
for incident in incidents
]).toDF(["var_", "incident"])
and simply join
same_var = col("Variable") == col("var_")
same_time = col("Time").between(
col("incident.startTime"),
col("incident.endTime")
)
ref.join(df.alias("df"), same_var & same_time)
or perform joins against particular partitions:
incidents_ = sc.parallelize([
(incident, ) for incident in incidents
]).toDF(["incident"])
for var_ in variables_of_interest:
df = spark.read.parquet("/some/path/Variable={0}".format(var_))
df.join(incidents_, same_time)
optionally marking one side as small enough to be broadcasted.