I have two pandas dataframes bookmarks and ratings where columns are respectively :
id_profile, id_item, time_watched
id_profile, id_item, score
I would like to find score for each couple (profile,item) in the ratings dataframe (set to 0 if does not exist). The problem is the bookmarks dataframe has 73 million rows and it takes so much time (after 15 min the code continues to run). I suppose there is a better way to do it.
Here is my code :
def find_rating(val):
res = ratings.loc[(ratings['id_profile'] == val[0]) & (ratings['id_asset'] == val[1])]
if res.empty :
return 0
return res['score'].values[0]
arr = bookmarks[['id_profile','id_asset']].values
rates = [find_rating(i) for i in arr]
I work on collab.
Do you think I can improve speed of the execution?
Just some thoughts ! I have not tried such large data in Pandas.
In pandas, the data is index on row as well as columns. So if you have 1 million rows with 5 columns ==> we have indexed 5 million records.
For performance boost,
Check if you can use Sparse* data structures - https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html
Filter out the unnecessary data as much feasible.
If you can, try stick to using only numpy. As this reduces few features, we also we loose some drag. Worth exploring.
Use some distributed multithreaded/multiprocessing tools like Dask\Ray. If you have 4 cores -> 4 parallel jobs => 25% faster
Related
I have a Dask dataframe of 100 million rows of data.
I am trying to iterate over this dataframe without loading the entire dataframe
to RAM.
For an experiment, trying to access row of index equal to 1.
%time dask_df.loc[1].compute()
The time it took is whopping 8.88 s (Wall time)
Why is it taking it so long?
What can I do to make it faster?
Thanks in advance.
Per request, here is the code.
It is just reading 100 million rows of data and trying to access a row.
`dask_df = dd.read_parquet("/content/drive/MyDrive/AffinityScore_STAGING/staging_affinity_block1.gzip", chunksize=10000000)`
Dask DataFrame Structure:
avg_user_prod_aff_score internalItemID internalUserID
npartitions=1
float32 int16 int32
len(dask_df)
100,000,000
%time dask_df.loc[1].compute()
There are just 3 columns with datatypes of float32, int16 and int32.
The dataframe is indexed starting at 0.
Writing time is actually very good which is around 2 minutes.
I must be doing something wrong here.
Similarly to pandas, dask_df[1] would actually reference a column, not a row. So if you have a column named 1 then you're just loading a column from the whole frame. You can't access rows positionally - df.iloc only supports indexing along the second (column) axis. If your index has the value 1 in it, you could select this with df.loc, e.g.:
df.loc[1].compute()
See the dask.dataframe docs on indexing for more information and examples.
When performing .loc on an unindexed dataframe, Dask will need to decompress the full file. Since each partition will have its own index, .loc[N] will check every partition for that N, see this answer.
One way of resolving this is to pay the cost of generating a unique index once and saving the indexed parquet file. This way .loc[N] will only load information from the specific partition (or row group) that contains row N.
It looks like there is a performance issues with Dask when trying
access 10 million rows. It took 2.28 secs to access first 10 rows.
With 100 million rows, it takes whopping 30 secs.
I am currently working on a project that uses Pandas, with a large dataset (~42K rows x 1K columns).
My dataset has many omissing values which I want to interpolate to obtain a better result when training an ML model using this data. My method of interpolating the data is by taking the average of the previous and the next value and then considering that the value for any NaN. Example:
TRANSACTION PAYED MONDAY TUESDAY WEDNESDAY
D8Q3ML42DS0 1 123.2 NaN 43.12
So in the above example the NaN would be replaced with the average of the 123.2 and 43.12 which is 83.16. If the value can't be interpolated then a 0 is put. I was able to implement this in a number of ways but I always end up getting into the issue of it taking a very long time to process all of the rows in the dataset despite running it on an Intel Core i9. The following are approaches I've tried and have found out that they take too long:
Interpolating the data and then only replacing the elements that need to be replaced instead of replacing the entire row.
Replacing the entire row with a new pd.Series that has the old and the interpolated values. It seems like my code is able to execute reasonably well on a Numpy Array but the slowness comes from the assignment.
I'm not quite sure why the performance of my code comes nowhere close to df.interpolate() despite it being the same idea. Here is some of my code responsible for the interpolation:
for transaction_id in df.index:
df.loc[transaction_id, 2:] = interpolate(df.loc[transaction_id, 2:])
def interpolate(array:np.array):
arr_len = len(array)
for i in range(array):
if math.isnan(array[i]):
if i == 0 or i == arr_len-1 or math.isnan(array[i-1]) or math.isnan(array[i+1]):
array[i] = 0
else:
statistics.mean([array[i-1], array[i+1]])
return array
My understanding is that Pandas has some sort of parallel techniques and functions that it is able to use to perform that. How can I speed this process up even a little?
df.interpolate(method='linear', limit_direction='forward', axis=0)
Try doing this it might help.
I have 22 million rows of house property sale data in a database table called sale_transactions. I am performing a job where I read information from this table, perform some calculations, and use the results to create entries to a new table. The process looks like this:
for index, row in zipcodes.iterrows(): # ~100k zipcodes
sql_string = """SELECT * from sale_transactions WHERE zipcode = '{ZIPCODE}' """
sql_query = sql_string.format(ZIPCODE=row['zipcode'])
df = pd.read_sql(sql_query, _engine)
area_stat = create_area_stats(df) # function does calculations
area_stat.save() # saves a Django model
At the moment each iteration of this loop takes about 20 seconds on my macbook pro (16GB RAM), which means that the code is going to take weeks to finish. The expensive part is the read_sql line.
How can I optimize this? I can't read the whole sale_transactions table into memory, it is about 5 GB, hence using the sql query each time to capture the relevant rows with the WHERE clause.
Most answers about optimizing pandas talk about reading with chunking, but in this case I need to perform the WHERE on all the data combined, since I am performing calculations in the create_area_stats function like number of sales over a ten year period. I don't have easy access to a machine with loads of RAM, unless I start going to town with EC2, which I worry will be expensive and quite a lot of hassle.
Suggestions would be greatly appreciated.
I also faced similar problem and the below code helped me to read database (~ 40 million rows) effectively .
offsetID = 0
totalrow = 0
while (True):
df_Batch=pd.read_sql_query('set work_mem="1024MB"; SELECT * FROM '+tableName+' WHERE row_number > '+ str(offsetID) +' ORDER BY row_number LIMIT 100000' ,con=engine)
offsetID = offsetID + len(df_Batch)
#your operation
totalrow = totalrow + len(df_Batch)
you have to create a index called row_number in your table. So this code will read your table (100 000 rows) index wise. for example when you want to read rows from 200 000 - 210 000 you don't need to read from 0 to 210 000. It will directly read by index. So It will improve your performance.
Since the bottleneck in the operation was the SQL WHERE query, the solution was to index the column upon which the WHERE statement was operating (i.e. the zipcode column).
In MySQL, the command for this was:
ALTER TABLE `db_name`.`table`
ADD INDEX `zipcode_index` USING BTREE (`zipcode` ASC);
After making this change, the loop execution speed increased by 8 fold.
I found this article useful because it encouraged profiling queries using EXPLAIN and observing opportunities for column indexing when key and possible_key values were NULL
I have a huge dataframe (4 million rows and 25 columns). I am trying to investigate 2 categorical columns. One of them has around 5000 levels (app_id) and the other has 50 levels (app_category).
I have seen that for for each level in app_id there is a unique value of app_category. How do I code to prove that?
I have tried something like this:
app_id_unique = list(train['app_id'].unique())
for unique in app_id_unique:
train.loc[train['app_id'] == unique].app_category.nunique()
This code takes forever.
I think you need groupby with nunique:
train.groupby('app_id').app_category.nunique()
Let' say I have a dataframe with 1 million rows and 30 columns.
I want to add a column to the dataframe and the value is "the most frequent value of the previous 30 columns". I also want to add the "second most frequent value of the previous 30 columns"
I know that you can do df.mode(axis=1) for "the most frequent value of the previous 30 columns", but it is so slow.
Is there anyway to vectorize this so it could be fast?
df.mode(axis=1) is already vectorized. However, you may want to consider how it works. It needs to operate on each row independently, which means you would benefit from "row-major order" which is called C order in NumPy. A Pandas DataFrame is always column-major order, which means that getting 30 values to compute the mode for one row requires touching 30 pages of memory, which is not efficient.
So, try loading your data into a plain NumPy 2D array and see if that helps speed things up. It should.
I tried this on my 1.5 GHz laptop:
x = np.random.randint(0,5,(10000,30))
df = pd.DataFrame(x)
%timeit df.mode(axis=1)
%timeit scipy.stats.mode(x, axis=1)
The DataFrame way takes 6 seconds (!), whereas the SciPy (row-major) way takes 16 milliseconds for 10k rows. Even SciPy in column-major order is not much slower, which makes me think the Pandas version is less efficient than it could be.