I have a Dask dataframe of 100 million rows of data.
I am trying to iterate over this dataframe without loading the entire dataframe
to RAM.
For an experiment, trying to access row of index equal to 1.
%time dask_df.loc[1].compute()
The time it took is whopping 8.88 s (Wall time)
Why is it taking it so long?
What can I do to make it faster?
Thanks in advance.
Per request, here is the code.
It is just reading 100 million rows of data and trying to access a row.
`dask_df = dd.read_parquet("/content/drive/MyDrive/AffinityScore_STAGING/staging_affinity_block1.gzip", chunksize=10000000)`
Dask DataFrame Structure:
avg_user_prod_aff_score internalItemID internalUserID
npartitions=1
float32 int16 int32
len(dask_df)
100,000,000
%time dask_df.loc[1].compute()
There are just 3 columns with datatypes of float32, int16 and int32.
The dataframe is indexed starting at 0.
Writing time is actually very good which is around 2 minutes.
I must be doing something wrong here.
Similarly to pandas, dask_df[1] would actually reference a column, not a row. So if you have a column named 1 then you're just loading a column from the whole frame. You can't access rows positionally - df.iloc only supports indexing along the second (column) axis. If your index has the value 1 in it, you could select this with df.loc, e.g.:
df.loc[1].compute()
See the dask.dataframe docs on indexing for more information and examples.
When performing .loc on an unindexed dataframe, Dask will need to decompress the full file. Since each partition will have its own index, .loc[N] will check every partition for that N, see this answer.
One way of resolving this is to pay the cost of generating a unique index once and saving the indexed parquet file. This way .loc[N] will only load information from the specific partition (or row group) that contains row N.
It looks like there is a performance issues with Dask when trying
access 10 million rows. It took 2.28 secs to access first 10 rows.
With 100 million rows, it takes whopping 30 secs.
Related
I have a csv file with 7000 rows and 5 cols.
I have an array of 5000 words, that I want to add to the same CSV file in a new column. I added a column 'originalWord ', and used the pd.Series function which added the 5000 words in a single column as I want.
allWords=['x' * 5000]
df['originalWord']=pd.Series(allWords)
My problem now is I want to get the data in the column 'originalWord' - whether by putting them in an array or accessing the column directly - even though it's 5000 rows only and the file has 7000 rows (with the last 2000 being null values)
print(len(df['originalWord']))
7000
Any idea how to make it reflect the original length 5000 ? Thank you.
If I understand you correctly, what you're asking for isn't possible. From what I can gather, you have a DataFrame that has 7000 rows and 5 columns, meaning that the index is of size 7000. To this DataFrame, you would like to add a column that has 5000 rows. Since there are in total 7000 rows in the DataFrame, the appended column will have 2000 missing values that would thus be assigned NaN. That's why you see the length as 7000.
In short, there is no way of accessing df['originalWord'] and automatically exclude all missing values as even that Series has an index of size 7000. The closest you could get to is to write a function that would include dropna() if the issue is that you find it bothersome to repeatedly call it.
I have two pandas dataframes bookmarks and ratings where columns are respectively :
id_profile, id_item, time_watched
id_profile, id_item, score
I would like to find score for each couple (profile,item) in the ratings dataframe (set to 0 if does not exist). The problem is the bookmarks dataframe has 73 million rows and it takes so much time (after 15 min the code continues to run). I suppose there is a better way to do it.
Here is my code :
def find_rating(val):
res = ratings.loc[(ratings['id_profile'] == val[0]) & (ratings['id_asset'] == val[1])]
if res.empty :
return 0
return res['score'].values[0]
arr = bookmarks[['id_profile','id_asset']].values
rates = [find_rating(i) for i in arr]
I work on collab.
Do you think I can improve speed of the execution?
Just some thoughts ! I have not tried such large data in Pandas.
In pandas, the data is index on row as well as columns. So if you have 1 million rows with 5 columns ==> we have indexed 5 million records.
For performance boost,
Check if you can use Sparse* data structures - https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html
Filter out the unnecessary data as much feasible.
If you can, try stick to using only numpy. As this reduces few features, we also we loose some drag. Worth exploring.
Use some distributed multithreaded/multiprocessing tools like Dask\Ray. If you have 4 cores -> 4 parallel jobs => 25% faster
I have enormous arrays of time-series data (~3GB, millions of rows) that I load into a dataframe through numpy memmap. I'd like to summarize them by getting descriptive statistics for each group of n elements (say 1000 per group). I really like the combination of group_by and describe but it seems like group_by is only useful for categorical data. If it wasn't a memmap I could add another column for time interval categories. Is there a way to get a GroupBy object that I can use describe on where the groups are by row index?
I am reading a CSV file with about 25million rows and 4 columns - Lat, Long, Country and Level. After filtering out what I dont want, I am left with around 500k rows which i would like to visualise using Folium.
Folium requires the dataframe with the lat, long and level columns passed to it as individual rows in the following manner
data = ddf.apply(lambda row: makeList(row['Latitude'], row['Longitude'], row['Level']), axis=1, meta=object)
makeList is a function defined as follows -
def makeList(x,y,z):
return [x,y,z]
The above function takes about 120 seconds to compute. I was wondering if there's a way to speed this up by perhaps using 'ddf.values.tolist()' OR any other way that would compute quicker?
thanks!
The title of your post suggests that you want a list, so maybe a Dask bag would be an option.
But your post contains also Folium requires the dataframe with ..., so more
likely you need to generate just a DataFrame with the 3 mentioned columns.
To generate a DataFrame with a subset of columns, you can run:
data = ddf[['Latitude', 'Longitude', 'Level']]
Then, you could e.g. save it in a single CSV file:
data.to_csv('your_file.csv', single_file=True)
(500k rows is an acceptable number) and process it in another program as an "ordinary" (Pandas) DataFrame.
Let' say I have a dataframe with 1 million rows and 30 columns.
I want to add a column to the dataframe and the value is "the most frequent value of the previous 30 columns". I also want to add the "second most frequent value of the previous 30 columns"
I know that you can do df.mode(axis=1) for "the most frequent value of the previous 30 columns", but it is so slow.
Is there anyway to vectorize this so it could be fast?
df.mode(axis=1) is already vectorized. However, you may want to consider how it works. It needs to operate on each row independently, which means you would benefit from "row-major order" which is called C order in NumPy. A Pandas DataFrame is always column-major order, which means that getting 30 values to compute the mode for one row requires touching 30 pages of memory, which is not efficient.
So, try loading your data into a plain NumPy 2D array and see if that helps speed things up. It should.
I tried this on my 1.5 GHz laptop:
x = np.random.randint(0,5,(10000,30))
df = pd.DataFrame(x)
%timeit df.mode(axis=1)
%timeit scipy.stats.mode(x, axis=1)
The DataFrame way takes 6 seconds (!), whereas the SciPy (row-major) way takes 16 milliseconds for 10k rows. Even SciPy in column-major order is not much slower, which makes me think the Pandas version is less efficient than it could be.