I have enormous arrays of time-series data (~3GB, millions of rows) that I load into a dataframe through numpy memmap. I'd like to summarize them by getting descriptive statistics for each group of n elements (say 1000 per group). I really like the combination of group_by and describe but it seems like group_by is only useful for categorical data. If it wasn't a memmap I could add another column for time interval categories. Is there a way to get a GroupBy object that I can use describe on where the groups are by row index?
Related
I have a Dask dataframe of 100 million rows of data.
I am trying to iterate over this dataframe without loading the entire dataframe
to RAM.
For an experiment, trying to access row of index equal to 1.
%time dask_df.loc[1].compute()
The time it took is whopping 8.88 s (Wall time)
Why is it taking it so long?
What can I do to make it faster?
Thanks in advance.
Per request, here is the code.
It is just reading 100 million rows of data and trying to access a row.
`dask_df = dd.read_parquet("/content/drive/MyDrive/AffinityScore_STAGING/staging_affinity_block1.gzip", chunksize=10000000)`
Dask DataFrame Structure:
avg_user_prod_aff_score internalItemID internalUserID
npartitions=1
float32 int16 int32
len(dask_df)
100,000,000
%time dask_df.loc[1].compute()
There are just 3 columns with datatypes of float32, int16 and int32.
The dataframe is indexed starting at 0.
Writing time is actually very good which is around 2 minutes.
I must be doing something wrong here.
Similarly to pandas, dask_df[1] would actually reference a column, not a row. So if you have a column named 1 then you're just loading a column from the whole frame. You can't access rows positionally - df.iloc only supports indexing along the second (column) axis. If your index has the value 1 in it, you could select this with df.loc, e.g.:
df.loc[1].compute()
See the dask.dataframe docs on indexing for more information and examples.
When performing .loc on an unindexed dataframe, Dask will need to decompress the full file. Since each partition will have its own index, .loc[N] will check every partition for that N, see this answer.
One way of resolving this is to pay the cost of generating a unique index once and saving the indexed parquet file. This way .loc[N] will only load information from the specific partition (or row group) that contains row N.
It looks like there is a performance issues with Dask when trying
access 10 million rows. It took 2.28 secs to access first 10 rows.
With 100 million rows, it takes whopping 30 secs.
I have a large DataFrame with 2 million observations. For my further analysis, I intend to use a relatively smaller sample (around 15-20% of the original DataFrame) drawn from the original DataFrame. While sampling, I also intend to keep the proportion of categorical values present in one of the columns intact.
For eg: if one column has 5 categories as its values: red(20% of total observations), green(10%), blue(15%), white(25%), yellow(30%); I would like the column in the sample dataset to also show the same proportion of different categories.
Please assist!
I am reading a CSV file with about 25million rows and 4 columns - Lat, Long, Country and Level. After filtering out what I dont want, I am left with around 500k rows which i would like to visualise using Folium.
Folium requires the dataframe with the lat, long and level columns passed to it as individual rows in the following manner
data = ddf.apply(lambda row: makeList(row['Latitude'], row['Longitude'], row['Level']), axis=1, meta=object)
makeList is a function defined as follows -
def makeList(x,y,z):
return [x,y,z]
The above function takes about 120 seconds to compute. I was wondering if there's a way to speed this up by perhaps using 'ddf.values.tolist()' OR any other way that would compute quicker?
thanks!
The title of your post suggests that you want a list, so maybe a Dask bag would be an option.
But your post contains also Folium requires the dataframe with ..., so more
likely you need to generate just a DataFrame with the 3 mentioned columns.
To generate a DataFrame with a subset of columns, you can run:
data = ddf[['Latitude', 'Longitude', 'Level']]
Then, you could e.g. save it in a single CSV file:
data.to_csv('your_file.csv', single_file=True)
(500k rows is an acceptable number) and process it in another program as an "ordinary" (Pandas) DataFrame.
I'm working on product recommendations.
My dataset is as follow ( A sample, the full one is with more than 110 000 rows and more than 80000 unique product_id):
user_id product_id
0 0E3D17EA-BEEF-493 12909837
1 0FD6955D-484C-4FC8-8C3F 12732936
2 CC2877D0-A15C-4C0A Gklb38
3 b5ad805c-f295-4852 12909841
4 0E3D17EA-BEEF-493 12645715
I want to calculate the cosine similarity between products based on purchased products per user.
Why? I need to have as a final result:
the list of the 5 most similar products for each product_id.
So, I thought the 1st thing that I need to do is to convert the dataframe into this format:
where I have one row per user_id and columns are product_ids. If a user bought product_id X then the correspondant row,column will contain the value 1, otherwise 0.
I did that using crosstab function of pandas dataframe.
crosstab_df = pd.crosstab(df.user_id, df.product_id).astype('bool').astype('int')
After that, I calculated the similarities between products.
def calculate_similarity(data_items):
"""Calculate the column-wise cosine similarity for a sparse
matrix. Return a new dataframe matrix with similarities.
"""
# create a scipy sparse matrix
data_sparse = sparse.csr_matrix(data_items)
#pairwise similarities between all samples in data_sparse.transpose()
similarities = cosine_similarity(data_sparse.transpose())
#put the similarities between products in a dataframe
sim = pd.DataFrame(data=similarities, index= data_items.columns, columns= data_items.columns)
return sim
similarity_matrix = calculate_similarity(crosstab_df)
I know that this is not efficient, because crosstab doesn't perform well when there is many rows and many columns, which is a case that I have to handle. So, I thought about instead of using a Crosstab DataFrame, I have to use scipy sparse matrix as it makes calculations faster (similarity calculations, vectors normalisation) because the input will be a numpy array, not a dataframe.
However, I didn't know how to do it. I also need to keep track of each column to what product_id it corresponds, so that I can then get the most similar product_ids to each product_id.
I found in other questions answers that:
scipy.sparse.csr_matrix(df.values)
can be used, but in my case I think, I can use it only after applying crosstab.. while I want to get rid of crosstab step.
Also, people suggested using scipy coo_matrix, but I didn't understand how can I apply it in my case, for the results I want..
I'm looking for a memory efficient solution as the initial dataset can grow for thousand of lines and hundred thousand of product_id..
I have a pandas.DataFrame indexed by time, as seen below. The other column contains data recorded from a device measuring current. I want to filter to the second column by a low pass filter with a frequency of 5Hz to eliminate high frequency noise. I want to return a dataframe, but I do not mind if it changes type for the application of the filter (numpy array, etc.).
In [18]: print df.head()
Time
1.48104E+12 1.1185
1.48104E+12 0.8168
1.48104E+12 0.8168
1.48104E+12 0.8168
1.48104E+12 0.8168
I am graphing this data by df.plot(legend=True, use_index=False, color='red') but would like to graph the filtered data instead.
I am using pandas 0.18.1 but I can change.
I have visited https://oceanpython.org/2013/03/11/signal-filtering-butterworth-filter/ and many other sources of similar approaches.
Perhaps I am over-simplifying this but you create a simple condition, create a new dataframe with the filter, and then create your graph from the new dataframe. Basically just reducing the dataframe to only the records that meet the condition. I admit I do not know what the exact number is for high frequency, but let's assume your second column name is "Frequency"
condition = df["Frequency"] < 1.0
low_pass_df = df[condition]
low_pass_df.plot(legend=True, use_index=False, color='red')