I am using the tsfresh library to extract features from my time-series dataset.
My problem is that I can only use the setting for the minimal feature set (MinimalFCParameters) because even the efficient one (EfficientFCParameters) is always stuck at 0% and never calculates any features.
The data is pretty large (over 40. Mio. rows, 100k windows) but this is even the smallest data set I am going to use. I am using a compute cluster, so computing resources should not be the issue. I also tried to use the n_jobs parameter for the extract_features-method (n_jobs=32). Finally, as suggested by the tsfresh website, I used dask instead of pandas for the input data frame - but no success too.
My questions are: Is there anything else I can try? Or are there any other libraries I could use?
Related
since i couldn't find the best way to deal with my issue i came here to ask..
I'm a beginner with Python but i have to handle a large dataset.
However, i don't know what's the best way to handle the "Memory Error" problem.
I already have a 64 bits 3.7.3 Python version.
I saw that we can use TensorFlow or specify chunks in the pandas instruction or use the library Dask but i don't know which one is the best to fit with my problem and as a beginner it's not very clear.
I have a huge dataset (over 100M observations) i don't think reducing the dataset would decrease a lot the memory.
What i want to do is to test multiple ML algorithms with a train and test samples. I don't know how to deal with the problem.
Thanks!
This question is high level, so I'll provide some broad approaches for reducing memory usage in Dask:
Use a columnar file format like Parquet so you can leverage column pruning
Use column dtypes that require less memory int8 instead of int64
Strategically persist in memory, where appropriate
Use a cluster that's sized well for your data (running an analysis on 2GB of data requires different amounts of memory than 2TB)
split data into multiple files so it's easier to process in parallel
Your data has 100 million rows, which isn't that big (unless it had thousands of columns). Big data typically has billions or trillions of rows.
Feel free to add questions that are more specific and I can provide more specific advice. You can provide the specs of your machine / cluster, the memory requirements of the DataFrame (via ddf.memory_usage(deep=True)) and the actual code you're trying to run.
i currently have data simple data processing that involving groupby, merge, and parallel column to column operation. The not so simple part are the massive row used (its detailed cost/financial data). its 300-400 gb in size.
due to limited RAM, currently im using out of core computing with dask. However, its really slow.
I've previously read using CuDF to improve performance on map_partitions and groupby, however most example are using mid-high end gpu (1050ti at least, most are run on gv-based cloud vm) and the data could fit on gpu RAM.
My machine spec are E5-2620v3(6C/12T), 128gb, and K620 (only have 2gb dedicated vram).
Intermediate dataframe used are stored in parquet.
Will it make it faster if i use low end GPU using CuDF? and is it possible to do out of core computing in GPU? (im looking all around for example but yet to find)
Below are simplified pseudo code of what im trying to do
a.csv are data with ~300gb in size, consisting of 3 column (Hier1, Hier2, Hier3, value) Hier1-3 are hierarchy, in string. value are sales value
b.csv are data with ~50gb in size, consisting of 3 column (Hier1, Hier2, valuetype, cost). Hier1-2 are hierarchy, in string. Value type are cost type, in string. cost are cost value
Basically, i need to do prorate topdown based on sales value from a.csv for each cost in b.csv. end of mind are i have each of cost available in Hier3 level (which is more detailed level)
First step is to create prorated ratio:
import dask.dataframe as dd
# read raw data, repartition, convert to parquet for both file
raw_reff = dd.read_csv('data/a.csv')
raw_reff = raw_reff.map_partitions(lambda df: df.assign(PartGroup=df['Hier1']+df['Hier2']))
raw_reff = raw_reff.set_index('PartGroup')
raw_reff.to_parquet("data/raw_a.parquet")
cost_reff = dd.read_csv('data/b.csv')
cost_reff = cost_reff.map_partitions(lambda df: df.assign(PartGroup=df['Hier1']+df['Hier2']))
cost_reff = cost_reff.set_index('PartGroup')
cost_reff.to_parquet("data/raw_b.parquet")
# create reference ratio
ratio_reff = dd.read_parquet("data/raw_a.parquet").reset_index()
#to push down ram usage, instead of dask groupby im using groupby on each partition. Should be ok since its already partitioned above on each group
ratio_reff = ratio_reff.map_partitions(lambda df: df.groupby(['PartGroup'])['value'].sum().reset_index())
ratio_reff = ratio_reff.set_index('PartGroup')
ratio_reff = ratio_reff.map_partitions(lambda df: df.rename(columns={'value':'value_on_group'}))
ratio_reff.to_parquet("data/reff_a.parquet")
and then do the merging to get the ratio
raw_data = dd.read_parquet("data/raw_a.parquet").reset_index()
reff_data = dd.read_parquet("data/reff_a.parquet").reset_index()
ratio_data = raw_data.merge(reff_data, on=['PartGroup'], how='left')
ratio_data['RATIO'] = ratio_data['value'].fillna(0)/ratio_data['value_on_group'].fillna(0)
ratio_data = ratio_data[['PartGroup','Hier3','RATIO']]
ratio_data = ratio_data.set_index('PartGroup')
ratio_data.to_parquet("data/ratio_a.parquet")
and then merge and multiply cost data on PartGroup to Ratio to have its prorated value
reff_stg = dd.read_parquet("data/ratio_a.parquet").reset_index()
cost_stg = dd.read_parquet("data/raw_b.parquet").reset_index()
final_stg = reff_stg.merge(cost_stg, on=['PartGroup'], how='left')
final_stg['allocated_cost'] = final_stg['RATIO']*final_stg['cost']
final_stg = final_stg.set_index('PartGroup')
final_stg.to_parquet("data/result_pass1.parquet")
in the real case there will be residual value caused by missing reference data etc and it will done in several pass using several reference, but basically above is the step
even with strictly parquet to parquet operation, it still tooks ~80gb of RAM out of my 128gb, all of my core running 100%, and 3-4 days to run. im looking for ways to have this done faster with current hardware. as you can see, its massively pararel problem which fit into definition for gpu-based processing
Thanks
#Ditto, unfortunately, this cannot be done with your current hardware. Your K620 has a Kepler architecture GPU and is below the minimum requirements for RAPIDS. You will need a Pascal card or better to run RAPIDS. The good news is that if purchasing a RAPIDS compatible video card is not a viable option, there are many inexpensive cloud provisioning options. Honestly, what you're asking to do, I'd want a little extra GPU processing speed and would recommend using a multi-GPU set up.
As for the larger data set than GPU RAM, you can use dask_cudf to allow for your dataset to be processed. There are several examples in our docs and notebooks. Please be advised that the resulting data set after dask.compute() needs to be able to fit in the GPU RAM.
https://rapidsai.github.io/projects/cudf/en/0.12.0/10min.html#10-Minutes-to-cuDF-and-Dask-cuDF
https://rapidsai.github.io/projects/cudf/en/0.12.0/dask-cudf.html#multi-gpu-with-dask-cudf
Once you can get a working, RAPID compatible, multi GPU set up and use dask_cudf, you should get a very worth while speed up, especially for that size of data exploration.
Hope this helps!
I'm working on a data set of ~ 100k lines in PySpark, and I wan't to convert it to Pandas. The data on web clicks contains string variables and is read from a .snappy.orc file in an Amazon S3 bucket via spark.read.orc(...).
The conversion is running too slow for my application (for reasons very well explained here on stackoverflow), and thus I've tried to downsample my spark DataFrame to one tenth - the dataset is so large, that the statistical analysis I need to do is probably still valid. I however need to repeat the analysis for 5000 similar datasets, why speed the a concern.
What surprised me, is that the running time of: df.sample(false, 0.1).toPandas() is exactly the same as df.toPandas() (approx 180s), and so I don't get the reduction in running time I was hoping for.
I'm suspecting it may be a question of putting in a .cache() or .collect(), but I can't figure out a proper way to fit it in.
I have a huge database (~100 variables with a few million rows) consisting of stock data. I managed to connect python with the database via sqlalchemy (postgreql+psycopg2). I am running it all on the cloud.
In principle I want to do a few things:
1) Regression of all possible combinations: I am running a simple regression of each stock, i.e. ABC on XYZ AND also XYZ on ABC, this across the n=100 stocks, resulting in n(n+1) / 2 combinations.
-> I think of a function that calls in the pairs of stocks, does the two regressions and compares the results and picks one based on some criteria.
My question: Is there an efficient way to call in the "factorial"?
2) Rolling Windows: To avoid an overload of data, I thought to only call the dataframe of investigation, i.e. 30days, and then roll over each day, meaning my periods are:
1: 1D-30D
2: 2D-31D and so on
Meaning I always drop the first day and add another row at the end of my dataframe. So meaning I have two steps, drop the first day and read in the next row from my database.
My question: Is this a meaningful way or does Python has something better in its sleeve? How would you do it?
3) Expanding windows: Instead of dropping the first row and add another one, I keep the 30 days and add another 30days and then run my regression. Problem here, at some point I would embrace all the data which will probably be too big for the memory?
My question: What would be a workaround here?
4) As I am running my analysis on the cloud (with a few more cores than my own pc) in fact I could use multithreading, sending "batch" jobs and let Python do things in parallel. I thought of splitting my dataset in 4x 25 stocks and let it run in parallel (so vertical split), or should I better split horizontally?
Additionally I am using Jupyter; I am wondering how to best approach here, usually I have a shell script calling my_program.py. Is this the same here?
Let me try to give answers categorically and also note my observations.
From your description, I suppose you have taken each stock scrip as one variable and you are trying to perform pairwaise linear regression amongst them. Good news about this - it's highly parallizable. All you need to do is generate unique combinations of all possible pairings and perform your regressions and then only to keep those models which fit your criteria.
Now as stocks are your variables, I am assuming rows are their prices or something similar values but definitely some time series data. If my assumption is correct then there is a problem in rolling window approach. In creating these rolling windows what you are implicitly doing is using a data sampling method called 'bootstrapping' which uses random but repeatitive sampling. But due to just rolling your data you are not using random sampling which might create problems for your regression results. At best the model may simply be overtrained, at worst, I cannot imagine. Hence, drop this appraoch. Plus if it's a time series data then the entire concept of windowing would be questionable anyway.
Expanding windows are no good for the same reasons stated above.
About memory and processibility - I think this is an excellent scenario where one can use Spark. It is exactly built for this purpose and has excellent support for python. Millions of data points are no big deal for Spark. Plus, you would be able to massively parallelize your operations. Being on cloud infrastructure also gives you advantage about configurability and expandability without headache. I don't know why people like to use Jupyter even for batch tasks like these but if you are hell-bent on using it, then PySpark kernel is also supported by Jupyter. Vertical split would be right approach here probably.
Hope these answer your questions.
I am looking for a method/data structure to implement an evaluation system for a binary matcher for a verification.
This system will be distributed over several PCs.
Basic idea is described in many places over the internet, for example, in this document: https://precisebiometrics.com/wp-content/uploads/2014/11/White-Paper-Understanding-Biometric-Performance-Evaluation.pdf
This matcher, that I am testing, takes two data items as an input and calculates a matching score that reflects their similarity (then a threshold will be chosen, depending on false match/false non-match rate).
Currently I store matching scores along with labels in CSV file, like following:
label1, label2, genuine, 0.1
label1, label4, genuine, 0.2
...
label_2, label_n+1, impostor, 0.8
label_2, label_n+3, impostor, 0.9
...
label_m, label_m+k, genuine, 0.3
...
(I've got a labeled data base)
Then I run a python script, that loads this table into Pandas DataFrame and calculates FMR/FNMR curve, similar to the one, shown in figure 2 in the link above. The processing is rather simple, just sorting the dataframe, scanning rows from top to bottom and calculating amount of impostors/genuines on rows above and below each row.
The system should also support finding outliers in order to support matching algorithm improvement (labels of pairs of data items, produced abnormally large genuine scores or abnormally small impostor scores). This is also pretty easy with the DataFrames (just sort and take head rows).
Now I'm thinking about how to store the comparison data in RAM instead of CSV files on HDD.
I am considering Redis in this regard: amount of data is large, and several PCs are involved in computations, and Redis has a master-slave feature that allows it quickly sync data over the network, so that several PCs have exact clones of data.
It is also free.
However, Redis does not seem to me to suit very well for storing such tabular data.
Therefore, I need to change data structures and algorithms for their processing.
However, it is not obvious for me, how to translate this table into Redis data structures.
Another option would be using some other data storage system instead of Redis. However, I am unaware of such systems and will be grateful for suggestions.
You need to learn more about Redis to solve your challenges. I recommend you give https://try.redis.io a try and then think about your questions.
TL;DR - Redis isn't a "tabular data" store, it is a store for data structures. It is up to you to use the data structure(s) that serves your query(ies) in the most optimal way.
IMO what you want to do is actually keep the large data (how big is it anyway?) on slower storage and just store the model (FMR curve computations? Outliers?) in Redis. This can almost certainly be done with the existing core data structures (probably Hashes and Sorted Sets in this case), but perhaps even more optimally with the new Modules API. See the redis-ml module as an example of serving machine learning models off Redis (and perhaps your use case would be a nice addition to it ;))
Disclaimer: I work at Redis Labs, home of the open source Redis and provider of commercial solutions that leverage on it, including the above mentioned module (open source, AGPL licensed).