This code is what I'm used it.
dataset = np.loadtxt('path to dataset', delimeter=',')
x_train=dataset[:700,0:3]
y_train=dataset[:700,3]
x_test=dataset[700:,0:3]
y_test=dataset[700:,3]
And I have billions of learning data.
Putting this data in a csv file is a pain to the computer.
I use 'sleep' to fetch 100,000 data at a time into a numpy array.
Is there a way to read the data in memory directly in csv format?
Related
I was trying to load a 30GB SAS format data file in pandas, but the memory does not allow me to do so. I then find a python library called Vaex, which suppose to analyze big datasets with no memory wasted. However, Vaex can only read data from certain file formats, such as CSV or HDF5. The method provided by its website below suggests converting the sas to pandas before it's been converted to vaex. It then back to my previous problems that I cannot even open this big data file using pandas. Thanks in Advance!!!!
pandas_df = pd.read_sas('./data/io/sample_airline.sas7bdat')
df = vaex.from_pandas(pandas_df, copy_index=False)
df
I have to process hdf5 files. Each of them contains data that can be loaded into a pandas DataFrame formed by 100 columns and almost 5E5 rows. Each hdf5 file weighs approximately 130MB.
So I want to fetch the data from the hdf5 file then apply some processing and finally save the new data in a csv file. In my case, the performance of the process is very important because I will have to repeat it.
So far I have focused on Pandas and Dask to get the job done. Dask is good for parallelization and I will get good processing times with a stronger PC and more CPUs.
However some of you have already encountered this problem and found the best optimization ?
As others have mentioned in the comments, unless you have to move it to CSV, I'd recommend keeping it in HDF5. However, below is a description of how you might do it if you do have to carry out the conversion.
It sounds like you have a function for loading the HDF5 file into a pandas data frame. I would suggest using dask's delayed API to create a list of delayed pandas data frames, and then convert them into a dask data frame. The snipped below is copied from the linked page, with an added line to save to CSV.
import dask.dataframe as dd
from dask.delayed import delayed
from my_custom_library import load
filenames = ...
dfs = [delayed(load)(fn) for fn in filenames]
df = dd.from_delayed(dfs)
df.to_csv(filename, **kwargs)
See dd.to_csv() documentation for info on options for saving to CSV.
I am working on a large dataset that is stored as ndjson where each row of the data is a json object, I read this in line by line and use pandas json_normalise() to flatten each one and save it in a list as a dataframe, I then concat this list afterwards.
The whole process takes ~2 hours on a high powered machine so I would like to save the result so I dont have to repeat it, however, I have tried using to_hdfs and to_parquet but both have been failing and I believe it is due to the majority of columns having mixed data types where there could be strings, floats and ints which is an unavoidable consequence of a messy data collection system.
What would be the most appropriate way of storing this unprocessed data prior to cleaning it?
I think here should help pickle.
For write DataFrame/Series use to_pickle .
For read back use read_pickle.
"We have the 2015 Yellow Cab NYC Taxi data as 12 CSV files on S3... This data is about 20GB on disk or 60GB in RAM."
i came across this observation while trying out dask, a python framework for handling out of memory datasets.
can someone explain to me why there is a 3x difference? id imagine it has to do with python objects but am not 100% sure.
thanks!
You are reading from a CSV on disk into a structured data frame object in memory. The two things are not at all analogous. The CSV data on disk is a single string of text. The data in memory is a complex data structure, with multiple data types, internal pointers, etc.
The CSV itself is not taking up any RAM. There is a complex data structure that is taking up RAM, and it was populated using data sourced from the CSV on disk. This is not at all the same thing.
To illustrate the difference, you could try reading the CSV into an actual single string variable and seeing how much memory that consumes. In this case, it would effectively be a single CSV string in memory:
with open('data.csv', 'r') as csvFile:
data=csvFile.read()
So I had this large Dataset of files and I created a program to put them in a pickle file but I Only Have 2GBs of RAM so I can`t Have the entire file in an array so how can I append the array on multiple batches "stuff data inside the array, append to the pickle file, clear the array, repeat " how can I do that,
thanks
Actually I don't think that it's possible to append data to a pickle file and if if it was, you would run into memory issues when trying to read the pickle file.
Pickle files are not designed for large data storage, so it might be worth switching to another file format.
You could go with text-based formats like csv, json, ... or binary formats like hdf5 which is specifically optimized for large amounts of numerical data.