Updating large DataFrame objects not on disk. - python

I've been learning the ins and outs of Pandas by way of manipulating large csv files obtained online, the files are time-series of financial data. I have so far figured out how to use HDFStore to store and manipulate them, however I was wondering if there exists an easier way to update the files, without re-downloading the entire source file?
I ask because I'm working with 12 ~300+MB files, which update every 15mins. While I don't need the update to be continuous it'd be swell to not download what I already have.

The Blaze library from Continuum should help you. You can find an introduction here.

Related

Memory usage due to pickle/joblib

I have a significant large dataset consisting of several thousands of files spread among different directories. These files all have different formats and come from different sensors giving me different sampling rates. Basically, a mess. I created a python module that is able to enter these folders and make sense of all this data, reformat it, get it into a pandas dataframe that I could use for effective and easy resampling, and in general, make it easier to work with.
The problem is that the resulting dataframe is big and takes a large amount of RAM memory. Loading several of these datasets leaves not enough memory available to actually train a ML model. And it is painfully slow to read the data.
So my solution is a two part approach. First, I read the dataset into a big variable. It is a dict with nested pandas DataFrame, then compute a reduced derived DataFrame with the information I actually need to train my model, and remove from memory the dict variable. Not ideal, but it works. However, further computations sometimes needs re-reading the data and as stated previously, it is slow.
Enter the second part. Before removing the dict from memory, I pickle it into a file. sklearn actually recommends using joblib, so that's what I use. So, once the single files for the dataset are stored in the working directory, the reading stage is about 90% faster than reading the scattered data, most likely because is loading a single large file directly into memory than reading and reformatting thousands of files across different directories.
Here's my problem. The same code when is reading the data from the scattered files, ends up with about 70% less RAM than when reading the pickled data. So, although it is faster, it ends up using much memory. Has anybody experienced something like this?
Given that there are some access issues to the data (it is located in a network drive with some weird restrictions for user access) and the fact that I need to make it as user friendly as possible for other people, I'm using a Jupyter notebook. My IT department provides a web tool with all the packages required to read the network drive from the go and run Jupyter there, whilst running from a VM will require the manual configuration of the network drive to access the data and that part is not user friendly. The Jupyter tool requires only login information, while the VM requires a basic knowledge of linux sysadmin
I'm using Python 3.9.6. I'll keep trying to get a MWE that has a similar situation. So far I have one that has the opposite behaviour (loading the pickled dataset consumes less memory than reading it directly). Might be because the particular structure of the dict with nested DataFrame
MWE (Warning, running this code will create a 4GB file in your hard drive):
import numpy as np
import psutil
from os.path import exists
from os import getpid
from joblib import dump, load
## WARNING. THIS CODE SAVES A LARGE FILE INTO YOUR HARD DRIVE
def read_compute():
if exists('df.joblib'):
df = load('df.joblib')
print('==== df loaded from .joblib')
else:
df = np.random.rand(1000000,500)
dump(df, 'df.joblib')
print('=== df created and dumped')
tab = df[:100, :10]
del df
return tab
table = read_compute()
print(f'{psutil.Process(getpid()).memory_info().rss / 1024 ** 2} MB')
With this, I get when running without the df.joblib file in the pwd
=== df created and dumped
3899.62890625 MB
And then, after that file is created, I restart the kernel and run the same code again, getting
==== df loaded from .joblib
1588.5234375 MB
In my actual case, with the format of my data, I have the opposite effect.

Can analyzing the data without creating a data frame in python for data loaded in database possible?

The data is stored in the DBeaver database.I would like to analyze my data through python without creating a data frame.And the python is installed in my computer. As the data is huge, creating a data frame will consume my ram and space.
So, it it possible to directly link my python code to the database and do the necessary aggregation or data manipulation and gather only the output
If you use python directly then also it will consume more ram and space. and if you do directly data analysis with database then it will may lead to unexpected results
instead you can use this Dask Dataframe from Dask Official ... Dask Wikipedia
with dask dataframe you can do data analysis even if you have big dataset
I don't know in which scale you want to work with your data and how big is your data set but if you are going to change the data in large scale i would recommend creating csv file which contains your dataset and working with pandas dataFrames reading csv files are fairly fast and easy to work if your interested you can visit here and read parts needed.
https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html

can you subset while reading in a csv in python

I have daily weather data in csv since 1980, >10GB in size. The column I am interested in date, and I want to be able to have a user select a date so that only the results from that date are returned.
I wonder if it is possible to read in and subset at the same time to save memory and computation
I am relatively new to python and tried:
d=pd.read_csv('weather.csv',sep='\t')['Date' == 'yyyymmdd']
to no avail.
Is it possible to read in all of the data that is only present for a single day (ei 20011004)?
Short answer: from a csv you'll not be able to do so.
Long answer: csv formats are very handy for humans to read, but it's the worst for machines to operate with. You'll need to parse line by line until you find the lines where the date fits the requested one.
A possible solution: You should convert the csv into a more amenable format for such operations. My suggestion would be to go with something like hdf5. You can read the whole csv with pandas and then save it as a hdf5 file as d.to_hdf('weather.h5', format='table'). You can check the pandas hdf documentation here. This should allow you to handle in a more memory and cpu efficient way.
Binary files can implement indexes and sorting in such a way that you don't have to go through all the data to check for those pieces you need. The same ideas apply to databases.
Addendum: There are other options for binary formats, like parquet (which maybe would be even better you should test) or feather (if you want some level of "native" interoperativity with R). You might want to check the following post for some insights regarding loading/saving times in different formats and their size.

Process a LOT of data

So I'm working with parametric energy simulations and ended up with 500GB+ of data stored in .CSV files. I need to be able to process all this data to compare the results and gain insights of the influence of different parameters.
Each csv file name contains information of the parameters used for the simulation so I can not merge the files.
I normally loaded the .csv files to python using pandas and defining a Class. but now (with all this data) there is not enough memory to do this.
Can you point me out a way to process this data? I need to be able to do plots and compare the csv files.
Thank you for your time.
Convert the csv files to hdf5, which was created to deal with massive and complex datasets. It works with pandas as well as other libraries.

Using pandas over csv library for manipulating CSV files in Python3

Forgive me if my questions is too general, or if its been asked before. I've been tasked to manipulate (e.g. copy and paste several range of entries, perform calculations on them, and then save them all to a new csv file) several large datasets in Python3.
What are the pros/cons of using the aforementioned libraries?
Thanks in advance.
I have not used CSV library, but many people are enjoying the benefits of Pandas. Pandas provides a lot of the tools you'll need, based off Numpy. You can easily then use more advance libraries for all sorts of analysis (sklearn for machine learning, nltk for nlp, etc.).
For your purposes, you'll find it easy to manage different cdv's, merge, concatenate, do whatever you want really.
Heres a link to a quick start guide. Lots of other resources out there as well.
getting started with pandas python
http://pandas.pydata.org/pandas-docs/stable/10min.html
Hope that helps a little bit.
You should always try to use as much as possible the work that other people have already been doing for you (such as programming the pandas library). This saves you a lot of time. Pandas has a lot to offer when you want to process such files so this seems to me to be the the best way to deal with such files. Since the question is very general, I can also only give a general answer... When you use pandas, you will however need to read more in the documentation. But I would not say that this is a downside.

Categories

Resources