Running out of RAM with pandas dataframe - python

My code looks like this:
import pandas as pd
import os
import glob
import numpy as np
# Reading files and getting Dataframes
PathCurrentPeriod = '/home/sergio/Documents/Energyfiles'
allFiles = glob.glob(PathCurrentPeriod + "/*.csv")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_)
list_.append(df)
frame = pd.concat(list_, axis='rows')
However, files are about 300. I'm thinking I get a "killed" response from the terminal when I run it on VSCode since trying to get those 300 files stored on "frame" may cause the Virtual Machine where I run this to go out of RAM Memory.
Is there a work around? Is it possible to use the Hard Drive as the memory for processing instead or the RAM?.
The problem is not the size itself of each .csv, so that I could read them by chunks...the problem is that I'm appending too many.

Related

What is most efficient approach to read multiple JSON files between Pandas and Pyspark?

I have a cloud bucket with many (around 1000) small JSON files (few KB each one). I have to read them, select some fields and store the result in a single parquet file. Since the JSON files are very small, the resulting dataframe (around 100MB) stays in memory.
I tried two ways. The first is using Pandas with a for loop:
import os
import pandas as pd
import json
path = ...
df = pd.DataFrame()
for root, _, filenames in os.walk(path):
for filename in fnmatch.filter(filenames, '*.json'):
with os.open(file_path, 'r') as f:
json_file = json.loads(f.read())
df = pd.DataFrame(json_file)
df = df.append(df, ignore_index=True)
The second option would be using Pyspark:
from pyspark.sql import SparkSession, SQLContext
path = ...
spark_builder = SparkSession.builder.appName(app_name).config(conf=conf)
sql_context = SQLContext(spark_builder)
df = sql_context.read.json(path)
What is the most efficient way to read multiple JSON files between the two approaches? And how the solutions scale if the number of files to read would be larger (more than 100K)?
If you are not running Spark in a cluster, it will not change much.
Pandas dataframe is not distributed. When performing transformations on a pd dataset, the data will not be spread across the cluster so all the processing will be concentrated in one node.
Working with the Spark datasets - like in second option - Spark will send chunks of data to the available workers in your cluster so this data will be processed in parallel, making the process much more fast. Depending on the size and shape of your data, you can play with how this data is "sliced" so you can increase performance even further.

Storing and accessing dataframe objects

I'm trying to read some excel files as pandas dataframes. The problem is they are quite large (about 2500 rows, columns up to 'CYK' label in the excel sheet, and there are 14 of them).
Every time that I run my program, it has to import again the files from excel. This causes the runtime to grow a lot, currently it's a bit more than 15 minutes and as of now the program doesn't even do anything significant except importing the files.
I would like to be able to import the files just once, then save the dataframe objects somewhere and make my program work only on those dataframes.
Any suggestions?
This is the code I developed until now:
import pandas as pd
import os
path = r'C:/Users/damia/Dropbox/Tesi/WIOD'
dirs = os.listdir(path)
complete_dirs = []
for f in dirs:
complete_dirs.append(path + r"/" + f)
data = []
for el in complete_dirs:
wiod = pd.read_excel(el, engine='pyxlsb')
data.append(wiod)
If anyone is interested, you can find the files I'm trying to read at this link:
http://www.wiod.org/database/wiots16
You could use the to_pickle and read_pickle methods provided by pandas to serialize the dataframes and store them in files.
docs
Example pickling:
data = []
pickle_paths = []
for el in complete_dirs:
wiod = pd.read_excel(el, engine='pyxlsb')
# here's where you store it
pickle_loc = 'your_unique_path_to_save_this_frame'
wiod.to_pickle(pickle_loc)
pickle_paths.append(pickle_loc)
data.append(wiod)
Depickling
data = []
for el in pickle_paths:
data.append(pd.read_pickle(el))
Another solution using to_pickle and read_pickle.
As an aside, you can read Excel files directly from URLs if you don't want to save to your drive first.
#read each file from the URL and save to disk
for year in range(2000, 2015):
pd.read_excel(f"http://www.wiod.org/protected3/data16/wiot_ROW/WIOT{year}_Nov16_ROW.xlsb").to_pickle(f"{year}.pkl")
#read pickle files from disk to a dataframe
data = list()
for year in range(2000, 2015):
data.append(pd.read_pickle(f"{year}.pkl"))
data = pd.concat(data)

Unable to allocate

I'm facing an issue where I need to read the file but instead giving me an error
"Unable to allocate 243. MiB for an array with shape (5, 6362620) and data type float64"
here are my code
import numpy as np
import pandas as pd
import os
for dirname, _, filenames in os.walk('D:/School/Classes/2nd Sem/Datasets/fraud.csv'):
for filename in filenames:
print(os.path.join(dirname, filename))
df = pd.read_csv('D:/School/Classes/2nd Sem/Datasets/fraud.csv')
when i run the last line of code, it will give me an error.
PS. I am using python3 jupyter notebook, windows 10 home single language
The MemoryError is coming because your file is too large in size, to solve this, you can use the chunk-size.
import pandas as pd
df = pd.read_csv("D:/School/Classes/2nd Sem/Datasets/fraud.csv", chunksize=1000)
Link for more help -
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

How to read multiple files to a dataframe whiteout getting the 'killed' error message?

I trying to fetch multiple csv files into a pandas dataframe. The folder is in total 16.6 GB and consists of multiple csv files. When I run this after a while I get a 'Killed' error. Is there a way to fix this issue?
Code:
def fetchFolder(folderPath):
print('Loading files...')
all_files = glob.glob(folderPath + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
histTrades = pd.concat(li, axis=0, ignore_index=True)
histTrades = histTrades.set_index('date')
histTrades.index = pd.to_datetime(histTrades.index, unit='ms')
return histTrades
fetchFolder(r'/run/media/kobej/B204D33B04D300F1/Work/backtra/data/BTCUSDT')
Output
Loading files...
Killed
Do you have at least 16.6GB of available RAM on your computer? You are trying to load all the data into memory, so if you don't have that much RAM available on your computer, it is not possible to do this.
Try to figure out a way where you can do your processing with only a chunk of your data, save the results to disk, and then operate on other chunks.
Another note - if you do have that much RAM available on your computer, you still have to tell your IDE that it is allowed to use so much RAM. There is setting to do this but this fact is irrelevant if you don't have so much available RAM on your computer.
You have 2 options here.
Option - 1 :
Use a library to handle large csv like dask. Use it as shown below.
import dask.dataframe as dd
df = dd.read_csv(file_name.csv)
Option -2 : Process data in chunks of 'n' rows.
#Process 5000 rows at a time
chunk_csv = pd.read_csv('fileName.csv', iterator=True, chunksize=5000)
df = pd.concat(chunk_csv, ignore_index=True)

How to import multiple csv files at once

I have 30 csv files of wind speed data on my computer- each file represents data at a different location. I have written code to calculate the statistics I need to run for each site; however, I am currently pulling in each csv file individually to do so(see code below):
from google.colab import files
data_to_load = files.upload()
import io
df = pd.read_csv(io.BytesIO(data_to_load['Downtown.csv']))
Is there a way to pull in all 30 csv files at once so each file is run through my statistical analysis code block and spits out an array with the file name and the statistic calculated?
use a loop
https://intellipaat.com/community/17913/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe
import glob
import pandas as pd
# get data file names
local_path = r'/my_files'
filenames = glob.glob(local_path + "/*.csv")
dfs = [pd.read_csv(filename)) for filename in filenames]
# if needed concatenate all data into one DataFrame
big_frame = pd.concat(dfs, ignore_index=True)
Also you can try put data online: github or google drive and read from there
https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92

Categories

Resources