change np to pandas to avoid memoryerror [duplicate]

change np to pandas to avoid memoryerror [duplicate] - python

I'm getting the following error:
MemoryError: Unable to allocate array with shape (118, 840983) and data type float64
in my python code whenever I am running a python pandas.readcsv() function to read a text file. Why is this??
This is my code:
import pandas as pd
df = pd.read_csv("LANGEVIN_DATA.txt", delim_whitespace=True)

The MemoryError means, you file is too large to readcsv in one time, you need used the chunksize to avoid the error.
just like:
import pandas as pd
df = pd.read_csv("LANGEVIN_DATA.txt", delim_whitespace=True, chunksize=1000)
you can read the official document for more help.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Related

Unable to allocate

I'm facing an issue where I need to read the file but instead giving me an error
"Unable to allocate 243. MiB for an array with shape (5, 6362620) and data type float64"
here are my code
import numpy as np
import pandas as pd
import os
for dirname, _, filenames in os.walk('D:/School/Classes/2nd Sem/Datasets/fraud.csv'):
for filename in filenames:
print(os.path.join(dirname, filename))
df = pd.read_csv('D:/School/Classes/2nd Sem/Datasets/fraud.csv')
when i run the last line of code, it will give me an error.
PS. I am using python3 jupyter notebook, windows 10 home single language

The MemoryError is coming because your file is too large in size, to solve this, you can use the chunk-size.
import pandas as pd
df = pd.read_csv("D:/School/Classes/2nd Sem/Datasets/fraud.csv", chunksize=1000)
Link for more help -
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

python/pandas "Kernel died, restarting" while loading a csv file

While trying to load a big csv file (150 MB) I get the error "Kernel died, restarting". Then only code that I use is the following:
import pandas as pd
from pprint import pprint
from pathlib import Path
from datetime import date
import numpy as np
import matplotlib.pyplot as plt
basedaily = pd.read_csv('combined_csv.csv')
Before it used to work, but I do not know why it is not working anymore. I tried to fixed it using engine="python" as follows:
basedaily = pd.read_csv('combined_csv.csv', engine='python')
But it gives me an error execution aborted.
Any help would be welcome!
Thanks in advance!

It may be because of the lack of memory you got this error. You can split your data in many data frames, do your work than you can re merge them, below some useful code that you may use:
import pandas as pd
# the number of row in each data frame
# you can put any value here according to your situation
chunksize = 1000
# the list that contains all the dataframes
list_of_dataframes = []
for df in pd.read_csv('combined_csv.csv', chunksize=chunksize):
# process your data frame here
# then add the current data frame into the list
list_of_dataframes.append(df)
# if you want all the dataframes together, here it is
result = pd.concat(list_of_dataframes)

Problem reading a data from a file with pandas Python (pandas.io.parsers.TextFileReader)

i want to read a dataset from a file with pandas, but when i use pd.read_csv(), the program read it, but when i want to see the dataframe appears:
pandas.io.parsers.TextFileReader at 0x1b3b6b3e198
As additional informational the file is too large (around 9 Gigas)
The file use as a separator the vertical lines, and i tried using chunksize but it doesn't work.
import pandas as pd
df = pd.read_csv(r"C:\Users\dguerr\Documents\files\Automotive\target_file", iterator=True, sep='|',chunksize=1000)
I want to import my data in the traditional pandas dataframe format.

You can load it chunk by chunk by doing:
import pandas as pd
path_to_file = "C:/Users/dguerr/Documents/Acxiom files/Automotive/auto_model_target_file"
chunk_size = 1000
for chunk in pd.read_csv(path_to_file,chunksize=chunk_size):
# do your stuff

You might want to check encoding types within a DataFrame. Your pd.read_csv defaults to utf8, should you be using latin1 for instance, this could potentially lead to such errors.
import pandas as pd
df = pd.read_csv('C:/Users/dguerr/Documents/Acxiom files/Automotive/auto_model_target_file',
encoding='latin-1', chunksize=1000)

How to import large csv file and perform operations

I have a problems with opening a large csv file (>5GB) and perform some simple operations on it. I have made this code:
import pandas as pd
import numpy as np
import os
import glob
os.chdir('C:\\Users\\data')
df = pd.read_csv('myfile.csv', low_memory=False, header=None, names= ['column1','column2', 'column3'])
df
Even setting low_meemory false it does not work. I used the following code that I found in this site but it does not work either.
import pandas as pd
import numpy as np
import os
import glob
os.chdir('C:\\Users\\data')
mylist = []
for chunk in pd.read_csv('SME_all.csv', sep=';', chunksize=20000):
mylist.append(chunk)
big_data = pd.concat(mylist, axis= 0)
del mylist
df = pd.read_csv('myfile.csv', low_memory=False, header=None,
error_bad_lines = False, names=['column1','column2', 'column3'])
df
Any suggestion? Should I consider using other application such as Apache Spark?

There are lots of approaches.
Perhaps the simplest is to split your CSV into multiple files. This only works if you don't need to aggregate the data in any way, such as groupby.
You can try specifying dtypes on import, otherwise Pandas may interpret columns as objects which will take up more memory.
You can iterate over the CSV using python's built in CSV reader, and perform operations on each row if that's the type of work you're trying to do.
You can look at Dask, or using PySpark on Google's dataproc or Azure's databricks.

Write pandas DataFrame to HDF in memory buffer

I want to get a dataframe as hdf in memory. The code below results in "AttributeError: '_io.BytesIO' object has no attribute 'put'". I am using python 3.5 and pandas 0.17
import pandas as pd
import numpy as np
import io
df = pd.DataFrame(np.arange(8).reshape(-1, 2), columns=['a', 'b'])
buf = io.BytesIO()
df.to_hdf(buf, 'some_key')
Update:
As UpSampler pointed out "path_or_buf" cannot be an io stream (which I find confusing since buf usually can be an io stream, see to_csv). Other than writing to disk and reading it back in, can I get a dataframe as hdf in memory?

Your first argument to
df.to_hdf()
has to be a "path (string) or HDFStore object" not an io stream. Documentation: http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.to_hdf.html

just try this
df = pd.DataFrame(np.arange(8).reshape(-1, 2), columns=['a', 'b'])
df.to_hdf(path_or_buf='path\to\your\file')
refer pandas.DataFrame.to_hdf

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

change np to pandas to avoid memoryerror [duplicate] - python

Related

Unable to allocate

python/pandas "Kernel died, restarting" while loading a csv file

Problem reading a data from a file with pandas Python (pandas.io.parsers.TextFileReader)

How to import large csv file and perform operations

Write pandas DataFrame to HDF in memory buffer

Categories

Resources