Write pandas DataFrame to HDF in memory buffer - python

I want to get a dataframe as hdf in memory. The code below results in "AttributeError: '_io.BytesIO' object has no attribute 'put'". I am using python 3.5 and pandas 0.17
import pandas as pd
import numpy as np
import io
df = pd.DataFrame(np.arange(8).reshape(-1, 2), columns=['a', 'b'])
buf = io.BytesIO()
df.to_hdf(buf, 'some_key')
Update:
As UpSampler pointed out "path_or_buf" cannot be an io stream (which I find confusing since buf usually can be an io stream, see to_csv). Other than writing to disk and reading it back in, can I get a dataframe as hdf in memory?

Your first argument to
df.to_hdf()
has to be a "path (string) or HDFStore object" not an io stream. Documentation: http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.to_hdf.html

just try this
df = pd.DataFrame(np.arange(8).reshape(-1, 2), columns=['a', 'b'])
df.to_hdf(path_or_buf='path\to\your\file')
refer pandas.DataFrame.to_hdf

Related

change np to pandas to avoid memoryerror [duplicate]

I'm getting the following error:
MemoryError: Unable to allocate array with shape (118, 840983) and data type float64
in my python code whenever I am running a python pandas.readcsv() function to read a text file. Why is this??
This is my code:
import pandas as pd
df = pd.read_csv("LANGEVIN_DATA.txt", delim_whitespace=True)
The MemoryError means, you file is too large to readcsv in one time, you need used the chunksize to avoid the error.
just like:
import pandas as pd
df = pd.read_csv("LANGEVIN_DATA.txt", delim_whitespace=True, chunksize=1000)
you can read the official document for more help.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

How to remove b' from values in dataframe

I read my arff dataframe from here https://archive.ics.uci.edu/ml/machine-learning-databases/00426/ like this:
from scipy.io import arff
import pandas as pd
data = arff.loadarff('Autism-Adult-Data.arff')
df = pd.DataFrame(data[0])
df.head()
But my dataframe has b' in all values in all columns:
How to remove it?
When i try this, it doesn't work as well:
from scipy.io import arff
import pandas as pd
data = arff.loadarff('Autism-Adult-Data.arff')
df = pd.DataFrame(data[0].str.decode('utf-8'))
df.head()
It says AttributeError: 'numpy.ndarray' object has no attribute 'str'
as you see .str.decode('utf-8') from Removing b'' from string column in a pandas dataframe didn't solve a problem
This doesn't work as well:
df.index = df.index.str.encode('utf-8')
A you see its both string and and numbers are bytes object
I was looking at the same dataset and had a similar issue. I did find a workaround and am not sure if this post will be helpful? So rather than use the from scipy.io import arff, I used another library called liac-arff. So the code should be like
pip install liac-arff
Or whatever the pip command that works for your operating system or IDE, and then
import arff
import pandas as pd
data = arff.loads('Autism-Adult-Data.arff')
Data returns a dictionary. To find what columns that dictionary has, you do
data.keys()
and you will find that all arff files have the following keys
['description', 'relation', 'attributes', 'data']
Where data is the actual data and attributes has the column names and the unique values of those columns. So to get a data frame you need to do the following
colnames = []
for i in range(len(data['attributes'])):
colnames.append(data['attributes'][i][0])
df = pd.DataFrame.from_dict(data['data'])
df.columns = colnames
df.head()
So I went overboard here with all creating the dataframe and all but this returns a data frame with no issues with a b', and the key is using import arff.
So the GitHub for the library I used can be found here.
Although Shimon shared an answer, you could also give this a try:
df.apply(lambda x: x.str.decode('utf8'))

Problem reading a data from a file with pandas Python (pandas.io.parsers.TextFileReader)

i want to read a dataset from a file with pandas, but when i use pd.read_csv(), the program read it, but when i want to see the dataframe appears:
pandas.io.parsers.TextFileReader at 0x1b3b6b3e198
As additional informational the file is too large (around 9 Gigas)
The file use as a separator the vertical lines, and i tried using chunksize but it doesn't work.
import pandas as pd
df = pd.read_csv(r"C:\Users\dguerr\Documents\files\Automotive\target_file", iterator=True, sep='|',chunksize=1000)
I want to import my data in the traditional pandas dataframe format.
You can load it chunk by chunk by doing:
import pandas as pd
path_to_file = "C:/Users/dguerr/Documents/Acxiom files/Automotive/auto_model_target_file"
chunk_size = 1000
for chunk in pd.read_csv(path_to_file,chunksize=chunk_size):
# do your stuff
You might want to check encoding types within a DataFrame. Your pd.read_csv defaults to utf8, should you be using latin1 for instance, this could potentially lead to such errors.
import pandas as pd
df = pd.read_csv('C:/Users/dguerr/Documents/Acxiom files/Automotive/auto_model_target_file',
encoding='latin-1', chunksize=1000)

How to import large csv file and perform operations

I have a problems with opening a large csv file (>5GB) and perform some simple operations on it. I have made this code:
import pandas as pd
import numpy as np
import os
import glob
os.chdir('C:\\Users\\data')
df = pd.read_csv('myfile.csv', low_memory=False, header=None, names= ['column1','column2', 'column3'])
df
Even setting low_meemory false it does not work. I used the following code that I found in this site but it does not work either.
import pandas as pd
import numpy as np
import os
import glob
os.chdir('C:\\Users\\data')
mylist = []
for chunk in pd.read_csv('SME_all.csv', sep=';', chunksize=20000):
mylist.append(chunk)
big_data = pd.concat(mylist, axis= 0)
del mylist
df = pd.read_csv('myfile.csv', low_memory=False, header=None,
error_bad_lines = False, names=['column1','column2', 'column3'])
df
Any suggestion? Should I consider using other application such as Apache Spark?
There are lots of approaches.
Perhaps the simplest is to split your CSV into multiple files. This only works if you don't need to aggregate the data in any way, such as groupby.
You can try specifying dtypes on import, otherwise Pandas may interpret columns as objects which will take up more memory.
You can iterate over the CSV using python's built in CSV reader, and perform operations on each row if that's the type of work you're trying to do.
You can look at Dask, or using PySpark on Google's dataproc or Azure's databricks.

Read excel file from StringIO buffer to dataframe with pandas.io.parsers.ExcelFile?

I'd like to read a string buffer into a pandas DataFrame. It seems that a good way to do it would be to use pandas' ExcelFile functionality. I've tried to do something like the following:
from pandas import ExcelFile as excel_handler
excel_data = excel_handler(StringIO(file_stream.read()).getvalue())
From then on, I guess ExcelFile.parse() can be used.
This produces the following error:
<class 'openpyxl.shared.exc.InvalidFileException'> [Errno 2] No such
file or directory: '
Any ideas on how to read in the file from the buffer?
Fixed. Had missed a part earlier in my code where file_stream.read() was being called. Consequently, by the time ExcelFile was being called, an empty string was being passed to it, causing an error. getvalue() needed to be removed. Here's how it should go:
from pandas import ExcelFile
excel_data = ExcelFile(StringIO(file_stream.read())
dataframe = excel_data.parse(excel_data.sheet_names[-1])

Categories

Resources