pandas: write empty DataFrame to HDF file - python

Is there a way to force pandas to write an empty DataFrame to an HDF file?
import pandas as pd
df = pd.DataFrame(columns=['x','y'])
df.to_hdf('temp.h5', 'xxx')
df2 = pd.read_hdf('temp.h5', 'xxx')
Output:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../Python-3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 389, in read_hdf
return store.select(key, auto_close=auto_close, **kwargs)
File ".../Python-3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 740, in select
return it.get_result()
File ".../Python-3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 1518, in get_result
results = self.func(self.start, self.stop, where)
File ".../Python-3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 733, in func
columns=columns)
File ".../Python-3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 2986, in read
idx=i), start=_start, stop=_stop)
File ".../Python-3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 2575, in read_index
_, index = self.read_index_node(getattr(self.group, key), **kwargs)
File ".../Python-3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 2676, in read_index_node
data = node[start:stop]
File ".../Python-3.6.3/lib/python3.6/site-packages/tables/vlarray.py", line 675, in __getitem__
return self.read(start, stop, step)
File ".../Python-3.6.3/lib/python3.6/site-packages/tables/vlarray.py", line 811, in read
listarr = self._read_array(start, stop, step)
File "tables/hdf5extension.pyx", line 2106, in tables.hdf5extension.VLArray._read_array (tables/hdf5extension.c:24649)
ValueError: cannot set WRITEABLE flag to True of this array
Writing with format='table':
import pandas as pd
df = pd.DataFrame(columns=['x','y'])
df.to_hdf('temp.h5', 'xxx', format='table')
df2 = pd.read_hdf('temp.h5', 'xxx')
Output:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../Python-3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 389, in read_hdf
return store.select(key, auto_close=auto_close, **kwargs)
File ".../Python-3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 722, in select
raise KeyError('No object named {key} in the file'.format(key=key))
KeyError: 'No object named xxx in the file'
Pandas version: 0.24.2
Thank you for your help!

Putting empty DataFrame into HDFStore in fixed format should work (maybe you need to check versions of other packages, e.g. tables):
# Versions
pd.__version__
tables.__version__
# DF
df = pd.DataFrame(columns=['x','y'])
df
# Dump in fixed format
with pd.HDFStore('temp.h5') as store:
store.put('df', df, format='f')
print('Read:')
store.select('df')
>>> '0.24.2'
>>> '3.5.1'
>>> x y
>>>
>>> Read:
>>> x y
Pytable really forbids to do so (at least it was), but for fixed pandas has its workaround.
But as discussed in same github issue there are made some efforts to fix this behavior for table as well. But looks like solution is still 'hangs in the air' because it was so at the end of march.

Related

Pandas AttributeError: 'DataFrame' object has no attribute 'Timestamp'

so i want to get the monthly sum with my script but i always get an AttributeError, which i dont understand. The column Timestamp does indeed exist on my combined_csv. I know for sure that this line is causing the problem since i tested al of my other code before.
AttributeError: 'DataFrame' object has no attribute 'Timestamp'
I'll appreciate every kind of help i can get - thanks
import os
import glob
import pandas as pd
# set working directory
os.chdir("Path to CSVs")
# find all csv files in the folder
# use glob pattern matching -> extension = 'csv'
# save result in list -> all_filenames
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
# print(all_filenames)
# combine all files in the list
combined_csv = pd.concat([pd.read_csv(f, sep=';') for f in all_filenames])
# Format CSV
# Transform Timestamp column into datetime
combined_csv['Timestamp'] = pd.to_datetime(combined_csv.Timestamp)
# Read out first entry of every day of every month
combined_csv = round(combined_csv.resample('D', on='Timestamp')['HtmDht_Energy'].agg(['first']))
# To get the yield of day i have to subtract day 2 HtmDht_Energy - day 1 HtmDht_Energy
combined_csv["dailyYield"] = combined_csv["first"] - combined_csv["first"].shift()
# combined_csv.reset_index()
# combined_csv.index.set_names(["year", "month"], inplace=True)
combined_csv["monthlySum"] = combined_csv.groupby([combined_csv.Timestamp.dt.year, combined_csv.Timestamp.dt.month]).sum()
Output of combined_csv.columns
Index(['Timestamp', 'teHst0101', 'teHst0102', 'teHst0103', 'teHst0104',
'teHst0105', 'teHst0106', 'teHst0107', 'teHst0201', 'teHst0202',
'teHst0203', 'teHst0204', 'teHst0301', 'teHst0302', 'teHst0303',
'teHst0304', 'teAmb', 'teSolFloHexHst', 'teSolRetHexHst',
'teSolCol0501', 'teSolCol1001', 'teSolCol1501', 'vfSol', 'prSolRetSuc',
'rdGlobalColAngle', 'gSolPump01_roActual', 'gSolPump02_roActual',
'gHstPump03_roActual', 'gHstPump04_roActual', 'gDhtPump06_roActual',
'gMB01_isOpened', 'gMB02_isOpened', 'gCV01_posActual',
'gCV02_posActual', 'HtmDht_Energy', 'HtmDht_Flow', 'HtmDht_Power',
'HtmDht_Volume', 'HtmDht_teFlow', 'HtmDht_teReturn', 'HtmHst_Energy',
'HtmHst_Flow', 'HtmHst_Power', 'HtmHst_Volume', 'HtmHst_teFlow',
'HtmHst_teReturn', 'teSolColDes', 'teHstFloDes'],
dtype='object')
Traceback:
When i select it with
combined_csv["monthlySum"] = combined_csv.groupby([combined_csv['Timestamp'].dt.year, combined_csv['Timestamp'].dt.month]).sum()
Traceback (most recent call last):
File "D:\Users\wink\PycharmProjects\csvToExcel\main.py", line 28, in <module>
combined_csv["monthlySum"] = combined_csv.groupby([combined_csv['Timestamp'].dt.year, combined_csv['Timestamp'].dt.month]).sum()
File "D:\Users\wink\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 3024, in __getitem__
indexer = self.columns.get_loc(key)
File "D:\Users\wink\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\indexes\base.py", line 3082, in get_loc
raise KeyError(key) from err
KeyError: 'Timestamp'
traceback with mustafas solution
Traceback (most recent call last):
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 3862, in reindexer
value = value.reindex(self.index)._values
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\util\_decorators.py", line 312, in wrapper
return func(*args, **kwargs)
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 4176, in reindex
return super().reindex(**kwargs)
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\generic.py", line 4811, in reindex
return self._reindex_axes(
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 4022, in _reindex_axes
frame = frame._reindex_index(
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 4038, in _reindex_index
new_index, indexer = self.index.reindex(
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\indexes\multi.py", line 2492, in reindex
target = MultiIndex.from_tuples(target)
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\indexes\multi.py", line 175, in new_meth
return meth(self_or_cls, *args, **kwargs)
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\indexes\multi.py", line 531, in from_tuples
arrays = list(lib.tuples_to_object_array(tuples).T)
File "pandas\_libs\lib.pyx", line 2527, in pandas._libs.lib.tuples_to_object_array
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long long'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\winklerm\PycharmProjects\csvToExcel\main.py", line 28, in <module>
combined_csv["monthlySum"] = combined_csv.groupby([combined_csv.Timestamp.dt.year, combined_csv.Timestamp.dt.month]).sum()
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 3163, in __setitem__
self._set_item(key, value)
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 3242, in _set_item
value = self._sanitize_column(key, value)
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 3888, in _sanitize_column
value = reindexer(value).T
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 3870, in reindexer
raise TypeError(
TypeError: incompatible index of inserted column with frame index
This line makes the Timestamp column the index of the combined_csv:
combined_csv = round(combined_csv.resample('D', on='Timestamp')['HtmDht_Energy'].agg(['first']))
and therefore you get an error when you try to access .Timestamp.
Remedy is to reset_index, so instead of above line, you can try this:
combined_csv = round(combined_csv.resample('D', on='Timestamp')['HtmDht_Energy'].agg(['first'])).reset_index()
which will take the Timestamp column back into normal columns from the index and you can then access it.
Side note:
combined_csv["dailyYield"] = combined_csv["first"] - combined_csv["first"].shift()
is equivalent to
combined_csv["dailyYield"] = combined_csv["first"].diff()

Error when using pandas to convert dates on the dataframe or when reading the csv file

I need to import a csv file using pandas that have a date field in the format 'year.decimal day' such as '1980.042' which would be in the format 'DD/MM/YYYY', '11/02/1980'.
File sample:
data
1980.042
1980.125
1980.208
1980.292
1980.375
1980.458
1980.542
1980.625
1980.708
Using pd.to_datetime I can transform it like this:
d = '1980.042'
print(pd.to_datetime(d, format = '%Y.%j'))
Output:
1980-02-11 00:00:00
My first attempt was to read the file and convert the dataframe column:
import pandas as pd
df = pd.read_csv('datas.csv')
print(df.dtypes, '\n\n', df.head())
df['data'] = p
d.to_datetime(df['data'], '%Y.%j')
Output:
data float64
dtype: object
data
0 1980.042
1 1980.125
2 1980.208
3 1980.292
4 1980.375
Traceback (most recent call last):
File "datas.py", line 4, in <module>
df['data'] = pd.to_datetime(df['data'], '%Y.%j')
File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 451, in to_datetime
values = _convert_listlike(arg._values, True, format)
File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 368, in _convert_listlike
require_iso8601=require_iso8601
File "pandas/_libs/tslib.pyx", line 492, in pandas._libs.tslib.array_to_datetime
File "pandas/_libs/tslib.pyx", line 513, in pandas._libs.tslib.array_to_datetime
AssertionError
The second attempt was to transform the column into a str and then a date:
import pandas as pd
df = pd.read_csv('datas.csv')
print(df.dtypes, '\n\n', df.head())
df['data'] = df['data'].astype(str)
df['data'] = pd.to_datetime(df['data'], '%Y.%j')
Output:
data float64
dtype: object
data
0 1980.042
1 1980.125
2 1980.208
3 1980.292
4 1980.375
Traceback (most recent call last):
File "datas.py", line 6, in <module>
df['data'] = pd.to_datetime(df['data'], '%Y.%j')
File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 451, in to_datetime
values = _convert_listlike(arg._values, True, format)
File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 368, in _convert_listlike
require_iso8601=require_iso8601
File "pandas/_libs/tslib.pyx", line 492, in pandas._libs.tslib.array_to_datetime
File "pandas/_libs/tslib.pyx", line 513, in pandas._libs.tslib.array_to_datetime
AssertionError
Then I realized that for some internal floating point issue the data was getting more than three decimal places. So I rounded it up to just three decimal places before convert:
import pandas as pd
df = pd.read_csv('datas.csv')
print(df.dtypes, '\n\n', df.head())
df['data'] = df['data'].round(3).astype(str)
print(df.dtypes, '\n\n', df.head())
df['data'] = pd.to_datetime(df['data'], '%Y.%j')
Output:
data float64
dtype: object
data
0 1980.042
1 1980.125
2 1980.208
3 1980.292
4 1980.375
data object
dtype: object
data
0 1980.042
1 1980.125
2 1980.208
3 1980.292
4 1980.375
Traceback (most recent call last):
File "datas.py", line 8, in <module>
df['data'] = pd.to_datetime(df['data'], '%Y.%j')
File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 451, in to_datetime
values = _convert_listlike(arg._values, True, format)
File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 368, in _convert_listlike
require_iso8601=require_iso8601
File "pandas/_libs/tslib.pyx", line 492, in pandas._libs.tslib.array_to_datetime
File "pandas/_libs/tslib.pyx", line 513, in pandas._libs.tslib.array_to_datetime
AssertionError
Finally, I looking at the pandas documentation and in some forums that I could define the data type when reading the file and also apply a lambda function:
import pandas as pd
date_parser = lambda col: pd.to_datetime(str(col), format = '%Y.%j')
df = pd.read_csv('datas.csv', parse_dates = ['data'], date_parser = date_parser)
print(df.dtypes, '\n\n', df.head())
Output:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 377, in _convert_listlike
values, tz = conversion.datetime_to_datetime64(arg)
File "pandas/_libs/tslibs/conversion.pyx", line 188, in pandas._libs.tslibs.conversion.datetime_to_datetime64
TypeError: Unrecognized value type: <class 'str'>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "datas.py", line 5, in <module>
df = pd.read_csv('datas.csv', parse_dates = ['data'], date_parser = date_parser)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 678, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 446, in _read
data = parser.read(nrows)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 1036, in read
ret = self._engine.read(nrows)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 1921, in read
names, data = self._do_date_conversions(names, data)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 1675, in _do_date_conversions
self.index_names, names, keep_date_col=self.keep_date_col)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 3066, in _process_date_conversion
data_dict[colspec] = converter(data_dict[colspec])
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 3033, in converter
return generic_parser(date_parser, *date_cols)
File "/usr/lib/python3/dist-packages/pandas/io/date_converters.py", line 39, in generic_parser
results[i] = parse_func(*args)
File "datas.py", line 3, in <lambda>
date_parser = lambda col: pd.to_datetime(str(col), format = '%Y.%j')
File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 469, in to_datetime
result = _convert_listlike(np.array([arg]), box, format)[0]
File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 380, in _convert_listlike
raise e
File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 347, in _convert_listlike
errors=errors)
File "pandas/_libs/tslibs/strptime.pyx", line 163, in pandas._libs.tslibs.strptime.array_strptime
ValueError: unconverted data remains: 5
Anyway, nothing works, has anyone been there? Any suggestions for doing the file reading with the correct data type or for converting the column on the dataframe?
I really hadn't realized the problem with the data.
Removing those with decimal parts greater than 365, I tested Tuhin Sharma's idea.
Unfortunately, it returns the value of the first line for all dataframe lines.
But I used the datetime module, as suggested by Tuhin Sharma, in a lambda function when reading the file as follows:
Sample file:
data
1980.042
1980.125
1980.208
1980.292
Code:
import pandas as pd
import datetime
date_parser = lambda col: datetime.datetime.strptime(col, '%Y.%j')
df = pd.read_csv('datas.csv', parse_dates = ['data'], date_parser = date_parser)
print(df)
Output:
data
0 1980-02-11
1 1980-05-04
2 1980-07-26
3 1980-10-18
You could try using datetime module. You can try the following code:-
import pandas as pd
import numpy as np
import datetime
import pandas as pd
df = pd.read_csv('datas.csv',dtype=str)
df["data"] = df["data"].map(lambda x: datetime.datetime.strptime(x,'%Y.%j'))
However this code will fail. Because your data has problem.
1980.375
1980.458
1980.542
1980.625
1980.708
For these values if you see the number of days is greater than 365 (3 decimal places), which is not possible and thats why it will throw error.
Hope this helps!!
You can try the following code as well which is a lot cleaner:-
import pandas as pd
import datetime
date_parser = lambda x: datetime.datetime.strptime(x, '%Y.%j')
df = pd.read_csv('datas.csv', parse_dates = ['data'], date_parser = date_parser)
print(df)

Dask not efficient on concatenating large pandas dataframes and gives Memory Error

At first, I tried typical concatenation of pandas dataframe:
df=pd.concat([df,df_filtered2],axis=1,sort=False)
but it gave the error:
/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/compat/__init__.py:84: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
warnings.warn(msg)
/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/compat/__init__.py:84: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
warnings.warn(msg)
Traceback (most recent call last):
File "process_data_interpolation.py", line 435, in <module>
df=pd.concat([df,df_filtered2],axis=1,sort=False)
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 255, in concat
sort=sort,
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 335, in __init__
obj._consolidate(inplace=True)
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/generic.py", line 5270, in _consolidate
self._consolidate_inplace()
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/generic.py", line 5252, in _consolidate_inplace
self._protect_consolidate(f)
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/generic.py", line 5241, in _protect_consolidate
result = f()
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/generic.py", line 5250, in f
self._data = self._data.consolidate()
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 932, in consolidate
bm._consolidate_inplace()
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 937, in _consolidate_inplace
self.blocks = tuple(_consolidate(self.blocks))
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1913, in _consolidate
list(group_blocks), dtype=dtype, _can_consolidate=_can_consolidate
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/blocks.py", line 3323, in _merge_blocks
new_values = new_values[argsort]
numpy.core._exceptions.MemoryError: Unable to allocate array with shape (41, 156082680) and data type float64
so I tried Dask:
df = dd.concat([df,df_filtered2],axis=1)
but it also gave me the MemoryError:
/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/compat/__init__.py:84: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
warnings.warn(msg)
/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/compat/__init__.py:84: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
warnings.warn(msg)
Traceback (most recent call last):
File "process_data_interpolation.py", line 443, in <module>
df = dd.concat([df,df_filtered2],axis=1)
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/dask/dataframe/multi.py", line 1045, in concat
dfs = _maybe_from_pandas(dfs)
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/dask/dataframe/core.py", line 4465, in _maybe_from_pandas
for df in dfs
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/dask/dataframe/core.py", line 4465, in <listcomp>
for df in dfs
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/dask/dataframe/io/io.py", line 209, in from_pandas
for i, (start, stop) in enumerate(zip(locations[:-1], locations[1:]))
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/dask/dataframe/io/io.py", line 209, in <dictcomp>
for i, (start, stop) in enumerate(zip(locations[:-1], locations[1:]))
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/indexing.py", line 1424, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/indexing.py", line 2137, in _getitem_axis
return self._get_slice_axis(key, axis=axis)
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/indexing.py", line 1308, in _get_slice_axis
return self._slice(indexer, axis=axis, kind="iloc")
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/indexing.py", line 166, in _slice
return self.obj._slice(obj, axis=axis, kind=kind)
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/generic.py", line 3371, in _slice
result = self._constructor(self._data.get_slice(slobj, axis=axis))
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 755, in get_slice
bm._consolidate_inplace()
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 937, in _consolidate_inplace
self.blocks = tuple(_consolidate(self.blocks))
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1913, in _consolidate
list(group_blocks), dtype=dtype, _can_consolidate=_can_consolidate
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/blocks.py", line 3323, in _merge_blocks
new_values = new_values[argsort]
MemoryError: Unable to allocate array with shape (41, 156082680) and data type float64
what else can I try? I am running Python script on linux node with 128GB of RAM memory. In my case the size of one of pandas dataframe after dropping unnecesary columns and converting some columns to integer is 44.48 GB.
This question is answered in the Dask Best Practices documentation:
https://docs.dask.org/en/latest/best-practices.html#load-data-with-dask

pandas cannot converge

I get an error that a file does not exist while I have the file there in the folder, would you please tell me where I am making a mistake?
pd.DataFrame.from_csv
I am getting an error shown below.
Traceback (most recent call last):
File "main.py", line 194, in <module>
start_path+end_res)
File "/Users/admin/Desktop/script/mergeT.py", line 5, in merge
df_peak = pd.DataFrame.from_csv(peak_score, index_col = False, sep='\t')
File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 1231, in from_csv
infer_datetime_format=infer_datetime_format)
File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 645, in parser_f
return _read(filepath_or_buffer, kwds)
File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 388, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 729, in __init__
self._make_engine(self.engine)
File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 922, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 1389, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "pandas/parser.pyx", line 373, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:4175)
File "pandas/parser.pyx", line 667, in pandas.parse**strong text**r.TextReader._setup_parser_source (pandas/parser.c:8440)
IOError: File results\scoring\fed\score_peak.txt does not exist
I have tried to set a path to the exact file
for example
As per documentation of pandas 0.19.1 pandas.DataFrame.from_csv does not support index_col = False. Try to use pandas.read_csv instead (with the same parameters). Also make sure you are using the up to date version of pandas.
See if this works:
import pandas as pd
def merge(peak_score, profile_score, res_file):
df_peak = pd.read_csv(peak_score, index_col = False, sep='\t')
df_profile = pd.read_csv(profile_score, index_col = False, sep='\t')
result = pd.concat([df_peak, df_profile], axis=1)
print result.head()
test = []
for a,b in zip(result['prot_a_p'],result['prot_b_p']):
if a == b:
test.append(1)
else:
test.append(0)
result['test']=test
result = result[result['test']==0]
del result['test']
result = result.fillna(0)
result.to_csv(res_file)
if __name__ == '__main__':
pass
Regarding the path issue when changing from Windows to OS X:
In all flavours of Unix, paths are written with slashes /, while in Windows backslashes \ are used. Since OS X is a descendant of Unix, as other users have correctly pointed out, when you change there from Windows you need to adapt your paths.

Python Pandas print error in Eclipse's PyDev: unknown encoding: MS874

I am trying to use Pandas library to read csv files, using Eclipse's PyDev.
foo.csv file:
"head1", "head2",
"A", "123"
test.py:
import pandas as pd
data = pd.read_csv('foo.csv');
print data
I ran this and got an error:
Traceback (most recent call last):
File "C:\Users\qqq\studyspace\macd\test3.py", line 4, in <module>
print data
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 666, in __str__
return self.__bytes__()
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 676, in __bytes__
return self.__unicode__().encode(encoding, 'replace')
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 691, in __unicode__
fits_horizontal = self._repr_fits_horizontal_()
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 651, in _repr_fits_horizontal_
d.to_string(buf=buf)
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 1488, in to_string
formatter.to_string()
File "C:\Python27\lib\site-packages\pandas\core\format.py", line 314, in to_string
strcols = self._to_str_columns()
File "C:\Python27\lib\site-packages\pandas\core\format.py", line 258, in _to_str_columns
str_index = self._get_formatted_index()
File "C:\Python27\lib\site-packages\pandas\core\format.py", line 472, in _get_formatted_index
fmt_index = [index.format(name=show_index_names, formatter=fmt)]
File "C:\Python27\lib\site-packages\pandas\core\index.py", line 450, in format
return self._format_with_header(header, **kwargs)
File "C:\Python27\lib\site-packages\pandas\core\index.py", line 472, in _format_with_header
result = _trim_front(format_array(values, None, justify='left'))
File "C:\Python27\lib\site-packages\pandas\core\format.py", line 1321, in format_array
return fmt_obj.get_result()
File "C:\Python27\lib\site-packages\pandas\core\format.py", line 1448, in get_result
return _make_fixed_width(fmt_values, self.justify)
File "C:\Python27\lib\site-packages\pandas\core\format.py", line 1495, in _make_fixed_width
max_len = np.max([_strlen(x) for x in strings])
File "C:\Python27\lib\site-packages\pandas\core\format.py", line 184, in _strlen
return len(x.decode(encoding))
LookupError: unknown encoding: MS874
I have tried to run this in IPython, and it does not give the error, so I think the problem is with my Eclipse setting. I use Eclipse Juno and I installed Pandas via Python(x,y).
I have tried to solve it blindly like this
import pandas as pd
data = pd.read_csv('foo.csv');
b = True;
while(b):
try:
print data
b = False
except:
print 'foooo'
And it just printed 'foooo' forever.
I have found the solution.
Right click on the project => Properties => Resource => Text file encoding. Choose other => UTF-8.

Categories

Resources