something wrong with dataframe.to_json() - python

my dataframe l is like this:
0 1
Sepal Length Label
(4.296, 5.2] setosa 39 0.866667
versicolor 5 0.111111
virginica 1 0.022222
(5.2, 6.1] setosa 11 0.220000
versicolor 29 0.580000
virginica 10 0.200000
(6.1, 7] versicolor 16 0.372093
virginica 27 0.627907
(7, 7.9] virginica 12 1.000000
and then excute ll=l.to_json() ,the result is:
'{"0":{"["(4.296, 5.2]","setosa"]":39,"["(4.296, 5.2]","versicolor"]":5,"["(4.296, 5.2]","virginica"]":1,"["(5.2, 6.1]","setosa"]":11,"["(5.2, 6.1]","versicolor"]":29,"["(5.2, 6.1]","virginica"]":10,"["(6.1, 7]","versicolor"]":16,"["(6.1, 7]","virginica"]":27,"["(7, 7.9]","virginica"]":12},"1":{"["(4.296, 5.2]","setosa"]":0.8666666667,"["(4.296, 5.2]","versicolor"]":0.1111111111,"["(4.296, 5.2]","virginica"]":0.0222222222,"["(5.2, 6.1]","setosa"]":0.22,"["(5.2, 6.1]","versicolor"]":0.58,"["(5.2, 6.1]","virginica"]":0.2,"["(6.1, 7]","versicolor"]":0.3720930233,"["(6.1, 7]","virginica"]":0.6279069767,"["(7, 7.9]","virginica"]":1.0}}'
then i try to read the ll, do pd.read_json(ll) and failed. the following is the message:
ValueError Traceback (most recent call last)
<ipython-input-574-4cb1a3ab2e3c> in <module>()
----> 1 pd.read_json(l1g)
/home/lv/anaconda3/lib/python3.6/site-packages/pandas/io/json.py in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines)
279 obj = FrameParser(json, orient, dtype, convert_axes, convert_dates,
280 keep_default_dates, numpy, precise_float,
--> 281 date_unit).parse()
282
283 if typ == 'series' or obj is None:
/home/lv/anaconda3/lib/python3.6/site-packages/pandas/io/json.py in parse(self)
347
348 else:
--> 349 self._parse_no_numpy()
350
351 if self.obj is None:
/home/lv/anaconda3/lib/python3.6/site-packages/pandas/io/json.py in _parse_no_numpy(self)
564 if orient == "columns":
565 self.obj = DataFrame(
--> 566 loads(json, precise_float=self.precise_float), dtype=None)
567 elif orient == "split":
568 decoded = dict((str(k), v)
ValueError: No ':' found when decoding object value
i want to save the data's struct of multiIndex ,so how to do
can anyone help me?thanks >.<

pandas json export and roundtrip still have difficulty with MultiIndex (according to this github issue)
One way to solve this would be to do a reset_index before the export and a set_index afterwards, as mentioned in the github issue and this previous answer
Did you try one of the other orient's (pandas documentation)

Related

Replace multiple "less than values" in different columns in pandas dataframe

I am working with python and pandas. I have a dataset of lab analysis where I am dealing with multiple parameters and detection limits(dl). Many of the samples are reported as below the dl (e.g.<dl,<4)
For example:
import pandas as pd
df=pd.DataFrame([['<4','88.72','<0.09'],['<1','5','<0.09'],['2','17.6','<0.09']], columns=['var_1','var_2','var_3'])
df
My goal is to replace all <dl with dl/2 as a float value.
I can do this for one column pretty easily.
df['var_3'] = df.var_3.str.replace('<' ,'').astype(float)
df['var_3'] = df['var_3'].apply(lambda x: x/2 if x == 0.09 else x)
df
but this requires me looking at the dl and inputting it.
I would like to streamline it to apply it across all variables with one or more detection limits per variable as I have many variables and the detection limit will not always be constant from data frame to data frame this is applied to.
I found something similar in R but not sure how to apply it in python. Any solutions would be appreciated.
Update
So the
df=df.replace(r'<(.*)', r'\1/2', regex=True).apply(pd.eval)
works well with dataframe with just columns with numbers. I assume that is a limitation of the eval function. For some reason I can get the code to work on smaller dataframes but after I concatenate them the code will not work on the larger dataframe and I get this error message:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/var/folders/9_/w2qcdj_x2x5852py8xl6b0sh0000gn/T/ipykernel_9403/3946462310.py in <module>
----> 1 MS=MS.replace(r'<(.*)', r'\1/2', regex=True).apply(pd.eval)
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwargs)
8738 kwargs=kwargs,
8739 )
-> 8740 return op.apply()
8741
8742 def applymap(
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/apply.py in apply(self)
686 return self.apply_raw()
687
--> 688 return self.apply_standard()
689
690 def agg(self):
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/apply.py in apply_standard(self)
810
811 def apply_standard(self):
--> 812 results, res_index = self.apply_series_generator()
813
814 # wrap results
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/apply.py in apply_series_generator(self)
826 for i, v in enumerate(series_gen):
827 # ignore SettingWithCopy here in case the user mutates
--> 828 results[i] = self.f(v)
829 if isinstance(results[i], ABCSeries):
830 # If we have a view on v, we need to make a copy because
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/computation/eval.py in eval(expr, parser, engine, truediv, local_dict, global_dict, resolvers, level, target, inplace)
351 eng = ENGINES[engine]
352 eng_inst = eng(parsed_expr)
--> 353 ret = eng_inst.evaluate()
354
355 if parsed_expr.assigner is None:
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/computation/engines.py in evaluate(self)
78
79 # make sure no names in resolvers and locals/globals clash
---> 80 res = self._evaluate()
81 return reconstruct_object(
82 self.result_type, res, self.aligned_axes, self.expr.terms.return_type
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/computation/engines.py in _evaluate(self)
119 scope = env.full_scope
120 _check_ne_builtin_clash(self.expr)
--> 121 return ne.evaluate(s, local_dict=scope)
122
123
~/opt/anaconda3/lib/python3.9/site-packages/numexpr/necompiler.py in evaluate(ex, local_dict, global_dict, out, order, casting, **kwargs)
821
822 # Create a signature
--> 823 signature = [(name, getType(arg)) for (name, arg) in
824 zip(names, arguments)]
825
~/opt/anaconda3/lib/python3.9/site-packages/numexpr/necompiler.py in <listcomp>(.0)
821
822 # Create a signature
--> 823 signature = [(name, getType(arg)) for (name, arg) in
824 zip(names, arguments)]
825
~/opt/anaconda3/lib/python3.9/site-packages/numexpr/necompiler.py in getType(a)
703 if kind == 'U':
704 raise ValueError('NumExpr 2 does not support Unicode as a dtype.')
--> 705 raise ValueError("unknown type %s" % a.dtype.name)
706
707
ValueError: unknown type object
Use replace instead str.replace than eval all expressions:
>>> df.replace(r'<(.*)', r'\1/2', regex=True).apply(pd.eval)
var_1 var_2 var_3
0 2.0 88.72 0.045
1 0.5 5.00 0.045
2 2.0 17.60 0.045
\1 will be replace by the first capture group .*
Update
Alternative:
out = df.melt(ignore_index=False)
m = out['value'].str.startswith('<')
out.loc[m, 'value'] = out.loc[m, 'value'].str.strip('<').astype(float) / 2
out = out.reset_index().pivot('index', 'variable', 'value') \
.rename_axis(index=None, columns=None)
Output:
>>> out
var_1 var_2 var_3
0 2.0 88.72 0.045
1 0.5 5 0.045
2 2 17.6 0.045
Update
Alternative using melt to flatten your dataframe and pivot to reshape to your original dataframe:
df1 = df.melt(ignore_index=False)
m = df1['value'].str.startswith('<')
df1['value'] = df1['value'].mask(~m).str[1:].astype(float).div(2) \
.fillna(df1['value']).astype(float)
df1 = df1.reset_index().pivot_table('value', 'index', 'variable') \
.rename_axis(index=None, columns=None)
Output:
>>> df1
var_1 var_2 var_3
0 2.0 88.72 0.045
1 0.5 5.00 0.045
2 2.0 17.60 0.045

Error with pd.IntervalIndex.from_arrays during groupby merge

I have two dataframes. They look like this:
df_a
Framecount probability
0 0.0 [0.00019486549333333332, 4.883635666666667e-06...
1 1.0 [0.00104359155, 3.9232405e-05, 0.0015722045000...
2 2.0 [0.00048501002666666667, 1.668179e-05, 0.00052...
3 3.0 [4.994969500000001e-05, 4.0931635e-07, 0.00011...
4 4.0 [0.0004808829, 5.389742e-05, 0.002522127933333...
.. ... ...
906 906.0 [1.677140566666667e-05, 1.1745095666666665e-06...
907 907.0 [1.5164155000000002e-05, 7.66629575e-07, 0.000...
908 908.0 [8.1334184e-05, 0.00012675669636333335, 0.0028...
909 909.0 [0.00014893802999999998, 1.0407592500000001e-0...
910 910.0 [4.178489e-05, 2.17477925e-06, 0.02094931, 0.0...
And:
df_b
start stop
0 12.12 12.47
1 13.44 20.82
2 20.88 29.63
3 31.61 33.33
4 33.44 42.21
.. ... ...
228 880.44 887.92
229 888.63 892.07
230 892.13 895.30
231 895.31 900.99
232 907.58 908.35
I want to merge df_a.probability onto df_b when df_a.Framecount is in between df_b.start and df_b.stop. The aggregation statistic for df_a.probability should be mean. df_a.probability is dtype np array.
I am trying to adapt my answer from this code:
df_text.idx = pd.IntervalIndex.from_arrays(df_b['start'],df_b['stop'],closed='both')
df_a['Framecount'].apply(lambda x : df_b.iloc[df_b.idx.get_loc(x)]['probability'].mean())
but it breaks with the first line with this error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-200-0c75f94bf6e2> in <module>
----> 1 df_textidx = pd.IntervalIndex.from_arrays(df_text['start'],df_text['stop'],closed='both')
2 df_vid['Framecount'].apply(lambda x : df_text.iloc[df_text.idx.get_loc(x)]['probability'])
~/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/interval.py in from_arrays(cls, left, right, closed, name, copy, dtype)
314 with rewrite_exception("IntervalArray", cls.__name__):
315 array = IntervalArray.from_arrays(
--> 316 left, right, closed, copy=copy, dtype=dtype
317 )
318 return cls._simple_new(array, name=name)
~/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/interval.py in from_arrays(cls, left, right, closed, copy, dtype)
380
381 return cls._simple_new(
--> 382 left, right, closed, copy=copy, dtype=dtype, verify_integrity=True
383 )
384
~/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/interval.py in _simple_new(cls, left, right, closed, copy, dtype, verify_integrity)
239 result._closed = closed
240 if verify_integrity:
--> 241 result._validate()
242 return result
243
~/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/interval.py in _validate(self)
486 if not (self.left[left_mask] <= self.right[left_mask]).all():
487 msg = "left side of interval must be <= right side"
--> 488 raise ValueError(msg)
489
490 # ---------
ValueError: left side of interval must be <= right side
Why is this happening? IIUC the left side is <= right side...
All left side values must be less than right side values, which isn't an issue with the test data provided.
For pd.IntervalIndex.from_arrays, the left and right parameter require a 1-d array.
.from_arrays(df_b['start'], df_b['stop']) passes a pandas.DataFrame
df_b['start'].values will pass an np.array.
Either should work.
Also see pandas: Indexing with an IntervalIndex
import pandas as pd
# setup the test dataframe
data = {'start': [12.12, 13.44, 20.88, 31.61, 33.44, 880.44, 888.63, 892.13, 895.31, 907.58], 'stop': [12.47, 20.82, 29.63, 33.33, 42.21, 887.92, 892.07, 895.3, 900.99, 908.35]}
df = pd.DataFrame(data)
# create the interval index by passing the column as np.array
idx = pd.IntervalIndex.from_arrays(df.start.values, df.stop.values, closed='both')
# display(idx)
IntervalIndex([[12.12, 12.47], [13.44, 20.82], [20.88, 29.63], [31.61, 33.33], [33.44, 42.21], [880.44, 887.92], [888.63, 892.07], [892.13, 895.3], [895.31, 900.99], [907.58, 908.35]],
closed='both',
dtype='interval[float64]')

Possible bug with `xarray.Dataset.groupby()`?

I'm using Xarray version 0.8.0, Python 3.5.1, on Mac OS X El Capitan 10.11.6.
The following code works as expected.
id_data_array = xarray.DataArray([280, 306, 280], coords={"index": range(3)})
random = numpy.random.rand(3)
score_data_array = xarray.DataArray(random, coords={"index": range(3)})
score_dataset = xarray.Dataset({"id": id_data_array, "score": score_data_array})
print(score_dataset)
print("======")
print(score_dataset.groupby("id").count())
Output:
<xarray.Dataset>
Dimensions: (index: 3)
Coordinates:
* index (index) int64 0 1 2
Data variables:
id (index) int64 280 306 280
score (index) float64 0.8358 0.7536 0.9495
======
<xarray.Dataset>
Dimensions: (id: 2)
Coordinates:
* id (id) int64 280 306
Data variables:
score (id) int64 2 1
In [ ]:
However, if I change just one little thing, to make the elements of id_data_array all distinct, then there is an error.
Code:
id_data_array = xarray.DataArray([280, 306, 120], coords={"index": range(3)})
random = numpy.random.rand(3)
score_data_array = xarray.DataArray(random, coords={"index": range(3)})
score_dataset = xarray.Dataset({"id": id_data_array, "score": score_data_array})
print(score_dataset)
print("======")
print(score_dataset.groupby("id").count())
Output:
<xarray.Dataset>
Dimensions: (index: 3)
Coordinates:
* index (index) int64 0 1 2
Data variables:
id (index) int64 280 306 120
score (index) float64 0.1353 0.0437 0.1687
======
---------------------------------------------------------------------------
InvalidIndexError Traceback (most recent call last)
<ipython-input-92-cc412270ba2e> in <module>()
5 print(score_dataset)
6 print("======")
----> 7 print(score_dataset.groupby("id").count())
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/xarray/core/common.py in wrapped_func(self, dim, keep_attrs, **kwargs)
44 return self.reduce(func, dim, keep_attrs,
45 numeric_only=numeric_only, allow_lazy=True,
---> 46 **kwargs)
47 return wrapped_func
48
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/xarray/core/groupby.py in reduce(self, func, dim, keep_attrs, **kwargs)
605 def reduce_dataset(ds):
606 return ds.reduce(func, dim, keep_attrs, **kwargs)
--> 607 return self.apply(reduce_dataset)
608
609 def assign(self, **kwargs):
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/xarray/core/groupby.py in apply(self, func, **kwargs)
562 kwargs.pop('shortcut', None) # ignore shortcut if set (for now)
563 applied = (func(ds, **kwargs) for ds in self._iter_grouped())
--> 564 combined = self._concat(applied)
565 result = self._maybe_restore_empty_groups(combined)
566 return result
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/xarray/core/groupby.py in _concat(self, applied)
570 concat_dim, positions = self._infer_concat_args(applied_example)
571
--> 572 combined = concat(applied, concat_dim)
573 reordered = _maybe_reorder(combined, concat_dim, positions)
574 return reordered
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/xarray/core/combine.py in concat(objs, dim, data_vars, coords, compat, positions, indexers, mode, concat_over)
114 raise TypeError('can only concatenate xarray Dataset and DataArray '
115 'objects, got %s' % type(first_obj))
--> 116 return f(objs, dim, data_vars, coords, compat, positions)
117
118
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/xarray/core/combine.py in _dataset_concat(datasets, dim, data_vars, coords, compat, positions)
276 if coord is not None:
277 # add concat dimension last to ensure that its in the final Dataset
--> 278 result[coord.name] = coord
279
280 return result
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/xarray/core/dataset.py in __setitem__(self, key, value)
536 raise NotImplementedError('cannot yet use a dictionary as a key '
537 'to set Dataset values')
--> 538 self.update({key: value})
539
540 def __delitem__(self, key):
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/xarray/core/dataset.py in update(self, other, inplace)
1434 dataset.
1435 """
-> 1436 variables, coord_names, dims = dataset_update_method(self, other)
1437
1438 return self._replace_vars_and_dims(variables, coord_names, dims,
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/xarray/core/merge.py in dataset_update_method(dataset, other)
490 priority_arg = 1
491 indexes = dataset.indexes
--> 492 return merge_core(objs, priority_arg=priority_arg, indexes=indexes)
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/xarray/core/merge.py in merge_core(objs, compat, join, priority_arg, explicit_coords, indexes)
371
372 coerced = coerce_pandas_values(objs)
--> 373 aligned = deep_align(coerced, join=join, copy=False, indexes=indexes)
374 expanded = expand_variable_dicts(aligned)
375
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/xarray/core/alignment.py in deep_align(list_of_variable_maps, join, copy, indexes)
146 out.append(variables)
147
--> 148 aligned = partial_align(*targets, join=join, copy=copy, indexes=indexes)
149
150 for key, aligned_obj in zip(keys, aligned):
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/xarray/core/alignment.py in partial_align(*objects, **kwargs)
109 valid_indexers = dict((k, v) for k, v in joined_indexes.items()
110 if k in obj.dims)
--> 111 result.append(obj.reindex(copy=copy, **valid_indexers))
112 return tuple(result)
113
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/xarray/core/dataset.py in reindex(self, indexers, method, tolerance, copy, **kw_indexers)
1216
1217 variables = alignment.reindex_variables(
-> 1218 self.variables, self.indexes, indexers, method, tolerance, copy=copy)
1219 return self._replace_vars_and_dims(variables)
1220
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/xarray/core/alignment.py in reindex_variables(variables, indexes, indexers, method, tolerance, copy)
218 target = utils.safe_cast_to_index(indexers[name])
219 indexer = index.get_indexer(target, method=method,
--> 220 **get_indexer_kwargs)
221
222 to_shape[name] = len(target)
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/indexes/base.py in get_indexer(self, target, method, limit, tolerance)
2080
2081 if not self.is_unique:
-> 2082 raise InvalidIndexError('Reindexing only valid with uniquely'
2083 ' valued Index objects')
2084
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
To me this seems buggy because if this is the desired behaviour then it would be very strange. Surely, we should include the case when all the elements of the DataArray we're grouping by are distinct?
Update
I've now uninstalled and reinstalled Xarray. The new Xarray is version 0.8.1, and it seems to work fine. So it may indeed be a bug in Xarray 0.8.0.

Index must be called with a collection of some kind: assign column name to dataframe

I have reweightTarget as follows and I want to convert it to a pandas Dataframe. However, I got following error:
TypeError: Index(...) must be called with a collection of some kind,
't' was passed
If I remove columns='t', it works fine. Can anyone please explain what's going on?
reweightTarget
Trading dates
2004-01-31 4.35
2004-02-29 4.46
2004-03-31 4.44
2004-04-30 4.39
2004-05-31 4.50
2004-06-30 4.53
2004-07-31 4.63
2004-08-31 4.58
dtype: float64
pd.DataFrame(reweightTarget, columns='t')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-334-bf438351aaf2> in <module>()
----> 1 pd.DataFrame(reweightTarget, columns='t')
C:\Anaconda3\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
253 else:
254 mgr = self._init_ndarray(data, index, columns, dtype=dtype,
--> 255 copy=copy)
256 elif isinstance(data, (list, types.GeneratorType)):
257 if isinstance(data, types.GeneratorType):
C:\Anaconda3\lib\site-packages\pandas\core\frame.py in _init_ndarray(self, values, index, columns, dtype, copy)
421 raise_with_traceback(e)
422
--> 423 index, columns = _get_axes(*values.shape)
424 values = values.T
425
C:\Anaconda3\lib\site-packages\pandas\core\frame.py in _get_axes(N, K, index, columns)
388 columns = _default_index(K)
389 else:
--> 390 columns = _ensure_index(columns)
391 return index, columns
392
C:\Anaconda3\lib\site-packages\pandas\indexes\base.py in _ensure_index(index_like, copy)
3407 index_like = copy(index_like)
3408
-> 3409 return Index(index_like)
3410
3411
C:\Anaconda3\lib\site-packages\pandas\indexes\base.py in __new__(cls, data, dtype, copy, name, fastpath, tupleize_cols, **kwargs)
266 **kwargs)
267 elif data is None or lib.isscalar(data):
--> 268 cls._scalar_data_error(data)
269 else:
270 if (tupleize_cols and isinstance(data, list) and data and
C:\Anaconda3\lib\site-packages\pandas\indexes\base.py in _scalar_data_error(cls, data)
481 raise TypeError('{0}(...) must be called with a collection of some '
482 'kind, {1} was passed'.format(cls.__name__,
--> 483 repr(data)))
484
485 #classmethod
TypeError: Index(...) must be called with a collection of some kind, 't' was passed
Documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
columns : Index or array-like
Column labels to use for resulting frame. Will default to np.arange(n) if no column labels are provided
Example:
df3 = DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
Try to use:
pd.DataFrame(reweightTarget, columns=['t'])
When you want to set an index or columns in data frame you must pass it as a list, so either:
pd.DataFrame(reweightTarget, columns=['t'])
pd.DataFrame(reweightTarget, columns=list('t'))

non-NDFFrame object error using pandas.SparseSeries.from_coo() function

I am trying to convert a COO type sparse matrix (from Scipy.Sparse) to a Pandas sparse series. From the documentation(http://pandas.pydata.org/pandas-docs/stable/sparse.html) it says to use the command SparseSeries.from_coo(A). This seems to be OK, but when I try to see the series' attributes, this is what happens.
10x10 seems OK.
import pandas as pd
import scipy.sparse as ss
import numpy as np
row = (np.random.random(10)*10).astype(int)
col = (np.random.random(10)*10).astype(int)
val = np.random.random(10)*10
sparse = ss.coo_matrix((val,(row,col)),shape=(10,10))
pss = pd.SparseSeries.from_coo(sparse)
print pss
0 7 1.416631
9 5.833902
1 0 4.131919
2 3 2.820531
7 2.227009
3 1 9.205619
4 4 8.309077
6 0 4.376921
7 6 8.444013
7 7.383886
dtype: float64
BlockIndex
Block locations: array([0])
Block lengths: array([10])
But not 100x100.
import pandas as pd
import scipy.sparse as ss
import numpy as np
row = (np.random.random(100)*100).astype(int)
col = (np.random.random(100)*100).astype(int)
val = np.random.random(100)*100
sparse = ss.coo_matrix((val,(row,col)),shape=(100,100))
pss = pd.SparseSeries.from_coo(sparse)
print pss
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-790-f0c22a601b93> in <module>()
7 sparse = ss.coo_matrix((val,(row,col)),shape=(100,100))
8 pss = pd.SparseSeries.from_coo(sparse)
----> 9 print pss
10
C:\Users\ej\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\base.pyc in __str__(self)
45 if compat.PY3:
46 return self.__unicode__()
---> 47 return self.__bytes__()
48
49 def __bytes__(self):
C:\Users\ej\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\base.pyc in __bytes__(self)
57
58 encoding = get_option("display.encoding")
---> 59 return self.__unicode__().encode(encoding, 'replace')
60
61 def __repr__(self):
C:\Users\ej\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\sparse\series.pyc in __unicode__(self)
287 def __unicode__(self):
288 # currently, unicode is same as repr...fixes infinite loop
--> 289 series_rep = Series.__unicode__(self)
290 rep = '%s\n%s' % (series_rep, repr(self.sp_index))
291 return rep
C:\Users\ej\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\series.pyc in __unicode__(self)
895
896 self.to_string(buf=buf, name=self.name, dtype=self.dtype,
--> 897 max_rows=max_rows)
898 result = buf.getvalue()
899
C:\Users\ej\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\series.pyc in to_string(self, buf, na_rep, float_format, header, length, dtype, name, max_rows)
960 the_repr = self._get_repr(float_format=float_format, na_rep=na_rep,
961 header=header, length=length, dtype=dtype,
--> 962 name=name, max_rows=max_rows)
963
964 # catch contract violations
C:\Users\ej\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\series.pyc in _get_repr(self, name, header, length, dtype, na_rep, float_format, max_rows)
989 na_rep=na_rep,
990 float_format=float_format,
--> 991 max_rows=max_rows)
992 result = formatter.to_string()
993
C:\Users\ej\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\format.pyc in __init__(self, series, buf, length, header, na_rep, name, float_format, dtype, max_rows)
145 self.dtype = dtype
146
--> 147 self._chk_truncate()
148
149 def _chk_truncate(self):
C:\Users\ej\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\format.pyc in _chk_truncate(self)
158 else:
159 row_num = max_rows // 2
--> 160 series = concat((series.iloc[:row_num], series.iloc[-row_num:]))
161 self.tr_row_num = row_num
162 self.tr_series = series
C:\Users\ej\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\tools\merge.pyc in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
752 keys=keys, levels=levels, names=names,
753 verify_integrity=verify_integrity,
--> 754 copy=copy)
755 return op.get_result()
756
C:\Users\ej\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\tools\merge.pyc in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy)
803 for obj in objs:
804 if not isinstance(obj, NDFrame):
--> 805 raise TypeError("cannot concatenate a non-NDFrame object")
806
807 # consolidate
TypeError: cannot concatenate a non-NDFrame object
I don't really understand the error message - I think I am following the example in the documentation to the letter, just using my own COO matrix (could it be the size?)
Regards
I have an older pandas. It has the sparse code, but not the tocoo.
The pandas issue that has been filed in connection with this is:
https://github.com/pydata/pandas/issues/10818
But I found on github that:
def _coo_to_sparse_series(A, dense_index=False):
""" Convert a scipy.sparse.coo_matrix to a SparseSeries.
Use the defaults given in the SparseSeries constructor. """
s = Series(A.data, MultiIndex.from_arrays((A.row, A.col)))
s = s.sort_index()
s = s.to_sparse() # TODO: specify kind?
# ...
return s
With a smallish sparse matrix I construct and display without problems:
In [259]: Asml=sparse.coo_matrix(np.arange(10*5).reshape(10,5))
In [260]: s=pd.Series(Asml.data,pd.MultiIndex.from_arrays((Asml.row,Asml.col)))
In [261]: s=s.sort_index()
In [262]: s
Out[262]:
0 1 1
2 2
3 3
4 4
1 0 5
1 6
2 7
[... mine]
3 48
4 49
dtype: int32
In [263]: ssml=s.to_sparse()
In [264]: ssml
Out[264]:
0 1 1
2 2
3 3
4 4
1 0 5
[... mine]
2 47
3 48
4 49
dtype: int32
BlockIndex
Block locations: array([0])
Block lengths: array([49])
but with a larger array (more nonzero elements) I get a display error. I'm guessing it happens when the display for the (plain) series starts to use an ellipsis (...). I'm running in Py3, so I get a different error message.
....\pandas\core\base.pyc in __str__(self)
45 if compat.PY3:
46 return self.__unicode__() # py3
47 return self.__bytes__() # py2 route
e.g.:
In [265]: Asml=sparse.coo_matrix(np.arange(10*7).reshape(10,7))
In [266]: s=pd.Series(Asml.data,pd.MultiIndex.from_arrays((Asml.row,Asml.col)))
In [267]: s=s.sort_index()
In [268]: s
Out[268]:
0 1 1
2 2
3 3
4 4
5 5
6 6
1 0 7
1 8
2 9
3 10
4 11
5 12
6 13
2 0 14
1 15
...
7 6 55
8 0 56
1 57
[... mine]
Length: 69, dtype: int32
In [269]: ssml=s.to_sparse()
In [270]: ssml
Out[270]: <repr(<pandas.sparse.series.SparseSeries at 0xaff6bc0c>)
failed: AttributeError: 'SparseArray' object has no attribute '_get_repr'>
I'm not sufficiently familiar with pandas code and structures to deduce much more for now.

Categories

Resources