Error with pd.IntervalIndex.from_arrays during groupby merge - python

I have two dataframes. They look like this:
df_a
Framecount probability
0 0.0 [0.00019486549333333332, 4.883635666666667e-06...
1 1.0 [0.00104359155, 3.9232405e-05, 0.0015722045000...
2 2.0 [0.00048501002666666667, 1.668179e-05, 0.00052...
3 3.0 [4.994969500000001e-05, 4.0931635e-07, 0.00011...
4 4.0 [0.0004808829, 5.389742e-05, 0.002522127933333...
.. ... ...
906 906.0 [1.677140566666667e-05, 1.1745095666666665e-06...
907 907.0 [1.5164155000000002e-05, 7.66629575e-07, 0.000...
908 908.0 [8.1334184e-05, 0.00012675669636333335, 0.0028...
909 909.0 [0.00014893802999999998, 1.0407592500000001e-0...
910 910.0 [4.178489e-05, 2.17477925e-06, 0.02094931, 0.0...
And:
df_b
start stop
0 12.12 12.47
1 13.44 20.82
2 20.88 29.63
3 31.61 33.33
4 33.44 42.21
.. ... ...
228 880.44 887.92
229 888.63 892.07
230 892.13 895.30
231 895.31 900.99
232 907.58 908.35
I want to merge df_a.probability onto df_b when df_a.Framecount is in between df_b.start and df_b.stop. The aggregation statistic for df_a.probability should be mean. df_a.probability is dtype np array.
I am trying to adapt my answer from this code:
df_text.idx = pd.IntervalIndex.from_arrays(df_b['start'],df_b['stop'],closed='both')
df_a['Framecount'].apply(lambda x : df_b.iloc[df_b.idx.get_loc(x)]['probability'].mean())
but it breaks with the first line with this error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-200-0c75f94bf6e2> in <module>
----> 1 df_textidx = pd.IntervalIndex.from_arrays(df_text['start'],df_text['stop'],closed='both')
2 df_vid['Framecount'].apply(lambda x : df_text.iloc[df_text.idx.get_loc(x)]['probability'])
~/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/interval.py in from_arrays(cls, left, right, closed, name, copy, dtype)
314 with rewrite_exception("IntervalArray", cls.__name__):
315 array = IntervalArray.from_arrays(
--> 316 left, right, closed, copy=copy, dtype=dtype
317 )
318 return cls._simple_new(array, name=name)
~/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/interval.py in from_arrays(cls, left, right, closed, copy, dtype)
380
381 return cls._simple_new(
--> 382 left, right, closed, copy=copy, dtype=dtype, verify_integrity=True
383 )
384
~/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/interval.py in _simple_new(cls, left, right, closed, copy, dtype, verify_integrity)
239 result._closed = closed
240 if verify_integrity:
--> 241 result._validate()
242 return result
243
~/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/interval.py in _validate(self)
486 if not (self.left[left_mask] <= self.right[left_mask]).all():
487 msg = "left side of interval must be <= right side"
--> 488 raise ValueError(msg)
489
490 # ---------
ValueError: left side of interval must be <= right side
Why is this happening? IIUC the left side is <= right side...

All left side values must be less than right side values, which isn't an issue with the test data provided.
For pd.IntervalIndex.from_arrays, the left and right parameter require a 1-d array.
.from_arrays(df_b['start'], df_b['stop']) passes a pandas.DataFrame
df_b['start'].values will pass an np.array.
Either should work.
Also see pandas: Indexing with an IntervalIndex
import pandas as pd
# setup the test dataframe
data = {'start': [12.12, 13.44, 20.88, 31.61, 33.44, 880.44, 888.63, 892.13, 895.31, 907.58], 'stop': [12.47, 20.82, 29.63, 33.33, 42.21, 887.92, 892.07, 895.3, 900.99, 908.35]}
df = pd.DataFrame(data)
# create the interval index by passing the column as np.array
idx = pd.IntervalIndex.from_arrays(df.start.values, df.stop.values, closed='both')
# display(idx)
IntervalIndex([[12.12, 12.47], [13.44, 20.82], [20.88, 29.63], [31.61, 33.33], [33.44, 42.21], [880.44, 887.92], [888.63, 892.07], [892.13, 895.3], [895.31, 900.99], [907.58, 908.35]],
closed='both',
dtype='interval[float64]')

Related

Unable to allocate 208. GiB for an array with shape (27939587241,) and data type int64?

This is my code:
play_count_with_title = pd.merge(df_count, df_small[['song_id', 'title', 'release']], on = 'song_id' )
final_ratings = pd.merge(play_count_with_title, df_small[['song_id', 'artist_name']], on = 'song_id' )
final_ratings
the error which i got is
Unable to allocate 208. GiB for an array with shape (27939587241,) and data type int64
The Code which enabled this error within the library is
File ~\anaconda3\lib\site-packages\pandas\core\reshape\merge.py:124, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
93 #Substitution("\nleft : DataFrame or named Series")
94 #Appender(_merge_doc, indents=0)
95 def merge(
(...)
108 validate: str | None = None,
109 ) -> DataFrame:
110 op = _MergeOperation(
111 left,
112 right,
(...)
122 validate=validate,
123 )
--> 124 return op.get_result(copy=copy)
File ~\anaconda3\lib\site-packages\pandas\core\reshape\merge.py:773, in _MergeOperation.get_result(self, copy)
770 if self.indicator:
771 self.left, self.right = self._indicator_pre_merge(self.left, self.right)
--> 773 join_index, left_indexer, right_indexer = self._get_join_info()
775 result = self._reindex_and_concat(
776 join_index, left_indexer, right_indexer, copy=copy
777 )
778 result = result.__finalize__(self, method=self._merge_type)
File ~\anaconda3\lib\site-packages\pandas\core\reshape\merge.py:1026, in _MergeOperation._get_join_info(self)
1022 join_index, right_indexer, left_indexer = _left_join_on_index(
1023 right_ax, left_ax, self.right_join_keys, sort=self.sort
1024 )
1025 else:
-> 1026 (left_indexer, right_indexer) = self._get_join_indexers()
1028 if self.right_index:
1029 if len(self.left) > 0:
File ~\anaconda3\lib\site-packages\pandas\core\reshape\merge.py:1000, in _MergeOperation._get_join_indexers(self)
998 def _get_join_indexers(self) -> tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]]:
999 """return the join indexers"""
-> 1000 return get_join_indexers(
1001 self.left_join_keys, self.right_join_keys, sort=self.sort, how=self.how
1002 )
File ~\anaconda3\lib\site-packages\pandas\core\reshape\merge.py:1610, in get_join_indexers(left_keys, right_keys, sort, how, **kwargs)
1600 join_func = {
1601 "inner": libjoin.inner_join,
1602 "left": libjoin.left_outer_join,
(...)
1606 "outer": libjoin.full_outer_join,
1607 }[how]
1609 # error: Cannot call function of unknown type
-> 1610 return join_func(lkey, rkey, count, **kwargs)
File ~\anaconda3\lib\site-packages\pandas\_libs\join.pyx:48, in pandas._libs.join.inner_join()
As a beginner i dont understand the error can you guys help me out?
It's hard to know what's going on without a sample of your data. However, this looks like the sort of problem you'd see if there are a lot of duplicated values in both dataframes.
Note that if there are multiple rows which match during the merge, then every combination of left and right rows is emitted by the merge.
For example, here's a tiny example of a 3-element DataFrame being merged with itself. The result has 9 elements!
In [7]: df = pd.DataFrame({'a': [1,1,1], 'b': [1,2,3]})
In [8]: df.merge(df, 'left', on='a')
Out[8]:
a b_x b_y
0 1 1 1
1 1 1 2
2 1 1 3
3 1 2 1
4 1 2 2
5 1 2 3
6 1 3 1
7 1 3 2
8 1 3 3
If your song_id column has a lot of duplicates in it, then the number of elements could be as many as N^2, i.e. 154377**2 == 23832258129 in the worst case.
Try using drop_duplicates('song_id') on each of the merge inputs to see what happens in that case.

Replace multiple "less than values" in different columns in pandas dataframe

I am working with python and pandas. I have a dataset of lab analysis where I am dealing with multiple parameters and detection limits(dl). Many of the samples are reported as below the dl (e.g.<dl,<4)
For example:
import pandas as pd
df=pd.DataFrame([['<4','88.72','<0.09'],['<1','5','<0.09'],['2','17.6','<0.09']], columns=['var_1','var_2','var_3'])
df
My goal is to replace all <dl with dl/2 as a float value.
I can do this for one column pretty easily.
df['var_3'] = df.var_3.str.replace('<' ,'').astype(float)
df['var_3'] = df['var_3'].apply(lambda x: x/2 if x == 0.09 else x)
df
but this requires me looking at the dl and inputting it.
I would like to streamline it to apply it across all variables with one or more detection limits per variable as I have many variables and the detection limit will not always be constant from data frame to data frame this is applied to.
I found something similar in R but not sure how to apply it in python. Any solutions would be appreciated.
Update
So the
df=df.replace(r'<(.*)', r'\1/2', regex=True).apply(pd.eval)
works well with dataframe with just columns with numbers. I assume that is a limitation of the eval function. For some reason I can get the code to work on smaller dataframes but after I concatenate them the code will not work on the larger dataframe and I get this error message:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/var/folders/9_/w2qcdj_x2x5852py8xl6b0sh0000gn/T/ipykernel_9403/3946462310.py in <module>
----> 1 MS=MS.replace(r'<(.*)', r'\1/2', regex=True).apply(pd.eval)
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwargs)
8738 kwargs=kwargs,
8739 )
-> 8740 return op.apply()
8741
8742 def applymap(
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/apply.py in apply(self)
686 return self.apply_raw()
687
--> 688 return self.apply_standard()
689
690 def agg(self):
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/apply.py in apply_standard(self)
810
811 def apply_standard(self):
--> 812 results, res_index = self.apply_series_generator()
813
814 # wrap results
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/apply.py in apply_series_generator(self)
826 for i, v in enumerate(series_gen):
827 # ignore SettingWithCopy here in case the user mutates
--> 828 results[i] = self.f(v)
829 if isinstance(results[i], ABCSeries):
830 # If we have a view on v, we need to make a copy because
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/computation/eval.py in eval(expr, parser, engine, truediv, local_dict, global_dict, resolvers, level, target, inplace)
351 eng = ENGINES[engine]
352 eng_inst = eng(parsed_expr)
--> 353 ret = eng_inst.evaluate()
354
355 if parsed_expr.assigner is None:
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/computation/engines.py in evaluate(self)
78
79 # make sure no names in resolvers and locals/globals clash
---> 80 res = self._evaluate()
81 return reconstruct_object(
82 self.result_type, res, self.aligned_axes, self.expr.terms.return_type
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/computation/engines.py in _evaluate(self)
119 scope = env.full_scope
120 _check_ne_builtin_clash(self.expr)
--> 121 return ne.evaluate(s, local_dict=scope)
122
123
~/opt/anaconda3/lib/python3.9/site-packages/numexpr/necompiler.py in evaluate(ex, local_dict, global_dict, out, order, casting, **kwargs)
821
822 # Create a signature
--> 823 signature = [(name, getType(arg)) for (name, arg) in
824 zip(names, arguments)]
825
~/opt/anaconda3/lib/python3.9/site-packages/numexpr/necompiler.py in <listcomp>(.0)
821
822 # Create a signature
--> 823 signature = [(name, getType(arg)) for (name, arg) in
824 zip(names, arguments)]
825
~/opt/anaconda3/lib/python3.9/site-packages/numexpr/necompiler.py in getType(a)
703 if kind == 'U':
704 raise ValueError('NumExpr 2 does not support Unicode as a dtype.')
--> 705 raise ValueError("unknown type %s" % a.dtype.name)
706
707
ValueError: unknown type object
Use replace instead str.replace than eval all expressions:
>>> df.replace(r'<(.*)', r'\1/2', regex=True).apply(pd.eval)
var_1 var_2 var_3
0 2.0 88.72 0.045
1 0.5 5.00 0.045
2 2.0 17.60 0.045
\1 will be replace by the first capture group .*
Update
Alternative:
out = df.melt(ignore_index=False)
m = out['value'].str.startswith('<')
out.loc[m, 'value'] = out.loc[m, 'value'].str.strip('<').astype(float) / 2
out = out.reset_index().pivot('index', 'variable', 'value') \
.rename_axis(index=None, columns=None)
Output:
>>> out
var_1 var_2 var_3
0 2.0 88.72 0.045
1 0.5 5 0.045
2 2 17.6 0.045
Update
Alternative using melt to flatten your dataframe and pivot to reshape to your original dataframe:
df1 = df.melt(ignore_index=False)
m = df1['value'].str.startswith('<')
df1['value'] = df1['value'].mask(~m).str[1:].astype(float).div(2) \
.fillna(df1['value']).astype(float)
df1 = df1.reset_index().pivot_table('value', 'index', 'variable') \
.rename_axis(index=None, columns=None)
Output:
>>> df1
var_1 var_2 var_3
0 2.0 88.72 0.045
1 0.5 5.00 0.045
2 2.0 17.60 0.045

Drop duplicates where some rows contain lists and others ints/strings

I have a dataframe where I want to drop rows that have duplicate IDs. For the most part, the IDs are ints and strings. Some of the ID entries, however, are lists of multiple IDs. I cannot split up these lists, but when trying to drop duplicates I get an error. For reference, I used df = df['ID'].astype(str) and it made no difference in the errors shown below.
Code for df:
d = {'ID': [999,
123,
F41,
99W21,
662,
123,
[552, F430, R111],
44482,
F41,
[M192, 5527, 7890, 111120]
]}
df = pd.Dataframe(data=d)
The input df ID column looks something like:
Index ID
-------------
0 999
1 123
2 F41
3 99W21
4 662
5 123
6 [552, F430, R111]
7 44482
8 F41
9 [M192, 5527, 7890, 111120]
And I would like to drop duplicates such that the output is:
Index ID
-------------
0 999
1 123
2 F41
3 99W21
4 662
5 [552, F430, R111]
6 44482
7 [M192, 5527, 7890, 111120]
I have tried df.drop_duplicates(subset=['ID'], inplace=True) which gives me the error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-13-0186aa1e1043> in <module>
3 # Reset index and drop CID duplicates
----> 4 df.drop_duplicates(subset=['ID'], inplace=True)
5 df.reset_index(drop=True, inplace=True)
/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in drop_duplicates(self, subset, keep, inplace)
4907
4908 inplace = validate_bool_kwarg(inplace, "inplace")
-> 4909 duplicated = self.duplicated(subset, keep=keep)
4910
4911 if inplace:
/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in duplicated(self, subset, keep)
4967
4968 vals = (col.values for name, col in self.items() if name in subset)
-> 4969 labels, shape = map(list, zip(*map(f, vals)))
4970
4971 ids = get_group_index(labels, shape, sort=False, xnull=False)
/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in f(vals)
4945 def f(vals):
4946 labels, shape = algorithms.factorize(
-> 4947 vals, size_hint=min(len(self), _SIZE_HINT_LIMIT)
4948 )
4949 return labels.astype("i8", copy=False), len(shape)
/usr/local/lib/python3.6/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
206 else:
207 kwargs[new_arg_name] = new_arg_value
--> 208 return func(*args, **kwargs)
209
210 return wrapper
/usr/local/lib/python3.6/dist-packages/pandas/core/algorithms.py in factorize(values, sort, order, na_sentinel, size_hint)
670
671 labels, uniques = _factorize_array(
--> 672 values, na_sentinel=na_sentinel, size_hint=size_hint, na_value=na_value
673 )
674
/usr/local/lib/python3.6/dist-packages/pandas/core/algorithms.py in _factorize_array(values, na_sentinel, size_hint, na_value)
506 table = hash_klass(size_hint or len(values))
507 uniques, labels = table.factorize(
--> 508 values, na_sentinel=na_sentinel, na_value=na_value
509 )
510
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.factorize()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable._unique()
TypeError: unhashable type: 'list'
And also df = pd.DataFrame(np.unique(df), columns=df.columns), which gives the error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-14-5b335a526fd5> in <module>
3 # Reset index and drop CID duplicates
----> 4 df = pd.DataFrame(np.unique(df), columns=df.columns)
5 df.reset_index(drop=True, inplace=True)
<__array_function__ internals> in unique(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts, axis)
260 ar = np.asanyarray(ar)
261 if axis is None:
--> 262 ret = _unique1d(ar, return_index, return_inverse, return_counts)
263 return _unpack_tuple(ret)
264
/usr/local/lib/python3.6/dist-packages/numpy/lib/arraysetops.py in _unique1d(ar, return_index, return_inverse, return_counts)
308 aux = ar[perm]
309 else:
--> 310 ar.sort()
311 aux = ar
312 mask = np.empty(aux.shape, dtype=np.bool_)
TypeError: '<' not supported between instances of 'float' and 'str'
If there is a way around this, I am not sure what it is, so any help would be useful.
unhashable type: 'list' error means Pandas trying to use list as an hash argument.
All of Python's immutable built-in objects are hashable, while no mutable containers (such as lists or dictionaries) are.
Try to convert column to string and drop the duplicates. and change it back to dataframe
df = df['ID'].astype(str).drop_duplicates().to_frame()

something wrong with dataframe.to_json()

my dataframe l is like this:
0 1
Sepal Length Label
(4.296, 5.2] setosa 39 0.866667
versicolor 5 0.111111
virginica 1 0.022222
(5.2, 6.1] setosa 11 0.220000
versicolor 29 0.580000
virginica 10 0.200000
(6.1, 7] versicolor 16 0.372093
virginica 27 0.627907
(7, 7.9] virginica 12 1.000000
and then excute ll=l.to_json() ,the result is:
'{"0":{"["(4.296, 5.2]","setosa"]":39,"["(4.296, 5.2]","versicolor"]":5,"["(4.296, 5.2]","virginica"]":1,"["(5.2, 6.1]","setosa"]":11,"["(5.2, 6.1]","versicolor"]":29,"["(5.2, 6.1]","virginica"]":10,"["(6.1, 7]","versicolor"]":16,"["(6.1, 7]","virginica"]":27,"["(7, 7.9]","virginica"]":12},"1":{"["(4.296, 5.2]","setosa"]":0.8666666667,"["(4.296, 5.2]","versicolor"]":0.1111111111,"["(4.296, 5.2]","virginica"]":0.0222222222,"["(5.2, 6.1]","setosa"]":0.22,"["(5.2, 6.1]","versicolor"]":0.58,"["(5.2, 6.1]","virginica"]":0.2,"["(6.1, 7]","versicolor"]":0.3720930233,"["(6.1, 7]","virginica"]":0.6279069767,"["(7, 7.9]","virginica"]":1.0}}'
then i try to read the ll, do pd.read_json(ll) and failed. the following is the message:
ValueError Traceback (most recent call last)
<ipython-input-574-4cb1a3ab2e3c> in <module>()
----> 1 pd.read_json(l1g)
/home/lv/anaconda3/lib/python3.6/site-packages/pandas/io/json.py in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines)
279 obj = FrameParser(json, orient, dtype, convert_axes, convert_dates,
280 keep_default_dates, numpy, precise_float,
--> 281 date_unit).parse()
282
283 if typ == 'series' or obj is None:
/home/lv/anaconda3/lib/python3.6/site-packages/pandas/io/json.py in parse(self)
347
348 else:
--> 349 self._parse_no_numpy()
350
351 if self.obj is None:
/home/lv/anaconda3/lib/python3.6/site-packages/pandas/io/json.py in _parse_no_numpy(self)
564 if orient == "columns":
565 self.obj = DataFrame(
--> 566 loads(json, precise_float=self.precise_float), dtype=None)
567 elif orient == "split":
568 decoded = dict((str(k), v)
ValueError: No ':' found when decoding object value
i want to save the data's struct of multiIndex ,so how to do
can anyone help me?thanks >.<
pandas json export and roundtrip still have difficulty with MultiIndex (according to this github issue)
One way to solve this would be to do a reset_index before the export and a set_index afterwards, as mentioned in the github issue and this previous answer
Did you try one of the other orient's (pandas documentation)

non-NDFFrame object error using pandas.SparseSeries.from_coo() function

I am trying to convert a COO type sparse matrix (from Scipy.Sparse) to a Pandas sparse series. From the documentation(http://pandas.pydata.org/pandas-docs/stable/sparse.html) it says to use the command SparseSeries.from_coo(A). This seems to be OK, but when I try to see the series' attributes, this is what happens.
10x10 seems OK.
import pandas as pd
import scipy.sparse as ss
import numpy as np
row = (np.random.random(10)*10).astype(int)
col = (np.random.random(10)*10).astype(int)
val = np.random.random(10)*10
sparse = ss.coo_matrix((val,(row,col)),shape=(10,10))
pss = pd.SparseSeries.from_coo(sparse)
print pss
0 7 1.416631
9 5.833902
1 0 4.131919
2 3 2.820531
7 2.227009
3 1 9.205619
4 4 8.309077
6 0 4.376921
7 6 8.444013
7 7.383886
dtype: float64
BlockIndex
Block locations: array([0])
Block lengths: array([10])
But not 100x100.
import pandas as pd
import scipy.sparse as ss
import numpy as np
row = (np.random.random(100)*100).astype(int)
col = (np.random.random(100)*100).astype(int)
val = np.random.random(100)*100
sparse = ss.coo_matrix((val,(row,col)),shape=(100,100))
pss = pd.SparseSeries.from_coo(sparse)
print pss
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-790-f0c22a601b93> in <module>()
7 sparse = ss.coo_matrix((val,(row,col)),shape=(100,100))
8 pss = pd.SparseSeries.from_coo(sparse)
----> 9 print pss
10
C:\Users\ej\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\base.pyc in __str__(self)
45 if compat.PY3:
46 return self.__unicode__()
---> 47 return self.__bytes__()
48
49 def __bytes__(self):
C:\Users\ej\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\base.pyc in __bytes__(self)
57
58 encoding = get_option("display.encoding")
---> 59 return self.__unicode__().encode(encoding, 'replace')
60
61 def __repr__(self):
C:\Users\ej\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\sparse\series.pyc in __unicode__(self)
287 def __unicode__(self):
288 # currently, unicode is same as repr...fixes infinite loop
--> 289 series_rep = Series.__unicode__(self)
290 rep = '%s\n%s' % (series_rep, repr(self.sp_index))
291 return rep
C:\Users\ej\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\series.pyc in __unicode__(self)
895
896 self.to_string(buf=buf, name=self.name, dtype=self.dtype,
--> 897 max_rows=max_rows)
898 result = buf.getvalue()
899
C:\Users\ej\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\series.pyc in to_string(self, buf, na_rep, float_format, header, length, dtype, name, max_rows)
960 the_repr = self._get_repr(float_format=float_format, na_rep=na_rep,
961 header=header, length=length, dtype=dtype,
--> 962 name=name, max_rows=max_rows)
963
964 # catch contract violations
C:\Users\ej\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\series.pyc in _get_repr(self, name, header, length, dtype, na_rep, float_format, max_rows)
989 na_rep=na_rep,
990 float_format=float_format,
--> 991 max_rows=max_rows)
992 result = formatter.to_string()
993
C:\Users\ej\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\format.pyc in __init__(self, series, buf, length, header, na_rep, name, float_format, dtype, max_rows)
145 self.dtype = dtype
146
--> 147 self._chk_truncate()
148
149 def _chk_truncate(self):
C:\Users\ej\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\format.pyc in _chk_truncate(self)
158 else:
159 row_num = max_rows // 2
--> 160 series = concat((series.iloc[:row_num], series.iloc[-row_num:]))
161 self.tr_row_num = row_num
162 self.tr_series = series
C:\Users\ej\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\tools\merge.pyc in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
752 keys=keys, levels=levels, names=names,
753 verify_integrity=verify_integrity,
--> 754 copy=copy)
755 return op.get_result()
756
C:\Users\ej\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\tools\merge.pyc in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy)
803 for obj in objs:
804 if not isinstance(obj, NDFrame):
--> 805 raise TypeError("cannot concatenate a non-NDFrame object")
806
807 # consolidate
TypeError: cannot concatenate a non-NDFrame object
I don't really understand the error message - I think I am following the example in the documentation to the letter, just using my own COO matrix (could it be the size?)
Regards
I have an older pandas. It has the sparse code, but not the tocoo.
The pandas issue that has been filed in connection with this is:
https://github.com/pydata/pandas/issues/10818
But I found on github that:
def _coo_to_sparse_series(A, dense_index=False):
""" Convert a scipy.sparse.coo_matrix to a SparseSeries.
Use the defaults given in the SparseSeries constructor. """
s = Series(A.data, MultiIndex.from_arrays((A.row, A.col)))
s = s.sort_index()
s = s.to_sparse() # TODO: specify kind?
# ...
return s
With a smallish sparse matrix I construct and display without problems:
In [259]: Asml=sparse.coo_matrix(np.arange(10*5).reshape(10,5))
In [260]: s=pd.Series(Asml.data,pd.MultiIndex.from_arrays((Asml.row,Asml.col)))
In [261]: s=s.sort_index()
In [262]: s
Out[262]:
0 1 1
2 2
3 3
4 4
1 0 5
1 6
2 7
[... mine]
3 48
4 49
dtype: int32
In [263]: ssml=s.to_sparse()
In [264]: ssml
Out[264]:
0 1 1
2 2
3 3
4 4
1 0 5
[... mine]
2 47
3 48
4 49
dtype: int32
BlockIndex
Block locations: array([0])
Block lengths: array([49])
but with a larger array (more nonzero elements) I get a display error. I'm guessing it happens when the display for the (plain) series starts to use an ellipsis (...). I'm running in Py3, so I get a different error message.
....\pandas\core\base.pyc in __str__(self)
45 if compat.PY3:
46 return self.__unicode__() # py3
47 return self.__bytes__() # py2 route
e.g.:
In [265]: Asml=sparse.coo_matrix(np.arange(10*7).reshape(10,7))
In [266]: s=pd.Series(Asml.data,pd.MultiIndex.from_arrays((Asml.row,Asml.col)))
In [267]: s=s.sort_index()
In [268]: s
Out[268]:
0 1 1
2 2
3 3
4 4
5 5
6 6
1 0 7
1 8
2 9
3 10
4 11
5 12
6 13
2 0 14
1 15
...
7 6 55
8 0 56
1 57
[... mine]
Length: 69, dtype: int32
In [269]: ssml=s.to_sparse()
In [270]: ssml
Out[270]: <repr(<pandas.sparse.series.SparseSeries at 0xaff6bc0c>)
failed: AttributeError: 'SparseArray' object has no attribute '_get_repr'>
I'm not sufficiently familiar with pandas code and structures to deduce much more for now.

Categories

Resources