Can an empty MultiIndex be used to slice a DataFrame? - python

I have a pandas MultiIndex that I'm trying to use to slice a DataFrame. When the MultiIndex is empty, it results in a ValueError:
(Pdb) p _ix
MultiIndex(levels=[[], [], [], []],
labels=[[], [], [], []],
names=['foo', 'bar', 'baz', 'raz'])
(Pdb) p df.index
MultiIndex(levels=[['adni'], ['123', '234'], ['M12_s1', 'M24_s1'], ['CRB', 'CRB_crop', 'PON']],
labels=[[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [1, 1, 1, 1, 1], [0, 0, 0, 0, 0]],
names=['foo', 'bar', 'baz', 'raz'])
(Pdb) p df.loc[_ix]
*** ValueError: operands could not be broadcast together with shapes (0,) (4,) (0,)
The names of the two indices match, so from what I understand this slicing should be fine.
In some conditions when _ix isn't empty, this works as expected. I can't find anything in the documentation that describes not supporting an empty DataFrame like this though. Am I missing something obvious?
Edit: adding concrete example:
$ cat so_qc.csv
foo,bar,baz,raz,qc
pd,andrew,M24_s1,CRB,True
pd,andrew,M24_s1,CRB_crop,True
$ cat so_df.csv
foo,bar,baz,raz,value
pd,andrew,M24_s1,CRB,0.701794977111406
pd,andrew,M24_s1,CRB,0.309406238674409
$ python
qc = pd.read_csv('so_qc.csv', index_col=[0,1,2,3], squeeze=True)
df = pd.read_csv('so_df.csv', index_col=[0,1,2,3])
# This is OK
df.loc[ qc.index.intersection(df.index) ]
# When I select only False elements from `qc` (which is none of them), ValueError about broadcasting
df.loc[ qc[~qc].index.intersection(df.index) ]

Related

Complex index numpy array or indexing dataframe

I have an array (dataframe) with shape 9800, 9800. I need to index it (without labels) like:
x = (9800,9800)
a = x[0:7000,0:7000] (plus) x[7201:9800, 0:7000] (plus) x[0:7000, 7201:9800] (plus) x[7201:9800, 7201:9800]
b = x[7000:7200, 7000:7200]
c = x[7000:7200, 0:7000] (plus) x[7000:7200, 7201:9800]
d = x[0:7000, 7000:7200] (plus) x[7201:9800, 7000:7200]
What I mean by plus, is not a proper addition but more like a concatenation. Like putting the resulting dataframes together one next to the other. See attached image.
Is there any "easy" way of doing this? I need to replicate this to 10,000 dataframes and add them up individually to save memory.
You have np.r_, which basically creates an index array for you, for example:
np.r_[:3,4:6]
gives
array([0, 1, 2, 4, 5])
So in your case:
a_idx = np.r_[0:7000,7200:9000]
a = x[a_idx, a_idx]
c = x[7000:7200, a_idx]
In [167]: x=np.zeros((9800,9800),'int8')
The first list of slices:
In [168]: a = [x[0:7000,0:7000], x[7201:9800, 0:7000],x[0:7000, 7201:9800], x[7201:9800, 7201:9800]]
and their shapes:
In [169]: [i.shape for i in a]
Out[169]: [(7000, 7000), (2599, 7000), (7000, 2599), (2599, 2599)]
Since the shapes vary, you can't simply concatenate them all:
In [170]: np.concatenate(a, axis=0)
Traceback (most recent call last):
File "<ipython-input-170-c111dc665509>", line 1, in <module>
np.concatenate(a, axis=0)
File "<__array_function__ internals>", line 5, in concatenate
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 7000 and the array at index 2 has size 2599
In [171]: np.concatenate(a, axis=1)
Traceback (most recent call last):
File "<ipython-input-171-227af3749524>", line 1, in <module>
np.concatenate(a, axis=1)
File "<__array_function__ internals>", line 5, in concatenate
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 7000 and the array at index 1 has size 2599
You can concatenate subsets:
In [172]: np.concatenate(a[:2], axis=0)
Out[172]:
array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], dtype=int8)
In [173]: _.shape
Out[173]: (9599, 7000)
I won't take the time to construct the other lists, but it looks like you could construct the first column with
np.concatenate([a[0], c[0], a[1]], axis=0)
similarly for the other columns, and then concatenate columns. Or join them by rows first.
np.block([[a[0],d[0],a[2]],[....]]) with an appropriate mix of list elements should do the same (just a difference in notation, same concatenation work).

Python: Mapping between two arrays with an index array

I have a numpy array
src = np.random.rand(320,240)
and another numpy array idx of size (2 x (320*240)). Each column of idx indexes an entry in a result array dst, e.g., idx[:,20] = [3,10] references row 3, column 10 in dst and the assumption is that 20 corresponds to the flattened index of src, i.e., idx establishes a mapping between the entries of src and dst. Assuming dst is initialized with all zeros, how can I copy the entries in src to their destination in dst without a loop?
Here is the canonical way of doing it:
>>> import numpy as np
>>>
>>> src = np.random.rand(4, 3)
>>> src
array([[0.0309325 , 0.72261479, 0.98373595],
[0.06357406, 0.44763809, 0.45116039],
[0.63992938, 0.6445605 , 0.01267776],
[0.76084312, 0.61888759, 0.2138713 ]])
>>>
>>> idx = np.indices(src.shape).reshape(2, -1)
>>> np.random.shuffle(idx.T)
>>> idx
array([[3, 3, 0, 1, 0, 3, 1, 1, 2, 2, 2, 0],
[1, 2, 2, 0, 1, 0, 1, 2, 2, 1, 0, 0]])
>>>
>>> dst = np.empty_like(src)
>>> dst[tuple(idx)] = src.ravel()
>>> dst
array([[0.2138713 , 0.44763809, 0.98373595],
[0.06357406, 0.63992938, 0.6445605 ],
[0.61888759, 0.76084312, 0.01267776],
[0.45116039, 0.0309325 , 0.72261479]])
If you can't be sure that idx is a proper shuffle it's a bit safer to use np.full with a fill value that does not appear in src instead of np.empty.
>>> dst = np.full_like(src, np.nan)
>>> dst[tuple(idx)] = src.ravel()
>>>
>>> dst
array([[0.27020869, 0.71216066, nan],
[0.63812283, 0.69151451, 0.65843901],
[ nan, 0.02406174, 0.47543061],
[0.05650845, nan, nan]])
If you spot the fill value in dst, something is wrong with idx.
You can try:
dst[idx[0, :], idx[1, :]] = src.flat
In [33]: src = np.random.randn(2, 3)
In [34]: src
Out[34]:
array([[ 0.68636938, 0.60275041, 1.26078727],
[ 1.17937849, -1.0369404 , 0.42847611]])
In [35]: dst = np.zeros_like(src)
In [37]: idx = np.array([[0, 1, 0, 1, 0, 0], [1, 2, 0, 1, 2, 0]])
In [38]: dst[idx[0, :], idx[1, :]] = src.flat
In [39]: dst
Out[39]:
array([[ 0.42847611, 0.68636938, -1.0369404 ],
[ 0. , 1.17937849, 0.60275041]])
dst[0, 1] is src[0, 0], etc.

Assigning particular elements of DataArray based on another

I am having trouble figuring out some basic usage patterns of xarray. Here's something that I used to be able to do easily in numpy: (setting elements where a particular condition is satisfied in another array)
import numpy as np
q_index = np.array([
[0, 1, 2, 3, 4, 5],
[1, 5, 3, 2, 0, 4],
])
# any element not yet specified
q_kinds = np.full_like(q_index, 'other', dtype=object)
# any element with q-index 0 should be classified as 'gamma'
q_kinds[q_index == 0] = 'gamma'
# q_kinds is now:
# [['gamma' 'other' 'other' 'other' 'other' 'other']
# ['other' 'other' 'other' 'other' 'gamma' 'other']]
# afterwards I do some other things to fill in some (but not all)
# of the 'other' elements with different labels
But I don't see any reasonable way to do this masked assignment in xarray:
import xarray as xr
ds = xr.Dataset()
ds.coords['q-index'] = (['layer', 'q'], [
[0, 1, 2, 3, 4, 5],
[1, 5, 3, 2, 0, 4],
])
ds['q-kinds'] = xr.full_like(ds.coords['q-index'], 'other', dtype=object)
# any element with q-index == 0 should be classified as 'gamma'
# Attempt 1:
# 'IndexError: 2-dimensional boolean indexing is not supported.'
ds['q-kinds'][ds.coords['q-index'] == 0] = 'gamma'
# Attempt 2:
# Under 'More advanced indexing', the docs show that you can
# use isel with DataArrays to do pointwise indexing, but...
ds['q-kinds'].isel(
# ...I don't how to compute these index arrays from q-index...
layer = xr.DataArray([1, 0]),
q = xr.DataArray([5, 0]),
# ...and the docs also clearly state that isel does not support mutation.
)[...] = 'gamma' # FIXME ineffective
"xy-problem" style answers are okay. It seems to me that maybe the way you're supposed to build an array like this is to start with an array that (somehow) describes just the 'gamma' elements (and likewise an array for each other classification), use the immutable APIs to (somehow) merge/combine them, do something to make sure the data is dense along the q dimension, and then .fillna('other'). Or something like that. I really don't know.
You're very close! Instead of boolean indexing, you can use xarray.where() with three arguments:
>>> xr.where(ds.coords['q-index'] == 0, 'gamma', ds['q-kinds'])
<xarray.DataArray (layer: 2, q: 6)>
array([['gamma', 'other', 'other', 'other', 'other', 'other'],
['other', 'other', 'other', 'other', 'gamma', 'gamma']], dtype=object)
Coordinates:
q-index (layer, q) int64 0 1 2 3 4 5 1 5 3 2 0 4
Dimensions without coordinates: layer, q
Or equivalently, instead of using .isel() for assignment, you can use a dictionary inside [], e.g.,
>>> indexer = dict(layer=xr.DataArray([1, 0]), q=xr.DataArray([5, 0]))
>>> ds['q-kinds'][indexer] = 'gamma'
Note that it's important to create the DataArray objects explicitly inside the dictionary, because they are created with the same new dimension name dim_0:
>>> indexer
{'layer': <xarray.DataArray (dim_0: 2)>
array([1, 0])
Dimensions without coordinates: dim_0, 'q': <xarray.DataArray (dim_0: 2)>
array([5, 0])
Dimensions without coordinates: dim_0}
If you pass lists or 1D numpy arrays directly, they are assumed to along independent dimensions, so you would end up with "outer" style indexing instead:
>>> indexer = dict(layer=[1, 0], q=[5, 0])
>>> ds['q-kinds'][indexer] = 'gamma'
>>> ds['q-kinds']
<xarray.DataArray 'q-kinds' (layer: 2, q: 6)>
array([['gamma', 'other', 'other', 'other', 'other', 'gamma'],
['gamma', 'other', 'other', 'other', 'other', 'gamma']], dtype=object)
Coordinates:
q-index (layer, q) int64 0 1 2 3 4 5 1 5 3 2 0 4
Dimensions without coordinates: layer, q

Counting Gene segments in python and print them in columns

I need to convert a text file into species and counts of gene segments. For this I wanted to create a dictionary, filled with keys i searched with a pattern. Every key should have 3 items (digits) starting with 0. With another patterns, I want to look for the gene segments and if there is one, I want to increase the count for that.
I'm searching for 3 different gene segments, why I only want to increase item1, item2 or item3. Is there a way to do this with python?
That's the code I wrote till now, but I don't know how to continue.
matrix = {}
pattern = re.compile(r"[A-Za-z ]*")
pattern_v = re.compile(r";[A_Z]+V[0-9]?;")
pattern_d = re.compile(r";[A_Z]+D[0-9]?;")
pattern_j = re.compile(r";[A_Z]+J[0-9]?;")
for i in file.readlines():
name = pattern.search(i)
if pattern_v.search:
if name.group() not in matrix:
matrix.update(name.group(), (1,0,0))
else:
matrix[(name.group()[0]] = matrix[(name.group()[0]]+1
...
As you can see, if pattern_v was found, I want to increase the item at position zero.
I know, that the last command doesn't work, I just wrote it to explain, what I want to do.
EDIT ADD: I got the algorithm working, but now i have the problem, that i cant print it like i want.
{'Mus cookii': [0, 0, 0], 'Ovis aries': [0, 7, 9], 'Camelus dromedarius': [2, 0, 0], 'Danio rerio': [1, 1, 5], 'Mus saxicola': [0, 0, 0], 'Homo sapiens': [21, 6, 33], 'Rattus norvegicus': [0, 1, 12], 'Sus scrofa': [0, 5, 13], 'Vicugna pacos': [0, 9, 7], 'Macaca nemestrina': [0, 0, 0], 'Mus spretus': [4, 0, 2], 'Mus musculus': [30, 5, 28], 'Mus minutoides': [0, 0, 0], 'Oncorhynchus mykiss': [0, 11, 16], 'Canis lupus familiaris': [4, 2, 0], 'Bos taurus': [2, 5, 12], 'Cercocebus atys': [0, 0, 0], 'Oryctolagus cuniculus': [0, 0, 10], 'Rattus rattus': [0, 0, 0], 'Ornithorhynchus anatinus': [0, 4, 9], 'Macaca mulatta': [1, 3, 16], 'Papio anubis anubis': [0, 0, 0], 'Macaca fascicularis': [0, 0, 0], 'Mus pahari': [0, 0, 0]}
is the output, but i need to make it more comfortable to read. The idea is to make a output with columns (name,v,d,j). I tried:
def printStatistics(dict):
for i in range(0,len(dict)):
print(" {0:30s}{1:30d}{2:30d}{3:30d}".format(dict[i],dict[i] [0],dict[i][1],dict[i][2]), sep = "")
but i get
"TypeError: non-empty format string passed to object.format"
You can make your algorithm work with collections.defaultdict:
input data
import re
from collections import defaultdict
import numpy as np
data= '''Bos taurus;TRGV8-1;F;Bos taurus T cell receptor gamma variable 8-1;1;4;4q3.1;AY644517;-;
Bos taurus;TRGV8-2;(F) F;Bos taurus T cell receptor gamma variable 8-2;2;4;4q3.1;AY644517;-;
Camelus dromedarius;TRDV1S3;F;Camelus dromedarius T cell receptor delta variable 1S3;1;-;-;FN298223;-;
Camelus dromedarius;TRDV1S4;F;Camelus dromedarius T cell receptor delta variable 1S4;2;-;-;FN298224;-;
Canis lupus familiaris;TRBD2;F;Canis lupus familiaris T cell receptor beta diversity 2;1;16;-;HE653929;-;'''
patterns = [
re.compile(r"TR.V"),
re.compile(r"TR.D"),
re.compile(r"TR.J")
]
result = defaultdict(lambda:np.array([0,0,0]))
script
for line in data.splitlines():
result[line.split(';')[0]]+=np.array([len(pattern.findall(line)) for pattern in patterns])
print(result)
output
defaultdict(<function <lambda> at 0x7f622f81c140>, {'Camelus dromedarius': array([2, 0, 0]), 'Canis lupus familiaris': array([0, 1, 0]), 'Bos taurus': array([2, 0, 0])})
defaultdict works like a dictionary, but every key is initialized with a callable of your choice. lambda: [0,0,0] gives you the ability to immediately increment the group occurences instead of having to do update and increment.
I decided to work with numpy arrays because they support vector like adding operations which makes the algorithm prettier, you could also do it without numpy.
Found a solution now with defaultdictionary:
def find_name(file):
gene_count = defaultdict(lambda:[0,0,0])
pattern = re.compile(r"[A-Za-z ]*")
pattern_v = re.compile(r"\;[A-Z]+V[0-9]?\;")
pattern_d = re.compile(r"\;[A-Z]+D[0-9]?\;")
pattern_j = re.compile(r"\;[A-Z]+J[0-9]?\;")
for i in file.readlines():
name = pattern.search(i)
name = name.group()
if name not in gene_count and name != "Species":
gene_count.update({name:[0,0,0]})
if pattern_v.search(i):
gene_count[name][0] += 1
elif pattern_d.search(i):
gene_count[name][1] += 1
elif pattern_j.search(i):
gene_count[name][2] += 1
return gene_count
PRINTING:
def printStatistics(dict):
print(" {0:<30s}{1:<15s}{2:<15s}{3:<15s}".format("Species", "V Count", "D Count", "J Count"), sep = "")
for item in dict:
print(" {0:<30s}{1:<15d}{2:<15d}{3:<15d}".format(item,dict[item][0],dict[item][1],dict[item][2]), sep = "")
Thx 4 help!

Completely remove one index label from a multiindex, in a dataframe

Given I have this multiindexed dataframe:
>>> import pandas as p
>>> import numpy as np
...
>>> arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo']),
... np.array(['one', 'two', 'one', 'two', 'one', 'two'])]
...
>>> s = p.Series(np.random.randn(6), index=arrays)
>>> s
bar one -1.046752
two 2.035839
baz one 1.192775
two 1.774266
foo one -1.716643
two 1.158605
dtype: float64
How I should do to eliminate index bar?
I tried with drop
>>> s1 = s.drop('bar')
>>> s1
baz one 1.192775
two 1.774266
foo one -1.716643
two 1.158605
dtype: float64
Seems OK but bar is still there in some bizarre way:
>>> s1.index
MultiIndex(levels=[[u'bar', u'baz', u'foo'], [u'one', u'two']],
labels=[[1, 1, 2, 2], [0, 1, 0, 1]])
>>> s1['bar']
Series([], dtype: float64)
>>>
How could I get ride of any residue from this index label ?
Definitely looks like a bug.
s1.index.tolist() returns to the expected value without "bar".
>>> s1.index.tolist()
[('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two')]
s1["bar"] returns a null Series.
>>> s1["bar"]
Series([], dtype: float64)
The standard methods to override this don't seem to work either:
>>> del s1["bar"]
>>> s1["bar"]
Series([], dtype: float64)
>>> s1.__delitem__("bar")
>>> s1["bar"]
Series([], dtype: float64)
However, as expected, trying grab a new key invokes a KeyError:
>>> s1["booz"]
... KeyError: 'booz'
The main difference is when you actually look at the source code between the two in pandas.core.index.py
class MultiIndex(Index):
...
def _get_levels(self):
return self._levels
...
def _get_labels(self):
return self._labels
# ops compat
def tolist(self):
"""
return a list of the Index values
"""
return list(self.values)
So, the index.tolist() and the _labels aren't accessing the same piece of shared information, in fact, they aren't even close to.
So, we can use this to manually update the resulting indexer.
>>> s1.index.labels
FrozenList([[1, 1, 2, 2], [0, 1, 0, 1]])
>>> s1.index._levels
FrozenList([[u'bar', u'baz', u'foo'], [u'one', u'two']])
>>> s1.index.values
array([('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two')], dtype=object)
If we compare this to the initial multindexed index, we get
>>> s.index.labels
FrozenList([[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
>>> s.index._levels
FrozenList([[u'bar', u'baz', u'foo'], [u'one', u'two']])
So the _levels attributes aren't updated, while the values is.
EDIT: Overriding it wasn't as easy as I thought.
EDIT: Wrote a custom function to fix this behavior
from pandas.core.base import FrozenList, FrozenNDArray
def drop(series, level, index_name):
# make new tmp series
new_series = series.drop(index_name)
# grab all indexing labels, levels, attributes
levels = new_series.index.levels
labels = new_series.index.labels
index_pos = levels[level].tolist().index(index_name)
# now need to reset the actual levels
level_names = levels[level]
# has no __delitem__, so... need to remake
tmp_names = FrozenList([i for i in level_names if i != index_name])
levels = FrozenList([j if i != level else tmp_names
for i, j in enumerate(levels)])
# need to turn off validation
new_series.index.set_levels(levels, verify_integrity=False, inplace=True)
# reset the labels
level_labels = labels[level].tolist()
tmp_labels = FrozenNDArray([i-1 if i > index_pos else i
for i in level_labels])
labels = FrozenList([j if i != level else tmp_labels
for i, j in enumerate(labels)])
new_series.index.set_labels(labels, verify_integrity=False, inplace=True)
return new_series
Example user:
>>> s1 = drop(s, 0, "bar")
>>> s1.index
MultiIndex(levels=[[u'baz', u'foo'], [u'one', u'two']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
>>> s1.index.tolist()
[('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two')]
>>> s1["bar"]
...
KeyError: 'bar'
EDIT: This seems to be specific to dataframes/series with multiindexing, as the standard pandas.core.index.Index class does not have the same limitations. I would recommend filing a bug report.
Consider the same series with a standard index:
>>> s = p.Series(np.random.randn(6))
>>> s.index
Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')
>>> s.drop(0, inplace=True)
>>> s.index
Int64Index([1, 2, 3, 4, 5], dtype='int64')
The same is true for a dataframe
>>> df = p.DataFrame([np.random.randn(6), np.random.randn(6)])
>>> df.index
Int64Index([0, 1], dtype='int64')
>>> df.drop(0, inplace=True)
>>> df.index
Int64Index([1], dtype='int64')
See long discussion here.
Bottom line, its not obvious when to recompute the levels, as the operation a user is doing is unknown (think from the Index perspective). For example, say you are dropping, then adding a value to a level (e.g. via indexing). This would be very wasteful and somewhat compute intensive.
In [11]: s1.index
Out[11]:
MultiIndex(levels=[[u'bar', u'baz', u'foo'], [u'one', u'two']],
labels=[[1, 1, 2, 2], [0, 1, 0, 1]])
Here is the actual index itself.
In [12]: s1.index.values
Out[12]: array([('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two')], dtype=object)
In [13]: s1.index.get_level_values(0)
Out[13]: Index([u'baz', u'baz', u'foo', u'foo'], dtype='object')
In [14]: s1.index.get_level_values(1)
Out[14]: Index([u'one', u'two', u'one', u'two'], dtype='object')
If you really feel it is necessary to 'get rid' of the removed level, then simply recreate the index. However, it is not harmful at all. These factorizations (e.g. the labels) are hidden from the user (yes they are displayed, but that is to be honest more of a confusion pain point, hence this question).
In [15]: pd.MultiIndex.from_tuples(s1.index.values)
Out[15]:
MultiIndex(levels=[[u'baz', u'foo'], [u'one', u'two']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

Categories

Resources