I have a utility function for creating a Pandas MultiIndex when I have two or more iterables and I want an index key for each unique pairing of the values in those iterables. It looks like this
import pandas as pd
import itertools
def product_index(values, names=None):
"""Make a MultiIndex from the combinatorial product of the values."""
iterable = itertools.product(*values)
idx = pd.MultiIndex.from_tuples(list(iterable), names=names)
return idx
And could be used like:
a = range(3)
b = list("ab")
product_index([a, b])
To create
MultiIndex(levels=[[0, 1, 2], [u'a', u'b']],
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
This works perfectly fine, but it seems like a common usecase and I am surprised I had to implement it myself. So, my question is, what have I missed/misunderstood in the Pandas library itself that offers this functionality?
Edit to add: This function has been added to Pandas as MultiIndex.from_product for the 0.13.1 release.
This is a very similar construction (but using cartesian_product which for larger arrays is faster than itertools.product)
In [2]: from pandas.tools.util import cartesian_product
In [3]: MultiIndex.from_arrays(cartesian_product([range(3),list('ab')]))
Out[3]:
MultiIndex(levels=[[0, 1, 2], [u'a', u'b']],
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
could be added as a convience method, maybe MultiIndex.from_iterables(...)
pls open an issue (and PR if you'd like)
FYI I very rarely actually construct a multi-index 'manually', almost always easier to actually construct a frame and just set_index.
In [10]: df = DataFrame(dict(A = np.arange(6),
B = ['foo'] * 3 + ['bar'] * 3,
C = np.ones(6)+np.arange(6)%2)
).set_index(['C','B']).sortlevel()
In [11]: df
Out[11]:
A
C B
1 bar 4
foo 0
foo 2
2 bar 3
bar 5
foo 1
[6 rows x 1 columns]
Related
I am no data scientist. I do know python and I currently have to manage time series data that is coming in at a regular interval. Much of this data is all zero's or values that are the same for a long time, and to save memory I'd like to filter them out. Is there some standard method for this (which I am obviously unaware of) or should I implement my own algorithm ?
What I want to achieve is the following:
interval value result
(summed)
1 0 0
2 0 # removed
3 0 0
4 1 1
5 2 2
6 2 # removed
7 2 # removed
8 2 2
9 0 0
10 0 0
Any help appreciated !
You could use pandas query on dataframes to achieve this:
import pandas as pd
matrix = [[1,0, 0],
[2, 0, 0],
[3, 0, 0],
[4, 1, 1],
[5, 2, 2],
[6, 2, 0],
[7, 2, 0],
[8, 2, 2],
[9, 0, 0],
[10,0, 0]]
df = pd.DataFrame(matrix, columns=list('abc'))
print(df.query("c != 0"))
There is no quick function call to do what you need. The following is one way
import pandas as pd
df = pd.DataFrame({'interval':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'value':[0, 0, 0, 1, 2, 2, 2, 2, 0, 0]}) # example dataframe
df['group'] = df['value'].ne(df['value'].shift()).cumsum() # column that increments every time the value changes
df['key'] = 1 # create column of ones
df['key'] = df.groupby('group')['key'].transform('cumsum') # get the cumulative sum
df['key'] = df.groupby('group')['key'].transform(lambda x: x.isin( [x.min(), x.max()])) # check which key is minimum and which is maximum by group
df = df[df['key']==True].drop(columns=['group', 'key']) # keep only relevant cases
df
Here is the code:
l = [0, 0, 0, 1, 2, 2, 2, 2, 0, 0]
for (i, ll) in enumerate(l):
if i != 0 and ll == l[i-1] and i<len(l)-1 and ll == l[i+1]:
continue
print(i+1, ll)
It produces what you want. You haven't specified format of your input data, so I assumed they're in a list. The conditions ll == l[i-1] and ll == l[i+1] are key to skipping the repeated values.
Thanks all!
Looking at the answers I guess I can conclude I'll need to roll my own. I'll be using your input as inspiration.
Thanks again !
I have a NumPy array with each row representing some (x, y, z) coordinate like so:
a = array([[0, 0, 1],
[1, 1, 2],
[4, 5, 1],
[4, 5, 2]])
I also have another NumPy array with unique values of the z-coordinates of that array like so:
b = array([1, 2])
How can I apply a function, let's call it "f", to each of the groups of rows in a which correspond to the values in b? For example, the first value of b is 1 so I would get all rows of a which have a 1 in the z-coordinate. Then, I apply a function to all those values.
In the end, the output would be an array the same shape as b.
I'm trying to vectorize this to make it as fast as possible. Thanks!
Example of an expected output (assuming that f is count()):
c = array([2, 2])
because there are 2 rows in array a which have a z value of 1 in array b and also 2 rows in array a which have a z value of 2 in array b.
A trivial solution would be to iterate over array b like so:
for val in b:
apply function to a based on val
append to an array c
My attempt:
I tried doing something like this, but it just returns an empty array.
func(a[a[:, 2]==b])
The problem is that the groups of rows with the same Z can have different sizes so you cannot stack them into one 3D numpy array which would allow to easily apply a function along the third dimension. One solution is to use a for-loop, another is to use np.split:
a = np.array([[0, 0, 1],
[1, 1, 2],
[4, 5, 1],
[4, 5, 2],
[4, 3, 1]])
a_sorted = a[a[:,2].argsort()]
inds = np.unique(a_sorted[:,2], return_index=True)[1]
a_split = np.split(a_sorted, inds)[1:]
# [array([[0, 0, 1],
# [4, 5, 1],
# [4, 3, 1]]),
# array([[1, 1, 2],
# [4, 5, 2]])]
f = np.sum # example of a function
result = list(map(f, a_split))
# [19, 15]
But imho the best solution is to use pandas and groupby as suggested by FBruzzesi. You can then convert the result to a numpy array.
EDIT: For completeness, here are the other two solutions
List comprehension:
b = np.unique(a[:,2])
result = [f(a[a[:,2] == z]) for z in b]
Pandas:
df = pd.DataFrame(a, columns=list('XYZ'))
result = df.groupby(['Z']).apply(lambda x: f(x.values)).tolist()
This is the performance plot I got for a = np.random.randint(0, 100, (n, 3)):
As you can see, approximately up to n = 10^5 the "split solution" is the fastest, but after that the pandas solution performs better.
If you are allowed to use pandas:
import pandas as pd
df=pd.DataFrame(a, columns=['x','y','z'])
df.groupby('z').agg(f)
Here f can be any custom function working on grouped data.
Numeric example:
a = np.array([[0, 0, 1],
[1, 1, 2],
[4, 5, 1],
[4, 5, 2]])
df=pd.DataFrame(a, columns=['x','y','z'])
df.groupby('z').size()
z
1 2
2 2
dtype: int64
Remark that .size is the way to count number of rows per group.
To keep it into pure numpy, maybe this can suit your case:
tmp = np.array([a[a[:,2]==i] for i in b])
tmp
array([[[0, 0, 1],
[4, 5, 1]],
[[1, 1, 2],
[4, 5, 2]]])
which is an array with each group of arrays.
c = np.array([])
for x in np.nditer(b):
c = np.append(c, np.where((a[:,2] == x))[0].shape[0])
Output:
[2. 2.]
I am relatively new to python, and a piece of existing code has created an object akin to per below. This is part of a legacy piece of code. i can unfortunately not change it. The code creates many objects that look like the following format:
[[{'a': 2,'b': 3}],[{'a': 1,'c': 3}],[{'c': 2,'d': 4}]]
I am trying to create transform this object into a matrix or numpy arrays. In this specific example - it would have three rows (1,2,3) and 4 columns (a,b,c,d), with the dictionary values inserted in the cells. (I have inserted how this matrix would look as a dinky toy example. However - i am not looking to recreate the table from scratch, but i am looking for code that translate the object per above in a matrix format).
I am struggling to find a fast and easy way to do this. Any tips or advice much appreciated.
a b c d
1 2 3 0 0
2 1 0 3 0
3 0 2 0 4
I suspect you are focusing on the fast and easy, when you need to address the how first. This isn't the normal input format for np.array or `pandas. So let's focus on that.
It's a list of lists; suggesting a 2d array. But each sublist contains one dictionary, not a list of values.
In [633]: dd=[[{'a': 2,'b': 3}],[{'a': 1,'c': 3}],[{'c': 2,'d': 4}]]
In [634]: dd[0]
Out[634]: [{'b': 3, 'a': 2}]
So let's define a function that converts a dictionary into a list of numbers. We can address the question of where a,b,c,d labels come from, and whether you need to collect them from dd or not, later.
In [635]: dd[0][0]
Out[635]: {'b': 3, 'a': 2}
In [636]: def mk_row(adict):
return [adict.get(k,0) for k in ['a','b','c','d']]
.....:
In [637]: mk_row(dd[0][0])
Out[637]: [2, 3, 0, 0]
So now we just need to apply the function to each sublist
In [638]: [mk_row(d[0]) for d in dd]
Out[638]: [[2, 3, 0, 0], [1, 0, 3, 0], [0, 0, 2, 4]]
This is the kind of list that #Colin fed to pandas. It can also be given to np.array:
In [639]: np.array([mk_row(d[0]) for d in dd])
Out[639]:
array([[2, 3, 0, 0],
[1, 0, 3, 0],
[0, 0, 2, 4]])
Simpy use:
import pandas as pd
df = pd.DataFrame.from_items([('1', [2, 3, 0,0]), ('2', [1, 0, 3,0]),('3', [0, 2, 0,4])], orient='index', columns=['a', 'b', 'c','d'])
arr = df.values
You can then reference it like a normal numpy array:
print(arr[0,:])
I have a matrix, and I want to write a script to extract values which are bigger than zero, its row number and column number(because the value belongs to that (row, column)), and here's an example,
from numpy import *
import numpy as np
m=np.array([[0,2,4],[4,0,4],[5,4,0]])
index_row=[]
index_col=[]
dist=[]
I want to store the row number in index_row, the column number in index_col, and the value in dist. So in this case,
index_row = [0 0 1 1 2 2]
index_col = [1 2 0 2 0 1]
dist = [2 4 4 4 5 4]
How to add the codes to achieve this goal? Thanks for giving me suggestions.
You can use numpy.where for this:
>>> indices = np.where(m > 0)
>>> index_row, index_col = indices
>>> dist = m[indices]
>>> index_row
array([0, 0, 1, 1, 2, 2])
>>> index_col
array([1, 2, 0, 2, 0, 1])
>>> dist
array([2, 4, 4, 4, 5, 4])
Though this has been answered already, I often find np.where to be somewhat cumbersome-- though like all things, depends on the circumstance. For this, I'd probably use a zip and a list comprehension:
index_row = [0, 0, 1, 1, 2, 2]
index_col = [1, 2, 0, 2, 0, 1]
zipped = zip(index_row, index_col)
dist = [m[z] for z in zipped]
The zip will give you an iteratable of tuples, which can be used to index numpy arrays.
The pandas factorize function assigns each unique value in a series to a sequential, 0-based index, and calculates which index each series entry belongs to.
I'd like to accomplish the equivalent of pandas.factorize on multiple columns:
import pandas as pd
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
pd.factorize(df)[0] # would like [0, 1, 2, 2, 1, 0]
That is, I want to determine each unique tuple of values in several columns of a data frame, assign a sequential index to each, and compute which index each row in the data frame belongs to.
Factorize only works on single columns. Is there a multi-column equivalent function in pandas?
You need to create a ndarray of tuple first, pandas.lib.fast_zip can do this very fast in cython loop.
import pandas as pd
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
print pd.factorize(pd.lib.fast_zip([df.x, df.y]))[0]
the output is:
[0 1 2 2 1 0]
I am not sure if this is an efficient solution. There might be better solutions for this.
arr=[] #this will hold the unique items of the dataframe
for i in df.index:
if list(df.iloc[i]) not in arr:
arr.append(list(df.iloc[i]))
so printing the arr would give you
>>>print arr
[[1,1],[1,2],[2,2]]
to hold the indices, i would declare an ind array
ind=[]
for i in df.index:
ind.append(arr.index(list(df.iloc[i])))
printing ind would give
>>>print ind
[0,1,2,2,1,0]
You can use drop_duplicates to drop those duplicated rows
In [23]: df.drop_duplicates()
Out[23]:
x y
0 1 1
1 1 2
2 2 2
EDIT
To achieve your goal, you can join your original df to the drop_duplicated one:
In [46]: df.join(df.drop_duplicates().reset_index().set_index(['x', 'y']), on=['x', 'y'])
Out[46]:
x y index
0 1 1 0
1 1 2 1
2 2 2 2
3 2 2 2
4 1 2 1
5 1 1 0
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
tuples = df[['x', 'y']].apply(tuple, axis=1)
df['newID'] = pd.factorize( tuples )[0]