I wish to extract values from a multiindex DataFrame, this df has two indexes, a_idx and b_idx. The values to be extracted are i.e. (1,1)
[in] df.loc[(1, 1), :]
[out] 0
Name: (1, 1), dtype: int64
which is as intended. But then if I want to obtain two values (1,2) and (2,3):
[in] df.loc[([1, 2], [2, 3]), :]
[out]
value
a_idx b_idx
1 2 1
3 6
2 2 3
3 9
Which is not what I wanted, I needed the specific pairs, not the 4 values.
Furthermore, I wish to select elements from this database with two arrays select_a and select_b: .loc[[, that have the same length as eachother, but not as the dataframe. So for
select_a = [1, 1, 2, 2, 3]
select_b = [1, 3, 2, 3, 1]
My gist was that I should do this using:
df.loc[(select_a, select_b), :]
and then receive a list of all items with a_idx==select_a[i] and b_idx==select_b[i] for all i in len(select_a).
I have tried xs and slice indexing, but this did not return the desired results. My main reason for going to the indexing method is because of computational speed, as the real dataset is actually 4.3 million lines and the dataset that has to be created will have even more.
If this is not the best way to achieve this result, then please point me in the right direction. Any sources are also welcome, what I found in the pandas documentation was not geared towards this kind of indexing (or at least I have not been able to find it)
The dataframe is created using the following code:
numbers = pd.DataFrame(np.random.randint(0,10,10), columns=["value"])
numbers["a"] = [1, 1, 1, 1, 2, 2, 2, 3, 3, 3]
numbers["b"] = [1, 2, 3, 4, 1, 2, 3, 1, 2, 3]
print("before adding the index to the dataframe")
print(numbers)
index_cols = pd.MultiIndex.from_arrays(
[numbers["a"].values, numbers["b"].values],
names=["a_idx", "b_idx"])
df = pd.DataFrame(numbers.values,
index=index_cols,
columns=numbers.columns.values)
df = df.sort_index()
df.drop(columns=["a","b"],inplace=True)
print("after adding the indexes to the dataframe")
print(df)
You were almost there. To get the pair for those indexes, you need to have the syntax like this:
df.loc[[(1, 2), (2, 3)], :]
You can also do this using select_a and select_b. Just make sure that you pass the pairs to df.loc as tuples.
Related
I have 2 dataframes: a and b.
When I run print(a.shape, b.shape), I get the following result: (1, 28849) (44, 29025) meaning that b has more columns that a. When I run b.columns.difference(a.columns) the result is a null index: (Index([], dtype='object'). I get the same result when I run a.columns.difference(b.columns). Why do the dataframes have different columns counts in shape but not have any different columns between them?
Why do the dataframes have different columns counts in shape but not
have any different columns between them?
Empty bi-directional pd.Index.difference is no guarantee that columns in 2 dataframes are the same. Consider the following example:
A = pd.DataFrame(columns=[1, 1, 2, 3, 4])
B = pd.DataFrame(columns=[1, 2, 3, 4])
A.columns.difference(B.columns) # Int64Index([], dtype='int64')
B.columns.difference(A.columns) # Int64Index([], dtype='int64')
pd.Index.difference can be compared to set.difference, i.e. it does not consider duplicates. If you print the columns explicitly, you should see they are different.
Or, to explicitly calculate the counts of each column name, you can use numpy.unique:
import numpy as np
print(np.unique(A.columns, return_counts=True))
(array([1, 2, 3, 4], dtype=int64), array([2, 1, 1, 1], dtype=int64))
I have two dataframes, each one having a lot of columns and rows. The elements in each row are the same, but their indexing is different. I want to add the elements of one of the columns of the two dataframes.
As a basic example consider the following two Series:
Sr1 = pd.Series([1,2,3,4], index = [0, 1, 2, 3])
Sr2 = pd.Series([3,4,-3,6], index = [1, 2, 3, 4])
Say that each row contains the same element, only in different indexing. I want to add the two columns and get in the end a new column that contains [4,6,0,10]. Instead, due to the indices, I get [nan, 5, 7, 1].
Is there an easy way to solve this without changing the indices?
I want output as a series.
You could use reset_index(drop=True):
Sr1 = pd.Series([1,2,3,4], index = [0, 1, 2, 3])
Sr2 = pd.Series([3,4,-3,6], index = [1, 2, 3, 4])
Sr1 + Sr2.reset_index(drop=True)
0 4
1 6
2 0
3 10
dtype: int64
Also,
pd.Series(Sr1.values + Sr2.values, index=Sr1.index)
One way is to use reset_index on one or more series:
Sr1 = pd.Series([1,2,3,4], index = [0, 1, 2, 3])
Sr2 = pd.Series([3,4,-3,6], index = [1, 2, 3, 4])
res = Sr1 + Sr2.reset_index(drop=True)
0 4
1 6
2 0
3 10
dtype: int64
Using zip
Ex:
import pandas as pd
Sr1 = pd.Series([1,2,3,4], index = [0, 1, 2, 3])
Sr2 = pd.Series([3,4,-3,6], index = [1, 2, 3, 4])
sr3 = [sum(i) for i in zip(Sr1, Sr2)]
print(sr3)
Output:
[4, 6, 0, 10]
You could use the .values, which gives you a numpy representation, and then you can add them like this:
Sr1.values + Sr2.values
I have 1D array in numpy, and I want to add a certain value to part of the array.
For example, if the array is:
a = [1, 2, 3, 4, 5]
I want to add the value 7 to 2nd and 3rd columns to get:
a = [1, 2, 10, 11, 5]
Is there any simple way to do this?
Thanks!
You can index the array with another array containing the indices:
a[[2,3]] += 7
If your columns have some pattern, like in this specific case, they are contiguous, then you can use fancy indexing:
a = np.array([1, 2, 3, 4, 5])
a[2:4] += 7
Note here 2:4 means "from column 2(included) to column 4(excluded)", thus it's column 2 and 3.
I am working on a sparse matrix stored in COO format. What would be the fastest way to get the number of consecutive elements per each row.
For example consider the following matrix:
a = [[0,1,2,0],[1,0,0,2],[0,0,0,0],[1,0,1,0]]
Its COO representation would be
(0, 1) 1
(0, 2) 2
(1, 0) 1
(1, 3) 2
(3, 0) 1
(3, 2) 1
I need the result to be [1,2,0,2]. The first row contains two Non-zero elements that lies nearby. Hence its a group or set. In the second row we have two non-zero elements,but they dont lie nearby, and hence we can say that it forms two groups. The third row there are no non-zeroes and hence no groups. The fourth row has again two non-zeroes but separated by zeroes nad hence we consider as two groups. It would be like the number of clusters per row. Iterating through the rows are an option but only if there is no faster solution. Any help in this regard is appreciated.
Another simple example: consider the following row:
[1,2,3,0,0,0,2,0,0,8,7,6,0,0]
The above row should return [3] sine there are three groups of non-zeroes getting separated by zeroes.
Convert it to a dense array, and apply your logic row by row.
you want the number of groups per row
zeros count when defining groups
row iteration is faster with arrays
In coo format your matrix looks like:
In [623]: M=sparse.coo_matrix(a)
In [624]: M.data
Out[624]: array([1, 2, 1, 2, 1, 1])
In [625]: M.row
Out[625]: array([0, 0, 1, 1, 3, 3], dtype=int32)
In [626]: M.col
Out[626]: array([1, 2, 0, 3, 0, 2], dtype=int32)
This format does not implement row indexing; csr and lil do
In [627]: M.tolil().data
Out[627]: array([[1, 2], [1, 2], [], [1, 1]], dtype=object)
In [628]: M.tolil().rows
Out[628]: array([[1, 2], [0, 3], [], [0, 2]], dtype=object)
So the sparse information for the 1st row is a list of nonzero data values, [1,2], and list of their column numbers, [1,2]. Compare that with the row of the dense array, [0, 1, 2, 0]. Which is easier to analyze?
Your first task is to write a function that analyzes one row. I haven't studied your logic enough to say whether the dense form is better than the sparse one or not. It is easy to get the column information from the dense form with M.A[0,:].nonzero().
In your last example, I can get the nonzero indices:
In [631]: np.nonzero([1,2,3,0,0,0,2,0,0,8,7,6,0,0])
Out[631]: (array([ 0, 1, 2, 6, 9, 10, 11], dtype=int32),)
In [632]: idx=np.nonzero([1,2,3,0,0,0,2,0,0,8,7,6,0,0])[0]
In [633]: idx
Out[633]: array([ 0, 1, 2, 6, 9, 10, 11], dtype=int32)
In [634]: np.diff(idx)
Out[634]: array([1, 1, 4, 3, 1, 1], dtype=int32)
We may be able to get the desired count from the number of diff values >1, though I'd have to look at more examples to define the details.
Extension of the analysis to multiple rows depends on first thoroughly understanding the single row case.
With the help of #hpauljs comment I came up with following snippet to do this:
M = m.tolil()
r = []
for i in range(M.shape[0]):
sumx=0
idx= M.rows[i]
if (len(idx) > 2):
tempidx = np.diff(idx)
if (1 in tempidx):
temp = filter(lambda a: a != 1, tempidx)
sumx=1
counts = len(temp)
r.append(counts+sumx)
elif (len(idx) == 2):
tempidx = np.diff(idx)
if(tempidx[0]==1):
counts = 1
r.append(counts)
else:
counts = 2
r.append(counts)
elif (len(idx) == 1):
counts = 1
r.append(counts)
else:
counts = 0
r.append(counts)
tempcluster = np.sum(r)/float(M.shape[0])
cluster.append(tempcluster)
I am relatively new to python, and a piece of existing code has created an object akin to per below. This is part of a legacy piece of code. i can unfortunately not change it. The code creates many objects that look like the following format:
[[{'a': 2,'b': 3}],[{'a': 1,'c': 3}],[{'c': 2,'d': 4}]]
I am trying to create transform this object into a matrix or numpy arrays. In this specific example - it would have three rows (1,2,3) and 4 columns (a,b,c,d), with the dictionary values inserted in the cells. (I have inserted how this matrix would look as a dinky toy example. However - i am not looking to recreate the table from scratch, but i am looking for code that translate the object per above in a matrix format).
I am struggling to find a fast and easy way to do this. Any tips or advice much appreciated.
a b c d
1 2 3 0 0
2 1 0 3 0
3 0 2 0 4
I suspect you are focusing on the fast and easy, when you need to address the how first. This isn't the normal input format for np.array or `pandas. So let's focus on that.
It's a list of lists; suggesting a 2d array. But each sublist contains one dictionary, not a list of values.
In [633]: dd=[[{'a': 2,'b': 3}],[{'a': 1,'c': 3}],[{'c': 2,'d': 4}]]
In [634]: dd[0]
Out[634]: [{'b': 3, 'a': 2}]
So let's define a function that converts a dictionary into a list of numbers. We can address the question of where a,b,c,d labels come from, and whether you need to collect them from dd or not, later.
In [635]: dd[0][0]
Out[635]: {'b': 3, 'a': 2}
In [636]: def mk_row(adict):
return [adict.get(k,0) for k in ['a','b','c','d']]
.....:
In [637]: mk_row(dd[0][0])
Out[637]: [2, 3, 0, 0]
So now we just need to apply the function to each sublist
In [638]: [mk_row(d[0]) for d in dd]
Out[638]: [[2, 3, 0, 0], [1, 0, 3, 0], [0, 0, 2, 4]]
This is the kind of list that #Colin fed to pandas. It can also be given to np.array:
In [639]: np.array([mk_row(d[0]) for d in dd])
Out[639]:
array([[2, 3, 0, 0],
[1, 0, 3, 0],
[0, 0, 2, 4]])
Simpy use:
import pandas as pd
df = pd.DataFrame.from_items([('1', [2, 3, 0,0]), ('2', [1, 0, 3,0]),('3', [0, 2, 0,4])], orient='index', columns=['a', 'b', 'c','d'])
arr = df.values
You can then reference it like a normal numpy array:
print(arr[0,:])