I have a df where I want to do multi-label classification. One of the ways which was suggested to me was to calculate the probability vector. Here's an example of my DF with what would represent training data.
id ABC DEF GHI
1 0 0 0 1
2 1 0 1 0
3 2 1 0 0
4 3 0 1 1
5 4 0 0 0
6 5 0 1 1
7 6 1 1 1
8 7 1 0 1
9 8 1 1 0
And I would like to concatenate columns ABC, DEF, GHI into a new column. I will also have to do this with more than 3 columns, so I want to do relatively cleanly using a column list or something similar:
col_list = ['ABC','DEF','GHI']
The result I am looking for would be something like:
id ABC DEF GHI Conc
1 0 0 0 1 [0,0,1]
2 1 0 1 0 [0,1,0]
3 2 1 0 0 [1,0,0]
4 3 0 1 1 [0,1,1]
5 4 0 0 0 [0,0,0]
6 5 0 1 1 [0,1,1]
7 6 1 1 1 [1,1,1]
8 7 1 0 1 [1,0,1]
9 8 1 1 0 [1,1,0]
Try:
col_list = ['ABC','DEF','GHI']
df['agg_lst']=df.apply(lambda x: list(x[col] for col in col_list), axis=1)
You can use 'agg' with function 'list':
df[cols].agg(list,axis=1)
1 [0, 0, 1]
2 [0, 1, 0]
3 [1, 0, 0]
4 [0, 1, 1]
5 [0, 0, 0]
6 [0, 1, 1]
7 [1, 1, 1]
8 [1, 0, 1]
9 [1, 1, 0]
Related
I would like to expand a series or dataframe into a sparse matrix based on the unique values of the series. It's a bit hard to explain verbally but an example should be clearer.
First, simpler version - if I start with this:
Idx Tag
0 A
1 B
2 A
3 C
4 B
I'd like to get something like this, where the unique values in the starting series are the column values here (could be 1s and 0s, Boolean, etc.):
Idx A B C
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0
Second, more advanced version - if I have values associated with each entry, preserving those and filling the rest of the matrix with a placeholder (0, NaN, something else), e.g. starting from this:
Idx Tag Val
0 A 5
1 B 2
2 A 3
3 C 7
4 B 1
And ending up with this:
Idx A B C
0 5 0 0
1 0 2 0
2 3 0 0
3 0 0 7
4 0 1 0
What's a Pythonic way to do this?
Here's how to do it, using pandas.get_dummies() which was designed specifically for this (often called "one-hot-encoding" in ML). I've done it step-by-step so you can see how it's done ;)
>>> df
Idx Tag Val
0 0 A 5
1 1 B 2
2 2 A 3
3 3 C 7
4 4 B 1
>>> pd.get_dummies(df['Tag'])
A B C
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0
>>> pd.concat([df[['Idx']], pd.get_dummies(df['Tag'])], axis=1)
Idx A B C
0 0 1 0 0
1 1 0 1 0
2 2 1 0 0
3 3 0 0 1
4 4 0 1 0
>>> pd.get_dummies(df['Tag']).to_numpy()
array([[1, 0, 0],
[0, 1, 0],
[1, 0, 0],
[0, 0, 1],
[0, 1, 0]], dtype=uint8)
>>> df2[['Val']].to_numpy()
array([[5],
[2],
[3],
[7],
[1]])
>>> pd.get_dummies(df2['Tag']).to_numpy() * df2[['Val']].to_numpy()
array([[5, 0, 0],
[0, 2, 0],
[3, 0, 0],
[0, 0, 7],
[0, 1, 0]])
>>> pd.DataFrame(pd.get_dummies(df['Tag']).to_numpy() * df[['Val']].to_numpy(), columns=df['Tag'].unique())
A B C
0 5 0 0
1 0 2 0
2 3 0 0
3 0 0 7
4 0 1 0
>>> pd.concat([df, pd.DataFrame(pd.get_dummies(df['Tag']).to_numpy() * df[['Val']].to_numpy(), columns=df['Tag'].unique())], axis=1)
Idx Tag Val A B C
0 0 A 5 5 0 0
1 1 B 2 0 2 0
2 2 A 3 3 0 0
3 3 C 7 0 0 7
4 4 B 1 0 1 0
Based on #user17242583 's answer, found a pretty simple way to do it using pd.get_dummies combined with DataFrame.multiply:
>>> df
Tag Val
0 A 5
1 B 2
2 A 3
3 C 7
4 B 1
>>> pd.get_dummies(df['Tag'])
A B C
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0
>>> pd.get_dummies(df['Tag']).multiply(df['Val'], axis=0)
A B C
0 5 0 0
1 0 2 0
2 3 0 0
3 0 0 7
4 0 1 0
If I have an array [1, 2, 3, 4, 5] and a Pandas Dataframe
df = pd.DataFrame([[1,1,1,1,1], [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0]])
0 1 2 3 4
0 1 1 1 1 1
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
How do I iterate through the Pandas DataFrame adding my array to each previous row?
The expected result would be:
0 1 2 3 4
0 1 1 1 1 1
1 2 3 4 5 6
2 3 5 7 9 11
3 4 7 10 13 16
The array is added n times to the nth row, which you can create using np.arange(len(df))[:,None] * a and then add the first row:
df
# 0 1 2 3 4
#0 1 1 1 1 1
#1 0 0 0 0 0
#2 0 0 0 0 0
#3 0 0 0 0 0
a = np.array([1, 2, 3, 4, 5])
np.arange(len(df))[:,None] * a
#array([[ 0, 0, 0, 0, 0],
# [ 1, 2, 3, 4, 5],
# [ 2, 4, 6, 8, 10],
# [ 3, 6, 9, 12, 15]])
df[:] = df.iloc[0].values + np.arange(len(df))[:,None] * a
df
# 0 1 2 3 4
#0 1 1 1 1 1
#1 2 3 4 5 6
#2 3 5 7 9 11
#3 4 7 10 13 16
df = pd.DataFrame([
[1,1,1],
[0,0,0],
[0,0,0],
])
s = pd.Series([1,2,3])
# add to every row except first, then cumulative sum
result = df.add(s, axis=1)
result.iloc[0] = df.iloc[0]
result.cumsum()
Or if you want a one-liner:
pd.concat([df[:1], df[1:].add(s, axis=1)]).cumsum()
Either way, result:
0 1 2
0 1 1 1
1 2 3 4
2 3 5 7
Using cumsum and assignment:
df[1:] = (df+l).cumsum()[:-1].values
0 1 2 3 4
0 1 1 1 1 1
1 2 3 4 5 6
2 3 5 7 9 11
3 4 7 10 13 16
Or using concat:
pd.concat((df[:1], (df+l).cumsum()[:-1]))
0 1 2 3 4
0 1 1 1 1 1
0 2 3 4 5 6
1 3 5 7 9 11
2 4 7 10 13 16
After cumsum, you can shift and add back to the original df:
a = [1,2,3,4,5]
updated = df.add(pd.Series(a), axis=1).cumsum().shift().fillna(0)
df.add(updated)
I have the following indices as you would get them from np.where(...):
coords = (
np.asarray([0 0 0 1 1 1 1 1 2 2 2 3 3 3 3 4 4 4 5 5 5 5 5 6 6 6]),
np.asarray([2 2 8 2 2 4 4 6 2 2 6 2 2 4 6 2 2 6 2 2 4 4 6 2 2 6]),
np.asarray([0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]),
np.asarray([0 1 0 0 1 0 1 1 0 1 1 0 1 1 1 0 1 1 0 1 0 1 1 0 1 1])
)
Another tuple with indices is meant to select those that are in coords:
index = tuple(
np.asarray([0 0 1 1 1 1 2 2 2 3 3 3 3 4 4 4 5 5 5 5 5 6 6 6]),
np.asarray([2 8 2 4 4 6 2 2 6 2 2 4 6 2 2 6 2 2 4 4 6 2 2 6]),
np.asarray([0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]),
np.asarray([0 0 1 0 1 1 0 1 1 0 1 1 1 0 1 1 0 1 0 1 1 0 1 1])
)
So for instance, coords[0] is selected because it's in index (at position 0), but coords[1] isn't selected because it's not available in index.
I can calculate the mask easily with [x in zip(*index) for x in zip(*coords)] (converted from bool to int for better readability):
[1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
but this wouldn't be very efficient for larger arrays. Is there a more "numpy-based" way that could calculate the mask?
Not so sure about efficiency but given you're basically comparing coordinates pairs you could use scipy distance functions. Something along:
from scipy.spatial.distance import cdist
c = np.stack(coords).T
i = np.stack(index).T
d = cdist(c, i)
In [113]: np.any(d == 0, axis=1).astype(int)
Out[113]:
array([1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1])
By default it uses L2 norm, you could probably make it slightly faster with a simpler distance function, e.g.:
d = cdist(c,i, lambda u, v: np.all(np.equal(u,v)))
np.any(d != 0, axis=1).astype(int)
You can use np.ravel_multi_index to compress the columns into unique numbers which are easier to handle:
cmx = *map(np.max, coords),
imx = *map(np.max, index),
shape = np.maximum(cmx, imx) + 1
ct = np.ravel_multi_index(coords, shape)
it = np.ravel_multi_index(index, shape)
it.sort()
result = ct == it[it.searchsorted(ct)]
print(result.view(np.int8))
Prints:
[1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
I have a DataFrame which looks like this:
>>> df
type value
0 1 0.698791
1 3 0.228529
2 3 0.560907
3 1 0.982690
4 1 0.997881
5 1 0.301664
6 1 0.877495
7 2 0.561545
8 1 0.167920
9 1 0.928918
10 2 0.212339
11 2 0.092313
12 4 0.039266
13 2 0.998929
14 4 0.476712
15 4 0.631202
16 1 0.918277
17 3 0.509352
18 1 0.769203
19 3 0.994378
I would like to group on the type column and obtain histogram bins for the column value in 10 new columns, e.g. something like that:
1 3 9 6 8 10 5 4 7 2
type
1 0 1 0 0 0 2 1 1 0 1
2 2 1 1 0 0 1 1 0 0 0
3 2 0 0 0 0 1 1 0 0 0
4 1 1 0 0 0 1 0 0 0 1
Where column 1 is the count for the first bin (0.0 to 0.1) and so on...
Using numpy.histogram, I can only obtain the following:
>>> df.groupby('type')['value'].agg(lambda x: numpy.histogram(x, bins=10, range=(0, 1)))
type
1 ([0, 1, 1, 1, 1, 0, 0, 0, 0, 2], [0.0, 0.1, 0....
2 ([2, 0, 1, 0, 1, 0, 0, 0, 1, 1], [0.0, 0.1, 0....
3 ([2, 0, 0, 0, 1, 0, 0, 0, 0, 1], [0.0, 0.1, 0....
4 ([1, 1, 1, 0, 0, 0, 0, 0, 0, 1], [0.0, 0.1, 0....
Name: value, dtype: object
Which I do not manage to put in the correct format afterwards (at least not in a simple way).
I found a trick to do what I want, but it is very ugly:
>>> d = {str(k): lambda x, _k = k: ((x >= (_k - 1)/10) & (x < _k/10)).sum() for k in range(1, 11)}
>>> df.groupby('type')['value'].agg(d)
1 3 9 6 8 10 5 4 7 2
type
1 0 1 0 0 0 2 1 1 0 1
2 2 1 1 0 0 1 1 0 0 0
3 2 0 0 0 0 1 1 0 0 0
4 1 1 0 0 0 1 0 0 0 1
Is there a better way to do what I want? I know that in R, the aggregate method can return a DataFrame, but not in python...
is that what you want?
In [98]: %paste
bins = np.linspace(0, 1.0, 11)
labels = list(range(1,11))
(df.assign(q=pd.cut(df.value, bins=bins, labels=labels, right=False))
.pivot_table(index='type', columns='q', aggfunc='size', fill_value=0)
)
## -- End pasted text --
Out[98]:
q 1 2 3 4 5 6 7 8 9 10
type
1 0 1 0 1 0 0 1 1 1 4
2 1 0 1 0 0 1 0 0 0 1
3 0 0 1 0 0 2 0 0 0 1
4 1 0 0 0 1 0 1 0 0 0
I have a dataframe with two columns:
x y
0 1
1 1
2 2
0 5
1 6
2 8
0 1
1 8
2 4
0 1
1 7
2 3
What I want is:
x val1 val2 val3 val4
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
I know that the values in column x are repeated all N times.
You could use groupby/cumcount to assign column numbers and then call pivot:
import pandas as pd
df = pd.DataFrame({'x': [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2],
'y': [1, 1, 2, 5, 6, 8, 1, 8, 4, 1, 7, 3]})
df['columns'] = df.groupby('x')['y'].cumcount()
# x y columns
# 0 0 1 0
# 1 1 1 0
# 2 2 2 0
# 3 0 5 1
# 4 1 6 1
# 5 2 8 1
# 6 0 1 2
# 7 1 8 2
# 8 2 4 2
# 9 0 1 3
# 10 1 7 3
# 11 2 3 3
result = df.pivot(index='x', columns='columns')
print(result)
yields
y
columns 0 1 2 3
x
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
Or, if you can really rely on the values in x being repeated in order N times,
N = 3
result = pd.DataFrame(df['y'].values.reshape(-1, N).T)
yields
0 1 2 3
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
Using reshape is quicker than calling groupby/cumcount and pivot, but it
is less robust since it relies on the values in y appearing in the right order.