I have a data set which has a variable with values 0,1.
I need output in the following way.
Variable - 0 1 1 1 0 1 1 1 0 1 1 0
Flag - 1 1 1 1 2 2 2 2 3 3 3 4
Every time variable changes to 0 flag should increment by 1, and it should remain same till it encounters next 0.
I'm doing code conversion from SAS to python. It was pretty easy in SAS but I'm finding it difficult in Pandas. Is there any specific retain function in pandas like SAS? I don't see any retain function in pandas documentation.
Thanks in Advance.
I think you need compare with 0 and cumsum:
s = pd.Series([ 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0])
print (s)
0 0
1 1
2 1
3 1
4 0
5 1
6 1
7 1
8 0
9 1
10 1
11 0
dtype: int64
s1 = (s == 0).cumsum()
print (s1)
0 1
1 1
2 1
3 1
4 2
5 2
6 2
7 2
8 3
9 3
10 3
11 4
dtype: int32
df = pd.DataFrame({'Variable': [ 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0]})
df['Flag'] = (df.Variable == 0).cumsum()
print (df)
Variable Flag
0 0 1
1 1 1
2 1 1
3 1 1
4 0 2
5 1 2
6 1 2
7 1 2
8 0 3
9 1 3
10 1 3
11 0 4
Instead of using pandas, just you can use loop,
Like this,
a='0 1 1 1 0 1 1 1 0 1 1 0'
flags=[]
flag=0
for i in list(a.split()):
if int(i)==0:
flag+=1
flags.append(flag)
else:
flags.append(flag)
print flags
Output:
[1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4]
Related
I'm using the following dictionary and developing in pandas to manipulate it in a dataframe:
data = {"Value": [4, 4, 2, 1, 1, 1, 0, 7, 0, 4, 1, 1, 3, 0, 3, 0, 7, 0, 4, 1, 0, 1, 0, 1, 4, 4, 2, 3],
"IdPar": [0, 0, 0, 0, 0, 0, 10, 10, 10, 10, 10, 0, 0, 22, 22, 28, 28, 28, 28, 0, 0, 38, 38 , 0, 0, 0, 0, 0]
}
df = pd.DataFrame(data)
I would like to achieve that when it finds a repeated number in the IdPar column, a sequential number is generated in the same row but in a new column called Count, with the condition that if it finds 0 it repeats the value of 0 in the new column. Next I show what I expect to get:
Value IdPar Count
0 4 0 0
1 4 0 0
2 2 0 0
3 1 0 0
4 1 0 0
5 1 0 0
6 0 10 1
7 7 10 2
8 0 10 3
9 4 10 4
1 1 10 5
1 1 0 0
1 3 0 0
1 0 22 1
1 3 22 2
1 0 28 1
1 7 28 2
1 0 28 3
1 4 28 4
1 1 0 0
2 0 0 0
2 1 38 1
2 0 38 2
2 1 0 0
2 4 0 0
2 4 0 0
2 2 0 0
2 3 0 0
What I've done is review pandas information, I've tried many functions and what I've found is the use of ne, shift, cumsum, groupby, pivot_table or transform functions, but it isn't the result I want:
s = df.pivot_table(index = ['IdPar'], aggfunc = 'size')
print(s)
t = df['IdPar'].ne(df['IdPar'].shift()).cumsum()
print(t)
df ['Count'] = df['IdPar'].isin(df['Id_Par'])
df ['Count'] = df.loc[df ['Count'] == True, 'IdPar']
print(df)
How far I've come is to place in the column Count the sum of the repetitions in front of the row in which it is presented or that the repetition of the number in the IdPar column begins, which is the code below, but I don't want that either:
df['Count'] = df.groupby(['IdPar'])['Value'].transform('count')
print(df['Count'])
I really appreciate anyone who can help me. Any comment helps.
Try cumcount:
df['Count'] = df.groupby('IdPar')['IdPar'].cumcount() + 1
df.loc[df['IdPar'] == 0, 'Count'] = 0
print(df)
Or try in one line:
df['Count'] = df.groupby('IdPar').cumcount().add(1).mask(df['IdPar'].eq(0), 0)
Both codes output:
IdPar Value Count
0 0 4 0
1 0 4 0
2 0 2 0
3 0 1 0
4 0 1 0
5 0 1 0
6 10 0 1
7 10 7 2
8 10 0 3
9 10 4 4
10 10 1 5
11 0 1 0
12 0 3 0
13 22 0 1
14 22 3 2
15 28 0 1
16 28 7 2
17 28 0 3
18 28 4 4
19 0 1 0
20 0 0 0
21 38 1 1
22 38 0 2
23 0 1 0
24 0 4 0
25 0 4 0
26 0 2 0
27 0 3 0
I have a df where I want to do multi-label classification. One of the ways which was suggested to me was to calculate the probability vector. Here's an example of my DF with what would represent training data.
id ABC DEF GHI
1 0 0 0 1
2 1 0 1 0
3 2 1 0 0
4 3 0 1 1
5 4 0 0 0
6 5 0 1 1
7 6 1 1 1
8 7 1 0 1
9 8 1 1 0
And I would like to concatenate columns ABC, DEF, GHI into a new column. I will also have to do this with more than 3 columns, so I want to do relatively cleanly using a column list or something similar:
col_list = ['ABC','DEF','GHI']
The result I am looking for would be something like:
id ABC DEF GHI Conc
1 0 0 0 1 [0,0,1]
2 1 0 1 0 [0,1,0]
3 2 1 0 0 [1,0,0]
4 3 0 1 1 [0,1,1]
5 4 0 0 0 [0,0,0]
6 5 0 1 1 [0,1,1]
7 6 1 1 1 [1,1,1]
8 7 1 0 1 [1,0,1]
9 8 1 1 0 [1,1,0]
Try:
col_list = ['ABC','DEF','GHI']
df['agg_lst']=df.apply(lambda x: list(x[col] for col in col_list), axis=1)
You can use 'agg' with function 'list':
df[cols].agg(list,axis=1)
1 [0, 0, 1]
2 [0, 1, 0]
3 [1, 0, 0]
4 [0, 1, 1]
5 [0, 0, 0]
6 [0, 1, 1]
7 [1, 1, 1]
8 [1, 0, 1]
9 [1, 1, 0]
I have the following indices as you would get them from np.where(...):
coords = (
np.asarray([0 0 0 1 1 1 1 1 2 2 2 3 3 3 3 4 4 4 5 5 5 5 5 6 6 6]),
np.asarray([2 2 8 2 2 4 4 6 2 2 6 2 2 4 6 2 2 6 2 2 4 4 6 2 2 6]),
np.asarray([0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]),
np.asarray([0 1 0 0 1 0 1 1 0 1 1 0 1 1 1 0 1 1 0 1 0 1 1 0 1 1])
)
Another tuple with indices is meant to select those that are in coords:
index = tuple(
np.asarray([0 0 1 1 1 1 2 2 2 3 3 3 3 4 4 4 5 5 5 5 5 6 6 6]),
np.asarray([2 8 2 4 4 6 2 2 6 2 2 4 6 2 2 6 2 2 4 4 6 2 2 6]),
np.asarray([0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]),
np.asarray([0 0 1 0 1 1 0 1 1 0 1 1 1 0 1 1 0 1 0 1 1 0 1 1])
)
So for instance, coords[0] is selected because it's in index (at position 0), but coords[1] isn't selected because it's not available in index.
I can calculate the mask easily with [x in zip(*index) for x in zip(*coords)] (converted from bool to int for better readability):
[1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
but this wouldn't be very efficient for larger arrays. Is there a more "numpy-based" way that could calculate the mask?
Not so sure about efficiency but given you're basically comparing coordinates pairs you could use scipy distance functions. Something along:
from scipy.spatial.distance import cdist
c = np.stack(coords).T
i = np.stack(index).T
d = cdist(c, i)
In [113]: np.any(d == 0, axis=1).astype(int)
Out[113]:
array([1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1])
By default it uses L2 norm, you could probably make it slightly faster with a simpler distance function, e.g.:
d = cdist(c,i, lambda u, v: np.all(np.equal(u,v)))
np.any(d != 0, axis=1).astype(int)
You can use np.ravel_multi_index to compress the columns into unique numbers which are easier to handle:
cmx = *map(np.max, coords),
imx = *map(np.max, index),
shape = np.maximum(cmx, imx) + 1
ct = np.ravel_multi_index(coords, shape)
it = np.ravel_multi_index(index, shape)
it.sort()
result = ct == it[it.searchsorted(ct)]
print(result.view(np.int8))
Prints:
[1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
I have a DataFrame which looks like this:
>>> df
type value
0 1 0.698791
1 3 0.228529
2 3 0.560907
3 1 0.982690
4 1 0.997881
5 1 0.301664
6 1 0.877495
7 2 0.561545
8 1 0.167920
9 1 0.928918
10 2 0.212339
11 2 0.092313
12 4 0.039266
13 2 0.998929
14 4 0.476712
15 4 0.631202
16 1 0.918277
17 3 0.509352
18 1 0.769203
19 3 0.994378
I would like to group on the type column and obtain histogram bins for the column value in 10 new columns, e.g. something like that:
1 3 9 6 8 10 5 4 7 2
type
1 0 1 0 0 0 2 1 1 0 1
2 2 1 1 0 0 1 1 0 0 0
3 2 0 0 0 0 1 1 0 0 0
4 1 1 0 0 0 1 0 0 0 1
Where column 1 is the count for the first bin (0.0 to 0.1) and so on...
Using numpy.histogram, I can only obtain the following:
>>> df.groupby('type')['value'].agg(lambda x: numpy.histogram(x, bins=10, range=(0, 1)))
type
1 ([0, 1, 1, 1, 1, 0, 0, 0, 0, 2], [0.0, 0.1, 0....
2 ([2, 0, 1, 0, 1, 0, 0, 0, 1, 1], [0.0, 0.1, 0....
3 ([2, 0, 0, 0, 1, 0, 0, 0, 0, 1], [0.0, 0.1, 0....
4 ([1, 1, 1, 0, 0, 0, 0, 0, 0, 1], [0.0, 0.1, 0....
Name: value, dtype: object
Which I do not manage to put in the correct format afterwards (at least not in a simple way).
I found a trick to do what I want, but it is very ugly:
>>> d = {str(k): lambda x, _k = k: ((x >= (_k - 1)/10) & (x < _k/10)).sum() for k in range(1, 11)}
>>> df.groupby('type')['value'].agg(d)
1 3 9 6 8 10 5 4 7 2
type
1 0 1 0 0 0 2 1 1 0 1
2 2 1 1 0 0 1 1 0 0 0
3 2 0 0 0 0 1 1 0 0 0
4 1 1 0 0 0 1 0 0 0 1
Is there a better way to do what I want? I know that in R, the aggregate method can return a DataFrame, but not in python...
is that what you want?
In [98]: %paste
bins = np.linspace(0, 1.0, 11)
labels = list(range(1,11))
(df.assign(q=pd.cut(df.value, bins=bins, labels=labels, right=False))
.pivot_table(index='type', columns='q', aggfunc='size', fill_value=0)
)
## -- End pasted text --
Out[98]:
q 1 2 3 4 5 6 7 8 9 10
type
1 0 1 0 1 0 0 1 1 1 4
2 1 0 1 0 0 1 0 0 0 1
3 0 0 1 0 0 2 0 0 0 1
4 1 0 0 0 1 0 1 0 0 0
I have a dataframe with two columns:
x y
0 1
1 1
2 2
0 5
1 6
2 8
0 1
1 8
2 4
0 1
1 7
2 3
What I want is:
x val1 val2 val3 val4
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
I know that the values in column x are repeated all N times.
You could use groupby/cumcount to assign column numbers and then call pivot:
import pandas as pd
df = pd.DataFrame({'x': [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2],
'y': [1, 1, 2, 5, 6, 8, 1, 8, 4, 1, 7, 3]})
df['columns'] = df.groupby('x')['y'].cumcount()
# x y columns
# 0 0 1 0
# 1 1 1 0
# 2 2 2 0
# 3 0 5 1
# 4 1 6 1
# 5 2 8 1
# 6 0 1 2
# 7 1 8 2
# 8 2 4 2
# 9 0 1 3
# 10 1 7 3
# 11 2 3 3
result = df.pivot(index='x', columns='columns')
print(result)
yields
y
columns 0 1 2 3
x
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
Or, if you can really rely on the values in x being repeated in order N times,
N = 3
result = pd.DataFrame(df['y'].values.reshape(-1, N).T)
yields
0 1 2 3
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
Using reshape is quicker than calling groupby/cumcount and pivot, but it
is less robust since it relies on the values in y appearing in the right order.