pandas add variables according to variable value - python

Suppose the following pandas dataframe
Wafer_Id v1 v2
0 0 9 6
1 0 7 8
2 0 1 5
3 1 6 6
4 1 0 8
5 1 5 0
6 2 8 8
7 2 2 6
8 2 3 5
9 3 5 1
10 3 5 6
11 3 9 8
I want to group it according to WaferId and I would like to get something like
w
Out[60]:
Wafer_Id v1_1 v1_2 v1_3 v2_1 v2_2 v2_3
0 0 9 7 1 6 ... ...
1 1 6 0 5 6
2 2 8 2 3 8
3 3 5 5 9 1
I think that I can obtain the result with the pivot function but I am not sure of how to do it

Possible solution
oes = pd.DataFrame()
oes['Wafer_Id'] = [0,0,0,1,1,1,2,2,2,3,3,3]
oes['v1'] = np.random.randint(0, 10, 12)
oes['v2'] = np.random.randint(0, 10, 12)
oes['id'] = [0, 1, 2] * 4
oes.pivot(index='Wafer_Id', columns='id')
oes
Out[74]:
Wafer_Id v1 v2 id
0 0 8 7 0
1 0 3 3 1
2 0 8 0 2
3 1 2 5 0
4 1 4 1 1
5 1 8 8 2
6 2 8 6 0
7 2 4 7 1
8 2 4 3 2
9 3 4 6 0
10 3 9 2 1
11 3 7 1 2
oes.pivot(index='Wafer_Id', columns='id')
Out[75]:
v1 v2
id 0 1 2 0 1 2
Wafer_Id
0 8 3 8 7 3 0
1 2 4 8 5 1 8
2 8 4 4 6 7 3
3 4 9 7 6 2 1

Related

How to reduce pandas dataframe to only those individuals with all timepoints

I am trying to conduct a mixed model analysis but would like to only include individuals who have data in all timepoints available. Here is an example of what my dataframe looks like:
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
outcome = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
df = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
print(df)
id timepoint outcome
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 1 6 7
6 2 1 3
7 2 2 4
8 2 3 1
9 2 4 2
10 2 5 3
11 2 6 4
12 3 1 5
13 3 2 4
14 3 4 5
15 4 1 8
16 4 2 4
17 4 3 5
18 4 4 6
19 4 5 2
20 4 6 3
I want to only keep individuals in the id column who have all 6 timepoints. I.e. IDs 1, 2, and 4 (and cut out all of ID 3's data).
Here's the ideal output:
id timepoint outcome
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 1 6 7
6 2 1 3
7 2 2 4
8 2 3 1
9 2 4 2
10 2 5 3
11 2 6 4
12 4 1 8
13 4 2 4
14 4 3 5
15 4 4 6
16 4 5 2
17 4 6 3
Any help much appreciated.
You can count the number of unique timepoints you have, and then filter your dataframe accordingly with transform('nunique') and loc keeping only the ID's that contain all 6 of them:
t = len(set(timepoint))
res = df.loc[df.groupby('id')['timepoint'].transform('nunique').eq(t)]
Prints:
id timepoint outcome
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 1 6 7
6 2 1 3
7 2 2 4
8 2 3 1
9 2 4 2
10 2 5 3
11 2 6 4
15 4 1 8
16 4 2 4
17 4 3 5
18 4 4 6
19 4 5 2
20 4 6 3

How can I create header starting with1

I loaded the data without header.
train = pd.read_csv('caravan.train', delimiter ='\t', header=None)
train.index = np.arange(1,len(train)+1)
train
0 1 2 3 4 5 6 7 8 9
1 33 1 3 2 8 0 5 1 3 7
2 37 1 2 2 8 1 4 1 4 6
3 37 1 2 2 8 0 4 2 4 3
4 9 1 3 3 3 2 3 2 4 5
5 40 1 4 2 10 1 4 1 4 7
but the header started from 0, and I want to create header starting with 1 insteade of 0
How can I do this?
In your case
df.columns = df.columns.astype(int)+1
df
Out[99]:
1 2 3 4 5 6 7 8 9 10
1 33 1 3 2 8 0 5 1 3 7
2 37 1 2 2 8 1 4 1 4 6
3 37 1 2 2 8 0 4 2 4 3
4 9 1 3 3 3 2 3 2 4 5
5 40 1 4 2 10 1 4 1 4 7

Need to count repeating, consecutive values in python dataframe within a groupby

df = pd.DataFrame({'site':[1,1,1,1,1,1,1,1,1,1], 'parm':[8,8,8,8,8,9,9,9,9,9],
'date':[1,2,3,4,5,1,2,3,4,5], 'obs':[1,1,2,3,3,3,5,5,6,6]})
Output
site parm date obs
0 1 8 1 1
1 1 8 2 1
2 1 8 3 2
3 1 8 4 3
4 1 8 5 3
5 1 9 1 3
6 1 9 2 5
7 1 9 3 5
8 1 9 4 6
9 1 9 5 6
I want to count repeating, sequential "obs" values within a "site" and "parm". I have this code which is close:
df['consecutive'] = df.parm.groupby((df.obs != df.obs.shift()).cumsum()).transform('size')
Output
site parm date obs consecutive
0 1 8 1 1 2
1 1 8 2 1 2
2 1 8 3 2 1
3 1 8 4 3 3
4 1 8 5 3 3
5 1 9 1 3 3
6 1 9 2 5 2
7 1 9 3 5 2
8 1 9 4 6 2
9 1 9 5 6 2
It creates the new column with the count. The gap is when the parm changes from 8 to 9 it includes the parm 9 in the parm 8 count. The expected output is:
site parm date obs consecutive
0 1 8 1 1 2
1 1 8 2 1 2
2 1 8 3 2 1
3 1 8 4 3 2
4 1 8 5 3 2
5 1 9 1 3 1
6 1 9 2 5 2
7 1 9 3 5 2
8 1 9 4 6 2
9 1 9 5 6 2
You need to throw site, parm as indicated in the question into groupby:
df['consecutive'] = (df.groupby([df.obs.ne(df.obs.shift()).cumsum(),
'site', 'parm']
)
['obs'].transform('size')
)
Output:
site parm date obs consecutive
0 1 8 1 1 2
1 1 8 2 1 2
2 1 8 3 2 1
3 1 8 4 3 2
4 1 8 5 3 2
5 1 9 1 3 1
6 1 9 2 5 2
7 1 9 3 5 2
8 1 9 4 6 2
9 1 9 5 6 2

Holding a first value in a column while another column equals a value?

I would like to hold the first value in a column while another column does not equal zero. For Column B, values alternate between -1, 0, 1. For Column C, values equal any integer. The objective is holding the first value of Column C while Column B equals zero. The current DataFrame is as follows:
A B C
1 8 1 9
2 2 1 1
3 3 0 7
4 9 0 8
5 5 0 9
6 6 0 1
7 1 1 9
8 6 1 10
9 3 0 4
10 8 0 8
11 5 0 9
12 6 0 10
The resulting DataFrame should be as follows:
A B C
1 8 1 9
2 2 1 1
3 3 0 7
4 9 0 7
5 5 0 7
6 6 0 7
7 1 1 9
8 6 1 10
9 3 0 4
10 8 0 4
11 5 0 4
12 6 0 4
13 3 1 9
You need first create NaNs by condition in column C and then add values by ffill:
mask = (df['B'].shift().fillna(False)).astype(bool) | (df['B'])
df['C'] = df.loc[mask, 'C']
df['C'] = df['C'].ffill().astype(int)
print (df)
A B C
1 8 1 9
2 2 1 1
3 3 0 7
4 9 0 7
5 5 0 7
6 6 0 7
7 1 1 9
8 6 1 10
9 3 0 4
10 8 0 4
11 5 0 4
12 6 0 4
13 3 1 9
Or use where and if type of all values is integer, add astype:
mask = (df['B'].shift().fillna(False)).astype(bool) | (df['B'])
df['C'] = df['C'].where(mask).ffill().astype(int)
print (df)
A B C
1 8 1 9
2 2 1 1
3 3 0 7
4 9 0 7
5 5 0 7
6 6 0 7
7 1 1 9
8 6 1 10
9 3 0 4
10 8 0 4
11 5 0 4
12 6 0 4
13 3 1 9

numpy replace groups of elements with integers incrementally

import numpy as np
data = np.array(['b','b','b','a','a','a','a','c','c','d','d','d'])
I need to replace each group of strings with an integer incrementally like this
data = np.array([0,0,0,1,1,1,1,2,2,3,3,3])
I'm looking for a numpy solution
With this dataset http://www.uploadmb.com/dw.php?id=1364341573
import numpy as np
f = open('test.txt','r')
lines = np.array([ line.strip() for line in f.readlines() ])
lines100 = lines[0:100]
_, ind, inv = np.unique(lines100, return_index=True, return_inverse=True)
print ind
print inv
nums = np.argsort(ind)[inv]
print nums
[ 0 83 62 40 19]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4]
lines200 = lines[0:200]
_, ind, inv = np.unique(lines200, return_index=True, return_inverse=True)
print ind
print inv
nums = np.argsort(ind)[inv]
print nums
[167 0 83 124 104 144 185 62 40 19]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
9 9 9 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 7 7 7 7 7 7 7 7 7 7 7 7
7 7 7 7 7 7 7 7 7 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4
4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 5 5 5 5
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6]
[9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
6 6 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 5 5 5 5 5 5 5 5 5 5 5
5 5 5 5 5 5 5 5 5 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 4 4 4 4
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3]
EDIT: This doesn't always work:
>>> a,b,c = np.unique(data, return_index=True, return_inverse=True)
>>> c # almost!!!
array([1, 1, 1, 0, 0, 0, 0, 2, 2, 3, 3, 3])
>>> np.argsort(b)[c]
array([0, 0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 3], dtype=int64)
But this does work:
def replace_groups(data):
a,b,c, = np.unique(data, True, True)
_, ret = np.unique(b[c], False, True)
return ret
and is faster than the dictionary replacement approach, about 33% for larger datasets:
def replace_groups_dict(data):
_, ind = np.unique(data, return_index=True)
unqs = data[np.sort(ind)]
data_id = dict(zip(unqs, np.arange(data.size)))
num = np.array([data_id[datum] for datum in data])
return num
In [7]: %timeit replace_groups_dict(lines100)
10000 loops, best of 3: 68.8 us per loop
In [8]: %timeit replace_groups_dict(lines200)
10000 loops, best of 3: 106 us per loop
In [9]: %timeit replace_groups_dict(lines)
10 loops, best of 3: 32.1 ms per loop
In [10]: %timeit replace_groups(lines100)
10000 loops, best of 3: 67.1 us per loop
In [11]: %timeit replace_groups(lines200)
10000 loops, best of 3: 78.4 us per loop
In [12]: %timeit replace_groups(lines)
10 loops, best of 3: 23.1 ms per loop
Given #DSM's noticing that my original idea doesn't work robustly, the best solution I can think of is a replacement dictionary:
data = np.array(['b','b','b','a','a','a','a','c','c','d','d','d'])
_, ind = np.unique(data, return_index=True)
unqs = data[np.sort(ind)]
data_id = dict(zip(unqs, np.arange(data.size)))
num = np.array([data_id[datum] for datum in data])
for the month data:
In [5]: f = open('test.txt','r')
In [6]: data = np.array([line.strip() for line in f.readlines()])
In [7]: _, ind, inv = np.unique(data, return_index=True)
In [8]: months = data[np.sort(ind)]
In [9]: month_id = dict(zip(months, np.arange(months.size)))
In [10]: np.array([month_id[datum] for datum in data])
Out[10]: array([ 0, 0, 0, ..., 41, 41, 41])

Categories

Resources