I am trying to take the average every fifth and every sixth row of var A in a dataframe, and put the result in a new column as var B. But it NaN shows. It may be resulted by I did not return values correctly?
Here is the sample data:
PID A
1 0
1 3
1 2
1 6
1 0
1 2
2 3
2 3
2 1
2 4
2 0
2 4
Expected results:
PID A B
1 0 1
1 3 1
1 2 1
1 6 1
1 0 1
1 2 1
2 3 2
2 3 2
2 1 2
2 4 2
2 0 2
2 4 2
My codes:
lst1 = df.iloc[5::6, :]
lst2 = df.iloc[4::6, :]
df['B'] = (lst1['A'] + lst2['A'])/2
print(df['B'])
The script can be run without error, but the var B is empty and show NaN.
Thanks for your help!
There is problem data not aligned, because different indexes, so get NaNs.
print(lst1)
PID A
5 1 2
11 2 4
print(lst2)
PID A
4 1 0
10 2 0
print (lst1['A'] + lst2['A'])
4 NaN
5 NaN
10 NaN
11 NaN
Name: A, dtype: float64
Solution is use values for add Series to numpy array:
print (lst1['A'] + (lst2['A'].values))
5 2
11 4
Name: A, dtype: int64
Or you can sum 2 numpy arrays:
print (lst1['A'].values + (lst2['A'].values))
[2 4]
It seems you need:
df['B'] = (lst1['A'] + lst2['A'].values).div(2)
df['B'] = df['B'].bfill()
print(df)
PID A B
0 1 0 1.0
1 1 3 1.0
2 1 2 1.0
3 1 6 1.0
4 1 0 1.0
5 1 2 1.0
6 2 3 2.0
7 2 3 2.0
8 2 1 2.0
9 2 4 2.0
10 2 0 2.0
11 2 4 2.0
But if need mean of 5. and 6. value per group by PID use groupby with transform:
df['B'] = df.groupby('PID').transform(lambda x: x.iloc[[4, 5]].mean())
print(df)
PID A B
0 1 0 1.0
1 1 3 1.0
2 1 2 1.0
3 1 6 1.0
4 1 0 1.0
5 1 2 1.0
6 2 3 2.0
7 2 3 2.0
8 2 1 2.0
9 2 4 2.0
10 2 0 2.0
11 2 4 2.0
Option 1
Straightforward way taking the mean of the 5th and 6th positions within each group defined by 'PID'.
df.assign(B=df.groupby('PID').transform(lambda x: x.values[[4, 5]].mean()))
PID A B
0 1 0 1.0
1 1 3 1.0
2 1 2 1.0
3 1 6 1.0
4 1 0 1.0
5 1 2 1.0
6 2 3 2.0
7 2 3 2.0
8 2 1 2.0
9 2 4 2.0
10 2 0 2.0
11 2 4 2.0
Option 2
Fun way using join assuming there are actually exactly 6 rows for each 'PID'.
df.join(df.set_index('PID').A.pipe(lambda d: (d.iloc[4::6] + d.iloc[5::6]) / 2).rename('B'), on='PID')
PID A B
0 1 0 1.0
1 1 3 1.0
2 1 2 1.0
3 1 6 1.0
4 1 0 1.0
5 1 2 1.0
6 2 3 2.0
7 2 3 2.0
8 2 1 2.0
9 2 4 2.0
10 2 0 2.0
11 2 4 2.0
Related
I have the following pandas dataframe:
SEC POS DATA
1 1 4
2 1 4
3 1 5
4 1 5
5 2 2
6 3 4
7 3 2
8 4 2
9 4 2
10 1 8
11 1 6
12 2 5
13 2 5
14 2 4
15 2 6
16 3 2
17 4 1
Now I want to know the mean value of DATA and the first value of SEC for every block of the POS column.
So like this:
SEC POS DATA
1 1 4.5
5 2 2
6 3 3
8 4 2
10 1 7
12 2 5
16 3 2
17 4 1
Additionally, I want to subtract the DATA value of POS=4 from it's 3 prior DATA values, so where POS = [1,2,3].
Obtaining the following:
SEC POS DATA
1 1 2.5
5 2 0
6 3 1
8 4 2
10 1 6
12 2 4
16 3 1
17 4 1
I figured out how to do this by separating the dataframe in many different dataframes using a forloop. taking the mean and then subtract for the other dataframes. However this is very slow, so I'm wondering if there's a faster way to do this, anyone that can help?
Thanks!
Another solution:
diff_to_previous = df.POS != df.POS.shift(1)
df = df.groupby(diff_to_previous.cumsum(), as_index=False).agg({'SEC': 'first', 'POS':'first', 'DATA':'mean'})
df['tmp'] = (df['POS'] == 4).astype(int).shift(fill_value=0).cumsum()
df['DATA'] = df.groupby('tmp')['DATA'].transform(lambda x: [*(x[x.index[:-1]] - x[x.index[-1]]), x[x.index[-1]]] )
df = df.drop(columns='tmp')
print(df)
Prints:
SEC POS DATA
0 1 1 2.5
1 5 2 0.0
2 6 3 1.0
3 8 4 2.0
4 10 1 6.0
5 12 2 4.0
6 16 3 1.0
7 17 4 1.0
For your first problem, we can use:
grps = df['POS'].ne(df['POS'].shift()).cumsum()
dfg = df.groupby(grps).agg(
POS=('POS', 'min'),
SEC=('SEC', 'min'),
DATA=('DATA', 'mean')
).reset_index(drop=True)
POS SEC DATA
0 1 1 4.5
1 2 5 2.0
2 3 6 3.0
3 4 8 2.0
4 1 10 7.0
5 2 12 5.0
6 3 16 2.0
7 4 17 1.0
For your second problem:
grps2 = dfg['POS'].lt(dfg['POS'].shift()).cumsum()
m = (
dfg.groupby(grps2)
.apply(lambda x: x.loc[x['POS'].isin([1,2,3]), 'DATA']
- x.loc[x['POS'].eq(4), 'DATA'].iat[0])
.droplevel(0)
)
dfg['DATA'].update(m)
POS SEC DATA
0 1 1 2.5
1 2 5 0.0
2 3 6 1.0
3 4 8 2.0
4 1 10 6.0
5 2 12 4.0
6 3 16 1.0
7 4 17 1.0
I have a dataframe with three columns. Two of them are group and subgroup, adn the third one is a value. I have some NaN values in the values column. I need to fiil them by median values,according to group and subgroup.
I made a pivot table with double index and the median of target column. But I don`t understand how to get this values and put them into original dataframe
import pandas as pd
df=pd.DataFrame(data=[
[1,1,'A',1],
[2,1,'A',3],
[3,3,'B',8],
[4,2,'C',1],
[5,3,'A',3],
[6,2,'C',6],
[7,1,'B',2],
[8,1,'C',3],
[9,2,'A',7],
[10,3,'C',4],
[11,2,'B',6],
[12,1,'A'],
[13,1,'C'],
[14,2,'B'],
[15,3,'A']],columns=['id','group','subgroup','value'])
print(df)
id group subgroup value
0 1 1 A 1
1 2 1 A 3
2 3 3 B 8
3 4 2 C 1
4 5 3 A 3
5 6 2 C 6
6 7 1 B 2
7 8 1 C 3
8 9 2 A 7
9 10 3 C 4
10 11 2 B 6
11 12 1 A NaN
12 13 1 C NaN
13 14 2 B NaN
14 15 3 A NaN
df_struct=df.pivot_table(index=['group','subgroup'],values='value',aggfunc='median')
print(df_struct)
value
group subgroup
1 A 2.0
B 2.0
C 3.0
2 A 7.0
B 6.0
C 3.5
3 A 3.0
B 8.0
C 4.0
Will be thankfull for any help
Use pandas.DataFrame.groupby.transform then fillna:
id group subgroup value
0 1 1 A 1.0
1 2 1 A NaN # < Value with nan
2 3 3 B 8.0
3 4 2 C 1.0
4 5 3 A 3.0
5 6 2 C 6.0
6 7 1 B 2.0
7 8 1 C 3.0
8 9 2 A 7.0
9 10 3 C 4.0
10 11 2 B 6.0
df['value'] = df['value'].fillna(df.groupby(['group', 'subgroup'])['value'].transform('median'))
print(df)
Output:
id group subgroup value
0 1 1 A 1.0
1 2 1 A 1.0
2 3 3 B 8.0
3 4 2 C 1.0
4 5 3 A 3.0
5 6 2 C 6.0
6 7 1 B 2.0
7 8 1 C 3.0
8 9 2 A 7.0
9 10 3 C 4.0
10 11 2 B 6.0
I cannot figure out a nice panda-ish way to fill in missing NaN values for left join by sampling from right table.
e.g
joined_left = left.merge(right, how="left", left_on=[attr1], right_on=[attr2])
from left and right
0 1 2
0 1 1 1
1 2 2 2
2 3 3 3
3 9 9 9
4 1 3 2
0 1 2
0 1 2 2
1 1 2 3
2 3 2 2
3 3 2 9
4 3 2 2
produces smth like
0 1_x 2_x 1_y 2_y
0 1 1 1 2.0 2.0
1 1 1 1 2.0 3.0
2 2 2 2 NaN NaN
3 3 3 3 2.0 2.0
4 3 3 3 2.0 9.0
5 3 3 3 2.0 2.0
6 9 9 9 NaN NaN
7 1 3 2 2.0 2.0
8 1 3 2 2.0 3.0
How do I sample a row from a right table instead of filling NaNs?
This is what I tried so far playground:
left = [[1,1,1], [2,2,2],[3,3,3], [9,9,9], [1,3,2]]
right = [[1,2,2],[1,2,3],[3,2,2], [3,2,9], [3,2,2]]
left = np.asarray(left)
right = np.asarray(right)
left = pd.DataFrame(left)
right = pd.DataFrame(right)
joined_left = left.merge(right, how="left", left_on=[0], right_on=[0])
while(joined_left.isnull().values.any()):
right_sample = right.sample().drop(0, axis=1)
joined_left.fillna(value=right_sample, limit=1)
print joined_left
Basically sample randomly and use fillna() for first occurance of NaN value to fill in...but for some reason I get no output.
Thank you!
One of outputs could be
0 1_x 2_x 1_y 2_y
0 1 1 1 2.0 2.0
1 1 1 1 2.0 3.0
2 2 2 2 2.0 2.0
3 3 3 3 2.0 2.0
4 3 3 3 2.0 9.0
5 3 3 3 2.0 2.0
6 9 9 9 3.0 2.9
7 1 3 2 2.0 2.0
8 1 3 2 2.0 3.0
with sampled 3 2 2and3 2 9
Using sample with fillna
joined_left = left.merge(right, how="left", left_on=[0], right_on=[0],indicator=True) # adding indicator
joined_left
Out[705]:
0 1_x 2_x 1_y 2_y _merge
0 1 1 1 2.0 2.0 both
1 1 1 1 2.0 3.0 both
2 2 2 2 NaN NaN left_only
3 3 3 3 2.0 2.0 both
4 3 3 3 2.0 9.0 both
5 3 3 3 2.0 2.0 both
6 9 9 9 NaN NaN left_only
7 1 3 2 2.0 2.0 both
8 1 3 2 2.0 3.0 both
nnull=joined_left['_merge'].eq('left_only').sum() # find all many row miss match , at the mergedf
s=right.sample(nnull)# rasmple from the dataframe after dropna
s.index=joined_left.index[joined_left['_merge'].eq('left_only')] # reset the index of the subset fill df to the index of null value show up
joined_left.fillna(s.rename(columns={1:'1_y',2:'2_y'}))
Out[706]:
0 1_x 2_x 1_y 2_y _merge
0 1 1 1 2.0 2.0 both
1 1 1 1 2.0 3.0 both
2 2 2 2 2.0 2.0 left_only
3 3 3 3 2.0 2.0 both
4 3 3 3 2.0 9.0 both
5 3 3 3 2.0 2.0 both
6 9 9 9 2.0 3.0 left_only
7 1 3 2 2.0 2.0 both
8 1 3 2 2.0 3.0 both
My dataset looks like this(first row is header)
0 1 2 3 4 5
1 3 4 6 2 3
3 8 9 3 2 4
2 2 3 2 1 2
I want to select a range of columns of the dataset based on the column [5], e.g:
1 3 4
3 8 9 3
2 2
I have tried the following, but it did not work:
df.iloc[:,0:df['5'].values]
Let's try:
df.apply(lambda x: x[:x.iloc[5]], 1)
Output:
0 1 2 3
0 1.0 3.0 4.0 NaN
1 3.0 8.0 9.0 3.0
2 2.0 2.0 NaN NaN
Recreate your dataframe
df=pd.DataFrame([x[:x[5]] for x in df.values]).fillna(0)
df
Out[184]:
0 1 2 3
0 1 3 4.0 0.0
1 3 8 9.0 3.0
2 2 2 0.0 0.0
How do I count the number of unique strings in a rolling window of a pandas dataframe?
a = pd.DataFrame(['a','b','a','a','b','c','d','e','e','e','e'])
a.rolling(3).apply(lambda x: len(np.unique(x)))
Output, same as original dataframe:
0
0 a
1 b
2 a
3 a
4 b
5 c
6 d
7 e
8 e
9 e
10 e
Expected:
0
0 1
1 2
2 2
3 2
4 2
5 3
6 3
7 3
8 2
9 1
10 1
I think you need first convert values to numeric - by factorize or by rank. Also min_periods parameter is necessary for avoid NaN in start of column:
a[0] = pd.factorize(a[0])[0]
print (a)
0
0 0
1 1
2 0
3 0
4 1
5 2
6 3
7 4
8 4
9 4
10 4
b = a.rolling(3, min_periods=1).apply(lambda x: len(np.unique(x))).astype(int)
print (b)
0
0 1
1 2
2 2
3 2
4 2
5 3
6 3
7 3
8 2
9 1
10 1
Or:
a[0] = a[0].rank(method='dense')
0
0 1.0
1 2.0
2 1.0
3 1.0
4 2.0
5 3.0
6 4.0
7 5.0
8 5.0
9 5.0
10 5.0
b = a.rolling(3, min_periods=1).apply(lambda x: len(np.unique(x))).astype(int)
print (b)
0
0 1
1 2
2 2
3 2
4 2
5 3
6 3
7 3
8 2
9 1
10 1