No results are returned in dataframe

No results are returned in dataframe - python

I am trying to take the average every fifth and every sixth row of var A in a dataframe, and put the result in a new column as var B. But it NaN shows. It may be resulted by I did not return values correctly?
Here is the sample data:
PID A
1 0
1 3
1 2
1 6
1 0
1 2
2 3
2 3
2 1
2 4
2 0
2 4
Expected results:
PID A B
1 0 1
1 3 1
1 2 1
1 6 1
1 0 1
1 2 1
2 3 2
2 3 2
2 1 2
2 4 2
2 0 2
2 4 2
My codes:
lst1 = df.iloc[5::6, :]
lst2 = df.iloc[4::6, :]
df['B'] = (lst1['A'] + lst2['A'])/2
print(df['B'])
The script can be run without error, but the var B is empty and show NaN.
Thanks for your help!

There is problem data not aligned, because different indexes, so get NaNs.
print(lst1)
PID A
5 1 2
11 2 4
print(lst2)
PID A
4 1 0
10 2 0
print (lst1['A'] + lst2['A'])
4 NaN
5 NaN
10 NaN
11 NaN
Name: A, dtype: float64
Solution is use values for add Series to numpy array:
print (lst1['A'] + (lst2['A'].values))
5 2
11 4
Name: A, dtype: int64
Or you can sum 2 numpy arrays:
print (lst1['A'].values + (lst2['A'].values))
[2 4]
It seems you need:
df['B'] = (lst1['A'] + lst2['A'].values).div(2)
df['B'] = df['B'].bfill()
print(df)
PID A B
0 1 0 1.0
1 1 3 1.0
2 1 2 1.0
3 1 6 1.0
4 1 0 1.0
5 1 2 1.0
6 2 3 2.0
7 2 3 2.0
8 2 1 2.0
9 2 4 2.0
10 2 0 2.0
11 2 4 2.0
But if need mean of 5. and 6. value per group by PID use groupby with transform:
df['B'] = df.groupby('PID').transform(lambda x: x.iloc[[4, 5]].mean())
print(df)
PID A B
0 1 0 1.0
1 1 3 1.0
2 1 2 1.0
3 1 6 1.0
4 1 0 1.0
5 1 2 1.0
6 2 3 2.0
7 2 3 2.0
8 2 1 2.0
9 2 4 2.0
10 2 0 2.0
11 2 4 2.0

Option 1
Straightforward way taking the mean of the 5th and 6th positions within each group defined by 'PID'.
df.assign(B=df.groupby('PID').transform(lambda x: x.values[[4, 5]].mean()))
PID A B
0 1 0 1.0
1 1 3 1.0
2 1 2 1.0
3 1 6 1.0
4 1 0 1.0
5 1 2 1.0
6 2 3 2.0
7 2 3 2.0
8 2 1 2.0
9 2 4 2.0
10 2 0 2.0
11 2 4 2.0
Option 2
Fun way using join assuming there are actually exactly 6 rows for each 'PID'.
df.join(df.set_index('PID').A.pipe(lambda d: (d.iloc[4::6] + d.iloc[5::6]) / 2).rename('B'), on='PID')
PID A B
0 1 0 1.0
1 1 3 1.0
2 1 2 1.0
3 1 6 1.0
4 1 0 1.0
5 1 2 1.0
6 2 3 2.0
7 2 3 2.0
8 2 1 2.0
9 2 4 2.0
10 2 0 2.0
11 2 4 2.0

Related

pandas timeseries splitting into many and taking the mean

I have the following pandas dataframe:
SEC POS DATA
1 1 4
2 1 4
3 1 5
4 1 5
5 2 2
6 3 4
7 3 2
8 4 2
9 4 2
10 1 8
11 1 6
12 2 5
13 2 5
14 2 4
15 2 6
16 3 2
17 4 1
Now I want to know the mean value of DATA and the first value of SEC for every block of the POS column.
So like this:
SEC POS DATA
1 1 4.5
5 2 2
6 3 3
8 4 2
10 1 7
12 2 5
16 3 2
17 4 1
Additionally, I want to subtract the DATA value of POS=4 from it's 3 prior DATA values, so where POS = [1,2,3].
Obtaining the following:
SEC POS DATA
1 1 2.5
5 2 0
6 3 1
8 4 2
10 1 6
12 2 4
16 3 1
17 4 1
I figured out how to do this by separating the dataframe in many different dataframes using a forloop. taking the mean and then subtract for the other dataframes. However this is very slow, so I'm wondering if there's a faster way to do this, anyone that can help?
Thanks!

Another solution:
diff_to_previous = df.POS != df.POS.shift(1)
df = df.groupby(diff_to_previous.cumsum(), as_index=False).agg({'SEC': 'first', 'POS':'first', 'DATA':'mean'})
df['tmp'] = (df['POS'] == 4).astype(int).shift(fill_value=0).cumsum()
df['DATA'] = df.groupby('tmp')['DATA'].transform(lambda x: [*(x[x.index[:-1]] - x[x.index[-1]]), x[x.index[-1]]] )
df = df.drop(columns='tmp')
print(df)
Prints:
SEC POS DATA
0 1 1 2.5
1 5 2 0.0
2 6 3 1.0
3 8 4 2.0
4 10 1 6.0
5 12 2 4.0
6 16 3 1.0
7 17 4 1.0

For your first problem, we can use:
grps = df['POS'].ne(df['POS'].shift()).cumsum()
dfg = df.groupby(grps).agg(
POS=('POS', 'min'),
SEC=('SEC', 'min'),
DATA=('DATA', 'mean')
).reset_index(drop=True)
POS SEC DATA
0 1 1 4.5
1 2 5 2.0
2 3 6 3.0
3 4 8 2.0
4 1 10 7.0
5 2 12 5.0
6 3 16 2.0
7 4 17 1.0
For your second problem:
grps2 = dfg['POS'].lt(dfg['POS'].shift()).cumsum()
m = (
dfg.groupby(grps2)
.apply(lambda x: x.loc[x['POS'].isin([1,2,3]), 'DATA']
- x.loc[x['POS'].eq(4), 'DATA'].iat[0])
.droplevel(0)
)
dfg['DATA'].update(m)
POS SEC DATA
0 1 1 2.5
1 2 5 0.0
2 3 6 1.0
3 4 8 2.0
4 1 10 6.0
5 2 12 4.0
6 3 16 1.0
7 4 17 1.0

How to fill NaN in one column depending from values two different columns

I have a dataframe with three columns. Two of them are group and subgroup, adn the third one is a value. I have some NaN values in the values column. I need to fiil them by median values,according to group and subgroup.
I made a pivot table with double index and the median of target column. But I don`t understand how to get this values and put them into original dataframe
import pandas as pd
df=pd.DataFrame(data=[
[1,1,'A',1],
[2,1,'A',3],
[3,3,'B',8],
[4,2,'C',1],
[5,3,'A',3],
[6,2,'C',6],
[7,1,'B',2],
[8,1,'C',3],
[9,2,'A',7],
[10,3,'C',4],
[11,2,'B',6],
[12,1,'A'],
[13,1,'C'],
[14,2,'B'],
[15,3,'A']],columns=['id','group','subgroup','value'])
print(df)
id group subgroup value
0 1 1 A 1
1 2 1 A 3
2 3 3 B 8
3 4 2 C 1
4 5 3 A 3
5 6 2 C 6
6 7 1 B 2
7 8 1 C 3
8 9 2 A 7
9 10 3 C 4
10 11 2 B 6
11 12 1 A NaN
12 13 1 C NaN
13 14 2 B NaN
14 15 3 A NaN
df_struct=df.pivot_table(index=['group','subgroup'],values='value',aggfunc='median')
print(df_struct)
value
group subgroup
1 A 2.0
B 2.0
C 3.0
2 A 7.0
B 6.0
C 3.5
3 A 3.0
B 8.0
C 4.0
Will be thankfull for any help

Use pandas.DataFrame.groupby.transform then fillna:
id group subgroup value
0 1 1 A 1.0
1 2 1 A NaN # < Value with nan
2 3 3 B 8.0
3 4 2 C 1.0
4 5 3 A 3.0
5 6 2 C 6.0
6 7 1 B 2.0
7 8 1 C 3.0
8 9 2 A 7.0
9 10 3 C 4.0
10 11 2 B 6.0
df['value'] = df['value'].fillna(df.groupby(['group', 'subgroup'])['value'].transform('median'))
print(df)
Output:
id group subgroup value
0 1 1 A 1.0
1 2 1 A 1.0
2 3 3 B 8.0
3 4 2 C 1.0
4 5 3 A 3.0
5 6 2 C 6.0
6 7 1 B 2.0
7 8 1 C 3.0
8 9 2 A 7.0
9 10 3 C 4.0
10 11 2 B 6.0

Fill in NaN values for left join by sampling from right table

I cannot figure out a nice panda-ish way to fill in missing NaN values for left join by sampling from right table.
e.g
joined_left = left.merge(right, how="left", left_on=[attr1], right_on=[attr2])
from left and right
0 1 2
0 1 1 1
1 2 2 2
2 3 3 3
3 9 9 9
4 1 3 2
0 1 2
0 1 2 2
1 1 2 3
2 3 2 2
3 3 2 9
4 3 2 2
produces smth like
0 1_x 2_x 1_y 2_y
0 1 1 1 2.0 2.0
1 1 1 1 2.0 3.0
2 2 2 2 NaN NaN
3 3 3 3 2.0 2.0
4 3 3 3 2.0 9.0
5 3 3 3 2.0 2.0
6 9 9 9 NaN NaN
7 1 3 2 2.0 2.0
8 1 3 2 2.0 3.0
How do I sample a row from a right table instead of filling NaNs?
This is what I tried so far playground:
left = [[1,1,1], [2,2,2],[3,3,3], [9,9,9], [1,3,2]]
right = [[1,2,2],[1,2,3],[3,2,2], [3,2,9], [3,2,2]]
left = np.asarray(left)
right = np.asarray(right)
left = pd.DataFrame(left)
right = pd.DataFrame(right)
joined_left = left.merge(right, how="left", left_on=[0], right_on=[0])
while(joined_left.isnull().values.any()):
right_sample = right.sample().drop(0, axis=1)
joined_left.fillna(value=right_sample, limit=1)
print joined_left
Basically sample randomly and use fillna() for first occurance of NaN value to fill in...but for some reason I get no output.
Thank you!
One of outputs could be
0 1_x 2_x 1_y 2_y
0 1 1 1 2.0 2.0
1 1 1 1 2.0 3.0
2 2 2 2 2.0 2.0
3 3 3 3 2.0 2.0
4 3 3 3 2.0 9.0
5 3 3 3 2.0 2.0
6 9 9 9 3.0 2.9
7 1 3 2 2.0 2.0
8 1 3 2 2.0 3.0
with sampled 3 2 2and3 2 9

Using sample with fillna
joined_left = left.merge(right, how="left", left_on=[0], right_on=[0],indicator=True) # adding indicator
joined_left
Out[705]:
0 1_x 2_x 1_y 2_y _merge
0 1 1 1 2.0 2.0 both
1 1 1 1 2.0 3.0 both
2 2 2 2 NaN NaN left_only
3 3 3 3 2.0 2.0 both
4 3 3 3 2.0 9.0 both
5 3 3 3 2.0 2.0 both
6 9 9 9 NaN NaN left_only
7 1 3 2 2.0 2.0 both
8 1 3 2 2.0 3.0 both
nnull=joined_left['_merge'].eq('left_only').sum() # find all many row miss match , at the mergedf
s=right.sample(nnull)# rasmple from the dataframe after dropna
s.index=joined_left.index[joined_left['_merge'].eq('left_only')] # reset the index of the subset fill df to the index of null value show up
joined_left.fillna(s.rename(columns={1:'1_y',2:'2_y'}))
Out[706]:
0 1_x 2_x 1_y 2_y _merge
0 1 1 1 2.0 2.0 both
1 1 1 1 2.0 3.0 both
2 2 2 2 2.0 2.0 left_only
3 3 3 3 2.0 2.0 both
4 3 3 3 2.0 9.0 both
5 3 3 3 2.0 2.0 both
6 9 9 9 2.0 3.0 left_only
7 1 3 2 2.0 2.0 both
8 1 3 2 2.0 3.0 both

Select a range of column base on value of another column in pandas

My dataset looks like this(first row is header)
0 1 2 3 4 5
1 3 4 6 2 3
3 8 9 3 2 4
2 2 3 2 1 2
I want to select a range of columns of the dataset based on the column [5], e.g:
1 3 4
3 8 9 3
2 2
I have tried the following, but it did not work:
df.iloc[:,0:df['5'].values]

Let's try:
df.apply(lambda x: x[:x.iloc[5]], 1)
Output:
0 1 2 3
0 1.0 3.0 4.0 NaN
1 3.0 8.0 9.0 3.0
2 2.0 2.0 NaN NaN

Recreate your dataframe
df=pd.DataFrame([x[:x[5]] for x in df.values]).fillna(0)
df
Out[184]:
0 1 2 3
0 1 3 4.0 0.0
1 3 8 9.0 3.0
2 2 2 0.0 0.0

Count distinct strings in rolling window using pandas

How do I count the number of unique strings in a rolling window of a pandas dataframe?
a = pd.DataFrame(['a','b','a','a','b','c','d','e','e','e','e'])
a.rolling(3).apply(lambda x: len(np.unique(x)))
Output, same as original dataframe:
0
0 a
1 b
2 a
3 a
4 b
5 c
6 d
7 e
8 e
9 e
10 e
Expected:
0
0 1
1 2
2 2
3 2
4 2
5 3
6 3
7 3
8 2
9 1
10 1

I think you need first convert values to numeric - by factorize or by rank. Also min_periods parameter is necessary for avoid NaN in start of column:
a[0] = pd.factorize(a[0])[0]
print (a)
0
0 0
1 1
2 0
3 0
4 1
5 2
6 3
7 4
8 4
9 4
10 4
b = a.rolling(3, min_periods=1).apply(lambda x: len(np.unique(x))).astype(int)
print (b)
0
0 1
1 2
2 2
3 2
4 2
5 3
6 3
7 3
8 2
9 1
10 1
Or:
a[0] = a[0].rank(method='dense')
0
0 1.0
1 2.0
2 1.0
3 1.0
4 2.0
5 3.0
6 4.0
7 5.0
8 5.0
9 5.0
10 5.0
b = a.rolling(3, min_periods=1).apply(lambda x: len(np.unique(x))).astype(int)
print (b)
0
0 1
1 2
2 2
3 2
4 2
5 3
6 3
7 3
8 2
9 1
10 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

No results are returned in dataframe - python

Related

pandas timeseries splitting into many and taking the mean

How to fill NaN in one column depending from values two different columns

Fill in NaN values for left join by sampling from right table

Select a range of column base on value of another column in pandas

Count distinct strings in rolling window using pandas

Categories

Resources