Ignore nan elements in a list using loc pandas - python

I have 2 different dataframes: df1, df2
df1:
index a
0 10
1 2
2 3
3 1
4 7
5 6
df2:
index a
0 1
1 2
2 4
3 3
4 20
5 5
I want to find the index of maximum values with a specific lookback in df1 (let's consider lookback=3 in this example). To do this, I use the following code:
tdf['a'] = df1.rolling(lookback).apply(lambda x: x.idxmax())
And the result would be:
id a
0 nan
1 nan
2 0
3 2
4 4
5 4
Now I need to save the values in df2 for each index found by idxmax() in tdf['b']
So if tdf['a'].iloc[3] == 2, I want tdf['b'].iloc[3] == df2.iloc[2]. I expect the final result to be like this:
id b
0 nan
1 nan
2 1
3 4
4 20
5 20
I'm guessing that I can do this using .loc() function like this:
tdf['b'] = df2.loc[tdf['a']]
But it throws an exception because there are nan values in tdf['a']. If I use dropna() before passing tdf['a'] to the .loc() function, then the indices get messed up (for example in tdf['b'], index 0 has to be nan but it'll have a value after dropna()).
Is there any way to get what I want?

Simply use a map:
lookback = 3
s = df1['a'].rolling(lookback).apply(lambda x: x.idxmax())
s.map(df2['a'])
Output:
0 NaN
1 NaN
2 1.0
3 4.0
4 20.0
5 20.0
Name: a, dtype: float64

Related

How to access common values from two or more columns?

I need to find the number of common values in a column wrt another column.
For example:
There are two columns X , Y.
X:
a
b
c
a
d
a
b
b
a
a
Y:
NaN
2
4
Nan
NaN
6
4
NaN
5
4
So how do I group values like NaN wrt a,b,c,d.
For example,
a has 2 NaN values.
b has 1 NaN values.
Per my comment, I have transposed your dataframe with df.set_index(0).T to get the following starting point.
In[1]:
0 X Y
1 a NaN
2 b 2
3 c 4
4 a NaN
5 d NaN
6 a 6
7 b 4
8 b NaN
9 a 5
10 a 4
From there, you can filter for null values with .isnull(). Then, you can use .groupby('X').size() to return the count of null values per group:
df[df['Y'].isnull()].groupby('X').size()
X
a 2
b 1
d 1
dtype: int64
Or, you could use value_counts() to achieve the same thing:
df[df['Y'].isnull()]['X'].value_counts()

Division of multiple dimension data in pandas using groupby

Since pandas can't work in multi-dimensions, I usually stack the data row-wise and use a dummy column to mark the data dimensions. Now, I need to divide one dimension by another.
For example, given this dataframe where key define the dimensions
index key value
0 a 10
1 b 12
2 a 20
3 b 15
4 a 8
5 b 9
I want to achieve this:
index key value ratio_a_b
0 a 10 0.833333
1 b 12 NaN
2 a 20 1.33333
3 b 15 NaN
4 a 8 0.888889
5 b 9 NaN
Is there a way to do it using groupby?
You don't really need (and should not use) groupby for this:
# interpolate the b values
s = df['value'].where(df['key'].eq('b')).bfill()
# mask the a values and divide
# change to df['key'].ne('b') if you have many values of a
df['ratio'] = df['value'].where(df['key'].eq('a')).div(s)
Output:
index key value ratio
0 0 a 10 0.833333
1 1 b 12 NaN
2 2 a 20 1.333333
3 3 b 15 NaN
4 4 a 8 0.888889
5 5 b 9 NaN
Using eq, cumsum and GroupBy.apply with shift.
We use .eq to get a boolean where the value is a then we use cumsum to make an unique identifier for each a, b pair.
Then we use groupby and divide each value by the value one row below with shift
s = df['key'].eq('a').cumsum()
df['ratio_a_b'] = df.groupby(s)['value'].apply(lambda x: x.div(x.shift(-1)))
Output
key value ratio_a_b
0 a 10 0.833333
1 b 12 NaN
2 a 20 1.333333
3 b 15 NaN
4 a 8 0.888889
5 b 9 NaN
This is what s returns, our unique identifier for each a,b pair:
print(s)
0 1
1 1
2 2
3 2
4 3
5 3
Name: key, dtype: int32

Fill missing data based on the other columns same data [duplicate]

I am trying to impute/fill values using rows with similar columns' values.
For example, I have this dataframe:
one | two | three
1 1 10
1 1 nan
1 1 nan
1 2 nan
1 2 20
1 2 nan
1 3 nan
1 3 nan
I wanted to using the keys of column one and two which is similar and if column three is not entirely nan then impute the existing value from a row of similar keys with value in column '3'.
Here is my desired result:
one | two | three
1 1 10
1 1 10
1 1 10
1 2 20
1 2 20
1 2 20
1 3 nan
1 3 nan
You can see that keys 1 and 3 do not contain any value because the existing value does not exists.
I have tried using groupby+fillna():
df['three'] = df.groupby(['one','two'])['three'].fillna()
which gave me an error.
I have tried forward fill which give me rather strange result where it forward fill the column 2 instead. I am using this code for forward fill.
df['three'] = df.groupby(['one','two'], sort=False)['three'].ffill()
If only one non NaN value per group use ffill (forward filling) and bfill (backward filling) per group, so need apply with lambda:
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.ffill().bfill())
print (df)
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
But if multiple value per group and need replace NaN by some constant - e.g. mean by group:
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 NaN
3 1 2 NaN
4 1 2 20.0
5 1 2 NaN
6 1 3 NaN
7 1 3 NaN
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.fillna(x.mean()))
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 25.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
You can sort data by the column with missing values then groupby and forwardfill:
df.sort_values('three', inplace=True)
df['three'] = df.groupby(['one','two'])['three'].ffill()

Get only two values from 4 specified columns and merge valid values into 2 columns

df:
index a b c d
-
0 1 2 NaN NaN
1 2 NaN 3 NaN
2 5 NaN 6 NaN
3 1 NaN NaN 5
df expect:
index one two
-
0 1 2
1 2 3
2 5 6
3 1 5
Above output example is self-explanatory. Basically, I just need to shift the two values from columns [a, b, c, d] except NaN into another set of two columns ["one", "two"]
Use back filling missing values and select first 2 columns:
df = df.bfill(axis=1).iloc[:, :2].astype(int)
df.columns = ["one", "two"]
print (df)
one two
index
0 1 2
1 2 3
2 5 6
3 1 5
Or combine_first + drop:
df['two']=df.pop('b').combine_first(df.pop('c')).combine_first(df.pop('d'))
df=df.drop(['b','c','d'],1)
df.columns=['index','one','two']
Or fillna:
df['two']=df.pop('b').fillna(df.pop('c')).fillna(df.pop('d'))
df=df.drop(['b','c','d'],1)
df.columns=['index','one','two']
Both cases:
print(df)
Is:
index one two
0 0 1 2.0
1 1 2 3.0
2 2 5 6.0
3 3 1 5.0
If want output like #jezrael's, add a: (both cases all okay)
df=df.set_index('index')
And then:
print(df)
Is:
one two
index
0 1 2.0
1 2 3.0
2 5 6.0
3 1 5.0

How to append two pandas.DataFrame with different numbers of columns

Based on the fact that directly append two dataframe with different numbers of columns, an error would occur as pandas.io.common.CParserError: Error tokenizing data. C error: Expected 4 fields in line 242, saw 5. How can I do with pandas to avoid the error??
I have figure out one naive approach: just to process the original data, to make the numbers of columns equally.
Can it be more elegant?? I think the missing columns can be filled with np.nan after pd.append.
You should be able to concat the dataframes as shown.
You will need to rename the columns to suit you needs.
df1 = pd.DataFrame({'a':[1,2,3,4],'b':[1,2,3,4],'c':[1,2,3,4]})
df2 = pd.DataFrame({'a':[1,2,3,4],'c':[1,2,3,4]})
df = pd.concat([df1,df2])
print('df1')
print(df1)
print('\ndf2')
print(df2)
print('\ndf')
print(df)
Output:
df1
a b c
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
df2
a c
0 1 1
1 2 2
2 3 3
3 4 4
df
a b c
0 1 1.0 1
1 2 2.0 2
2 3 3.0 3
3 4 4.0 4
0 1 NaN 1
1 2 NaN 2
2 3 NaN 3
3 4 NaN 4

Categories

Resources