Understanding how pandas join works - python

Can somebody please explain this result to me? In particular, I don't know where the NaNs come from in the result. Also, I don't know how the join will decide what row to match with what row in this case.
left_df = pd.DataFrame.from_dict({'unique_l':[0, 1, 2, 3, 4], 'join':['a', 'a', 'b','b', 'c'] })
right_df = pd.DataFrame.from_dict({'unique_r':[10, 11, 12, 13, 14], 'join':['a', 'b', 'b','c', 'c'] })
join unique_l
0 a 0
1 a 1
2 b 2
3 b 3
4 c 4
join unique_r
0 a 10
1 b 11
2 b 12
3 c 13
4 c 14
print left_df.join(right_df, on='join', rsuffix='_r')
join unique_l join_r unique_r
0 a 0 NaN NaN
1 a 1 NaN NaN
2 b 2 NaN NaN
3 b 3 NaN NaN
4 c 4 NaN NaN

The join method makes use of indices. What you want is merge:
In [6]: left_df.merge(right_df, on="join", suffixes=("_l", "_r"))
Out[6]:
join unique_l unique_r
0 a 0 10
1 a 1 10
2 b 2 11
3 b 2 12
4 b 3 11
5 b 3 12
6 c 4 13
7 c 4 14
Here is a related (but, IMO, not quite a duplicate) question that explains the difference between join and merge in more detail.

Related

Pandas interpolation adding rows by group with different ranges for each group

I am trying to add rows to a DataFrame interpolating values in a column by group, and fill with missing all other columns. My data looks something like this:
import pandas as pd
import random
random.seed(42)
data = {'group':['a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c', 'c' ],
'value' : [1, 2, 5, 3, 4, 5, 7, 4, 7, 9],
'other': random.sample(range(1, 100), 10)}
df = pd.DataFrame(data)
print(df)
group value other
0 a 1 82
1 a 2 15
2 a 5 4
3 b 3 95
4 b 4 36
5 b 5 32
6 b 7 29
7 c 4 18
8 c 7 14
9 c 9 87
What I am trying to achieve is something like this:
group value other
a 1 82
a 2 15
a 3 NaN
a 4 NaN
a 5 NaN
b 3 95
b 4 36
b 5 32
b 6 NaN
b 7 29
c 4 18
c 5 NaN
c 6 NaN
c 7 14
c 8 NaN
c 9 87
For example, group a has a range from 1 to 5, b from 3 to 7, and c from 4 to 9.
The issue I'm having is that each group has a different range. I found something that works assuming a single range for all groups. This could work using the global min and max and dropping extra rows in each group, but since my data is fairly large adding many rows per group quickly becomes unfeasible.
>>> df.groupby('group').apply(lambda x: x.set_index('value').reindex(np.arange(x['value'].min(), x['value'].max() + 1))).drop(columns='group').reset_index()
group value other
0 a 1 82.0
1 a 2 15.0
2 a 3 NaN
3 a 4 NaN
4 a 5 4.0
5 b 3 95.0
6 b 4 36.0
7 b 5 32.0
8 b 6 NaN
9 b 7 29.0
10 c 4 18.0
11 c 5 NaN
12 c 6 NaN
13 c 7 14.0
14 c 8 NaN
15 c 9 87.0
We group on the group column and then re-index each group with the range from the min to the max of the value column
One option is with the complete function from pyjanitor, which can be helpful in exposing explicitly missing rows (and can be helpful as well in abstracting the reshaping process):
# pip install pyjanitor
import pandas as pd
import janitor
new_value = {'value' : lambda df: range(df.min(), df.max()+1)}
# expose the missing values per group via the `by` parameter
df.complete(new_value, by='group', sort = True)
group value other
0 a 1 82.0
1 a 2 15.0
2 a 3 NaN
3 a 4 NaN
4 a 5 4.0
5 b 3 95.0
6 b 4 36.0
7 b 5 32.0
8 b 6 NaN
9 b 7 29.0
10 c 4 18.0
11 c 5 NaN
12 c 6 NaN
13 c 7 14.0
14 c 8 NaN
15 c 9 87.0

Drop specific column and indexes in pandas DataFrame

DataFrame:
A B C
0 1 6 11
1 2 7 12
2 3 8 13
3 4 9 14
4 5 10 15
Is it possible to drop values from index 2 to 4 in column B? or replace it with NaN.
In this case, values: [8, 9, 10] should be removed.
I tried this: df.drop(columns=['B'], index=[8, 9, 10]), but then column B is removed.
Drop values does not make sense into DataFrame. You can set values to NaN instead and use .loc / .iloc to access index/columns:
>>> df
A B C
a 1 6 11
b 2 7 12
c 3 8 13
d 4 9 14
e 5 10 15
# By name:
df.loc['c':'e', 'B'] = np.nan
# By number:
df.iloc[2:5, 2] = np.nan
Read carefully Indexing and selecting data
import pandas as pd
data = [
['A','B','C'],
[1,6,11],
[2,7,12],
[3,8,13],
[4,9,14],
[5,10,15]
]
df = pd.DataFrame(data=data[1:], columns=data[0])
df['B'] = df['B'].shift(3)
>>>
A B C
0 1 NaN 11
1 2 NaN 12
2 3 NaN 13
3 4 6.0 14
4 5 7.0 15

How to set a selection of values in a pandas DataFrame with a selection from another pd.DataFrame using labels

Assume 2 DataFrames:
df = pd.DataFrame(columns=['A','B','C'], data=[[0,1,2],[3,4,5],[6,7,8]]
df
A B C
0 0 1 2
1 3 4 5
2 6 7 8
df1 = pd.DataFrame(columns=['A','B','C'], data=[[0,1,2],[3,4,5],[6,7,8]], index=[3,4,5])
df1
A B C
3 0 1 2
4 3 4 5
5 6 7 8
I want to set a part of df using values from df1 according to column-lables, without changing anithing in df but the selected fields. The desired result would be:
df
A B C
0 4 5 2
1 7 8 5
2 6 7 8
I tried:
df.loc[df['A'].isin([0,3]), ['A', 'B']] = df1.loc[df1['A'].isin([7,6]), ['B', 'C']]
But the result is:
A B C
0 NaN NaN 2.0
1 NaN NaN 5.0
2 6.0 7.0 8.0
Because I guess it still requires the indices to match. I feel like this is a pretty basic task to do so I'm wondering if there is a simple way of ding this?
I also looked into 'merge' and 'join' but these functions seem to have different purpose.
One possible solution if matching number of rows and columns of both selections is convert output to mumpy array:
df.loc[df['A'].isin([0,3]), ['A', 'B']] =
df1.loc[df1['A'].isin([3,6]), ['B', 'C']].to_numpy()
print (df)
A B C
0 4 5 2
1 7 8 5
2 6 7 8

Get Index of matching string from Two dataframe

I have two data frames. I need to search through datframe 2 to see whichone matches in in datframe 1. And replace the string with its index.
So I Want a third data frame indicating the index of the matching string from dataframe 2 to dataframe 1.
X = pd.DataFrame(np.array(['A','B','C','D','AA','AB','AC','AD','BA','BB','BC','AD']).reshape(4,3),columns=['a','b','c'])
a b c
0 A B C
1 D AA AB
2 AC AD BA
3 BB BC AD
Y = pd.DataFrame(np.array(['A','AA','AC','D','B','AB','C','AD','BC','BB']).reshape(10,1),columns=['X'])
X
0 A
1 AA
2 AC
3 D
4 B
5 AB
6 C
7 AD
8 BC
9 BB
Resulting Datafreme
a b c
0 0 4 6
1 3 1 5
2 2 7 NA
3 9 8 7
Some guy suggested me with the following code but does not seems okay. Not working.
t = pd.merge(df1.stack().reset_index(), df2.reset_index(), left_on = 0, right_on = "0")
res = t.set_index(["level_0", "level_1"]).drop([0, "0"], axis=1).unstack()
print(res)
Use apply with map:
Y = Y.reset_index().set_index('X')['index']
X = X.apply(lambda x: x.map(Y))
print(X)
a b c
0 0 4 6.0
1 3 1 5.0
2 2 7 NaN
3 9 8 7.0
Step1 : Create a mapping from Y :
mapping = {value: key for key, value in Y.T.to_dict("records")[0].items()}
mapping
{'A': 0,
'AA': 1,
'AC': 2,
'D': 3,
'B': 4,
'AB': 5,
'C': 6,
'AD': 7,
'BC': 8,
'BB': 9}
Step 2: stack the X column, map the mapping to the stacked dataframe, and unstack to get back to the original shape :
X.stack().map(mapping).unstack()
a b c
0 0.0 4.0 6.0
1 3.0 1.0 5.0
2 2.0 7.0 NaN
3 9.0 8.0 7.0
Alternatively, you can avoid the stack/unstack step and use replace, with pd.to_numeric :
X.replace(mapping).apply(pd.to_numeric, errors="coerce")
No tests done, just my gut feeling that mapping should be faster.
Short solution based on applymap:
X.applymap(lambda x: Y[Y.X==x].index.max())
result:
a b c
0 0 4 6.0
1 3 1 5.0
2 2 7 NaN
3 9 8 7.0
Y = pd.Series(Y.index, index=Y.X).sort_index()
will give you a more easily searchable object... then something like
flat = X.to_numpy().flatten()
Y = Y.reindex(np.unique(flatten)) # all items need to be in index to be able to use loc[list]
res = pd.DataFrame(Y.loc[flat].reshape(X.shape), columns=X.columns)
Let us do
X = X.where(X.isin(Y.X.tolist())).replace(dict(zip(Y.X,Y.index)))
Out[15]:
a b c
0 0 4 6.0
1 3 1 5.0
2 2 7 NaN
3 9 8 7.0

possible bug in pandas sort with NaN values

If I make a dataframe like the following:
In [128]: test = pd.DataFrame({'a':[1,4,2,7,3,6], 'b':[2,2,2,1,1,1], 'c':[2,6,np.NaN, np.NaN, 1, np.NaN]})
In [129]: test
Out[129]:
a b c
0 1 2 2
1 4 2 6
2 2 2 NaN
3 7 1 NaN
4 3 1 1
5 6 1 NaN
basic sorts perform as expected. Sorting on column c appropriately segregates the nan values. Doing a multi-level sort on columns a and b orders them as expected:
In [133]: test.sort(columns='c', ascending=False)
Out[133]:
a b c
5 6 1 NaN
3 7 1 NaN
2 2 2 NaN
1 4 2 6
0 1 2 2
4 3 1 1
In [134]: test.sort(columns=['b', 'a'], ascending=False)
Out[134]:
a b c
1 4 2 6
2 2 2 NaN
0 1 2 2
3 7 1 NaN
5 6 1 NaN
4 3 1 1
But doing a multi-level sort with columns b and c does not give the expected result:
In [135]: test.sort(columns=['b', 'c'], ascending=False)
Out[135]:
a b c
1 4 2 6
0 1 2 2
2 2 2 NaN
3 7 1 NaN
4 3 1 1
5 6 1 NaN
And, in fact, even sorting just on column c but using the multi-level sort nomenclature fails:
In [136]: test.sort(columns=['c'], ascending=False)
Out[136]:
a b c
1 4 2 6
0 1 2 2
2 2 2 NaN
3 7 1 NaN
4 3 1 1
5 6 1 NaN
I would think that this should have given the exact same result as line 133 above. Is this a pandas bug or is there something I'm not getting? (FYI, pandas v0.11.0, numpy v1.7.1, python 2.7.2.5 32bit on windows 7)
This is an interesting corner case. Note that even vanilla python doesn't get this "correct":
>>> nan = float('nan')
>>> a = [ 6, 2, nan, nan, 1, nan]
>>> sorted(a)
[2, 6, nan, nan, 1, nan]
The reason here is because NaN is neither greater nor less than the other elements -- So there is no strict ordering defined. Because of this, python leaves them alone.
>>> nan > 6
False
>>> nan < 6
False
Pandas must make an explicit check in the single column case -- probably using np.argsort or np.sort as starting at numpy 1.4, np.sort puts NaN values at the end.
Thanks for the heads up above. I guess this is already a known issue. One stopgap solution I came up with is:
test['c2'] = test.c.fillna(value=test.c.min() - 1)
test.sort(['b', 'c2'])
test = test.drop('c2', axis = 1)
This method wouldn't work in regular numpy since .min() would return nan, but in pandas it works fine.

Categories

Resources