pandas: unexpected join behavior results in NaN [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have two dataframes that I'm trying to join in pandas (version 0.18.1).
test1 = pd.DataFrame({'id': range(1,6), 'place': ['Kent','Lenawee','Washtenaw','Berrien','Ottawa']})
id_1 place
0 1 Kent
1 2 Lenawee
2 3 Montreal
3 4 Berrien
4 5 Ottawa
test2 = pd.DataFrame({'id_2': range(6,11), 'id_parent': range(1,6)})
id_2 id_parent
0 6 1
1 7 2
2 8 3
3 9 4
4 10 5
Yet when I join the two tables, the last row doesn't join properly and, because it's a left join, results in NaN.
df = test2.join(test1,on='id_parent',how='left')
id_2 id_parent id_1 place
0 6 1 2 Lenawee
1 7 2 3 Montreal
2 8 3 4 Berrien
3 9 4 5 Ottawa
4 10 5 NaN NaN
This doesn't make sense to me-- id_parent and id_1 are the keys on which to join the two tables, and they both have the same value. Both columns have the same dtype (int64). What's going on here?

join joins primarily on indices, use merge for this:
In [18]:
test2.merge(test1,left_on='id_parent', right_on='id')
Out[18]:
id_2 id_parent id place
0 6 1 1 Kent
1 7 2 2 Lenawee
2 8 3 3 Washtenaw
3 9 4 4 Berrien
4 10 5 5 Ottawa
You get the NaN because the rhs will use the rhs index and there is no entry for 0 and 5 so you get NaN

Here I quote the documentation of pandas : 'join takes an optional on argument which may be a column or multiple column names, which specifies that the passed DataFrame is to be aligned on that column in the DataFrame. "
So in your case, you are matching the index of test2 on id_parent from test1.

Related

Pandas add dataframe to another row-wise by columns setting columns not available in the other as "nan" [duplicate]

This question already has answers here:
Append multiple pandas data frames at once
(5 answers)
Closed 1 year ago.
Say we have two dataframes, A with columns a,b,c and B with columns a,b,d and some values
A =
a
b
c
1
2
3
4
5
6
7
8
9
and B =
a
b
d
1
2
3
4
5
6
7
8
9
Is there a pandas function with can combine the two so that
C = f(A,B) =
a
b
c
d
1
2
3
nan
4
5
6
nan
7
8
9
nan
1
2
nan
3
4
5
nan
6
7
8
nan
9
In other words, the columns that exist in one dataframe but not the other should be set to 'nan' in the other when adding the rows, but still add rows values on the columns common to both. I've tried join, concat and merge, but it seems that they don't work in this way, or I've used them wrong. Anyone have suggestions?
Use pd.concat([A, B], axis=0, ignore_index=True)

summing columns from different dataframes Pandas

I have 3 DataFrames, all with over 100 rows and 1000 columns. I am trying to combine all these DataFrames into one in such a way that common columns from each DataFrame are summed up. I understand there is a method of summation called "pd.DataFrame.sum()", but remember, I have over 1000 columns and I can not add each common column manually. I am attaching sample DataFrames and the result I want. Help will be appreciated.
#Sample DataFrames.
df_1 = pd.DataFrame({'a':[1,2,3],'b':[2,1,0],'c':[1,3,5]})
df_2 = pd.DataFrame({'a':[1,1,0],'b':[2,1,4],'c':[1,0,2],'d':[2,2,2]})
df_3 = pd.DataFrame({'a':[1,2,3],'c':[1,3,5], 'x':[2,3,4]})
#Result.
df_total = pd.DataFrame({'a':[3,5,6],'b':[4,2,4],'c':[3,6,12],'d':[2,2,2], 'x':[2,3,4]})
df_total
a b c d x
0 3 4 3 2 2
1 5 2 6 2 3
2 6 4 12 2 4
Let us do pd.concat then sum
out = pd.concat([df_1,df_2,df_3],axis=1).sum(level=0,axis=1)
Out[7]:
a b c d x
0 3 4 3 2 2
1 5 2 6 2 3
2 6 4 12 2 4
You can add with fill_value=0:
df_1.add(df_2, fill_value=0).add(df_3, fill_value=0).astype(int)
Output:
a b c d x
0 3 4 3 2 2
1 5 2 6 2 3
2 6 4 12 2 4
Note: pandas intrinsically aligns most operations along indexes (index and column headers).

Using pandas extract regex with multiple groups

I am trying to extract a number from a pandas series of strings. For example consider this series:
s = pd.Series(['a-b-1', 'a-b-2', 'c1-d-5', 'c1-d-9', 'e-10-f-1-3.xl', 'e-10-f-2-7.s'])
0 a-b-1
1 a-b-2
2 c1-d-5
3 c1-d-9
4 e-10-f-1-3.xl
5 e-10-f-2-7.s
dtype: object
There are 6 rows, and three string formats/templates (known). The goal is to extract a number for each of the rows depending on the string. Here is what I came up with:
s.str.extract('a-b-([0-9])|c1-d-([0-9])|e-10-f-[0-9]-([0-9])')
and this correctly extracts the numbers that I want from each row:
0 1 2
0 1 NaN NaN
1 2 NaN NaN
2 NaN 5 NaN
3 NaN 9 NaN
4 NaN NaN 3
5 NaN NaN 7
However, since I have three groups in the regex, I have 3 columns, and here comes the question:
Can I write a regex that has one group or that can generate a single column, or do I need to coalesce the columns into one, and how can I do that without a loop if necessary?
Desired outcome would be a series like:
0 1
1 2
2 5
3 9
4 3
5 7
Simplest thing to do is bfill\ffill:
(s.str.extract('a-b-([0-9])|c1-d-([0-9])|e-10-f-[0-9]-([0-9])')
.bfill(axis=1)
[0]
)
Output:
0 1
1 2
2 5
3 9
4 3
5 7
Name: 0, dtype: object
Another way is to use optional non-capturing group:
s.str.extract('(?:a-b-)?(?:c1-d-)?(?:e-10-f-[0-9]-)?([0-9])')
Output:
0
0 1
1 2
2 5
3 9
4 3
5 7
You could use a single capturing group at the end, and add the 3 prefixes in a on capturing group (?:
As they all end with a hyphen, you could move that to after the non capturing group to shorted it a bit.
(?:a-b|c1-d|e-10-f-[0-9])-([0-9])
Regex demo
s.str.extract('(?:a-b|c1-d|e-10-f-[0-9])-([0-9])')
Ouput
0
0 1
1 2
2 5
3 9
4 3
5 7

Using Pandas join to fill in columns

I have two DataFrames that roughly look like
(ID) (Category) (Value1) (Value2)
111 1 5 7
112 1 3 8
113 2 6 9
114 3 2 6
and
(Category) (Value1 Average for Category) (Value2 Average for Category)
1 4 5
2 6 7
3 9 2
Ultimately, I'd like to join the two DataFrames so that each ID can have the average value for its category in the row with it. I'm having trouble finding the right way to join/merge/etc. that will fill in columns by checking the category from the other DateFrame. Does anyone have any idea where to start?
You are simply looking for a join, in pandas we use pd.merge for that like the following:
df3 = pd.merge(df1, df2, on='Category')
ID Category Value1 Value2 Value 1 Average Value 2 Average
0 111 1 5 7 4 5
1 112 1 3 8 4 5
2 113 2 6 9 6 7
3 114 3 2 6 9 2
Official documentation of pandas on merging:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
Here is a good explanation on joins:
Pandas Merging 101
Just do:
df1.groupby(['ID', 'Category']).transform(func='mean')
on the first dataframe to get the desired dataframe.

what is the difference between with or without .loc when using groupby + transform in Pandas

I am new to python. here is the question I have, which is really weird to me.
A simple data frame looks like:
a1=pd.DataFrame({'Hash':[1,1,2,2,2,3,4,4],
'Card':[1,1,2,2,3,3,4,4]})
I need to group a1 by Hash, calculate how many rows in each group, then add one column in a1 to indicate row numbers. So, I want to use groupby + transform.
When I use:
a1['CustomerCount']=a1.groupby(['Hash']).transform(lambda x: x.shape[0])
The result is correct:
Card Hash CustomerCount
0 1 1 2
1 1 1 2
2 2 2 3
3 2 2 3
4 3 2 3
5 3 3 1
6 4 4 2
7 4 4 2
But when I use:
a1.loc[:,'CustomerCount']=a1.groupby(['Hash']).transform(lambda x: x.shape[0])
The result is:
Card Hash CustomerCount
0 1 1 NaN
1 1 1 NaN
2 2 2 NaN
3 2 2 NaN
4 3 2 NaN
5 3 3 NaN
6 4 4 NaN
7 4 4 NaN
So, why does this happen?
As far as I know, loc and iloc (like a1.loc[:,'CustomerCount']) are better than nothing (like a1['CustomerCount']) so loc and iloc are usually recommanded to use. But why this happens?
Also, I have tried loc and iloc a lot of times to generate a new column in one data frame. They usualy work. So does this have something to do with groupby + transform?
The difference is how loc deals with assigning a DataFrame object to a single column. When you assigned the DataFrame with the columns of Card it attempted to line up the index and the column name. The columns didn't line up and you got NaNs. When assigning via direct column access, it determined that it was one column for another and just did it.
Reduce to a single column
You can resolve this by either reducing the result of the groupby operation to just one column thus allowing for easy resolution.
a1.loc[:,'CustomerCount'] = a1.groupby(['Hash']).Card.transform('size')
a1
Hash Card CustomerCount
0 1 1 2
1 1 1 2
2 2 2 3
3 2 2 3
4 2 3 3
5 3 3 1
6 4 4 2
7 4 4 2
Rename the column
Don't really do this, the other answer is far simpler
a1.loc[:, 'CustomerCount'] = a1.groupby('Hash').transform(len).rename(
columns={'Card': 'CustomerCount'})
a1
pd.factorize and np.bincount
What I'd actually do
f, u = pd.factorize(a1.Hash)
a1['CustomerCount'] = np.bincount(f)[f]
a1
Or inline making a copy
a1.assign(CustomerCount=(lambda f: np.bincount(f)[f])(pd.factorize(a1.Hash)[0]))
Hash Card CustomerCount
0 1 1 2
1 1 1 2
2 2 2 3
3 2 2 3
4 2 3 3
5 3 3 1
6 4 4 2
7 4 4 2

Categories

Resources