Using Pandas join to fill in columns - python

I have two DataFrames that roughly look like
(ID) (Category) (Value1) (Value2)
111 1 5 7
112 1 3 8
113 2 6 9
114 3 2 6
and
(Category) (Value1 Average for Category) (Value2 Average for Category)
1 4 5
2 6 7
3 9 2
Ultimately, I'd like to join the two DataFrames so that each ID can have the average value for its category in the row with it. I'm having trouble finding the right way to join/merge/etc. that will fill in columns by checking the category from the other DateFrame. Does anyone have any idea where to start?

You are simply looking for a join, in pandas we use pd.merge for that like the following:
df3 = pd.merge(df1, df2, on='Category')
ID Category Value1 Value2 Value 1 Average Value 2 Average
0 111 1 5 7 4 5
1 112 1 3 8 4 5
2 113 2 6 9 6 7
3 114 3 2 6 9 2
Official documentation of pandas on merging:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
Here is a good explanation on joins:
Pandas Merging 101

Just do:
df1.groupby(['ID', 'Category']).transform(func='mean')
on the first dataframe to get the desired dataframe.

Related

summing columns from different dataframes Pandas

I have 3 DataFrames, all with over 100 rows and 1000 columns. I am trying to combine all these DataFrames into one in such a way that common columns from each DataFrame are summed up. I understand there is a method of summation called "pd.DataFrame.sum()", but remember, I have over 1000 columns and I can not add each common column manually. I am attaching sample DataFrames and the result I want. Help will be appreciated.
#Sample DataFrames.
df_1 = pd.DataFrame({'a':[1,2,3],'b':[2,1,0],'c':[1,3,5]})
df_2 = pd.DataFrame({'a':[1,1,0],'b':[2,1,4],'c':[1,0,2],'d':[2,2,2]})
df_3 = pd.DataFrame({'a':[1,2,3],'c':[1,3,5], 'x':[2,3,4]})
#Result.
df_total = pd.DataFrame({'a':[3,5,6],'b':[4,2,4],'c':[3,6,12],'d':[2,2,2], 'x':[2,3,4]})
df_total
a b c d x
0 3 4 3 2 2
1 5 2 6 2 3
2 6 4 12 2 4
Let us do pd.concat then sum
out = pd.concat([df_1,df_2,df_3],axis=1).sum(level=0,axis=1)
Out[7]:
a b c d x
0 3 4 3 2 2
1 5 2 6 2 3
2 6 4 12 2 4
You can add with fill_value=0:
df_1.add(df_2, fill_value=0).add(df_3, fill_value=0).astype(int)
Output:
a b c d x
0 3 4 3 2 2
1 5 2 6 2 3
2 6 4 12 2 4
Note: pandas intrinsically aligns most operations along indexes (index and column headers).

Map two pandas dataframe and add a column to the first dataframe

I have posted two sample dataframes. I would like to map one column of a dataframe with respect to the index of a column in another dataframe and place the values back to the first dataframe shown as below
A = np.array([0,1,1,3,5,2,5,4,2,0])
B = np.array([55,75,86,98,100,111])
df1 = pd.Series(A, name='data').to_frame()
df2 = pd.Series(B, name='values_for_replacement').to_frame()
The below is the first dataframe df1
data
0 0
1 1
2 1
3 3
4 5
5 2
6 5
7 4
8 2
9 0
And the below is the second dataframe df2
values_for_replacement
0 55
1 75
2 86
3 98
4 100
5 111
The below is the output needed (Mapped with respect to the index of the df2)
data new_data
0 0 55
1 1 75
2 1 75
3 3 98
4 5 111
5 2 86
6 5 111
7 4 100
8 2 86
9 0 55
I would kindly like to know how one can achieve this using some pandas functions like map.
Looking forward for some answers. Many thanks in advance

dataframe with two conditions on two different columns

I want to filter a dataframe based on two conditions on two different columns. In the example below, I want to filter the dataframe df to contain rows such that it contains uids with value counts for the val column greater than 4 is more than 2.
df = pd.DataFrame({'uid':[1,1,1,2,2,3,3,4,4,4],'iid':[11,12,13,12,13,13,14,14,11,12], 'val':[3,4,5,3,5,4,5,4,3,4]})
For this dataframe, my output should be
df
uid iid val
0 1 11 3
1 1 12 4
2 1 13 5
5 3 13 4
6 3 14 5
7 4 14 4
8 4 11 3
9 4 12 4
Here, I filtered out the uid 2 becuase number of rows with uid == 2 and val >= 4 is less than 2. I want to keep only uid rows for which number of val with values greater than 4 is greater than or equal to 2.
you need groupby.transform with sum once check where val is greater or equal ge than 4. and check that the result is ge to use it as a boolean filter on df.
print (df[df['val'].ge(4).groupby(df['uid']).transform(sum).ge(2)])
uid iid val
0 1 11 3
1 1 12 4
2 1 13 5
5 3 13 4
6 3 14 5
7 4 14 4
8 4 11 3
9 4 12 4
EDIT: another way to avoid groupby.transform is to loc the rows where val is ge than 4 and the column uid, use value_counts on it and get True where ge 2. then map back to the uid column to create the boolean filter on df. same result and potentially faster.
df[df['uid'].map(df.loc[df['val'].ge(4), 'uid'].value_counts().ge(2))]

Pandas compare 1 columns values to another dataframe column, find matching rows

I have a database that I am bringing in a SQL table of events and alarms (df1), and I have a txt file of alarm codes and properties (df2) to watch for. Want to use 1 columns values from df2 that each value needs cross checked against an entire column values in df1, and output the entire rows of any that match into another dataframe df3.
df1 A B C D
0 100 20 1 1
1 101 30 1 1
2 102 21 2 3
3 103 15 2 3
4 104 40 2 3
df2 0 1 2 3 4
0 21 2 2 3 3
1 40 0 NaN NaN NaN
Output entire rows from df1 that column B match with any of df2 column 0 values into df3.
df3 A B C D
0 102 21 2 3
1 104 40 2 3
I was able to get single results using:
df1[df1['B'] == df2.iloc[0,0]]
But I need something that will do this on a larger scale.
Method 1: merge
Use merge, on B and 0. Then select only the df1 columns
df1.merge(df2, left_on='B', right_on='0')[df1.columns]
A B C D
0 102 21 2 3
1 104 40 2 3
Method 2: loc
Alternatively use loc to find rows in df1 where B has a match in df2 column 0 using .isin:
df1.loc[df1.B.isin(df2['0'])]
A B C D
2 102 21 2 3
4 104 40 2 3

pandas: unexpected join behavior results in NaN [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have two dataframes that I'm trying to join in pandas (version 0.18.1).
test1 = pd.DataFrame({'id': range(1,6), 'place': ['Kent','Lenawee','Washtenaw','Berrien','Ottawa']})
id_1 place
0 1 Kent
1 2 Lenawee
2 3 Montreal
3 4 Berrien
4 5 Ottawa
test2 = pd.DataFrame({'id_2': range(6,11), 'id_parent': range(1,6)})
id_2 id_parent
0 6 1
1 7 2
2 8 3
3 9 4
4 10 5
Yet when I join the two tables, the last row doesn't join properly and, because it's a left join, results in NaN.
df = test2.join(test1,on='id_parent',how='left')
id_2 id_parent id_1 place
0 6 1 2 Lenawee
1 7 2 3 Montreal
2 8 3 4 Berrien
3 9 4 5 Ottawa
4 10 5 NaN NaN
This doesn't make sense to me-- id_parent and id_1 are the keys on which to join the two tables, and they both have the same value. Both columns have the same dtype (int64). What's going on here?
join joins primarily on indices, use merge for this:
In [18]:
test2.merge(test1,left_on='id_parent', right_on='id')
Out[18]:
id_2 id_parent id place
0 6 1 1 Kent
1 7 2 2 Lenawee
2 8 3 3 Washtenaw
3 9 4 4 Berrien
4 10 5 5 Ottawa
You get the NaN because the rhs will use the rhs index and there is no entry for 0 and 5 so you get NaN
Here I quote the documentation of pandas : 'join takes an optional on argument which may be a column or multiple column names, which specifies that the passed DataFrame is to be aligned on that column in the DataFrame. "
So in your case, you are matching the index of test2 on id_parent from test1.

Categories

Resources