DataFrame merge on column gives NaN - python

I have two DataFrames with the first df:
indegree interrupts Subject
1 2 Weather
2 3 Weather
4 5 Weather
The second join:
Subject interrupts_mean indegree_mean
weather 2 3
But the second is a lot shorter since I made that the means of all the different subjects in the first dataframe.
When I want to merge both DataFrames
pd.merge(df,join,left_index=True,right_index=True,how='left')
it merges but it gives NaNs on the second dataframe in the new dataframe and I suppose it it so since the DataFrames are not the same length. How can I still merge on subject so that the values from the second DataFrame are duplicated in the new DataFrame?

Related

Copying (assembling) the column from smaller data frames into the bigger data frame with pandas

I have a data frame with measurements for several groups of participants, and I am doing some calculations for each group. I want to add a column in a big data frame (all participants), from secondary data frames (partial list of participants).
When I do merge a couple of times (merging a new data frame to the existing one), it creates a duplicate of the column instead of one column.
As the size of the dataframes is different I can not compare them directly.
I tried
#df1 - main bigger dataframe, df2 - smaller dataset contains group of df1
for i in range(len(df1)):
# checking indeces to place the data to correct participant:
if df1.index[i] not in df2['index']:
pass
else :
df1['rate'][i] = list(df2[rate][df2['index']==i])
It does not work properly though. Can you please help with the correct way of assembling the column?
update: where the index of the initial dataframe and the "index" column of the calculation is the same, copy the rate value from the calculation into main df
main dataframe 1df
index
rate
1
0
2
0
3
0
4
0
5
0
6
0
dataframe with calculated values
index
rate
1
value
4
value
6
value
output df
index
rate
1
value
2
0
3
0
4
value
5
0
6
value
Try this – using .join() to merge dataframes on their indices and combining two columns using .combine_first():
df = df1.join(df2, lsuffix="_df1", rsuffix="_df2")
df["rate"] = df["rate_df2"].combine_first(df["rate_df1"])
EDIT:
This assumes both dataframes use a matching index. If that is not the case for df2, run this first:
df2 = df2.set_index('index')

Python join dataframes by index

I'm working with multiple dataframes in Python and I'm looking to map one onto the other based on a common column (similar to index/match in Excel). I want to join the date column of one dataframe, to the index of the other dataframe (where the date is stored as the index). How would I call out the index? For reference, I want to subtract my ROI for dataframe 2 (awk_price) to the ROI from dataframe 1 (S&P 500). The dataframes are shown below.
I currently have a merged dataframe using
pd.merge(awk_price,sp_500, left_index=True, right_on='Date')
I would love to just add a column to df2 subtracting ROI from dataframe 2 by ROI from dataframe 1 but I can't figure out how to "map" the dates column from dataframe 1 to the index from dataframe 2.
Dataframe 2 (awk_price)
Dataframe 1 (sp_500)
You can use reset_index(), and then rename the column:
df=df1.reset_index().rename(columns={"index": "Date"})
df

Find Mismatched Data Between Two Different Dataframe Columns

I am trying to compare two columns, both from different dataframes. The issue is that these columns are not in the same order.
Given the two dataframes below:
df1
index Letter
1 C,D,X
2 E,F
3 A,B
df2
index Letter
1 A
2 C,D
3 F
I want to identify all data that these two dataframes do not have in common. For example, one dataframe has 'A,B' while the other one has only 'A'. I am trying to flag the missing 'B' in df2. Would the best approach be to split each row of 'Letter' into a list of lists and then create a dictionary to compare?

assign value to pandas column based on data in another dataframe

I have 2 dataframes
df1
ID ID2 NUMBER
1 2 null
df2
ID ID2 NUMBER
1 2 1
1 2 2
1 2 3
So when doing merge between df1 and df2 usin ID and ID2 I get duplicated columns because df1 has 3 matches in df2. I'd like to assign a random number to df1 and use it for merging, this way I always get 1 to 1 merge.
The problem is that my dataset is rather big and sometimes I have only 1 row in df2 (so merge works properly) and sometimes I have 10+ rows in df2. I'd like to assign a number to df1 using:
rand(1,len(df1[(df1.ID=1) & (df1.ID2=2]))
I think I found a solution I'm posting it here so others can tell me if there is a better way.
def select_random_row(grp):
ID= grp.ID.iloc[0]
ID2= grp.ID2.iloc[0]
return random.randint(1, len(df1[(df1.ID== ID) & (df1.ID2 == ID2)]))
df2['g'] = df2.groupby(['ID','ID2']).apply(select_random_row)
EDIT:
This is way to slow to do on large dataset... I decided to just use drop_duplicates before merging and keep 1st record. It isn't randomly like I wanted but it is better than nothing

How to copy one DataFrame column in to another Dataframe if their indexes values are the same

After creating a DataFrame with some duplicated cell values in column with the name 'keys':
import pandas as pd
df = pd.DataFrame({'keys': [1,2,2,3,3,3,3],'values':[1,2,3,4,5,6,7]})
I go ahead and create two more DataFrames which are the consolidated versions of the original DataFrame df. Those newly created DataFrames will have no duplicated cell values under the 'keys' column:
df_sum = df_a.groupby('keys', axis=0).sum().reset_index()
df_mean = df_b.groupby('keys', axis=0).mean().reset_index()
As you can see df_sum['values'] cells values were all summed together.
While df_mean['values'] cell values were averaged with mean() method.
Lastly I rename the 'values' column in both dataframes with:
df_sum.columns = ['keys', 'sums']
df_mean.columns = ['keys', 'means']
Now I would like to copy the df_mean['means'] column into the dataframe df_sum.
How to achieve this?
The Photoshoped image below illustrates the dataframe I would like to create. Both 'sums' and 'means' columns are merged into a single DataFrame:
There are several ways to do this. Using the merge function off the dataframe is the most efficient.
df_both = df_sum.merge(df_mean, how='left', on='keys')
df_both
Out[1]:
keys sums means
0 1 1 1.0
1 2 5 2.5
2 3 22 5.5
I think pandas.merge() is the function you are looking for. Like pd.merge(df_sum, df_mean, on = "keys"). Besides, this result can also be summarized on one agg function as following:
df.groupby('keys')['values'].agg(['sum', 'mean']).reset_index()
# keys sum mean
#0 1 1 1.0
#1 2 5 2.5
#2 3 22 5.5

Categories

Resources