Merging two dataframes based on index - python

I've been on this all night, and just can't figure it out, even though I know it should be simple. So, my sincerest apologies for the following incantation from a sleep-deprived fellow:
So, I have four fields, Employee ID, Name, Station and Shift (ID is non-null integer, the rest are strings or null).
I have about 10 dataframes, all indexed by ID. And each containing only two columns either (Name and Station) or (Name and Shift)
Now of course, I want to combine all of this into one dataframe, which has a unique row for each ID.
But I'm really frustrated by it at this point(especially because I can't find a way to directly check how many unique indices my final dataframe ends with)
After messing around with some very ugly ways of using .merge(), I finally found .concat(). But it keeps making multiple rows per ID, when I check in excel, the indices are like Table1/1234, Table2/1234 etc. One row has the shift, the other one has station, which is precisely what I'm trying to avoid.
How do I compile all my data into one dataframe, having exactly one row per ID? Possibly without using 9 different merge statements, as I have to scale up later.

If I understand your question correctly, this is the thing that you want.
For example with this 3 dataframes..
In [1]: df1
Out[1]:
0 1 2
0 3.588843 3.566220 6.518865
1 7.585399 4.269357 4.781765
2 9.242681 7.228869 5.680521
3 3.600121 3.931781 4.616634
4 9.830029 9.177663 9.842953
5 2.738782 3.767870 0.925619
6 0.084544 6.677092 1.983105
7 5.229042 4.729659 8.638492
8 8.575547 6.453765 6.055660
9 4.386650 5.547295 8.475186
In [2]: df2
Out[2]:
0 1
0 95.013170 90.382886
2 1.317641 29.600709
4 89.908139 21.391058
6 31.233153 3.902560
8 17.186079 94.768480
In [3]: df
Out[3]:
0 1 2
0 0.777689 0.357484 0.753773
1 0.271929 0.571058 0.229887
2 0.417618 0.310950 0.450400
3 0.682350 0.364849 0.933218
4 0.738438 0.086243 0.397642
5 0.237481 0.051303 0.083431
6 0.543061 0.644624 0.288698
7 0.118142 0.536156 0.098139
8 0.892830 0.080694 0.084702
9 0.073194 0.462129 0.015707
You can do
pd.concat([df,df1,df2], axis=1)
This produces
In [6]: pd.concat([df,df1,df2], axis=1)
Out[6]:
0 1 2 0 1 2 0 1
0 0.777689 0.357484 0.753773 3.588843 3.566220 6.518865 95.013170 90.382886
1 0.271929 0.571058 0.229887 7.585399 4.269357 4.781765 NaN NaN
2 0.417618 0.310950 0.450400 9.242681 7.228869 5.680521 1.317641 29.600709
3 0.682350 0.364849 0.933218 3.600121 3.931781 4.616634 NaN NaN
4 0.738438 0.086243 0.397642 9.830029 9.177663 9.842953 89.908139 21.391058
5 0.237481 0.051303 0.083431 2.738782 3.767870 0.925619 NaN NaN
6 0.543061 0.644624 0.288698 0.084544 6.677092 1.983105 31.233153 3.902560
7 0.118142 0.536156 0.098139 5.229042 4.729659 8.638492 NaN NaN
8 0.892830 0.080694 0.084702 8.575547 6.453765 6.055660 17.186079 94.768480
9 0.073194 0.462129 0.015707 4.386650 5.547295 8.475186 NaN NaN
For more details you might want to see pd.concat
Just a tip putting simple illustrative data in your question always helps in getting answer.

Related

Pandas reorder rows of dataframe

I stumble upon very peculiar problem in Pandas. I have this dataframe
,time,id,X,Y,theta,Vx,Vy,ANGLE_FR,DANGER_RAD,RISK_RAD,TTC_DAN_LOW,TTC_DAN_UP,TTC_STOP,SIM
0,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,2.0,3
1,1600349033921620000,1,18.5371406,-14.224917,0,-0.0113912,1.443597,20,0.5,0.9,-1,7,2.0,3
2,1600349033921650000,2,19.808648100000006,-6.778450599999998,0,0.037289,-1.0557937,20,0.5,0.9,-1,7,2.0,3
3,1600349033921670000,3,22.1796988,-5.7078115999999985,0,0.2585675,-1.2431861000000002,20,0.5,0.9,-1,7,2.0,3
4,1600349033921670000,4,20.757325,-16.115366,0,-0.2528627,0.7889673,20,0.5,0.9,-1,7,2.0,3
5,1600349033921690000,5,20.9491012,-17.7806833,0,0.5062633,0.9386511,20,0.5,0.9,-1,7,2.0,3
6,1600349033921690000,6,20.6225258,-5.5344404,0,-0.1192678,-0.7889041,20,0.5,0.9,-1,7,2.0,3
7,1600349033921700000,7,21.8077004,-14.736984,0,-0.0295737,1.3084618,20,0.5,0.9,-1,7,2.0,3
8,1600349033954560000,0,23.206789800000006,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,2.0,3
9,1600349033954570000,1,18.555421300000006,-13.7440508,0,0.0548418,1.4426004,20,0.5,0.9,-1,7,2.0,3
10,1600349033954570000,2,19.8409748,-7.126075500000002,0,0.0969802,-1.0428747,20,0.5,0.9,-1,7,2.0,3
11,1600349033954580000,3,22.3263185,-5.9586202,0,0.4398591,-0.752425,20,0.5,0.9,-1,7,2.0,3
12,1600349033954590000,4,20.7154136,-15.842398800000002,0,-0.12573430000000002,0.8189016,20,0.5,0.9,-1,7,2.0,3
13,1600349033954590000,5,21.038901,-17.4111883,0,0.2693992,1.108485,20,0.5,0.9,-1,7,2.0,3
14,1600349033954600000,6,20.612499,-5.810969,0,-0.030080400000000007,-0.8295869,20,0.5,0.9,-1,7,2.0,3
15,1600349033954600000,7,21.7872537,-14.3011986,0,-0.0613401,1.3073578,20,0.5,0.9,-1,7,2.0,3
16,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,2
17,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,2
18,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,2
This is input file
Please note that Id always starts at 0 up to 7 and repeat and time column is in sequential step (which implies that previous row should be smaller or equal to current one).
I would like to reorder rows of the dataframe as it is below.
,time,id,X,Y,theta,Vx,Vy,ANGLE_FR,DANGER_RAD,RISK_RAD,TTC_DAN_LOW,TTC_DAN_UP,TTC_STOP,SIM
0,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.0,2
1,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.0,2
2,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.0,2
3,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,1
4,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,1
5,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,1
6,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,2
7,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,2
8,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,2
9,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,3
10,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,3
11,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,3
This is the desired result
Please note that I need to reorder dataframe rows based on this columns id, time, ANGLE_FR, DANGER_RAD, RISK_RAD, TTC_DAN_LOW, TTC_DAN_UP, TTC_STOP, SIM.
As you see from the desired result we need to reoder dataframe in that way time column from smallest to largest one this holds true for the rest of columns, id, sim, ANGLE_FR, DANGER_RAD, RISK_RAD, TTC_DAN_LOW, TTC_DAN_UP, TTC_STOP.
I tried to sort by several columns without success. Moreover, I tried to use groupby but I failed.
Would you like to help to solve the problem? Any suggestions are welcome.
P.S.
I have paste dataframe so they can be read easily with clipboard function in order to be easily reproducible.
I am attaching pic as well.
What did you try to sort by several columns?
In [10]: df.sort_values(['id', 'time', 'ANGLE_FR', 'DANGER_RAD', 'RISK_RAD', 'TTC_DAN_LOW', 'TTC_DAN_UP', 'TTC_STOP', 'SIM'])
Out[10]:
Unnamed: 0 time id X Y theta Vx Vy ANGLE_FR DANGER_RAD RISK_RAD TTC_DAN_LOW TTC_DAN_UP TTC_STOP SIM
0 0 1600349033921610000 0 23.2644 -7.1409 0 0.0210 -1.1414 20 0.5 0.9 -1 7 2 3
8 8 1600349033954560000 0 23.2068 -7.5171 0 -0.1728 -1.1285 20 0.5 0.9 -1 7 2 3
1 1 1600349033921620000 1 18.5371 -14.2249 0 -0.0114 1.4436 20 0.5 0.9 -1 7 2 3
9 9 1600349033954570000 1 18.5554 -13.7441 0 0.0548 1.4426 20 0.5 0.9 -1 7 2 3
2 2 1600349033921650000 2 19.8086 -6.7785 0 0.0373 -1.0558 20 0.5 0.9 -1 7 2 3
How about this:
groupby_cols = ['ANGLE_FR', 'DANGER_RAD', 'RISK_RAD', 'TTC_DAN_LOW', 'TTC_DAN_UP', 'TTC_STOP, SIM']
df = df.groupby(groupby_cols).reset_index()

Pandas: Dynamically replace NaN values with the average of previous and next non-missing values

I have a dataframe df with NaN values and I want to dynamically replace them with the average values of previous and next non-missing values.
In [27]: df
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 NaN -2.027325 1.533582
4 NaN NaN 0.461821
5 -0.788073 NaN NaN
6 -0.916080 -0.612343 NaN
7 -0.887858 1.033826 NaN
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
For example, A[3] is NaN so its value should be (-0.120211-0.788073)/2 = -0.454142. A[4] then should be (-0.454142-0.788073)/2 = -0.621108.
Therefore, the result dataframe should look like:
In [27]: df
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.454142 -2.027325 1.533582
4 -0.621108 -1.319834 0.461821
5 -0.788073 -0.966089 -1.260202
6 -0.916080 -0.612343 -2.121213
7 -0.887858 1.033826 -2.551718
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
Is this a good way to deal with the missing values? I can't simply replace them by the average values of each column because my data is time-series and tends to increase over time. (The initial value may be $0 and final value might be $100000, so the average is $50000 which can be much bigger/smaller than the NaN values).
You can try to understand your logic behind the average that is Geometric progression
s=df.isnull().cumsum()
t1=df[(s==1).shift(-1).fillna(False)].stack().reset_index(level=0,drop=True)
t2=df.lookup(s.idxmax()+1,s.idxmax().index)
df.fillna(t1/(2**s)+t2*(1-0.5**s)*2/2)
Out[212]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.454142 -2.027325 1.533582
4 -0.621107 -1.319834 0.461821
5 -0.788073 -0.966089 -1.260201
6 -0.916080 -0.612343 -2.121213
7 -0.887858 1.033826 -2.551718
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
Explanation:
1st NaN x/2+y/2=1st
2nd NaN 1st/2+y/2=2nd
3rd NaN 2nd/2+y/2+3rd
Then x/(2**n)+y(1-(1/2)**n)/(1-1/2), this is the key
Got a simular Problem.
The following code worked for me.
def fill_nan_with_mean_from_prev_and_next(df):
NANrows = pd.isnull(df).any(1).nonzero()[0]
null_df = df.isnull()
for row in NANrows :
for colum in range(0,df.shape[1]):
if(null_df.iloc[row][colum]):
df.iloc[row][colum] = (df.iloc[row-1][colum]+df.iloc[row-1][colum])/2
return df
maybe it is helps someone too.
as Ben.T has mentioned above
if you have another group of NaN in the same column
you can consider this lazy solution :)
for column in df:
for ind,row in df[[column]].iterrows():
if ~np.isnan(row[column]):
previous = row[column]
else:
indx = ind + 1
while np.isnan(df.loc[indx,column]):
indx += 1
next = df.loc[indx,column]
previous = df[column][ind] = (previous + next)/2

Pandas Vectorization with Function on Parts of Column

So I have a dataframe that looks something like this:
df1 = pd.DataFrame([[1,2, 3], [5,7,8], [2,5,4]])
0 1 2
0 1 2 3
1 5 7 8
2 2 5 4
I then have a function that adds 5 to a number called add5. I'm trying to create a new column in df1 that adds 5 to all the numbers in column 2 that are greater than 3. I want to use vectorization not apply as this concept is going to be expanded to a dataset with hundreds of thousands of entries and speed will be important. I can do it without the greater than 3 constraint like this:
df1['3'] = add5(df1[2])
But my goal is to do something like this:
df1['3'] = add5(df1[2]) if df1[2] > 3
Hoping someone can point me in the right direction on this. Thanks!
With Pandas, a function applied explicitly to each row typically cannot be vectorised. Even implicit loops such as pd.Series.apply will likely be inefficient. Instead, you should use true vectorised operations, which lean heavily on NumPy in both functionality and syntax.
In this case, you can use numpy.where:
df1[3] = np.where(df1[2] > 3, df1[2] + 5, df1[2])
Alternatively, you can use pd.DataFrame.loc in a couple of steps:
df1[3] = df1[2]
df1.loc[df1[2] > 3, 3] = df1[2] + 5
In each case, the term df1[2] > 3 creates a Boolean series, which is then used to mask another series.
Result:
print(df1)
0 1 2 3
0 1 2 3 3
1 5 7 8 13
2 2 5 4 9

Drop pandas dataframe row based on max value of a column

I have a Dataframe like so:
p_rel y_BET sq_resid
1 0.069370 41.184996 0.292942
2 0.116405 43.101090 0.010953
3 0.173409 44.727748 0.036832
4 0.225629 46.681293 0.540616
5 0.250682 46.980616 0.128191
6 0.294650 47.446113 0.132367
7 0.322530 48.078038 0.235047
How do I get rid of the fourth row because it has the max value of sq_resid? note: the max will change from dataset to dataset so just removing the 4th row isn't enough.
I have tried several things such as I can remove the max value which leaves the dataframe like below but haven't been able to remove the whole row.
p_rel y_BET sq_resid
1 0.069370 41.184996 0.292942
2 0.116405 43.101090 0.010953
3 0.173409 44.727748 0.036832
4 0.225629 46.681293 Nan
5 0.250682 46.980616 0.128191
6 0.294650 47.446113 0.132367
7 0.322530 48.078038 0.235047
You could just filter the df like so:
In [255]:
df.loc[df['sq_resid']!=df['sq_resid'].max()]
Out[255]:
p_rel y_BET sq_resid
1 0.069370 41.184996 0.292942
2 0.116405 43.101090 0.010953
3 0.173409 44.727748 0.036832
5 0.250682 46.980616 0.128191
6 0.294650 47.446113 0.132367
or drop using idxmax which will return the label row of the max value:
In [257]:
df.drop(df['sq_resid'].idxmax())
Out[257]:
p_rel y_BET sq_resid
1 0.069370 41.184996 0.292942
2 0.116405 43.101090 0.010953
3 0.173409 44.727748 0.036832
5 0.250682 46.980616 0.128191
6 0.294650 47.446113 0.132367
7 0.322530 48.078038 0.235047

Pandas variable creation using multiple If-else

Need help with Pandas multiple IF-ELSE statements. I have a test dataset (titanic) as follows:
ID Survived Pclass Name Sex Age
1 0 3 Braund male 22
2 1 1 Cumings, Mrs. female 38
3 1 3 Heikkinen, Miss. Laina female 26
4 1 1 Futrelle, Mrs. female 35
5 0 3 Allen, Mr. male 35
6 0 3 Moran, Mr. male
7 0 1 McCarthy, Mr. male 54
8 0 3 Palsson, Master male 2
where Id is the passenger id. I want to create a new flag variable in this data frame which has the following rule:
if Sex=="female" or (Pclass==1 and Age <18) then 1 else 0.
Now to do this I tried a few approaches. This is how I approached first:
df=pd.read_csv(data.csv)
for passenger_index,passenger in df.iterrows():
if passenger['Sex']=="female" or (passenger['Pclass']==1 and passenger['Age']<18):
df['Prediction']=1
else:
df['Prediction']=0
The problem with above code is that it creates a Prediction variable in df but with all values as 0.
However if I use the same code but instead output it to a dictionary it gives the right answer as shown below:
prediction={}
df=pd.read_csv(data.csv)
for passenger_index,passenger in df.iterrows():
if passenger['Sex']=="female" or (passenger['Pclass']==1 and passenger['Age']<18):
prediction[passenger['ID']=1
else:
prediction[passenger['ID']=0
This gives a dict prediction with keys as ID and values as 1 or 0 based on the above logic.
So why the df variable works wrongly?. I even tried by first defining a function and then calling it. Gave the same ans as first.
So, how can we do this in pandas?.
Secondly, I guess the same can be done if we can just use some multiple if-else statements. I know np.where but it is not allowing to add 'and' condition. So here is what I was trying:
df['Prediction']=np.where(df['Sex']=="female",1,np.where((df['Pclass']==1 and df['Age']<18),1,0)
The above gave an error for 'and' keyword in where.
So can someone help?. Solutions with multiple approache using np.where(simple if-else like) and using some function(applymap etc) or modifications to what I wrote earlier would be really appreciated.
Also how do we do the same using some applymap or apply/map method of df?.
Instead of looping through the rows using df.iterrows (which is relatively slow), you can assign the desired values to the Prediction column in one assignment:
In [27]: df['Prediction'] = ((df['Sex']=='female') | ((df['Pclass']==1) & (df['Age']<18))).astype('int')
In [29]: df['Prediction']
Out[29]:
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
Name: Prediction, dtype: int32
For your first approach, remember that df['Prediction'] represents an entire column of df, so df['Prediction']=1 assigns the value 1 to each row in that column. Since df['Prediction']=0 was the last assignment, the entire column ended up being filled with zeros.
For your second approach, note that you need to use & not and to perform an elementwise logical-and operation on two NumPy arrays or Pandas NDFrames. Thus, you could use
In [32]: np.where(df['Sex']=='female', 1, np.where((df['Pclass']==1)&(df['Age']<18), 1, 0))
Out[32]: array([0, 1, 1, 1, 0, 0, 0, 0])
though I think it is much simpler to just use | for logical-or and & for logical-and:
In [34]: ((df['Sex']=='female') | ((df['Pclass']==1) & (df['Age']<18)))
Out[34]:
0 False
1 True
2 True
3 True
4 False
5 False
6 False
7 False
dtype: bool

Categories

Resources