i have a logical problem to solve an issue. I have two dataframes.
Dataframe_one have the following columns:
[Id, workflowprofile_A, workflow_profile_B, important_string_info ]
Dataframe_two have the following columns:
[workflowprofile, option, workflow]
My Problem is that workflowprofile from Dataframe_two can be a workflowprofile_A OR and AND a workflow_profile_B from Dataframe_one. How can i get a merged dataframe where the columns would look like these.
dataframe_three:
[Id, workflowprofile_A,workflowprofile_fromA, option_fromA, workflow_fromA,important_string_info_fromA workflow_profile_B, workflowprofile_fromB, option_fromB, workflow_fromB, important_string_info_fromB]
You can create new column by fillna or combine_first, because always one value is NaN and then merge by this column:
df1['workflowprofile'] = df1['workflowprofile_A'].fillna(df1['workflow_profile_B'])
#alternative
#df1['workflowprofile'] = df1['workflowprofile_A'].combine_first(df1['workflow_profile_B'])
df3 = pd.merge(df1, df2, on='workflowprofile')
Sample:
print (df1)
Id workflowprofile_A workflow_profile_B important_string_info
0 1 7.0 NaN 8
1 2 NaN 5.0 1
print (df2)
workflowprofile option workflow
0 7 0 0
1 5 9 0
2 7 0 0
3 4 1 2
df1['workflowprofile'] = df1['workflowprofile_A'].fillna(df1['workflow_profile_B'])
df3 = pd.merge(df1, df2, on='workflowprofile')
print (df3)
Id workflowprofile_A workflow_profile_B important_string_info \
0 1 7.0 NaN 8
1 1 7.0 NaN 8
2 2 NaN 5.0 1
workflowprofile option workflow
0 7 0 0
1 7 0 0
2 5 9 0
Related
I have a data frame that has the form:
index predicted
1 1
2 1
3 0
4 0
5 1
And another that has the form:
index actual
2 1
4 0
I want the data frame:
index predicted actual
1 1 nan
2 1 1
3 0 nan
4 0 0
5 1 nan
I've tried pd.concat([df1,df2], on="index", how="left") and pd.merge(df1, df2, axis=1)
Both give the dataframe:
index predicted actual
1 1 1
2 1 0
3 0 nan
4 0 nan
5 1 nan
How can I get the data frame I need. Also thanks in advance.
You can use the pd.merge() setting the parameters left_index = True and right_index = True
import pandas as pd
df1 = pd.DataFrame({'predicted': [1,1,0,0,1]}, index = (1,2,3,4,5))
df2 = pd.DataFrame({'actual': [1,0]}, index = (2,4))
pd.merge(df1, df2, how = 'left', left_index=True, right_index=True)
This will merge the two dataframes on index and produce the intended result required.
index predicted actual
1 1 NaN
2 1 1.0
3 0 NaN
4 0 0.0
5 1 NaN
If you make sure that your index column is actually the df.index, pd.concat should work:
import pandas as pd
left = pd.DataFrame({"predicted": [1, 1, 0, 0, 1]}, index=[1, 2, 3, 4, 5])
right = pd.DataFrame({"actual": [1, 0]}, index=[2, 4])
out = pd.concat([left, right], axis=1)
predicted actual
1 1 NaN
2 1 1.0
3 0 NaN
4 0 0.0
5 1 NaN
If they're just columns, such as the following:
left = left.reset_index(names="index")
right = right.reset_index(names="index")
then you can use:
left.merge(right, on="index", how="left")
index predicted actual
0 1 1 NaN
1 2 1 1.0
2 3 0 NaN
3 4 0 0.0
4 5 1 NaN
Create index as a temporary column then left join using that then set it as index.
predict_df = pd.DataFrame({'predicted': [1,1,0,0,1]}, index=range(1,6))
actual_df = pd.DataFrame({'actual': [1,0]}, index=[2,4])
pd.merge(
left=predict_df.reset_index(),
right=actual_df.reset_index(),
how='left',
on='index'
).set_index('index')
predicted actual
index
1 1 NaN
2 1 1.0
3 0 NaN
4 0 0.0
5 1 NaN
df1.join(df2)
out:
predicted actual
1 1 NaN
2 1 1.0
3 0 NaN
4 0 0.0
5 1 NaN
I am curious why a simple concatenation of two dataframes in pandas:
initId.shape # (66441, 1)
initId.isnull().sum() # 0
ypred.shape # (66441, 1)
ypred.isnull().sum() # 0
of the same shape and both without NaN values
foo = pd.concat([initId, ypred], join='outer', axis=1)
foo.shape # (83384, 2)
foo.isnull().sum() # 16943
can result in a lot of NaN values if joined.
How can I fix this problem and prevent NaN values being introduced?
Trying to reproduce it like
aaa = pd.DataFrame([0,1,0,1,0,0], columns=['prediction'])
bbb = pd.DataFrame([0,0,1,0,1,1], columns=['groundTruth'])
pd.concat([aaa, bbb], axis=1)
failed e.g. worked just fine as no NaN values were introduced.
I think there is problem with different index values, so where concat cannot align get NaN:
aaa = pd.DataFrame([0,1,0,1,0,0], columns=['prediction'], index=[4,5,8,7,10,12])
print(aaa)
prediction
4 0
5 1
8 0
7 1
10 0
12 0
bbb = pd.DataFrame([0,0,1,0,1,1], columns=['groundTruth'])
print(bbb)
groundTruth
0 0
1 0
2 1
3 0
4 1
5 1
print (pd.concat([aaa, bbb], axis=1))
prediction groundTruth
0 NaN 0.0
1 NaN 0.0
2 NaN 1.0
3 NaN 0.0
4 0.0 1.0
5 1.0 1.0
7 1.0 NaN
8 0.0 NaN
10 0.0 NaN
12 0.0 NaN
Solution is reset_index if indexes values are not necessary:
aaa.reset_index(drop=True, inplace=True)
bbb.reset_index(drop=True, inplace=True)
print(aaa)
prediction
0 0
1 1
2 0
3 1
4 0
5 0
print(bbb)
groundTruth
0 0
1 0
2 1
3 0
4 1
5 1
print (pd.concat([aaa, bbb], axis=1))
prediction groundTruth
0 0 0
1 1 0
2 0 1
3 1 0
4 0 1
5 0 1
EDIT: If need same index like aaa and length of DataFrames is same use:
bbb.index = aaa.index
print (pd.concat([aaa, bbb], axis=1))
prediction groundTruth
4 0 0
5 1 0
8 0 1
7 1 0
10 0 1
12 0 1
You can do something like this:
concatenated_dataframes = concat(
[
dataframe_1.reset_index(drop=True),
dataframe_2.reset_index(drop=True),
dataframe_3.reset_index(drop=True)
],
axis=1,
ignore_index=True,
)
concatenated_dataframes_columns = [
list(dataframe_1.columns),
list(dataframe_2.columns),
list(dataframe_3.columns)
]
flatten = lambda nested_lists: [item for sublist in nested_lists for item in sublist]
concatenated_dataframes.columns = flatten(concatenated_dataframes_columns)
To concatenate multiple DataFrames and keep the columns names / avoid NaN.
As jezrael pointed out, this is due to different index labels. concat matches on index, so if they are not the same, this problem will occur. For a straightforward horizontal concatenation, you must "coerce" the index labels to be the same. One way is via set_axis method. This makes the second dataframes index to be the same as the first's.
joined_df = pd.concat([df1, df2.set_axis(df1.index)], axis=1)
or just reset the index of both frames
joined_df = pd.concat([df1.reset_index(drop=True), df2.reset_index(drop=True)], axis=1)
I have two dataframes, for each id in df1 I need to pick rows from df2 with same id, filter out rows where df2.ApplicationDate < df1.ApplicationDate and count how many of such rows exist.
Here is how I am doing it currently:
for i, row in df1.iterrows():
count = len(df2[(df2['PersonalId']==row['PersonalId'])
& (df2['ApplicationDate'] < row['ApplicationDate']])
counts.append(count)
This approach works but its hellishly slow on large dataframes, is there any way to accelerate it?
Edit: added sample input with expected output
df1:
Id ApplicationDate
0 1 5-12-20
1 2 6-12-20
2 3 7-12-20
3 4 8-12-20
4 5 9-12-20
5 6 10-12-20
df2:
Id ApplicationDate
0 1 4-11-20
1 1 4-12-20
2 3 5-12-20
3 3 8-12-20
4 5 1-12-20
expected counts for each id: [2, 0, 1, 0, 1, 0]
You can can left join both tables:
df3 = df1.merge(df2, on='Id', how='left')
Result:
Id ApplicationDate_x ApplicationDate_y
0 1 2020-05-12 2020-04-11
1 1 2020-05-12 2020-04-12
2 2 2020-06-12 NaT
3 3 2020-07-12 2020-05-12
4 3 2020-07-12 2020-08-12
5 4 2020-08-12 NaT
6 5 2020-09-12 2020-01-12
7 6 2020-10-12 NaT
Then you can compare dates, group by 'Id' and count True values per group:
df3.ApplicationDate_x.gt(df3.ApplicationDate_y).groupby(df3.Id).sum()
Result:
Id
1 2
2 0
3 1
4 0
5 1
6 0
df1.merge(df2, on="Id", how="left").assign(
temp=lambda x: x.ApplicationDate_y.notna(),
tempr=lambda x: x.ApplicationDate_x > x.ApplicationDate_y,
counter=lambda x: x.temp & x.tempr,
).groupby("Id").counter.sum()
Id
1 2
2 0
3 1
4 0
5 1
6 0
Name: counter, dtype: int64
The code above merges the dataframe and then uses the sum of the conditions based on the groupby to get the count.
I have a Pandas dataset with 3 columns. I need to group by the ID column while finding the sum and count of the other two columns. Also, I have to ignore the zeroes in the columsn 'A' and 'B'.
The dataset looks like -
ID A B
1 0 5
2 10 0
2 20 0
3 0 30
What I need -
ID A_Count A_Sum B_Count B_Sum
1 0 0 1 5
2 2 30 0 0
3 0 0 1 30
I have tried this using one column but wasn't able to get both the aggregations in the final dataset.
(df.groupby('ID').agg({'A':'sum', 'A':'count'}).reset_index().rename(columns = {'A':'A_sum', 'A': 'A_count'}))
If you don't pass it columns specifically, it will aggregate the numeric columns by itself.
Since your don't want to count 0, replace them with NaN first:
df.replace(0, np.NaN, inplace=True)
print(df)
ID A B
0 1 NaN 5.0
1 2 10.0 NaN
2 2 20.0 NaN
3 3 NaN 30.0
df = df.groupby('ID').agg(['count', 'sum'])
print(df)
A B
count sum count sum
ID
1 0 0.0 1 5.0
2 2 30.0 0 0.0
3 0 0.0 1 30.0
Remove MultiIndex columns
You can use list comprehension:
df.columns = ['_'.join(col) for col in df.columns]
print(df)
A_count A_sum B_count B_sum
ID
1 0 0.0 1 5.0
2 2 30.0 0 0.0
3 0 0.0 1 30.0
df:
index a b c d
-
0 1 2 NaN NaN
1 2 NaN 3 NaN
2 5 NaN 6 NaN
3 1 NaN NaN 5
df expect:
index one two
-
0 1 2
1 2 3
2 5 6
3 1 5
Above output example is self-explanatory. Basically, I just need to shift the two values from columns [a, b, c, d] except NaN into another set of two columns ["one", "two"]
Use back filling missing values and select first 2 columns:
df = df.bfill(axis=1).iloc[:, :2].astype(int)
df.columns = ["one", "two"]
print (df)
one two
index
0 1 2
1 2 3
2 5 6
3 1 5
Or combine_first + drop:
df['two']=df.pop('b').combine_first(df.pop('c')).combine_first(df.pop('d'))
df=df.drop(['b','c','d'],1)
df.columns=['index','one','two']
Or fillna:
df['two']=df.pop('b').fillna(df.pop('c')).fillna(df.pop('d'))
df=df.drop(['b','c','d'],1)
df.columns=['index','one','two']
Both cases:
print(df)
Is:
index one two
0 0 1 2.0
1 1 2 3.0
2 2 5 6.0
3 3 1 5.0
If want output like #jezrael's, add a: (both cases all okay)
df=df.set_index('index')
And then:
print(df)
Is:
one two
index
0 1 2.0
1 2 3.0
2 5 6.0
3 1 5.0