I'm new to Python. I have 2 dataframes each with a single column. I want to join them together and keep the values based on their respective positions in each of the tables.
My code looks something like this:
huh = pd.DataFrame(columns=['result'], data=['a','b','c','d'])
huh2 = pd.DataFrame(columns=['result2'], data=['aa','bb','cc','dd'])
huh2 = huh2.sort_values('result2', ascending=False)
tmp = pd.concat([huh,huh2], ignore_index=True, axis=1)
tmp
From the documentation it looks like the ignore_index flag and axis=1 should be sufficient to achieve this but the results obviously disagree.
Current Output:
0 1
0 a aa
1 b bb
2 c cc
3 d dd
Desired Output:
result result2
0 a dd
1 b cc
2 c bb
3 d aa
If you concatenate the DataFrames horizontally, then the column names are ignored. If you concatenate vertically, the indexes are ignored. You can only ignore one or the other, not both.
In your case, I would recommend setting the index of "huh2" to be the same as that of "huh".
pd.concat([huh, huh2.set_index(huh.index)], axis=1)
result result2
0 a dd
1 b cc
2 c bb
3 d aa
If you aren't dealing with custom indices, reset_index will suffice.
pd.concat([huh, huh2.reset_index(drop=True)], axis=1)
result result2
0 a dd
1 b cc
2 c bb
3 d aa
Related
I have two dataframes, example:
Df1 -
A B C D
x j 5 2
y k 7 3
z l 9 4
Df2 -
A B C D
z o 1 1
x p 2 1
y q 3 1
I want to deduct columns C and D in Df2 from columns C and D in Df1 based on the key contained in column A.
I also want to ensure that column B remains untouched, example:
Df3 -
A B C D
x j 3 1
y k 4 2
z l 8 3
I found an almost perfect answer in the following thread:
Subtracting columns based on key column in pandas dataframe
However what the answer does not explain is if there are other columns in the primary df (such as column B) that should not be involved as an index or with the operation.
Is somebody please able to advise?
I was originally performing a loop which find the value in the other df and deducts it however this takes too long for my code to run with the size of data I am working with.
Idea is specify column(s) for maching and column(s) for subtract, convert all not cols columnsnames to MultiIndex, subtract:
match = ['A']
cols = ['C','D']
df1 = Df1.set_index(match + Df1.columns.difference(match + cols).tolist())
df = df1.sub(Df2.set_index(match)[cols], level=0).reset_index()
print (df)
A B C D
0 x j 3 1
1 y k 4 2
2 z l 8 3
Or replace not matched values to original Df1:
match = ['A']
cols = ['C','D']
df1 = Df1.set_index(match)
df = df1.sub(Df2.set_index(match)[cols], level=0).reset_index().fillna(Df1)
print (df)
A B C D
0 x j 3 1
1 y k 4 2
2 z l 8 3
This is the dataframe I have with three rows and three columns.
a d aa
b e bb
c f cc
What I want is to remove the second column and adding those values to the rows in first column along with their respective values from third column.
This is the expected result:
a aa
b bb
c cc
d aa
e bb
f cc
Firstly concat the columns:
df1 = pd.concat([df[df.columns[[0,2]]], df[df.columns[[1,2]]]])
Then what you obtain is:
0 1 2
0 a NaN aa
1 b NaN bb
2 c NaN cc
0 NaN d aa
1 NaN e bb
2 NaN f cc
Now, just replace the NaN values in [0] with the corresponding values from [1].
df1[0] = df1[0].fillna(df1[1])
Output:
0 1 2
0 a NaN aa
1 b NaN bb
2 c NaN cc
0 d d aa
1 e e bb
2 f f cc
Here, you may only need [0] and [2] columns.
df1[[0,2]]
Final Output:
0 2
0 a aa
1 b bb
2 c cc
0 d aa
1 e bb
2 f cc
Here are 4 steps: split into 2 dataframes; make column names the same; append; reindex.
Import pandas as pd
df = pd.DataFrame({'col1':['a','b','c'],'col2':['c','d','e'],'col3':['aa','bb','cc']})
df2 = df[['col1','col3']] # split into 2 dataframes
df3 = df[['col2','col3']]
df3.columns = df2.columns # make column names the same
df_final = df2.append(df3) # append
df_final.index = range(len(df_final.index)) # reindex
print(df_final)
pd.concat([df[df.columns[[0, 2]]], df[df.columns[[0, 1]]])
I'm trying to ffill() values in two columns in a df based on a separate column. I'm hoping to continue filling until a condition is met. Using the df below, where Val1 and Val2 are equal to C, I want to fill subsequent rows until strings in Code begin with either ['FR','GE','GA'].
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Code' : ['CA','GA','YA','GE','XA','CA','YA','FR','XA'],
'Val1' : ['A','B','C','A','B','C','A','B','C'],
'Val2' : ['A','B','C','A','B','C','A','B','C'],
})
mask = (df['Val1'] == 'C') & (df['Val2'] == 'C')
cols = ['Val1', 'Val2']
df[cols] = np.where(mask, df[cols].ffill(), df[cols])
Intended output:
Code Val1 Val2
0 CA A A
1 GA B B
2 YA C C
3 GE A A
4 XA B B
5 CA C C
6 YA C C
7 FR B B
8 XA C C
Note: Strings in Code are shortened to be two characters but are longer in my dataset, so I'm hoping to use startswith
The problem is similar to start/stop signal that I have answered before, but couldn't find it. So here's the solution along with other things your mentioned:
# check for C
is_C = df.Val1.eq('C') & df.Val2.eq('C')
# check for start substring with regex
startswith = df.Code.str.match("^(FR|GE|GA)")
# merge the two series
# startswith is 0, is_C is 1
mask = np.select((startswith,is_C), (0,1), np.nan)
# update mask with ffill
# rows after an `is_C` and before a `startswith` will be marked with 1
mask = pd.Series(mask, df.index).ffill().fillna(0).astype(bool);
# update the dataframe
df.loc[mask, ['Val1','Val2']] = 'C'
Output
Code Val1 Val2
0 CA A A
1 GA B B
2 YA C C
3 GE A A
4 XA B B
5 CA C C
6 YA C C
7 FR B B
8 XA C C
I want to merge the rows of the two dataframes hereunder, when the strings in Test1 column of DF2 contain a substring of Test1 column of DF1.
DF1 = pd.DataFrame({'Test1':list('ABC'),
'Test2':[1,2,3]})
print (DF1)
Test1 Test2
0 A 1
1 B 2
2 C 3
DF2 = pd.DataFrame({'Test1':['ee','bA','cCc','D'],
'Test2':[1,2,3,4]})
print (DF2)
Test1 Test2
0 ee 1
1 bA 2
2 cCc 3
3 D 4
For that, I am able with "str contains" to identify the substring of DF1.Test1 available in the strings of DF2.Test1
INPUT:
for i in DF1.Test1:
ok = DF2[Df2.Test1.str.contains(i)]
print(ok)
OUPUT:
Now, I would like to add in the output, the merge of the substrings of Test1 which match with the strings of Test2
OUPUT:
For that, I tried with "pd.merge" and "if" but i am not able to find the right code yet..
Do you have suggestions please?
for i in DF1.Test1:
if DF2.Test1.str.contains(i) == 'True':
ok = pd.merge(DF1, DF2, on= ['Test1'[i]], how='outer')
print(ok)
Thank you for your ideas :)
I could not respnd to jezrael's comment because of my reputation. But I changed his answer to a function to merge on non-capitalized text.
def str_merge(part_string_df,full_string_df, merge_column):
merge_column_lower = 'merge_column_lower'
part_string_df[merge_column_lower] = part_string_df[merge_column].str.lower()
full_string_df[merge_column_lower] = full_string_df[merge_column].str.lower()
pat = '|'.join(r"{}".format(x) for x in part_string_df[merge_column_lower])
full_string_df['Test3'] = full_string_df[merge_column_lower].str.extract('('+ pat + ')', expand=True)
DF = pd.merge(part_string_df, full_string_df, left_on= merge_column_lower, right_on='Test3').drop([merge_column_lower + '_x',merge_column_lower + '_y','Test3'],axis=1)
return DF
Used with example:
DF1 = pd.DataFrame({'Test1':list('ABC'),
'Test2':[1,2,3]})
DF2 = pd.DataFrame({'Test1':['ee','bA','cCc','D'],
'Test2':[1,2,3,4]})
print(str_merge(DF1,DF2, 'Test1'))
Test1_x Test2_x Test1_y Test2_y
0 B 2 bA 2
1 C 3 cCc 3
I believe you need extract values to new column and then merge, last remove helper column Test3:
pat = '|'.join(r"{}".format(x) for x in DF1.Test1)
DF2['Test3'] = DF2.Test1.str.extract('('+ pat + ')', expand=False)
DF = pd.merge(DF1, DF2, left_on= 'Test1', right_on='Test3').drop('Test3', axis=1)
print (DF)
Test1_x Test2_x Test1_y Test2_y
0 A 1 bA 2
1 C 3 cCc 3
Detail:
print (DF2)
Test1 Test2 Test3
0 ee 1 NaN
1 bA 2 A
2 cCc 3 C
3 D 4 NaN
>>> df
0 1
0 0 0
1 1 1
2 2 1
>>> df1
0 1 2
0 A B C
1 D E F
>>> crazy_magic()
>>> df
0 1 3
0 0 0 A #df1[0][0]
1 1 1 E #df1[1][1]
2 2 1 F #df1[2][1]
Is there a way to achieve this without for?
import pandas as pd
df = pd.DataFrame([[0,0],[1,1],[2,1]])
df1 = pd.DataFrame([['A', 'B', 'C'],['D', 'E', 'F']])
df2 = df1.reset_index(drop=False)
# index 0 1 2
# 0 0 A B C
# 1 1 D E F
df3 = pd.melt(df2, id_vars=['index'])
# index variable value
# 0 0 0 A
# 1 1 0 D
# 2 0 1 B
# 3 1 1 E
# 4 0 2 C
# 5 1 2 F
result = pd.merge(df, df3, left_on=[0,1], right_on=['variable', 'index'])
result = result[[0, 1, 'value']]
print(result)
yields
0 1 value
0 0 0 A
1 1 1 E
2 2 1 F
My reasoning goes as follows:
We want to use two columns of df as coordinates.
The word "coordinates" reminds me of pivot, since
if you have two columns whose values represent "coordinates" and a third
column representing values, and you want to convert that to a grid, then
pivot is the tool to use.
But df does not have a third column of values. The values are in df1. In fact df1 looks like the result of a pivot operation. So instead of pivoting df, we want to unpivot df1.
pd.melt is the function to use when you want to unpivot.
So I tried melting df1. Comparison with other uses of pd.melt led me to conclude df1 needed the index as a column. That's the reason for defining df2. So we melt df2.
Once you get that far, visually comparing df3 to df leads you naturally to the use of pd.merge.