I want to merge 2 dataframes and first is dm.shape = (21184, 34), second is po.shape = (21184, 6). I want to merge them then it will be 40 columns. I write as this
dm = dm.merge(po, left_index=True, right_index=True)
then it is dm.shape = (4554, 40) my rows decreased.
P.s po is the PolynomialFeatures of numerical data of dm.
Problem is different index values, so convert them to default RangeIndex in both DataFrames:
df = dm.reset_index(drop=True).merge(po.reset_index(drop=True),
left_index=True,
right_index=True)
Solution with concat - by default outer join, but if same index values in both working same:
df = pd.concat([dm.reset_index(drop=True), po.reset_index(drop=True)], axis=1)
Or use:
dm = pd.DataFrame([dm.values.flatten().tolist(), po.values.flatten().tolist()]).rename(index=dict(zip(range(2),[*po.columns.tolist(), *dm.columns.tolist()]))).T
You can use the method join and set the parameter on to the index of the joined dataframe:
df1 = pd.DataFrame({'col1': [1, 2]}, index=[1,2])
df2 = pd.DataFrame({'col2': [3, 4]}, index=[3,4])
df1.join(df2, on=df2.index)
Output:
col1 col2
1 1 3
2 2 4
The joined dataframe must not contain duplicated indices.
Related
I would like to assign agent_code to specific number of rows in df2.
df1
df2
Thank you.
df3 (Output)
First make sure in both DataFrames is default index by DataFrame.reset_index with drop=True, then repeat agent_code, convert to default index and last use concat:
df1 = df1.reset_index(drop=True)
df2 = df2.reset_index(drop=True)
s = df1['agent_code'].repeat(df1['number']).reset_index(drop=True)
df3 = pd.concat([df2, s], axis=1)
The goal of following code is to go through each row in df_label, extract app1 and app2 names, filter df_all using those two names, concatenate the result and return it as a dataframe. Here is the code:
def create_dataset(se):
# extracting the names of applications
app1 = se.app1
app2 = se.app2
# extracting each application from df_all
df1 = df_all[df_all.workload == app1]
df1.columns = df1.columns + '_0'
df2 = df_all[df_all.workload == app2]
df2.columns = df2.columns + '_1'
# combining workloads to create the pairs dataframe
df3 = pd.concat([df1, df2], axis=1)
display(df3)
return df3
df_pairs = pd.DataFrame()
df_label.apply(create_dataset, axis=1)
#df_pairs = df_pairs.append(df_label.apply(create_dataset, axis=1))
I would like to append all dataframes returned from apply. However, while display(df3) shows the correct dataframe, when returned from function, it's not a dataframe anymore and it's a series. A series with one element and that element seems to be the whole dataframe. Any ideas what I am doing wrong?
When you select a single column, you'll get a Series instead of a DataFrame so df1 and df2 will both be series.
However, concatenating them on axis=1 should produce a DataFrame (whereas combining them on axis=0 would produce a series). For example:
df = pd.DataFrame({'a':[1,2],'b':[3,4]})
df1 = df['a']
df2 = df['b']
>>> pd.concat([df1,df2],axis=1)
a b
0 1 3
1 2 4
>>> pd.concat([df1,df2],axis=0)
0 1
1 2
0 3
1 4
dtype: int64
When we merge two dataframes using pandas merge function, is it possible to ensure the key(s) based on which the two dataframes are merged is not repeated twice in the result? For e.g., I tried to merge two DFs with a column named 'isin_code' in the left DF and a column named 'isin' in the right DF. Even though the column/header names are different, the values of both the columns are same. In, the eventual result though, I get to see both 'isin_code' column and 'isin' column, which I am trying to avoid.
Code used:
result = pd.merge(df1,df2[['isin','issue_date']],how='left',left_on='isin_code',right_on = 'isin')
Either rename the columns to match before merge to uniform the column names and specify only on:
result = pd.merge(
df1,
df2[['isin', 'issue_date']].rename(columns={'isin': 'isin_code'}),
on='isin_code',
how='left'
)
OR drop the duplicate column after merge:
result = pd.merge(
df1,
df2[['isin', 'issue_date']],
how='left',
left_on='isin_code',
right_on='isin'
).drop(columns='isin')
Sample DataFrames and output:
import pandas as pd
df1 = pd.DataFrame({'isin_code': [1, 2, 3], 'a': [4, 5, 6]})
df2 = pd.DataFrame({'isin': [1, 3], 'issue_date': ['2021-01-02', '2021-03-04']})
df1:
isin_code a
0 1 4
1 2 5
2 3 6
df2:
isin issue_date
0 1 2021-01-02
1 3 2021-03-04
result:
isin_code a issue_date
0 1 4 2021-01-02
1 2 5 NaN
2 3 6 2021-03-04
I’ve got two data frames :-
Df1
Time V1 V2
02:00 D3F3 0041
02:01 DD34 0040
Df2
FileName V1 V2
1111.txt D3F3 0041
2222.txt 0000 0040
Basically I want to compare the v1 v2 columns and if they match print the row time from df1 and the row from df2 filename. So far all i can find is the
isin()
, which simply gives you a boolean output.
So the output would be :
1111.txt 02:00
I started using dataframes because i though i could query the two df's on the V1 / V2 values but I can't see a way. Any pointers would be much appreciated
Use merge on the dataframe columns that you want to have the same values. You can then drop the rows with NaN values, as those will not have matching values. From there, you can print the merged dataframes values however you see fit.
df1 = pd.DataFrame({'Time': ['8a', '10p'], 'V1': [1, 2], 'V2': [3, 4]})
df2 = pd.DataFrame({'fn': ['8.txt', '10.txt'], 'V1': [3, 2], 'V2': [3, 4]})
df1.merge(df2, on=['V1', 'V2'], how='outer').dropna()
=== Output: ===
Time V1 V2 fn
1 10p 2 4 10.txt
The most intuitive solution is:
1) iterate the V1 column in DF1;
2) for each item in this column, check if this item exists in the V1 column of DF2;
3) if the item exists in DF2's V1, then find the index of that item in the DF2 and then you would be able to find the file name.
You can try using pd.concat.
On this case it would be like:
pd.concat([df1, df2.reindex(df1.index)], axis=1)
It will create a new dataframe with all the values, but in case there are some values that doesn't match in both dataframes, it'll return NaN. If you doesn't want this to happen you must use this:
pd.concat([df1, df4], axis=1, join='inner')
If you wanna learn a bit more, use pydata: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
You can use merge option with inner join
df2.merge(df1,how="inner",on=["V1","V2"])[["FileName","Time"]]
While I think Eric's solution is more pythonic, if your only aim is to print the rows on which df1 and df2 have v1 and v2 values the same, provided the two dataframes are of the same length, you can do the following:
for row in range(len(df1)):
if (df1.iloc[row,1:] == df2.iloc[row,1:]).all() == True:
print(df1.iloc[row], df2.iloc[row])
Try this:
client = boto3.client('s3')
obj = client.get_object(Bucket='', Key='')
data = obj['Body'].read()
df1 = pd.read_excel(io.BytesIO(data), sheet_name='0')
df2 = pd.read_excel(io.BytesIO(data), sheet_name='1')
head = df2.columns[0]
print(head)
data = df1.iloc[[8],[0]].values[0]
print(data)
print(df2)
df2.columns = df2.iloc[0]
df2 = df2.drop(labels=0, axis=0)
df2['Head'] = head
df2['ID'] = pd.Series([data,data])
print(df2)
df2.to_csv('test.csv',index=False)
I'm searching and haven't found an answer to this question, can you perform a merge of pandas dataframes using OR logic? Basically, the equivalent of a SQL merge using "where t1.A = t2.A OR t1.A = t2.B".
I have a situation where I am pulling information from one database into a dataframe (df1) and I need to merge it with information from another database, which I pulled into another dataframe (df2), merging based on a single column (col1). If these always used the same value when they matched, it would be very straightforward. The situation I have is that sometimes they match and sometimes they use a synonym. There is a third database that has a table that provides a lookup between synonyms for this data entity (col1 and col1_alias), which could be pulled into a third dataframe (df3). What I am looking to do is merge the columns I need from df1 and the columns I need from df2.
As stated above, in cases where df1.col1 and df2.col1 match, this would work...
df = df1.merge(df2, on='col1', how='left')
However, they don't always have the same value and sometimes have the synonyms. I thought about creating df3 based on when df3.col1 was in df1.col1 OR df3.col1_alias was in df1.col1. Then, creating a single list of values from df3.col1 and df3.col1_alias (list1) and selecting df2 based on df2.col1 in list1. This would give me the rows from df2 I need but, that still wouldn't put me in position to merge df1 and df2 matching the appropriate rows. I think if there an OR merge option, I can step through this and make it work, but all of the following threw a syntax error:
df = df1.merge((df3, left_on='col1', right_on='col1', how='left')|(df3, left_on='col1', right_on='col1_alias', how='left'))
and
df = df1.merge(df3, (left_on='col1', right_on='col1')|(left_on='col1', right_on='col1_alias'), how='left')
and
df = df1.merge(df3, left_on='col1', right_on='col1'|right_on='col1_alias', how='left')
and several other variations. Any guidance on how to perform an OR merge or suggestions on a completely different approach to merging df1 and df2 using the synonyms in two columns in df3?
I think I would do this as two merges:
In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6]], columns=["A", "B"])
In [12]: df2 = pd.DataFrame([[1, 7], [2, 8], [4, 9]], columns=["C", "D"])
In [13]: res = df.merge(df2, left_on="B", right_on="C", how="left")
In [14]: res.update(df.merge(df2, left_on="A", right_on="C", how="left"))
In [15]: res
Out[15]:
A B C D
0 1 2 1.0 7.0
1 3 4 4.0 9.0
2 5 6 NaN NaN
As you can see this picks A = 1 -> D = 7 rather than B = 2 -> D = 8.
Note: For more extensibility (matching different columns) it might make sense to pull out a single column, although they're both the same in this example:
In [21]: res = df.merge(df2, left_on="B", right_on="C", how="left")["C"]
In [22]: res.update(df.merge(df2, left_on="A", right_on="C", how="left")["C"])
In [23]: res
Out[23]:
0 1.0
1 4.0
2 NaN
Name: C, dtype: float64
#will this work?
df = pd.concat([df1.merge(df3, left_on='col1', right_on='col1', how='left'), df1.merge(df3, left_on='col1', right_on='col1_alias', how='left')]