I am trying to merge two dataframes:
this are tables I am trying to merge
and this is the result I want to achieve
So, I want merged values not to be on the same row.
When I try to merge like that:
pd.merge(a,b,how='inner', on=['date_install','device_os'])
it merges values in one raw.
Please,suggest how it can be resolved
You can use append():
df1.append(df2, ignore_index = True)
or by using concat():
pd.concat([df,df1], ignore_index = True)
or using merge():
df.merge(df1, how = 'outer')
Reference taken from: https://stackoverflow.com/a/24391268/3748167 (probable duplicate)
The only difference here is concatenating both dataframes first and then applying rest of the logic.
def sjoin(x):
return ';'.join(x[x.notnull()].astype(str))
pd.concat([a,b]).groupby(level=0, axis=1).apply(lambda x: x.apply(sjoin, axis=1))
Related
I have a pandas dataframe with the following data: (in csv)
#list1
poke_id,symbol
0,BTC
1,ETB
2,USDC
#list2
5,SOL
6,XRP
I am able to concatenate them into one dataframe using the following code:
df = pd.concat([df1, df2], ignore_index = True)
df = df.reset_index(drop = True)
df['poke_id'] = df.index
df = df[['poke_id','symbol']]
which gives me the output: (in csv)
poke_id,symbol
0,BTC
1,ETB
2,USDC
3,SOL
4,XRP
Is there any other way to do the same. I think calling the whole data frame of ~4000 entries just to add ~100 more will be a little pointless and cumbersome. How can I make it in such a way that it picks list 1 (or dataframe 1) and picks the highest poke_id; and just does i + 1 to the later entries in list 2.
Your solution is good, is possible simplify:
df = pd.concat([df1, df2], ignore_index = True).rename_axis('poke_id').reset_index()
use indexes to get what data you want from the dataframe, although this is not effective if you want large amounts of data from the dataframe, this method allows you to take specific amounts of data from the dataframe
I have three dataframes with Date_Final as column under them all. I want to merge them all and Currently, my code is. I want to merge these dataframes into a single dataframe where firstly I create a new dataframe from any stock's dataframe.
What should be the optimized code?
df_stockanlaysis = stock1[['Date_final']]
df_stockanlaysis = pd.merge(df_stockanlaysis, stock1, how='left', on ="Date_final")
df_stockanlaysis = pd.merge(df_stockanlaysis, stock2, how='left', on ="Date_final")
df_stockanlaysis = pd.merge(df_stockanlaysis, stock3, how='left', on ="Date_final")
You could use a reduce, pandas and list comprehension.
Put those dataframes in a list like
dfs = [df_stockanlaysis1, df_stockanlaysis2, df_stockanlaysis3]
and the common column in this case i assume its 'Date_final'
df_final = reduce(lambda left,right: pd.merge(left,right,on='Date_final'), dfs)
I am trying to join two dataframes with the following data:
df1
df2
I want to join these two dataframes on the condition that if 'col2' of df2 is blank/NULL then the join should occur only on 'column1' of df1 and 'col1' of df2 but if it is not NULL/blank then the join should occur on two conditions, i.e. 'column1', 'column2' of df1 with 'col1', 'col2' of df2 respectively.
For reference the final dataframe that I wish to obtain is:
My current approach is that I'm trying to slice these 2 dataframes into 4 and then joining them seperately based on the condition. Is there any way to do this without slicing them or maybe a better way that I'm missing out??
Idea is rename columns before left join by both columns first and then replace missing value by matching by column1, here is necessary remove duplicates by DataFrame.drop_duplicates before Series.map for unique values in col1:
df22 = df2.rename(columns={'col1':'column1','col2':'column2'})
df = df1.merge(df22, on=['column1','column2'], how='left')
s = df2.drop_duplicates('col1').set_index('col1')['col3']
df['col3'] = df['col3'].fillna(df['column1'].map(s))
EDIT: General solution working with multiple columns - first part is same, is used left join, in second part is used merge by one column with DataFrame.combine_first for replace missing values:
df22 = df2.rename(columns={'col1':'column1','col2':'column2'})
df = df1.merge(df22, on=['column1','column2'], how='left')
df23 = df22.drop_duplicates('column1').drop('column2', axis=1)
df = df.merge(df23, on='column1', how='left', suffixes=('','_'))
cols = df.columns[df.columns.str.endswith('_')]
df = df.combine_first(df[cols].rename(columns=lambda x: x.strip('_'))).drop(cols, axis=1)
I have been trying to merge multiple dataframes using reduce() function mentioned in this link pandas three-way joining multiple dataframes on columns.
dfs = [df0, df1, df2, dfN]
df_final = reduce(lambda left,right: pd.merge(left,right,on='name'), dfs)
However, in my case the join columns are different for the related dataframes. Therefore I would need to use different left_on and right_on values on every merge.
I have come up with a workaround, which is not efficient or elegant in any way, but for now it works. I would like to know if the same can be achieved using reduce() or may be other efficient alternatives. I am foreseeing that there would be many dataframes I would need to join down-the-line.
import pandas as pd
...
...
# xml files - table1.xml, table2.xml and table3.xml are converted to <dataframe11>, <dataframe2>, <dataframe3> respectively.
_df = {
'table1' : '<dataframe1>',
'table2' : '<dataframe2>',
'table3' : '<dataframe3>'
}
# variable that tells column1 of table1 is related to column2 of table2, which can be used as left_on/right_on while merging dataframes
_relationship = {
'table1': {
'table2': ['NAME', 'DIFF_NAME']},
'table2': {
'table3': ['T2_ID', 'T3_ID']}
}
def _join_dataframes(_rel_pair):
# copy
df_temp = dict(_df)
for ele in _rel_pair:
first_table = ele[0]
second_table = ele[1]
lefton = _onetomany[first_table][second_table][0]
righton = _onetomany[first_table][second_table][1]
_merged_df = pd.merge(df_temp[first_table], df_temp[second_table],
left_on=lefton, right_on=righton, how="inner")
df_temp[ele[1]] = _merged_df
return _merged_df
# I have come up with this structure based on _df.keys()
_rel_pair = [['table1', 'table2'], ['table2', 'table3']]
_join_dataframes(_rel_pair)
Why don't you just rename the columns of all the dataframes first?
df0.rename({'commonname': 'old_column_name0'}, inplace=True)
.
.
.
.
dfN.rename({'commonname': 'old_column_nameN'}, inplace=True)
dfs = [df0, df1, df2, ... , dfN]
df_final = reduce(lambda left,right: pd.merge(left,right,on='name'), dfs)
Try using the concat function, instead of reduce.
A simple trick I like to use when merging DFs is setting the index on the columns I want to use as a guide when merging. Example:
# note different column names 'B' and 'C'
dfA = pd.read_csv('yourfile_A.csv', index_col=['A', 'B']
dfB = pd.read_csv('yourfile_B.csv', index_col=['C', 'D']
df = pd.concat([dfA, dfB], axis=1)
You will need unique indexes / multiindexes for this to work, but I think this should be no problem for most cases. Never tried a large concat, but this approach should theoretically work for N concats.
Alternatively, you can use merge instead, as it provide left_on and right_on parameters specially for those situations where column names differ between dataframes. An example:
dfA.merge(dfB, left_on='name', right_on='username')
A more complete explanation on how to merge dfs: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
concat: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
merge: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html
Suppose I have the following dataframes in pySpark:
df1 = sqlContext.createDataFrame([Row(name='john', age=50), Row(name='james', age=25)])
df2 = sqlContext.createDataFrame([Row(name='john', weight=150), Row(name='mike', weight=115)])
df3 = sqlContext.createDataFrame([Row(name='john', age=50, weight=150), Row(name='james', age=25, weight=None), Row(name='mike', age=None, weight=115)])
Now suppose I want to create df3 from joining/merging df1 and df2.
I tried doing
df1.join(df2, df1.name == df2.name, 'outer')
This doesn't quite work exactly because it produces two name columns. I need to then somehow combine the two name columns so that missing names from one name column are filled in by the missing name from the other name column.
How would I do that? Or is there a better way to create df3 from df1 and df2?
You can use coallesce function which returns the first not-null argument.
from pyspark.sql.functions import coalesce
df1 = df1.alias("df1")
df2 = df2.alias("df2")
(df1.join(df2, df1.name == df2.name, 'outer')
.withColumn("name_", coalesce("df1.name", "df2.name"))
.drop("name")
.withColumnRenamed("name_", "name"))
This is a little late, but there is a simpler solution if someone needs it. Just a simple change from original poster's solution:
df1.join(df2, 'name', 'outer')
df3 = df1.join(df2, ['name'], 'outer')
Joining in this way will prevent the duplication of the name column. https://kb.databricks.com/data/join-two-dataframes-duplicated-columns.html