I am trying to add two dataframes but not getting the right result. I have two files in which one file is having column name and other file is having data. I want to merge them.
I am using '\001' delimiter.
Example:
df1:
56447MNEMILY 2703546.742893.9553218262930LP2018-11-21 09:18:46.040618
62872ILOPDYKE 1708138.269688.8052618165922LP2018-11-21 09:18:46.040618
04925MECARATUNK 2302545.231369.9861207221305LP2018-11-21 09:18:46.040618
df2:
meli_zip_cd_basemeli_stt_provncdmeli_city_nmmeli_typmeli_cntry_fipsmeli_latimeli_longimeli_area_cdmeli_fin_cdmeli_last_lnmeli_facmeli_msa_cdmeli_pmsa_cdmeli_dma_cdload_dt
Expected final result:
df_final:
meli_zip_cd_basemeli_stt_provncdmeli_city_nmmeli_typmeli_cntry_fipsmeli_latimeli_longimeli_area_cdmeli_fin_cdmeli_last_lnmeli_facmeli_msa_cdmeli_pmsa_cdmeli_dma_cdload_dt
56447MNEMILY 2703546.742893.9553218262930LP2018-11-21 09:18:46.040618
62872ILOPDYKE 1708138.269688.8052618165922LP2018-11-21 09:18:46.040618
04925MECARATUNK 2302545.231369.9861207221305LP2018-11-21 09:18:46.040618
If I understand you correctly, you want the first (and only) row from df2 to become the header of the first (and only) column in df1:
df1.columns = df2.iloc[0].values
I think I got the solution:
df1 = pd.read_csv('/medaff/eureka/CDP/DMN_MELI_ZIP/DMN_MELI_ZIP.txt', delimiter='\001')
df2 = pd.read_csv('/medaff/eureka/CDP/HEADERS/DMN_MELI_ZIP_HEADER.txt', delimiter='\001')
df1.columns = df2.columns
df1.to_csv('/medaff/eureka/CDP/HEADERS/test.txt', sep ='\001', index=False)
Related
I am trying to join two dataframes with the following data:
df1
df2
I want to join these two dataframes on the condition that if 'col2' of df2 is blank/NULL then the join should occur only on 'column1' of df1 and 'col1' of df2 but if it is not NULL/blank then the join should occur on two conditions, i.e. 'column1', 'column2' of df1 with 'col1', 'col2' of df2 respectively.
For reference the final dataframe that I wish to obtain is:
My current approach is that I'm trying to slice these 2 dataframes into 4 and then joining them seperately based on the condition. Is there any way to do this without slicing them or maybe a better way that I'm missing out??
Idea is rename columns before left join by both columns first and then replace missing value by matching by column1, here is necessary remove duplicates by DataFrame.drop_duplicates before Series.map for unique values in col1:
df22 = df2.rename(columns={'col1':'column1','col2':'column2'})
df = df1.merge(df22, on=['column1','column2'], how='left')
s = df2.drop_duplicates('col1').set_index('col1')['col3']
df['col3'] = df['col3'].fillna(df['column1'].map(s))
EDIT: General solution working with multiple columns - first part is same, is used left join, in second part is used merge by one column with DataFrame.combine_first for replace missing values:
df22 = df2.rename(columns={'col1':'column1','col2':'column2'})
df = df1.merge(df22, on=['column1','column2'], how='left')
df23 = df22.drop_duplicates('column1').drop('column2', axis=1)
df = df.merge(df23, on='column1', how='left', suffixes=('','_'))
cols = df.columns[df.columns.str.endswith('_')]
df = df.combine_first(df[cols].rename(columns=lambda x: x.strip('_'))).drop(cols, axis=1)
So I have 2 csv files with the same number of columns. The first csv file has its columns named (age, sex). The second file though doesn't name its columns like the first one but it's data corresponds to the according column of the first csv file. How can I concat them properly?
First csv.
Second csv.
This is how I read my files:
df1 = pd.read_csv("input1.csv")
df2 = pd.read_csv("input2.csv", header=None)
I tried using concat() like this but I get 4 columns as a result..
df = pd.concat([df1, df2])
You can also use the append function. Be careful to have the same column names for both, otherwise you will end with 4 columns.
Check this link, I found it very useful.
df1 = pd.read_csv("input1.csv")
df2 = pd.read_csv("input2.csv", header = None)
df2.columns = df1.columns
df = df1.append(df2, ignore_index=True)
I found a solution. After reading the second file I added
df2.columns = df1.columns
Works just like I wanted to. I guess I better research more next time :). Thanks
Final code:
df1 = pd.read_csv("input1.csv")
df2 = pd.read_csv("input2.csv", header = None)
df2.columns = df1.columns
df = pd.concat([df1, df2])
I'm trying to execute a merge with pandas. The two files have a common key ("KEY_PLA") which I'm trying to use with a left join. But unfortunately, all columns which are transferred from the second file to the first file have NaN values.
Here is what I have done so far:
df_1 = pd.read_excel(path1, skiprows=1)
df_2 = pd.read_excel(path2, skiprows=1)
df_1.columns = ["Index", "KEY", "KEY_PLA", "INFO1", "INFO2"]
df_2.columns = ["Index", "KEY_PLA", "INFO4"]
df_1.drop(["Index"], axis=1, inplace=True)
df_2.drop(["Index"], axis=1, inplace=True)
# Merge all dataframes
df_merge = pd.DataFrame()
df_merge = df_1.merge(df_2, left_on="KEY_PLA", right_on="KEY_PLA", how="left")
print(df_merge)
This is the result:
Here are the excel files:
Excel1
Excel2
What is wrong with the code? I also checked the types and even converted the columns in strings. But nothing works.
I think problem is different types of joined columns KEY_PLA, obviously one is integer and another strings.
Solution is cast to same, e.g. to ints:
print (df_1['KEY_PLA'].dtype)
object
print (df_2['KEY_PLA'].dtype)
int64
df_1['KEY_PLA'] = df_1['KEY_PLA'].astype(int)
I have two excel, named df1 and df2.
df1.columns : url, content, ortheryy
df2.columns : url, content, othterxx
Some contents in df1 are empty, and df1 and df2 share some urls(not all).
What I want to do is fill df1's empty content by df2 if that row has same url.
I tried
ndf = pd.merge(df1, df2[['url', 'content']], on='url', how='left')
# how='inner' result same
Which result:
two column: content_x and content_y
I know it can be solve by loop through df1 and df2, but I'd like to do is in pandas way.
I think need Series.combine_first or Series.fillna:
df1['content'] = df1['content'].combine_first(ndf['content_y'])
Or:
df1['content'] = df1['content'].fillna(ndf['content_y'])
It works, because left join create in ndf same index values as df1.
Suppose I have the following dataframes in pySpark:
df1 = sqlContext.createDataFrame([Row(name='john', age=50), Row(name='james', age=25)])
df2 = sqlContext.createDataFrame([Row(name='john', weight=150), Row(name='mike', weight=115)])
df3 = sqlContext.createDataFrame([Row(name='john', age=50, weight=150), Row(name='james', age=25, weight=None), Row(name='mike', age=None, weight=115)])
Now suppose I want to create df3 from joining/merging df1 and df2.
I tried doing
df1.join(df2, df1.name == df2.name, 'outer')
This doesn't quite work exactly because it produces two name columns. I need to then somehow combine the two name columns so that missing names from one name column are filled in by the missing name from the other name column.
How would I do that? Or is there a better way to create df3 from df1 and df2?
You can use coallesce function which returns the first not-null argument.
from pyspark.sql.functions import coalesce
df1 = df1.alias("df1")
df2 = df2.alias("df2")
(df1.join(df2, df1.name == df2.name, 'outer')
.withColumn("name_", coalesce("df1.name", "df2.name"))
.drop("name")
.withColumnRenamed("name_", "name"))
This is a little late, but there is a simpler solution if someone needs it. Just a simple change from original poster's solution:
df1.join(df2, 'name', 'outer')
df3 = df1.join(df2, ['name'], 'outer')
Joining in this way will prevent the duplication of the name column. https://kb.databricks.com/data/join-two-dataframes-duplicated-columns.html