I have two excel, named df1 and df2.
df1.columns : url, content, ortheryy
df2.columns : url, content, othterxx
Some contents in df1 are empty, and df1 and df2 share some urls(not all).
What I want to do is fill df1's empty content by df2 if that row has same url.
I tried
ndf = pd.merge(df1, df2[['url', 'content']], on='url', how='left')
# how='inner' result same
Which result:
two column: content_x and content_y
I know it can be solve by loop through df1 and df2, but I'd like to do is in pandas way.
I think need Series.combine_first or Series.fillna:
df1['content'] = df1['content'].combine_first(ndf['content_y'])
Or:
df1['content'] = df1['content'].fillna(ndf['content_y'])
It works, because left join create in ndf same index values as df1.
Related
I have 2 data frame from a basic web scrape using Pandas (below). The second table has less columns than the first, and I need to concat the dataframes. I have been manually inserting columns for a while but seeing as they change frequently I would like to have a function that can assess the columns in df2, check whether they are all in df2, and if not, add the column, with the data from df2.
import pandas as pd
link = 'https://en.wikipedia.org/wiki/Opinion_polling_for_the_next_French_presidential_election'
df = pd.read_html(link,header=0)
df1 = df[1]
df1 = df1.drop([0])
df1 = df1.drop('Abs.',axis=1)
df2 = df[2]
df2 = df2.drop([0])
df2 = df2.drop(['Abs.'],axis=1)
Many thanks,
#divingTobi's answer:
pd.concat([df1, df2]) does the trick.
I am trying to join two dataframes with the following data:
df1
df2
I want to join these two dataframes on the condition that if 'col2' of df2 is blank/NULL then the join should occur only on 'column1' of df1 and 'col1' of df2 but if it is not NULL/blank then the join should occur on two conditions, i.e. 'column1', 'column2' of df1 with 'col1', 'col2' of df2 respectively.
For reference the final dataframe that I wish to obtain is:
My current approach is that I'm trying to slice these 2 dataframes into 4 and then joining them seperately based on the condition. Is there any way to do this without slicing them or maybe a better way that I'm missing out??
Idea is rename columns before left join by both columns first and then replace missing value by matching by column1, here is necessary remove duplicates by DataFrame.drop_duplicates before Series.map for unique values in col1:
df22 = df2.rename(columns={'col1':'column1','col2':'column2'})
df = df1.merge(df22, on=['column1','column2'], how='left')
s = df2.drop_duplicates('col1').set_index('col1')['col3']
df['col3'] = df['col3'].fillna(df['column1'].map(s))
EDIT: General solution working with multiple columns - first part is same, is used left join, in second part is used merge by one column with DataFrame.combine_first for replace missing values:
df22 = df2.rename(columns={'col1':'column1','col2':'column2'})
df = df1.merge(df22, on=['column1','column2'], how='left')
df23 = df22.drop_duplicates('column1').drop('column2', axis=1)
df = df.merge(df23, on='column1', how='left', suffixes=('','_'))
cols = df.columns[df.columns.str.endswith('_')]
df = df.combine_first(df[cols].rename(columns=lambda x: x.strip('_'))).drop(cols, axis=1)
So I have 2 csv files with the same number of columns. The first csv file has its columns named (age, sex). The second file though doesn't name its columns like the first one but it's data corresponds to the according column of the first csv file. How can I concat them properly?
First csv.
Second csv.
This is how I read my files:
df1 = pd.read_csv("input1.csv")
df2 = pd.read_csv("input2.csv", header=None)
I tried using concat() like this but I get 4 columns as a result..
df = pd.concat([df1, df2])
You can also use the append function. Be careful to have the same column names for both, otherwise you will end with 4 columns.
Check this link, I found it very useful.
df1 = pd.read_csv("input1.csv")
df2 = pd.read_csv("input2.csv", header = None)
df2.columns = df1.columns
df = df1.append(df2, ignore_index=True)
I found a solution. After reading the second file I added
df2.columns = df1.columns
Works just like I wanted to. I guess I better research more next time :). Thanks
Final code:
df1 = pd.read_csv("input1.csv")
df2 = pd.read_csv("input2.csv", header = None)
df2.columns = df1.columns
df = pd.concat([df1, df2])
I am trying to concat two dataframes using the below code. df1 is a daily update of values to the indexes in df2, which is an ongoing monthly dataset. df3 is the result that is saved.
The problem I am experiencing is that when an index value is not in df1 (no values for that particular day), it gets deleted from df3 altogether. In other words, if the index value is not in df2, then it doesn't appear in df3 at all.
How can I keep the original index of df3, so that if the index value is not in df1, it doesn't delete it? I also cannot enter 0 values, as it is relevant to the data that it is empty.
import os
import pandas as pd
import glob
def Monthly_aggregation_merge(month, date):
# file to be merged
df1 = pd.read_csv(r'Data\{}\{}\Aggregated\Aggregated_Daily_All.csv'.format(month,date), usecols=['CU', 'Parameters', 'Total/Max/Min'], index_col =[0,1])
df1 = df1.rename(columns = {'Total/Max/Min':date}) # Change column name
# original file that data should be merged with
df2 = pd.read_csv(r'Data\{}\MonthlyData\July2017NEW.csv'.format(month), index_col = [0,1])
df3 = pd.concat([df2, df1], axis=1).reindex(df1.index)
df3.to_csv(r'Data\{}\MonthlyData\July2017NEW.csv'.format(month))
print 'Monthly Merge Done!'
Suppose I have the following dataframes in pySpark:
df1 = sqlContext.createDataFrame([Row(name='john', age=50), Row(name='james', age=25)])
df2 = sqlContext.createDataFrame([Row(name='john', weight=150), Row(name='mike', weight=115)])
df3 = sqlContext.createDataFrame([Row(name='john', age=50, weight=150), Row(name='james', age=25, weight=None), Row(name='mike', age=None, weight=115)])
Now suppose I want to create df3 from joining/merging df1 and df2.
I tried doing
df1.join(df2, df1.name == df2.name, 'outer')
This doesn't quite work exactly because it produces two name columns. I need to then somehow combine the two name columns so that missing names from one name column are filled in by the missing name from the other name column.
How would I do that? Or is there a better way to create df3 from df1 and df2?
You can use coallesce function which returns the first not-null argument.
from pyspark.sql.functions import coalesce
df1 = df1.alias("df1")
df2 = df2.alias("df2")
(df1.join(df2, df1.name == df2.name, 'outer')
.withColumn("name_", coalesce("df1.name", "df2.name"))
.drop("name")
.withColumnRenamed("name_", "name"))
This is a little late, but there is a simpler solution if someone needs it. Just a simple change from original poster's solution:
df1.join(df2, 'name', 'outer')
df3 = df1.join(df2, ['name'], 'outer')
Joining in this way will prevent the duplication of the name column. https://kb.databricks.com/data/join-two-dataframes-duplicated-columns.html