I have a list of dataframes as:
[df1, df2, df3, ..., df100, oddDF]
Each dataframe dfi has DateTime as column1 and Temperature as column2. Except the dataframe oddDF which has DateTime as column1 and has temperature columns in column2 and column3.
I am looking to create a list of dataframe or one dataframe which has the common temperatures from each of df1, .. df100 and oddDF
I am trying the following:
dfs = [df0, df1, df2, .., df100, oddDF]
df_final = reduce(lambda left,right: pd.merge(left,right,on='DateTime'), dfs)
But it produces df_final as empty
If however I do just:
dfs = [df0, df1, df2, .., df100]
df_final = reduce(lambda left,right: pd.merge(left,right,on='DateTime'), dfs)
df_final produces the right answer.
How do I incorporate oddDF in the code also. I have checked to make sure that oddDF's DateTime column has the common dates with
df1, df2, .., df100
Related
I want to merge two dataframes like:
df1.columns = A, B, C, E, ..., D
df2.columns = A, B, C, F, ..., D
If I merge them, it merges on all columns. Also since the number of columns is high I don't want to specify them in on. I prefer to exclude the columns which I don't want to be merged. How can I do that?
mdf = pd.merge(df1, df2, exclude D)
I expect the result be like:
mdf.columns = A, B, C, E, F ..., D_x, D_y
You mentioned you mentioned you don't want to use on "since the number of columns is much".
You could still use on this way even if there are a lot of columns:
mdf = pd.merge(df1, df2, on=[i for i in df1.columns if i != 'D'])
Or
By using pd.Index.difference
mdf = pd.merge(df1, df2, on=df1.columns.difference(['D']).tolist())
Another solution can be:
mdf = pd.merge(df1, df2, on= df1.columns.tolist().remove('D')
What about dropping the unwanted column after the merge?
You can use pandas.DataFrame.drop:
mdf = pd.merge(df1, df2).drop('D', axis=1)
or dropping before the merge:
mdf = pd.merge(df1.drop('D', axis=1), df2.drop('D', axis=1))
One solution is using intersection and then difference on df1 and df2 columns:
mdf = pd.merge(df1, df2, on=df1.columns.intersection(df2.columns).difference(['D']).tolist())
The other solution could be renaming columns you want to exclude from merge:
df2.rename(columns={"D":"D_y"}, inplace=True)
mdf = pd.merge(df1, df2)
I have a:
Dataframe df1 with columns A, B and C. A is the index.
Dataframe df2 with columns D, E and F. D is the index.
What’s an efficient way to drop from df1 all rows where B is not found in df2 (in D the index)?
If need drop some not exist values it is same like select only existing values. So is possible use:
You can filter df1.B by index from df2 in Series.isin:
df3 = df1[df1.B.isin(df2.index)]
Or by DataFrame.merge with left join:
df3 = df1.merge(df2[[]], left_on='B', right_index=True, how='left')
The following 2 series of stocks in a single excel file:
Can be combined using the date as index?
The result should be like this:
You need a simple df.merge() here:
df = pd.merge(df1, df2, left_index=True, right_index=True, how='outer')
OR
df = df1.join(df2, how='outer')
I am trying this:
df3 = pd.concat([df1, df2]).sort_values('Date').reset_index(drop=True)
or
df3 = df1.append(df2).sort_values('Date').reset_index(drop=True)
I have two dataframe df1, df2
df1.columns
['id','a','b']
df2.columns
['id','ab','cd','ab_test','mn_test']
Expected out column is ['id','a','b','ab_test','mn_test']
How to get the all the columns from df1, and columns which contain test in the column name
pseudocode > pd.merge(df1,df2,how='id')
You can merge and use filter one the second dataframe to keep the columns of interest:
df1.merge(df2.filter(regex=r'^id$|test'), on='id')
Or similarly through bitwise operations:
df1.merge(df2.loc[:,(df2.columns=='id')|df2.columns.str.contains('test')], on='id')
df1 = pd.DataFrame(columns=['id','a','b'])
df2 = pd.DataFrame(columns=['id','ab','cd','ab_test','mn_test'])
df1.merge(df2.filter(regex=r'^id$|test'), on='id').columns
# Index(['a', 'b', 'id', 'ab_test', 'mn_test'], dtype='object')
I have a partitioned dataframe say df1. From df1 i will create df2 and df3..
df1 = df1.withColumn("key", concat("col1", "col2", "col3"))
df1 =df1.repartition(400, "key")
df2 = df.groupBy("col1", "col2").agg(sum(colx))
df3 = df1.join(df2, ["col1", "col2"])
I want to know will df3 retain same partition of df1? or do i need to re-partition df3 again?.
Partitioning of df3 will be totally different comparing to df1. And (probably) df2 will have spark.sql.shuffle.partitions (default: 200) number of partitions, not 400.