Intersection of pandas dataframe with multiple columns

Intersection of pandas dataframe with multiple columns - python

I have a list of dataframes as:
[df1, df2, df3, ..., df100, oddDF]
Each dataframe dfi has DateTime as column1 and Temperature as column2. Except the dataframe oddDF which has DateTime as column1 and has temperature columns in column2 and column3.
I am looking to create a list of dataframe or one dataframe which has the common temperatures from each of df1, .. df100 and oddDF
I am trying the following:
dfs = [df0, df1, df2, .., df100, oddDF]
df_final = reduce(lambda left,right: pd.merge(left,right,on='DateTime'), dfs)
But it produces df_final as empty
If however I do just:
dfs = [df0, df1, df2, .., df100]
df_final = reduce(lambda left,right: pd.merge(left,right,on='DateTime'), dfs)
df_final produces the right answer.
How do I incorporate oddDF in the code also. I have checked to make sure that oddDF's DateTime column has the common dates with
df1, df2, .., df100

Related

pandas merge by excluding certain columns from merge

I want to merge two dataframes like:
df1.columns = A, B, C, E, ..., D
df2.columns = A, B, C, F, ..., D
If I merge them, it merges on all columns. Also since the number of columns is high I don't want to specify them in on. I prefer to exclude the columns which I don't want to be merged. How can I do that?
mdf = pd.merge(df1, df2, exclude D)
I expect the result be like:
mdf.columns = A, B, C, E, F ..., D_x, D_y

You mentioned you mentioned you don't want to use on "since the number of columns is much".
You could still use on this way even if there are a lot of columns:
mdf = pd.merge(df1, df2, on=[i for i in df1.columns if i != 'D'])
Or
By using pd.Index.difference
mdf = pd.merge(df1, df2, on=df1.columns.difference(['D']).tolist())

Another solution can be:
mdf = pd.merge(df1, df2, on= df1.columns.tolist().remove('D')

What about dropping the unwanted column after the merge?
You can use pandas.DataFrame.drop:
mdf = pd.merge(df1, df2).drop('D', axis=1)
or dropping before the merge:
mdf = pd.merge(df1.drop('D', axis=1), df2.drop('D', axis=1))

One solution is using intersection and then difference on df1 and df2 columns:
mdf = pd.merge(df1, df2, on=df1.columns.intersection(df2.columns).difference(['D']).tolist())
The other solution could be renaming columns you want to exclude from merge:
df2.rename(columns={"D":"D_y"}, inplace=True)
mdf = pd.merge(df1, df2)

How to drop dataframe rows not in another dataframe?

I have a:
Dataframe df1 with columns A, B and C. A is the index.
Dataframe df2 with columns D, E and F. D is the index.
What’s an efficient way to drop from df1 all rows where B is not found in df2 (in D the index)?

If need drop some not exist values it is same like select only existing values. So is possible use:
You can filter df1.B by index from df2 in Series.isin:
df3 = df1[df1.B.isin(df2.index)]
Or by DataFrame.merge with left join:
df3 = df1.merge(df2[[]], left_on='B', right_index=True, how='left')

Combine series by date

The following 2 series of stocks in a single excel file:
Can be combined using the date as index?
The result should be like this:

You need a simple df.merge() here:
df = pd.merge(df1, df2, left_index=True, right_index=True, how='outer')
OR
df = df1.join(df2, how='outer')

I am trying this:
df3 = pd.concat([df1, df2]).sort_values('Date').reset_index(drop=True)
or
df3 = df1.append(df2).sort_values('Date').reset_index(drop=True)

How to drop column from the target data frame, but the column(s) are required for the join in merge

I have two dataframe df1, df2
df1.columns
['id','a','b']
df2.columns
['id','ab','cd','ab_test','mn_test']
Expected out column is ['id','a','b','ab_test','mn_test']
How to get the all the columns from df1, and columns which contain test in the column name
pseudocode > pd.merge(df1,df2,how='id')

You can merge and use filter one the second dataframe to keep the columns of interest:
df1.merge(df2.filter(regex=r'^id$|test'), on='id')
Or similarly through bitwise operations:
df1.merge(df2.loc[:,(df2.columns=='id')|df2.columns.str.contains('test')], on='id')
df1 = pd.DataFrame(columns=['id','a','b'])
df2 = pd.DataFrame(columns=['id','ab','cd','ab_test','mn_test'])
df1.merge(df2.filter(regex=r'^id$|test'), on='id').columns
# Index(['a', 'b', 'id', 'ab_test', 'mn_test'], dtype='object')

Pyspark OLD dataframe partition to New Dataframe

I have a partitioned dataframe say df1. From df1 i will create df2 and df3..
df1 = df1.withColumn("key", concat("col1", "col2", "col3"))
df1 =df1.repartition(400, "key")
df2 = df.groupBy("col1", "col2").agg(sum(colx))
df3 = df1.join(df2, ["col1", "col2"])
I want to know will df3 retain same partition of df1? or do i need to re-partition df3 again?.

Partitioning of df3 will be totally different comparing to df1. And (probably) df2 will have spark.sql.shuffle.partitions (default: 200) number of partitions, not 400.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Intersection of pandas dataframe with multiple columns - python

Related

pandas merge by excluding certain columns from merge

How to drop dataframe rows not in another dataframe?

Combine series by date

How to drop column from the target data frame, but the column(s) are required for the join in merge

Pyspark OLD dataframe partition to New Dataframe

Categories

Resources