How do I remove the rows identified in df2 from df1?

How do I remove the rows identified in df2 from df1? - python

I have a dataframe called df1. I then create a filter like this:
df2 = df1.loc[(df1['unit'].str.contains('Ph'))]
How do I remove the rows identified in df2 from df1? thanks!

Use ~, not operand in boolean indexing:
df3 = df1.loc[~(df1['unit'].str.contains('Ph'))]
Now, df3 is df1 minus df2.

Related

Assign specific value from a column to specific number of rows

I would like to assign agent_code to specific number of rows in df2.
df1
df2
Thank you.
df3 (Output)

First make sure in both DataFrames is default index by DataFrame.reset_index with drop=True, then repeat agent_code, convert to default index and last use concat:
df1 = df1.reset_index(drop=True)
df2 = df2.reset_index(drop=True)
s = df1['agent_code'].repeat(df1['number']).reset_index(drop=True)
df3 = pd.concat([df2, s], axis=1)

DataFrame 'groupby' is fixing group columns with index

I have used a simple 'groupby' to condense rows in a Pandas dataframe:
df = df.groupby(['col1', 'col2', 'col3']).sum()
In the new DataFrame 'df', the three columns that were used in the 'groupby' function are now fixed within the index and are no longer column indexes 0, 1 and 2 - what was previously column index 4 is now column index 0.
How do I stop this from happening / reinclude the three 'groupby' columns along with the original data?

Try -
df = df.groupby(['col1', 'col2', 'col3'], as_index = False).sum()
#or
df = df.groupby(['col1', 'col2', 'col3']).sum().reset_index()

Try resetting the index
df = df.reset_index()

How to drop dataframe rows not in another dataframe?

I have a:
Dataframe df1 with columns A, B and C. A is the index.
Dataframe df2 with columns D, E and F. D is the index.
What’s an efficient way to drop from df1 all rows where B is not found in df2 (in D the index)?

If need drop some not exist values it is same like select only existing values. So is possible use:
You can filter df1.B by index from df2 in Series.isin:
df3 = df1[df1.B.isin(df2.index)]
Or by DataFrame.merge with left join:
df3 = df1.merge(df2[[]], left_on='B', right_index=True, how='left')

How to drop column from the target data frame, but the column(s) are required for the join in merge

I have two dataframe df1, df2
df1.columns
['id','a','b']
df2.columns
['id','ab','cd','ab_test','mn_test']
Expected out column is ['id','a','b','ab_test','mn_test']
How to get the all the columns from df1, and columns which contain test in the column name
pseudocode > pd.merge(df1,df2,how='id')

You can merge and use filter one the second dataframe to keep the columns of interest:
df1.merge(df2.filter(regex=r'^id$|test'), on='id')
Or similarly through bitwise operations:
df1.merge(df2.loc[:,(df2.columns=='id')|df2.columns.str.contains('test')], on='id')
df1 = pd.DataFrame(columns=['id','a','b'])
df2 = pd.DataFrame(columns=['id','ab','cd','ab_test','mn_test'])
df1.merge(df2.filter(regex=r'^id$|test'), on='id').columns
# Index(['a', 'b', 'id', 'ab_test', 'mn_test'], dtype='object')

Pyspark OLD dataframe partition to New Dataframe

I have a partitioned dataframe say df1. From df1 i will create df2 and df3..
df1 = df1.withColumn("key", concat("col1", "col2", "col3"))
df1 =df1.repartition(400, "key")
df2 = df.groupBy("col1", "col2").agg(sum(colx))
df3 = df1.join(df2, ["col1", "col2"])
I want to know will df3 retain same partition of df1? or do i need to re-partition df3 again?.

Partitioning of df3 will be totally different comparing to df1. And (probably) df2 will have spark.sql.shuffle.partitions (default: 200) number of partitions, not 400.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I remove the rows identified in df2 from df1? - python

I have a dataframe called df1. I then create a filter like this: df2 = df1.loc[(df1['unit'].str.contains('Ph'))] How do I remove the rows identified in df2 from df1? thanks!

Use ~, not operand in boolean indexing: df3 = df1.loc[~(df1['unit'].str.contains('Ph'))] Now, df3 is df1 minus df2.

Related

Assign specific value from a column to specific number of rows

DataFrame 'groupby' is fixing group columns with index

How to drop dataframe rows not in another dataframe?

How to drop column from the target data frame, but the column(s) are required for the join in merge

Pyspark OLD dataframe partition to New Dataframe

Categories

Resources