Pyspark OLD dataframe partition to New Dataframe - python

I have a partitioned dataframe say df1. From df1 i will create df2 and df3..
df1 = df1.withColumn("key", concat("col1", "col2", "col3"))
df1 =df1.repartition(400, "key")
df2 = df.groupBy("col1", "col2").agg(sum(colx))
df3 = df1.join(df2, ["col1", "col2"])
I want to know will df3 retain same partition of df1? or do i need to re-partition df3 again?.

Partitioning of df3 will be totally different comparing to df1. And (probably) df2 will have spark.sql.shuffle.partitions (default: 200) number of partitions, not 400.

Related

Assign specific value from a column to specific number of rows

I would like to assign agent_code to specific number of rows in df2.
df1
df2
Thank you.
df3 (Output)
First make sure in both DataFrames is default index by DataFrame.reset_index with drop=True, then repeat agent_code, convert to default index and last use concat:
df1 = df1.reset_index(drop=True)
df2 = df2.reset_index(drop=True)
s = df1['agent_code'].repeat(df1['number']).reset_index(drop=True)
df3 = pd.concat([df2, s], axis=1)

How to drop column from the target data frame, but the column(s) are required for the join in merge

I have two dataframe df1, df2
df1.columns
['id','a','b']
df2.columns
['id','ab','cd','ab_test','mn_test']
Expected out column is ['id','a','b','ab_test','mn_test']
How to get the all the columns from df1, and columns which contain test in the column name
pseudocode > pd.merge(df1,df2,how='id')
You can merge and use filter one the second dataframe to keep the columns of interest:
df1.merge(df2.filter(regex=r'^id$|test'), on='id')
Or similarly through bitwise operations:
df1.merge(df2.loc[:,(df2.columns=='id')|df2.columns.str.contains('test')], on='id')
df1 = pd.DataFrame(columns=['id','a','b'])
df2 = pd.DataFrame(columns=['id','ab','cd','ab_test','mn_test'])
df1.merge(df2.filter(regex=r'^id$|test'), on='id').columns
# Index(['a', 'b', 'id', 'ab_test', 'mn_test'], dtype='object')

How to append two completely different data sets in python?

I have to append two data sets. They have completely different rows and columns. I have tried the command:
df1 = pd.merge(df1, df2)but it gives an error.Data Frame 1
Data Frame 2
if they have the same number of columns and are on the same order, you could do :
df2.columns = df1.columns
df_concat = pd.concat([df1, df2], ignore_index=True)

How do I remove the rows identified in df2 from df1?

I have a dataframe called df1. I then create a filter like this:
df2 = df1.loc[(df1['unit'].str.contains('Ph'))]
How do I remove the rows identified in df2 from df1? thanks!
Use ~, not operand in boolean indexing:
df3 = df1.loc[~(df1['unit'].str.contains('Ph'))]
Now, df3 is df1 minus df2.

Intersection of pandas dataframe with multiple columns

I have a list of dataframes as:
[df1, df2, df3, ..., df100, oddDF]
Each dataframe dfi has DateTime as column1 and Temperature as column2. Except the dataframe oddDF which has DateTime as column1 and has temperature columns in column2 and column3.
I am looking to create a list of dataframe or one dataframe which has the common temperatures from each of df1, .. df100 and oddDF
I am trying the following:
dfs = [df0, df1, df2, .., df100, oddDF]
df_final = reduce(lambda left,right: pd.merge(left,right,on='DateTime'), dfs)
But it produces df_final as empty
If however I do just:
dfs = [df0, df1, df2, .., df100]
df_final = reduce(lambda left,right: pd.merge(left,right,on='DateTime'), dfs)
df_final produces the right answer.
How do I incorporate oddDF in the code also. I have checked to make sure that oddDF's DateTime column has the common dates with
df1, df2, .., df100

Categories

Resources