How to append two completely different data sets in python? - python

I have to append two data sets. They have completely different rows and columns. I have tried the command:
df1 = pd.merge(df1, df2)but it gives an error.Data Frame 1
Data Frame 2

if they have the same number of columns and are on the same order, you could do :
df2.columns = df1.columns
df_concat = pd.concat([df1, df2], ignore_index=True)

Related

Assign specific value from a column to specific number of rows

I would like to assign agent_code to specific number of rows in df2.
df1
df2
Thank you.
df3 (Output)
First make sure in both DataFrames is default index by DataFrame.reset_index with drop=True, then repeat agent_code, convert to default index and last use concat:
df1 = df1.reset_index(drop=True)
df2 = df2.reset_index(drop=True)
s = df1['agent_code'].repeat(df1['number']).reset_index(drop=True)
df3 = pd.concat([df2, s], axis=1)

Pandas Apply returns a Series instead of a dataframe

The goal of following code is to go through each row in df_label, extract app1 and app2 names, filter df_all using those two names, concatenate the result and return it as a dataframe. Here is the code:
def create_dataset(se):
# extracting the names of applications
app1 = se.app1
app2 = se.app2
# extracting each application from df_all
df1 = df_all[df_all.workload == app1]
df1.columns = df1.columns + '_0'
df2 = df_all[df_all.workload == app2]
df2.columns = df2.columns + '_1'
# combining workloads to create the pairs dataframe
df3 = pd.concat([df1, df2], axis=1)
display(df3)
return df3
df_pairs = pd.DataFrame()
df_label.apply(create_dataset, axis=1)
#df_pairs = df_pairs.append(df_label.apply(create_dataset, axis=1))
I would like to append all dataframes returned from apply. However, while display(df3) shows the correct dataframe, when returned from function, it's not a dataframe anymore and it's a series. A series with one element and that element seems to be the whole dataframe. Any ideas what I am doing wrong?
When you select a single column, you'll get a Series instead of a DataFrame so df1 and df2 will both be series.
However, concatenating them on axis=1 should produce a DataFrame (whereas combining them on axis=0 would produce a series). For example:
df = pd.DataFrame({'a':[1,2],'b':[3,4]})
df1 = df['a']
df2 = df['b']
>>> pd.concat([df1,df2],axis=1)
a b
0 1 3
1 2 4
>>> pd.concat([df1,df2],axis=0)
0 1
1 2
0 3
1 4
dtype: int64

Perform concat in pandas without including common column twice

From Pandas documentation:
result = pd.concat([df1, df4], axis=1, join='inner')
How can I perform a concat operation but without including the common columns twice? I only want to include them once. In this example, columns B and D are repeated twice after concat but have the same values.
Select the columns from df4 that are not in df1:
result = pd.concat([df1, df4[['F']]], axis=1, join='inner')
Or:
complementary = [c for c in df4 if c not in df1]
result = pd.concat([df1, df4[complementary], axis=1, join='inner')
The latter expression will choose the complementary columns automatically.
P.S. If the columns with the same name are different in df1 and df4 (as it seems to be in your case), you can apply the same trick symmetrically and select only the complementary columns from df1.

How to drop column from the target data frame, but the column(s) are required for the join in merge

I have two dataframe df1, df2
df1.columns
['id','a','b']
df2.columns
['id','ab','cd','ab_test','mn_test']
Expected out column is ['id','a','b','ab_test','mn_test']
How to get the all the columns from df1, and columns which contain test in the column name
pseudocode > pd.merge(df1,df2,how='id')
You can merge and use filter one the second dataframe to keep the columns of interest:
df1.merge(df2.filter(regex=r'^id$|test'), on='id')
Or similarly through bitwise operations:
df1.merge(df2.loc[:,(df2.columns=='id')|df2.columns.str.contains('test')], on='id')
df1 = pd.DataFrame(columns=['id','a','b'])
df2 = pd.DataFrame(columns=['id','ab','cd','ab_test','mn_test'])
df1.merge(df2.filter(regex=r'^id$|test'), on='id').columns
# Index(['a', 'b', 'id', 'ab_test', 'mn_test'], dtype='object')

Pyspark OLD dataframe partition to New Dataframe

I have a partitioned dataframe say df1. From df1 i will create df2 and df3..
df1 = df1.withColumn("key", concat("col1", "col2", "col3"))
df1 =df1.repartition(400, "key")
df2 = df.groupBy("col1", "col2").agg(sum(colx))
df3 = df1.join(df2, ["col1", "col2"])
I want to know will df3 retain same partition of df1? or do i need to re-partition df3 again?.
Partitioning of df3 will be totally different comparing to df1. And (probably) df2 will have spark.sql.shuffle.partitions (default: 200) number of partitions, not 400.

Categories

Resources