Can you merge elements of Pandas dataframes into tuples? - python

If you have two Pandas dataframes in Python with identical axes, is there a function to merge the elements as tuples so that they maintain their positions? If there is a better way to combine these dataframes without duplicating the number of indices or columns, that works as well.
Expected logic:

You can do this in pure pandas:
(pd.concat([df1,df2])
.stack()
.groupby(level=[0,1])
.apply(tuple)
.unstack()
)
Output:
A B
0 (1, 7) (4, 10)
1 (2, 8) (5, 11)
2 (3, 9) (6, 12)
Input:
import pandas as pd
df1 = pd.DataFrame({"A":[1,2,3],"B":[4,5,6]})
df2 = pd.DataFrame({"A":[7,8,9],"B":[10,11,12]})

The operation you're looking for seems like "zip". That is, match elements of two sequences together into a sequence of tuples. If you look at each column in your dataframes and zip them together you will have a result that is a list of lists of tuples - what you want to be in your result dataframe. You can then construct a dataframe with the same columns and index out of that data. In code, that looks like this:
data = [list(zip(df1[col], df2[col])) for col in df1]
pd.DataFrame(data, index=[1,2,3], columns=["A", "B", "C"])

You can maybe use something like this to achieve what you want.
df3 = pd.DataFrame({x: zip(df1[x], df2[x]) for x in df1.columns})

df1 = pd.DataFrame({"A" : [1,2,3], "B":[4,5,6]})
df2 = pd.DataFrame({"A" : [7,8,9], "B":[10,11,12]})
def add_dfs(df1, df2):
for col in df1.columns:
df1[col] = df1[col].apply(lambda x: (x,))
for col in df2.columns:
df2[col] = df2[col].apply(lambda x: (x,))
df = df1 + df2 # using + operator , satisfies answer technically
return df
df = add_dfs(df1, df2)

Related

Pyspark replace values on array column based on another dataframe

I have two dataframes, one simply with some unique ids with associated names like so:
Id name
0 name_a
1 name_b
2 name_c
Second dataframe contains the ids from the first dataframe stored in an array, in each row:
Row_1 row_2
0 [0,2]
1 [1,0]
My question is it possible to replace the arrays from the second dataframe so it checks the names from the first df based on the ids, so:
Row_1 row_2
0 [name_a, name_c]
1 [name_b, name_a]
It seems too time consuming to create a map of the first df and just add it to the second df with an udf. Any help is much appreciated on how to approach this.
Join using array_contains function + groupby and collect_list:
from pyspark.sql import functions as F
df1 = spark.createDataFrame([(0, "name_a"), (1, "name_b"), (2, "name_c")], ["Id", "name"])
df2 = spark.createDataFrame([(0, [0, 2]), (1, [1, 0])], ["Row_1", "Row_2"])
result = df2.join(
df1, on=F.array_contains("Row_2", F.col("Id")), how="left"
).groupBy("Row_1").agg(
F.collect_list("name").alias("Row_2")
)
result.show()
#+-----+----------------+
#|Row_1| Row_2|
#+-----+----------------+
#| 0|[name_a, name_c]|
#| 1|[name_a, name_b]|
#+-----+----------------+
You can try using explode function to convert array into rows, then join the data with the initial data frame, in the last step do a group by & .agg(collect_list())
from pyspark.sql.functions import explode
df3 = df2.select(df2.row_1,explode(df2.row_2))
df4 = df3.join(df1,df3.row_1==df1.Id).select(df3.row_1,df1.name)
df5 = df4.groupBy('row_1').agg(collect_list('name').alias('name'))
Reference links:
https://sparkbyexamples.com/pyspark/pyspark-explode-array-and-map-columns-to-rows/#:~:text=explode%20%E2%80%93%20PySpark%20explode%20array%20or,it%20contains%20all%20array%20elements.
https://www.owenrumney.co.uk/pyspark-opposite-of-explode/

rename columns according to list

I have 3 lists of data frames and I want to add a suffix to each column according to whether it belongs to a certain list of data frames. its all in order, so the first item in the suffix list should be appended to the columns of data frames in the first list of data frames etc. I am trying here but its adding each item in the suffix list to each column.
In the expected output
all columns in dfs in cat_a need group1 appended
all columns in dfs in cat_b need group2 appended
all columns in dfs in cat_c need group3 appended
data and code are here
df1, df2, df3, df4 = (pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=('a', 'b')),
pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=('c', 'd')),
pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=('e', 'f')),
pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=('g', 'h')))
cat_a = [df1, df2]
cat_b = [df3, df4, df2]
cat_c = [df1]
suffix =['group1', 'group2', 'group3']
dfs = [cat_a, cat_b, cat_c]
for x, y in enumerate(dfs):
for i in y:
suff=suffix
i.columns = i.columns + '_' + suff[x]
thanks for taking a look!
Brian Joseph's answer is great*, but I'd like to point out that you were very close, you just weren't renaming the columns correctly. Your last line should be like this:
i.columns = [col + '_' + suff[x] for col in i.columns]
instead of this:
i.columns = i.columns + '_' + suff[x]
Assuming you want to have multiple suffixes for some dataframes, I think this is what you want?:
suffix_mapper = {
'group1': [df1, df2],
'group2': [df3, df4, df2],
'group3': [df1]
}
for suffix, dfs in suffix_mapper.items():
for df in dfs:
df.columns = [f"{col}_{suffix}" for col in df.columns]
I think the issue is because you're not taking a copy of the dataframe so each cat dataframe is referencing a df dataframe multiple times.
Try:
cat_a = [df1.copy(), df2.copy()]
cat_b = [df3.copy(), df4.copy(), df2.copy()]
cat_c = [df1.copy()]

how to iterate over list of dataframes?

Basically, I have 5 pd.dataframes, named= df0, df1, df2, df3, df4. What I would like to do is use a for loop to add data to these 5 dataframes. Something the likes of:
for i, dataset in enumerate([df0,df1,df2,df3,df4]):
dataset = pd.concat([dataset, NEW_DATA])
However, when you do it like this (or when you use a solely list instead of enumerate), 'dataset' returns the dataset, rather than the name (i.e. df0). How can I solve this. For example, the output for the second iteration should be:
for i, dataset in enumerate([df0,df1,df2,df3,df4]):
df1 = pd.concat([df1, NEW_DATA])
edit: I have also tried dictionaries, such as {'df0':df0... etc}, however, it again prints the dataset rather than the dataset 'variable name'.
You can re-assign the new df into your list:
# setup example
df0 = pd.DataFrame(np.random.randint(0, 10, (3, 2)))
df1 = pd.DataFrame(np.random.randint(0, 10, (3, 2)))
df2 = pd.DataFrame(np.random.randint(0, 10, (3, 2)))
# then
lst = [df0, df1, df2]
for i, df in enumerate(lst):
newdata = pd.DataFrame([[0,0], [0,0]]) # (say)
lst[i] = df.append(newdata)
df0, df1, df2 = lst
>>> df0
0 1
0 8 7
1 9 1
2 5 6
0 0 0
1 0 0
But, BTW, it might be better to store your DataFrames collection in a dict instead of a list, if you want to refer to them by name instead of by index.
Edit: Rewriting the solution to provide some proper practice.
So the problem is you have a bunch of values that need to be updated through reassignment. There's a stylistic thing going on where if you have df1, df2, ..., maybe you'd much rather have them in a list.
Using a list in any case is also how I'd address the issue.
dfs = [df0, df1, df2, ...]
dfs = [pd.concat([df, NEW_DATA]) for df in dfs]
[df0, df1, df2, ...] = dfs
See how, if you'd just use dfs in general and refer to dfs[0] instead of df0, this solution could almost come for free?

How to concatenate two dataframes with different indices along column axis

I want to merge 2 dataframes and first is dm.shape = (21184, 34), second is po.shape = (21184, 6). I want to merge them then it will be 40 columns. I write as this
dm = dm.merge(po, left_index=True, right_index=True)
then it is dm.shape = (4554, 40) my rows decreased.
P.s po is the PolynomialFeatures of numerical data of dm.
Problem is different index values, so convert them to default RangeIndex in both DataFrames:
df = dm.reset_index(drop=True).merge(po.reset_index(drop=True),
left_index=True,
right_index=True)
Solution with concat - by default outer join, but if same index values in both working same:
df = pd.concat([dm.reset_index(drop=True), po.reset_index(drop=True)], axis=1)
Or use:
dm = pd.DataFrame([dm.values.flatten().tolist(), po.values.flatten().tolist()]).rename(index=dict(zip(range(2),[*po.columns.tolist(), *dm.columns.tolist()]))).T
You can use the method join and set the parameter on to the index of the joined dataframe:
df1 = pd.DataFrame({'col1': [1, 2]}, index=[1,2])
df2 = pd.DataFrame({'col2': [3, 4]}, index=[3,4])
df1.join(df2, on=df2.index)
Output:
col1 col2
1 1 3
2 2 4
The joined dataframe must not contain duplicated indices.

Pandas: Assign MultiIndex Column from DataFrame

I have a DataFrame with multiIndex columns. Suppose it is this:
index = pd.MultiIndex.from_tuples([('one', 'a'), ('one', 'b'),
('two', 'a'), ('two', 'b')])
df = pd.DataFrame({'col': np.arange(1.0, 5.0)}, index=index)
df = df.unstack(1)
(I know this definition could be more direct). I now want to set a new level 0 column based on a DataFrame. For example
df['col2'] = df['col'].applymap(lambda x: int(x < 3))
This does not work. The only method I have found so far is to add each column seperately:
Pandas: add a column to a multiindex column dataframe
, or some sort of convoluted joining process.
The desired result is a new level 0 column 'col2' with two level 1 subcolumns: 'a' and 'b'
Any help would be much appreciated, Thank you.
I believe need solution with no unstack and stack - filter by boolean indexing, rename values for avoid duplicates and last use DataFrame.append:
df2 = df[df['col'] < 3].rename({'one':'one1', 'two':'two1'}, level=0)
print (df2)
col
one1 a 1.0
b 2.0
df = df.append(df2)
print (df)
col
one a 1.0
b 2.0
two a 3.0
b 4.0
one1 a 1.0
b 2.0

Categories

Resources