Create combinations sets for elements in a DataFrame

Create combinations sets for elements in a DataFrame - python

I am creating a "design of experiments" matrix from a DataFrame that represents the possible choices for each element.
I would like to create a column for each unique combination of elements in a DataFrame, which will represent one experimental set.
Constraints: Elements are not all the same size.
Input:
index Column1 Column2 Column3
a a1
b b1 b2 b3
c c1 c2
d d1
Desired Output:
index Column1 Column2 Column3 Column4 Column5 Column6
a a1 a1 a1 a1 a1 a1
b b1 b2 b3 b1 b2 b3
c c1 c1 c1 c2 c2 c2
d d1 d1 d1 d1 d1 d1
I have looked at zipping lists but hoping to find an elegant way.

Maybe some itertools action? :-)
idx = ['a','b','c','d']
df = pd.DataFrame([['a1',None,None],['b1','b2','b3'],['c1','c2',None],['d1',None,None]],
index=idx,
columns=['Column1','Column2','Column3'])
NUM_OF_COLUMNS = 6
result = []
for r in df.values:
#Filter None or other types of "emtpy" values you have:
filtered = [x for x in r if x is not None]
# Creat a row by repeating the elements:
rep_list = list(islice(cycle(filtered), NUM_OF_COLUMNS))
result.append(rep_list)
res_df = pd.DataFrame(result,
index=idx,
columns=['Column'+str(i) for i in range(1, NUM_OF_COLUMNS+1)])

Related

Split column based on input string into multiple columns in pandas python

I have below pandas data frame and I am trying to split col1 into multiple columns based on split_format string.
Inputs:
split_format = 'id-id1_id2|id3'
data = {'col1':['a-a1_a2|a3', 'b-b1_b2|b3', 'c-c1_c2|c3', 'd-d1_d2|d3'],
'col2':[20, 21, 19, 18]}
df = pd.DataFrame(data).style.hide_index()
df
col1 col2
a-a1_a2|a3 20
b-b1_b2|b3 21
c-c1_c2|c3 19
d-d1_d2|d3 18
Expected Output:
id id1 id2 id3 col2
a a1 a2 a3 20
b b1 b2 b3 21
c c1 c2 c3 19
d d1 d2 d3 18
**Note: The special characters and column name in split_string can be changed.

I think I am able to figure it out.
col_name = re.split('[^0-9a-zA-Z]+',split_format)
df[col_name] = df['col1'].str.split('[^0-9a-zA-Z]+',expand=True)
del df['col1']
df
col2 id id1 id2 id3
0 20 a a1 a2 a3
1 21 b b1 b2 b3
2 19 c c1 c2 c3
3 18 d d1 d2 d3

I parse the symbols and then recursively evaluate the resulting strings from the token split on the string. I flatten the resulting list and their recursive evaluate the resulting list until all the symbols have been evaluated.
split_format = 'id-id1_id2|id3'
data = {'col1':['a-a1_a2|a3', 'b-b1_b2|b3', 'c-c1_c2|c3', 'd-d1_d2|d3'],
'col2':[20, 21, 19, 18]}
df = pd.DataFrame(data)
symbols=[]
for x in split_format:
if x.isalnum()==False:
symbols.append(x)
result=[]
def parseTree(stringlist,symbols,result):
#print("String list",stringlist)
if len(symbols)==0:
[result.append(x) for x in stringlist]
return
token=symbols.pop(0)
elements=[]
for item in stringlist:
elements.append(item.split(token))
flat_list = [item for sublist in elements for item in sublist]
parseTree(flat_list,symbols,result)
df2=pd.DataFrame(columns=["id","id1","id2","id3"])
for key, item in df.iterrows():
symbols2=symbols.copy()
value=item['col1']
parseTree([value],symbols2,result)
a_series = pd. Series(result, index = df2.columns)
df2=df2.append(a_series, ignore_index=True)
result.clear()
df2['col2']=df['col2']
print(df2)
output:
id id1 id2 id3 col2
0 a a1 a2 a3 20
1 b b1 b2 b3 21
2 c c1 c2 c3 19
3 d d1 d2 d3 18

Check if value of one column exists in another column, put a value in another column in pandas

Say I have a data frame like the following:
A B C D E
a1 b1 c1 d1 e1
a2 a1 c2 d2 e2
a3 a1 a2 d3 e3
a4 a1 a2 a3 e4
I want to create a new column with predefined values if a value found in other columns.
Something like this:
A B C D E F
a1 b1 c1 d1 e1 NA
a2 a1 c2 d2 e2 in_B
a3 a1 a2 d3 e3 in_B, in_C
a4 a1 a2 a3 e4 in_B, in_C, in_D
The in_B, in_C could be other string of choice. If values present in multiple columns, then value of F would be multiple. Example, row 3 and 4 of column F (in row 3 there are two values and in row 4 there are three values). So far, I have tried a below:
DF.F=np.where(DF.A.isin(DF.B), DF.A,'in_B')
But it does not give expected result. Any help

STEPS:
Stack the dataframe.
check for the duplicate values.
unstack to get the same structure back.
use dot to get the required result.
df['new_col'] = df.stack().duplicated().unstack().dot(
'In ' + k.columns + ',').str.strip(',')
OUTPUT:
A B C D E new_col
0 a1 b1 c1 d1 e1
1 a2 a1 c2 d2 e2 In B
2 a3 a1 a2 d3 e3 In B,In C
3 a4 a1 a2 a3 e4 In B,In C,In D

How to compute each cell as a function of index and column?

I have a use-case where it naturally fits to compute each cell of a pd.DataFrame as a function of the corresponding index and column i.e.
import pandas as pd
import numpy as np
data = np.empty((3, 3))
data[:] = np.nan
df = pd.DataFrame(data=data, index=[1, 2, 3], columns=['a', 'b', 'c'])
print(df)
> a b c
>1 NaN NaN NaN
>2 NaN NaN NaN
>3 NaN NaN NaN
and I'd like (this is only a mock example) to get a result that is a function f(index, column):
> a b c
>1 a1 b1 c1
>2 a2 b2 c2
>3 a3 b3 c3
In order to accomplish this I need a way different to apply or applymap where the lambda gets the coordinates in terms of the index and col i.e.
def my_cell_map(ix, col):
return col + str(ix)

Here is possible use numpy - add index values to columns with broadcasting and pass to DataFrame constructor:
a = df.columns.to_numpy() + df.index.astype(str).to_numpy()[:, None]
df = pd.DataFrame(a, index=df.index, columns=df.columns)
print (df)
a b c
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
EDIT: For processing by columns names is possible use x.name with index values:
def f(x):
return x.name + x.index.astype(str)
df = df.apply(f)
print (df)
a b c
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
EDIT1: For your function is necessary use another lambda function for loop by index values:
def my_cell_map(ix, col):
return col + str(ix)
def f(x):
return x.index.map(lambda y: my_cell_map(y, x.name))
df = df.apply(f)
print (df)
a b c
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
EDIT2: Also is possible loop by index and columns values and set by loc, if large DataFrame performance should be slow:
for c in df.columns:
for i in df.index:
df.loc[i, c] = my_cell_map(i, c)
print (df)
a b c
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3

How to compare two data frames with same columns but different number of rows?

df1=
A B C D
a1 b1 c1 1
a2 b2 c2 2
a3 b3 c3 4
df2=
A B C D
a1 b1 c1 2
a2 b2 c2 1
I want to compare the value of the column 'D' in both dataframes. If both dataframes had same number of rows I would just do this.
newDF = df1['D']-df2['D']
However there are times when the number of rows are different. I want a result Dataframe which shows a dataframe like this.
resultDF=
A B C D_df1 D_df2 Diff
a1 b1 c1 1 2 -1
a2 b2 c2 2 1 1
EDIT: if 1st row in A,B,C from df1 and df2 is same then and only then compare 1st row of column D for each dataframe. Similarly, repeat for all the row.

Use merge and df.eval
df1.merge(df2, on=['A','B','C'], suffixes=['_df1','_df2']).eval('Diff=D_df1 - D_df2')
Out[314]:
A B C D_df1 D_df2 Diff
0 a1 b1 c1 1 2 -1
1 a2 b2 c2 2 1 1

Summing columns from different dataframe according to some column names

Suppose I have a main dataframe
main_df
Cri1 Cri2 Cr3 total
0 A1 A2 A3 4
1 B1 B2 B3 5
2 C1 C2 C3 6
I also have 3 dataframes
df_1
Cri1 Cri2 Cri3 value
0 A1 A2 A3 1
1 B1 B2 B3 2
df_2
Cri1 Cri2 Cri3 value
0 A1 A2 A3 9
1 C1 C2 C3 10
df_3
Cri1 Cri2 Cri3 value
0 B1 B2 B3 15
1 C1 C2 C3 17
What I want is to add value from each frame df to total in the main_df according to Cri
i.e. main_df will become
main_df
Cri1 Cri2 Cri3 total
0 A1 A2 A3 14
1 B1 B2 B3 22
2 C1 C2 C3 33
Of course I can do it using for loop, but at the end I want to apply the method to a large amount of data, say 50000 rows in each dataframe.
Is there other ways to solve it?
Thank you!

First you should align your numeric column names. In this case:
df_main = df_main.rename(columns={'total': 'value'})
Then you have a couple of options.
concat + groupby
You can concatenate and then perform a groupby with sum:
res = pd.concat([df_main, df_1, df_2, df_3])\
.groupby(['Cri1', 'Cri2', 'Cri3']).sum()\
.reset_index()
print(res)
Cri1 Cri2 Cri3 value
0 A1 A2 A3 14
1 B1 B2 B3 22
2 C1 C2 C3 33
set_index + reduce / add
Alternatively, you can create a list of dataframes indexed by your criteria columns. Then use functools.reduce with pd.DataFrame.add to sum these dataframes.
from functools import reduce
dfs = [df.set_index(['Cri1', 'Cri2', 'Cri3']) for df in [df_main, df_1, df_2, df_3]]
res = reduce(lambda x, y: x.add(y, fill_value=0), dfs).reset_index()
print(res)
Cri1 Cri2 Cri3 value
0 A1 A2 A3 14.0
1 B1 B2 B3 22.0
2 C1 C2 C3 33.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create combinations sets for elements in a DataFrame - python

Related

Split column based on input string into multiple columns in pandas python

Check if value of one column exists in another column, put a value in another column in pandas

How to compute each cell as a function of index and column?

How to compare two data frames with same columns but different number of rows?

Summing columns from different dataframe according to some column names

Categories

Resources