Create combinations sets for elements in a DataFrame - python

I am creating a "design of experiments" matrix from a DataFrame that represents the possible choices for each element.
I would like to create a column for each unique combination of elements in a DataFrame, which will represent one experimental set.
Constraints: Elements are not all the same size.
Input:
index Column1 Column2 Column3
a a1
b b1 b2 b3
c c1 c2
d d1
Desired Output:
index Column1 Column2 Column3 Column4 Column5 Column6
a a1 a1 a1 a1 a1 a1
b b1 b2 b3 b1 b2 b3
c c1 c1 c1 c2 c2 c2
d d1 d1 d1 d1 d1 d1
I have looked at zipping lists but hoping to find an elegant way.

Maybe some itertools action? :-)
idx = ['a','b','c','d']
df = pd.DataFrame([['a1',None,None],['b1','b2','b3'],['c1','c2',None],['d1',None,None]],
index=idx,
columns=['Column1','Column2','Column3'])
NUM_OF_COLUMNS = 6
result = []
for r in df.values:
#Filter None or other types of "emtpy" values you have:
filtered = [x for x in r if x is not None]
# Creat a row by repeating the elements:
rep_list = list(islice(cycle(filtered), NUM_OF_COLUMNS))
result.append(rep_list)
res_df = pd.DataFrame(result,
index=idx,
columns=['Column'+str(i) for i in range(1, NUM_OF_COLUMNS+1)])

Related

Split column based on input string into multiple columns in pandas python

I have below pandas data frame and I am trying to split col1 into multiple columns based on split_format string.
Inputs:
split_format = 'id-id1_id2|id3'
data = {'col1':['a-a1_a2|a3', 'b-b1_b2|b3', 'c-c1_c2|c3', 'd-d1_d2|d3'],
'col2':[20, 21, 19, 18]}
df = pd.DataFrame(data).style.hide_index()
df
col1 col2
a-a1_a2|a3 20
b-b1_b2|b3 21
c-c1_c2|c3 19
d-d1_d2|d3 18
Expected Output:
id id1 id2 id3 col2
a a1 a2 a3 20
b b1 b2 b3 21
c c1 c2 c3 19
d d1 d2 d3 18
**Note: The special characters and column name in split_string can be changed.
I think I am able to figure it out.
col_name = re.split('[^0-9a-zA-Z]+',split_format)
df[col_name] = df['col1'].str.split('[^0-9a-zA-Z]+',expand=True)
del df['col1']
df
col2 id id1 id2 id3
0 20 a a1 a2 a3
1 21 b b1 b2 b3
2 19 c c1 c2 c3
3 18 d d1 d2 d3
I parse the symbols and then recursively evaluate the resulting strings from the token split on the string. I flatten the resulting list and their recursive evaluate the resulting list until all the symbols have been evaluated.
split_format = 'id-id1_id2|id3'
data = {'col1':['a-a1_a2|a3', 'b-b1_b2|b3', 'c-c1_c2|c3', 'd-d1_d2|d3'],
'col2':[20, 21, 19, 18]}
df = pd.DataFrame(data)
symbols=[]
for x in split_format:
if x.isalnum()==False:
symbols.append(x)
result=[]
def parseTree(stringlist,symbols,result):
#print("String list",stringlist)
if len(symbols)==0:
[result.append(x) for x in stringlist]
return
token=symbols.pop(0)
elements=[]
for item in stringlist:
elements.append(item.split(token))
flat_list = [item for sublist in elements for item in sublist]
parseTree(flat_list,symbols,result)
df2=pd.DataFrame(columns=["id","id1","id2","id3"])
for key, item in df.iterrows():
symbols2=symbols.copy()
value=item['col1']
parseTree([value],symbols2,result)
a_series = pd. Series(result, index = df2.columns)
df2=df2.append(a_series, ignore_index=True)
result.clear()
df2['col2']=df['col2']
print(df2)
output:
id id1 id2 id3 col2
0 a a1 a2 a3 20
1 b b1 b2 b3 21
2 c c1 c2 c3 19
3 d d1 d2 d3 18

Check if value of one column exists in another column, put a value in another column in pandas

Say I have a data frame like the following:
A B C D E
a1 b1 c1 d1 e1
a2 a1 c2 d2 e2
a3 a1 a2 d3 e3
a4 a1 a2 a3 e4
I want to create a new column with predefined values if a value found in other columns.
Something like this:
A B C D E F
a1 b1 c1 d1 e1 NA
a2 a1 c2 d2 e2 in_B
a3 a1 a2 d3 e3 in_B, in_C
a4 a1 a2 a3 e4 in_B, in_C, in_D
The in_B, in_C could be other string of choice. If values present in multiple columns, then value of F would be multiple. Example, row 3 and 4 of column F (in row 3 there are two values and in row 4 there are three values). So far, I have tried a below:
DF.F=np.where(DF.A.isin(DF.B), DF.A,'in_B')
But it does not give expected result. Any help
STEPS:
Stack the dataframe.
check for the duplicate values.
unstack to get the same structure back.
use dot to get the required result.
df['new_col'] = df.stack().duplicated().unstack().dot(
'In ' + k.columns + ',').str.strip(',')
OUTPUT:
A B C D E new_col
0 a1 b1 c1 d1 e1
1 a2 a1 c2 d2 e2 In B
2 a3 a1 a2 d3 e3 In B,In C
3 a4 a1 a2 a3 e4 In B,In C,In D

How to compute each cell as a function of index and column?

I have a use-case where it naturally fits to compute each cell of a pd.DataFrame as a function of the corresponding index and column i.e.
import pandas as pd
import numpy as np
data = np.empty((3, 3))
data[:] = np.nan
df = pd.DataFrame(data=data, index=[1, 2, 3], columns=['a', 'b', 'c'])
print(df)
> a b c
>1 NaN NaN NaN
>2 NaN NaN NaN
>3 NaN NaN NaN
and I'd like (this is only a mock example) to get a result that is a function f(index, column):
> a b c
>1 a1 b1 c1
>2 a2 b2 c2
>3 a3 b3 c3
In order to accomplish this I need a way different to apply or applymap where the lambda gets the coordinates in terms of the index and col i.e.
def my_cell_map(ix, col):
return col + str(ix)
Here is possible use numpy - add index values to columns with broadcasting and pass to DataFrame constructor:
a = df.columns.to_numpy() + df.index.astype(str).to_numpy()[:, None]
df = pd.DataFrame(a, index=df.index, columns=df.columns)
print (df)
a b c
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
EDIT: For processing by columns names is possible use x.name with index values:
def f(x):
return x.name + x.index.astype(str)
df = df.apply(f)
print (df)
a b c
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
EDIT1: For your function is necessary use another lambda function for loop by index values:
def my_cell_map(ix, col):
return col + str(ix)
def f(x):
return x.index.map(lambda y: my_cell_map(y, x.name))
df = df.apply(f)
print (df)
a b c
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
EDIT2: Also is possible loop by index and columns values and set by loc, if large DataFrame performance should be slow:
for c in df.columns:
for i in df.index:
df.loc[i, c] = my_cell_map(i, c)
print (df)
a b c
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3

How to compare two data frames with same columns but different number of rows?

df1=
A B C D
a1 b1 c1 1
a2 b2 c2 2
a3 b3 c3 4
df2=
A B C D
a1 b1 c1 2
a2 b2 c2 1
I want to compare the value of the column 'D' in both dataframes. If both dataframes had same number of rows I would just do this.
newDF = df1['D']-df2['D']
However there are times when the number of rows are different. I want a result Dataframe which shows a dataframe like this.
resultDF=
A B C D_df1 D_df2 Diff
a1 b1 c1 1 2 -1
a2 b2 c2 2 1 1
EDIT: if 1st row in A,B,C from df1 and df2 is same then and only then compare 1st row of column D for each dataframe. Similarly, repeat for all the row.
Use merge and df.eval
df1.merge(df2, on=['A','B','C'], suffixes=['_df1','_df2']).eval('Diff=D_df1 - D_df2')
Out[314]:
A B C D_df1 D_df2 Diff
0 a1 b1 c1 1 2 -1
1 a2 b2 c2 2 1 1

Summing columns from different dataframe according to some column names

Suppose I have a main dataframe
main_df
Cri1 Cri2 Cr3 total
0 A1 A2 A3 4
1 B1 B2 B3 5
2 C1 C2 C3 6
I also have 3 dataframes
df_1
Cri1 Cri2 Cri3 value
0 A1 A2 A3 1
1 B1 B2 B3 2
df_2
Cri1 Cri2 Cri3 value
0 A1 A2 A3 9
1 C1 C2 C3 10
df_3
Cri1 Cri2 Cri3 value
0 B1 B2 B3 15
1 C1 C2 C3 17
What I want is to add value from each frame df to total in the main_df according to Cri
i.e. main_df will become
main_df
Cri1 Cri2 Cri3 total
0 A1 A2 A3 14
1 B1 B2 B3 22
2 C1 C2 C3 33
Of course I can do it using for loop, but at the end I want to apply the method to a large amount of data, say 50000 rows in each dataframe.
Is there other ways to solve it?
Thank you!
First you should align your numeric column names. In this case:
df_main = df_main.rename(columns={'total': 'value'})
Then you have a couple of options.
concat + groupby
You can concatenate and then perform a groupby with sum:
res = pd.concat([df_main, df_1, df_2, df_3])\
.groupby(['Cri1', 'Cri2', 'Cri3']).sum()\
.reset_index()
print(res)
Cri1 Cri2 Cri3 value
0 A1 A2 A3 14
1 B1 B2 B3 22
2 C1 C2 C3 33
set_index + reduce / add
Alternatively, you can create a list of dataframes indexed by your criteria columns. Then use functools.reduce with pd.DataFrame.add to sum these dataframes.
from functools import reduce
dfs = [df.set_index(['Cri1', 'Cri2', 'Cri3']) for df in [df_main, df_1, df_2, df_3]]
res = reduce(lambda x, y: x.add(y, fill_value=0), dfs).reset_index()
print(res)
Cri1 Cri2 Cri3 value
0 A1 A2 A3 14.0
1 B1 B2 B3 22.0
2 C1 C2 C3 33.0

Categories

Resources