Looping through multiple arrays & concatenating values in pandas - python

I've a dataframe with list of items separated by , commas as below.
+----------------------+
| Items |
+----------------------+
| X1,Y1,Z1 |
| X2,Z3 |
| X3 |
| X1,X2 |
| Y2,Y4,Z2,Y5,Z3 |
| X2,X3,Y1,Y2,Z2,Z4,X1 |
+----------------------+
Also I've 3 list of arrays which has all items said above clubbed into specific groups as below
X = [X1,X2,X3,X4,X5]
Y = [Y1,Y2,Y3,Y4,Y5]
Z = [Z1,Z2,Z3,Z4,Z5]
my task is to split the each value in the dataframe & check individual items in the 3 arrays and if an item is in any of the array, then it should concatenate the name of the groups which it is found, separated with &. Also if many items are in the same group/array, then it should mention the number of occurrence as well.
My desired output is as below. refer Category column
+----------------------+--------------+
| Items | Category |
+----------------------+--------------+
| X1,Y1,Z1 | X & Y & Z |
| X2,Z3 | X & Z |
| X3 | X |
| X1,X2 | 2X |
| Y2,Y4,Z2,Y5,Z3 | 3Y & 2Z |
| X2,X3,Y1,Y2,Z2,Z4,X1 | 3X & 2Y & 2Z |
+----------------------+--------------+
X,Y, and Z are the name of the arrays.
how shall I start to solve this using pandas? please guide.

Assuming a column of lists, explode the lists, then this is a simple isin check that we sum along the original index. I'd suggest a different output, which gets across the same information but is much easier to work with in the future.
Example
import pandas as pd
df = pd.DataFrame({'Items': [['X1', 'Y1', 'Z1'], ['X2', 'Z3'], ['X3'],
['X1', 'X2'], ['Y2', 'Y4', 'Z2', 'Y5', 'Z3'],
['X2', 'X3', 'Y1', 'Y2', 'Z2', 'Z4', 'X1']]})
X = ['X1','X2','X3','X4','X5']
Y = ['Y1','Y2','Y3','Y4','Y5']
Z = ['Z1','Z2','Z3','Z4','Z5']
s = df.explode('Items')['Items']
pd.concat([s.isin(l).sum(level=0).rename(name)
for name, l in [('X', X), ('Y', Y), ('Z', Z)]], axis=1).astype(int)
# X Y Z
#0 1 1 1
#1 1 0 1
#2 1 0 0
#3 2 0 0
#4 0 3 2
#5 3 2 2
To get your output, mask the 0s and add the columns names after the values. Then we string join to get the result. Here I use an apply for simplicity, alignment and NaN handling, but there are other slightly faster alternatives.
res = pd.concat([s.isin(l).sum(level=0).rename(name)
for name, l in [('X', X), ('Y', Y), ('Z', Z)]], axis=1).astype(int)
res = res.astype(str).replace('1', '').where(res.ne(0))
res = res.add(res.columns, axis=1)
# Aligns on index due to `.sum(level=0)`
df['Category'] = res.apply(lambda x: ' & '.join(x.dropna()), axis=1)
# Items Category
#0 [X1, Y1, Z1] X & Y & Z
#1 [X2, Z3] X & Z
#2 [X3] X
#3 [X1, X2] 2X
#4 [Y2, Y4, Z2, Y5, Z3] 3Y & 2Z
#5 [X2, X3, Y1, Y2, Z2, Z4, X1] 3X & 2Y & 2Z

Setup
df = pd.DataFrame(
[['X1,Y1,Z1'],
['X2,Z3'],
['X3'],
['X1,X2'],
['Y2,Y4,Z2,Y5,Z3'],
['X2,X3,Y1,Y2,Z2,Z4,X1']],
columns=['Items']
)
X = ['X1', 'X2', 'X3', 'X4', 'X5']
Y = ['Y1', 'Y2', 'Y3', 'Y4', 'Y5']
Z = ['Z1', 'Z2', 'Z3', 'Z4', 'Z5']
Counter
from collections import Counter
M = {**dict.fromkeys(X, 'X'), **dict.fromkeys(Y, 'Y'), **dict.fromkeys(Z, 'Z')}
num = lambda x: {1: ''}.get(x, x)
cat = ' & '.join
fmt = lambda c: cat(f'{num(v)}{k}' for k, v in c.items())
cnt = lambda x: Counter(map(M.get, x.split(',')))
df.assign(Category=[*map(fmt, map(cnt, df.Items))])
Items Category
0 X1,Y1,Z1 X & Y & Z
1 X2,Z3 X & Z
2 X3 X
3 X1,X2 2X
4 Y2,Y4,Z2,Y5,Z3 3Y & 2Z
5 X2,X3,Y1,Y2,Z2,Z4,X1 3X & 2Y & 2Z
OLD STUFF
pandas.Series.str.get_dummies and groupby
First convert the definitions of X, Y, and Z into one dictionary, then use that as the argument for groupby on axis=1
M = {**dict.fromkeys(X, 'X'), **dict.fromkeys(Y, 'Y'), **dict.fromkeys(Z, 'Z')}
counts = df.Items.str.get_dummies(',').groupby(M, axis=1).sum()
counts
X Y Z
0 1 1 1
1 1 0 1
2 1 0 0
3 2 0 0
4 0 3 2
5 3 2 2
Add the desired column
Work in Progress I don't like this solution
def fmt(row):
a = [f'{"" if v == 1 else v}{k}' for k, v in row.items() if v > 0]
return ' & '.join(a)
df.assign(Category=counts.apply(fmt, axis=1))
Items Category
0 X1,Y1,Z1 X & Y & Z
1 X2,Z3 X & Z
2 X3 X
3 X1,X2 2X
4 Y2,Y4,Z2,Y5,Z3 3Y & 2Z
5 X2,X3,Y1,Y2,Z2,Z4,X1 3X & 2Y & 2Z
NOT TO BE TAKEN SERIOUSLY
Because I'm leveraging the character of your contrived example and there is nowai you should depend on the first character of your values to be the thing that differentiates them.
from operator import itemgetter
df.Items.str.get_dummies(',').groupby(itemgetter(0), axis=1).sum()
X Y Z
0 1 1 1
1 1 0 1
2 1 0 0
3 2 0 0
4 0 3 2
5 3 2 2

Create your dataframe
import pandas as pd
df = pd.DataFrame({'Items': [['X1', 'Y1', 'Z1'],
['X2', 'Z3'],
['X3'],
['X1', 'X2'],
['Y2', 'Y4', 'Z2', 'Y5', 'Z3'],
['X2', 'X3', 'Y1', 'Y2', 'Z2', 'Z4', 'X1']]})
explode
df_exp = df.explode('Items')
def check_if_in_set(item, set):
return 1 if (item in set) else 0
dict = {'X': set(['X1','X2','X3','X4','X5']),
'Y': set(['Y1','Y2','Y3','Y4','Y5']),
'Z': set(['Z1','Z2','Z3','Z4','Z5'])}
for l, s in dict.items():
df_exp[l] = df_exp.apply(lambda row: check_if_in_set(row['Items'], s), axis=1)
groupby
df_exp.groupby(df_exp.index).agg(
Items_list = ('Items', list),
X_count = ('X', 'sum'),
y_count = ('Y', 'sum'),
Z_count = ('Z', 'sum')
)
Items_list X_count y_count Z_count
0 [X1, Y1, Z1] 1 1 1
1 [X2, Z3] 1 0 1
2 [X3] 1 0 0
3 [X1, X2] 2 0 0
4 [Y2, Y4, Z2, Y5, Z3] 0 3 2
5 [X2, X3, Y1, Y2, Z2, Z4, X1] 3 2 2

Related

how to get top n values in pandas dataframe if it has repeated values

I have a pandas dataframe say:
x
y
z
1
a
x
1
b
y
1
c
z
2
a
x
2
b
x
3
a
y
4
a
z
If i wanted top 2 values by x, I mean top 2 values by x column which gives:
x
y
z
1
a
x
1
b
y
1
c
z
2
a
x
2
b
x
If i wanted top 2 values by y, I mean top 2 values by y column which gives:
x
y
z
1
a
x
1
b
y
2
a
x
2
b
x
3
a
y
4
a
z
How can I achieve this?
You can use:
>>> df[df['x'].isin(df['x'].value_counts().head(2).index)]
x y z
0 1 a x
1 1 b y
2 1 c z
3 2 a x
4 2 b x
>>> df[df['y'].isin(df['y'].value_counts().head(2).index)]
x y z
0 1 a x
1 1 b y
3 2 a x
4 2 b x
5 3 a y
6 4 a z
def select_top_k(df, col, top_k):
grouping_df = df.groupby(col)
gr_list = list(grouping_df.groups)[:top_k]
temp = grouping_df.filter(lambda x: x[col].iloc[0] in gr_list)
return temp
data = {'x': [1, 1, 1, 2, 2, 3, 4],
'y': ['a', 'b', 'c', 'a', 'b', 'a', 'a'],
'z': ['x', 'y', 'z', 'x', 'x', 'y', 'z']}
df = pd.DataFrame(data)
col = 'x'
top_k = 2
select_top_k(df, col, top_k)

Group By Sum Multiple Columns in Pandas (Ignoring duplicates)

I have the following code where my dataframe contains 3 columns
toBeSummed toBeSummed2 toBesummed3 someColumn
0 X X Y NaN
1 X Y Z NaN
2 Y Y Z NaN
3 Z Z Z NaN
oneframe = pd.concat([df['toBeSummed'],df['toBeSummed2'],df['toBesummed3']], axis=1).reset_index()
temp = oneframe.groupby(['toBeSummed']).size().reset_index()
temp2 = oneframe.groupby(['toBeSummed2']).size().reset_index()
temp3 = oneframe.groupby(['toBeSummed3']).size().reset_index()
temp.columns.values[0] = "SameName"
temp2.columns.values[0] = "SameName"
temp3.columns.values[0] = "SameName"
final = pd.concat([temp,temp2,temp3]).groupby(['SameName']).sum().reset_index()
final.columns.values[0] = "Letter"
final.columns.values[1] = "Sum"
The problem here is that with the code I have, it sums up all instances of each value. Meaning calling final would result in
Letter Sum
0 X 3
1 Y 4
2 Z 5
However I want it to not count more than once if the same value exists in the row (I.e in the first row there are two X's so it would only count the one X)
Meaning the desired output is
Letter Sum
0 X 2
1 Y 3
2 Z 3
I can update or add more comments if this is confusing.
Given df:
toBeSummed toBeSummed2 toBesummed3 someColumn
0 X X Y NaN
1 X Y Z NaN
2 Y Y Z NaN
3 Z Z Z NaN
Doing:
sum_cols = ['toBeSummed', 'toBeSummed2', 'toBesummed3']
out = df[sum_cols].apply(lambda x: x.unique()).explode().value_counts()
print(out.to_frame('Sum'))
Output:
Sum
Y 3
Z 3
X 2

how to solve pandas multi-column explode issue?

I am trying to explode multi-columns at a time systematically.
Such that:
[
and I want the final output as:
I tried
df=df.explode('sauce', 'meal')
but this only provides the first element ( sauce) in this case to be exploded, and the second one was not exploded.
I also tried:
df=df.explode(['sauce', 'meal'])
but this code provides
ValueError: column must be a scalar
error.
I tried this approach, and also this. none worked.
Note: cannot apply to index, there are some none- unique values in the fruits column.
Prior to pandas 1.3.0 use:
df.set_index(['fruits', 'veggies'])[['sauce', 'meal']].apply(pd.Series.explode).reset_index()
Output:
fruits veggies sauce meal
0 x1 y2 a d
1 x1 y2 b e
2 x1 y2 c f
3 x2 y2 g k
4 x2 y2 h l
Many columns? Try:
df.set_index(df.columns.difference(['sauce', 'meal']).tolist())\
.apply(pd.Series.explode).reset_index()
Output:
fruits veggies sauce meal
0 x1 y2 a d
1 x1 y2 b e
2 x1 y2 c f
3 x2 y2 g k
4 x2 y2 h l
Update your version of Pandas
# Setup
df = pd.DataFrame({'fruits': ['x1', 'x2'],
'veggies': ['y1', 'y2'],
'sauce': [list('abc'), list('gh')],
'meal': [list('def'), list('kl')]})
print(df)
# Output
fruits veggies sauce meal
0 x1 y1 [a, b, c] [d, e, f]
1 x2 y2 [g, h] [k, l]
Explode (Pandas 1.3.5):
out = df.explode(['sauce', 'meal'])
print(out)
# Output
fruits veggies sauce meal
0 x1 y1 a d
0 x1 y1 b e
0 x1 y1 c f
1 x2 y2 g k
1 x2 y2 h l

Map values based off matched columns - Python

I want to map values based how two columns are matched. For instance, the df below contains different labels, A or B. I want to assign a new column that describes these labels. How this occurs is comparing columns Z L and Z P. Z L will always contain either ['X1','X2','X3','X4']. While Z P will correspondingly contain ['LA','LB','LC','LD'].
These will always be in acceding order or reverse order. As in ascending order will mean X1 corresponds to LA, X2 corresponds to LB etc. Reverse order means X1 corresponds to LD, X2 corresponds to LC etc.
If ascending order I want to map an R. If reverse order I want to map an L.
X = ['X1','X2','X3','X4']
R = ['LA','LB','LC','LD']
L = ['LD','LC','LB','LA']
df = pd.DataFrame({
'Period' : [1,1,1,1,1,2,2,2,2,2],
'labels' : ['A','B','A','B','A','B','A','B','A','B'],
'Z L' : [np.nan,np.nan,'X3','X2','X4',np.nan,'X2','X3','X3','X1'],
'Z P' : [np.nan,np.nan,'LC','LC','LD',np.nan,'LC','LC','LB','LA'],
})
df = df.dropna()
This is the output dataset to determine the combinations. I have a large df with repeated combinations so I'm not too concerned with returning all of them. I'm mainly concerned with all unique Mapped values for each Period.
Period labels Z L Z P
2 1 A X3 LC
3 1 B X2 LC
4 1 A X4 LD
6 2 A X2 LC
7 2 B X3 LC
8 2 A X3 LB
9 2 B X1 LA
Attempt:
labels = df['labels'].unique().tolist()
I = df.loc[df['labels'] == labels[0]]
J = df.loc[df['labels'] == labels[1]]
I['Map'] = ((I['Z L'].isin(X)) | (I['Z P'].isin(R))).map({True:'R', False:'L'})
J['Map'] = ((J['Z L'].isin(X)) | (J['Z P'].isin(R))).map({True:'R', False:'L'})
If I drop duplicates from period and labels the intended df is:
Period labels Map
0 1 A R
1 1 B L
2 2 A L
3 2 B R
Here's my approach:
# the ascending orders
lst1,lst2 = ['X1','X2','X3','X4'], ['LA','LB','LC','LD']
# enumerate the orders
d1, d2 = ({v:k for k,v in enumerate(l)} for l in (lst1, lst2))
# check if the enumerations in `Z L` and `Z P` are the same
df['Map'] = np.where(df['Z L'].map(d1)== df['Z P'].map(d2), 'R', 'L')
Output:
Period labels Z L Z P Map
2 1 A X3 LC R
3 1 B X2 LC L
4 1 A X4 LD R
6 2 A X2 LC L
7 2 B X3 LC R
8 2 A X3 LB L
9 2 B X1 LA R
and df.drop_duplicates(['Period', 'labels']):
Period labels Z L Z P Map
2 1 A X3 LC R
3 1 B X2 LC L
6 2 A X2 LC L
7 2 B X3 LC R
You said your data is always either in ascending or reversed order. You only need to define a fix mapping between Z L and Z P as the R and check on this mapping. If True it is R, else L. I may be wrong, but I think solution may be reduced to this
r_dict = dict(zip(['X1','X2','X3','X4'], ['LA','LB','LC','LD']))
df1['Map'] = (df1['Z L'].map(r_dict) == df1['Z P']).map({True: 'R', False: 'L'})
Out[292]:
Period labels Z L Z P Map
2 1 A X3 LC R
3 1 B X2 LC L
4 1 A X4 LD R
6 2 A X2 LC L
7 2 B X3 LC R
8 2 A X3 LB L
9 2 B X1 LA R
For the bottom desired output, you just drop_duplicates as QuangHoang.

Pandas Slice by Pairwise Attributes

I have two lists, namely;
x = [3, 7, 9, ...] and y = [13, 17, 19, ...]
And I have a dataframe like this:
df =
x y z
0 0 10 0.54
1 1 11 0.68
2 2 12 0.75
3 3 13 0.23
4 4 14 0.52
5 5 15 0.14
6 6 16 0.23
. . . ..
. . . ..
What I want to do is slice the dataframe given the pairwise combos in an efficient manner, as so:
df_slice = df [ ( (df.x == x[0]) & (df.y == y[0]) ) |
( (df.x == x[1]) & (df.y == y[0]) ) |
....
( (df.x == x[-1) & (df.y == y[-1]) ) ]
df_slice =
x y z
3 3 13 0.23
7 7 17 0.74
9 9 19 0.24
. .. .. ....
Is there any way to do this programmatically and quickly?
Create helper DataFrame and DataFrame.merge with no on parameter, so merging by all intersected columns, here by x and y:
x = [3, 7, 9]
y = [13, 17, 19]
df1 = pd.DataFrame({'x':x, 'y':y})
df2 = df.merge(df1)
print (df2)
x y
0 3 13
Or get interesection of MultiIndexes by Index.isin and filter by boolean indexing:
mux = pd.MultiIndex.from_arrays([x, y])
df2 = df[df.set_index(['x','y']).index.isin(mux)]
print (df2)
x y
3 3 13
Your solution should be changed with list comprehension of zipped lists and np.logical_or.reduce:
mask = np.logical_or.reduce([(df.x == a) & (df.y == b) for a, b in zip(x, y)])
df2 = df[mask]
print (df2)
x y z
3 3 13 0.23

Categories

Resources