how to solve pandas multi-column explode issue? - python

I am trying to explode multi-columns at a time systematically.
Such that:
[
and I want the final output as:
I tried
df=df.explode('sauce', 'meal')
but this only provides the first element ( sauce) in this case to be exploded, and the second one was not exploded.
I also tried:
df=df.explode(['sauce', 'meal'])
but this code provides
ValueError: column must be a scalar
error.
I tried this approach, and also this. none worked.
Note: cannot apply to index, there are some none- unique values in the fruits column.

Prior to pandas 1.3.0 use:
df.set_index(['fruits', 'veggies'])[['sauce', 'meal']].apply(pd.Series.explode).reset_index()
Output:
fruits veggies sauce meal
0 x1 y2 a d
1 x1 y2 b e
2 x1 y2 c f
3 x2 y2 g k
4 x2 y2 h l
Many columns? Try:
df.set_index(df.columns.difference(['sauce', 'meal']).tolist())\
.apply(pd.Series.explode).reset_index()
Output:
fruits veggies sauce meal
0 x1 y2 a d
1 x1 y2 b e
2 x1 y2 c f
3 x2 y2 g k
4 x2 y2 h l

Update your version of Pandas
# Setup
df = pd.DataFrame({'fruits': ['x1', 'x2'],
'veggies': ['y1', 'y2'],
'sauce': [list('abc'), list('gh')],
'meal': [list('def'), list('kl')]})
print(df)
# Output
fruits veggies sauce meal
0 x1 y1 [a, b, c] [d, e, f]
1 x2 y2 [g, h] [k, l]
Explode (Pandas 1.3.5):
out = df.explode(['sauce', 'meal'])
print(out)
# Output
fruits veggies sauce meal
0 x1 y1 a d
0 x1 y1 b e
0 x1 y1 c f
1 x2 y2 g k
1 x2 y2 h l

Related

Sampling with fixed column ratio in pandas

I have this dataframe:
record = {
'F1': ['x1', 'x2','x3', 'x4','x5','x6','x7'],
'F2': ['a1', 'a2','a3', 'a4','a5','a6','a7'],
'Sex': ['F', 'M','F', 'M','M','M','F'] }
# Creating a dataframe
df = pd.DataFrame(record)
I would like to create for example 2 samples of this dataframe while keeping a fixed ratio of 50-50 on the Sex column.
I tried like this:
df_dict ={}
for i in range(2):
df_dict['df{}'.format(i)] = df.sample(frac=0.50, random_state=123)
But the output I get does not seem to match my expectation:
df_dict["df0"]
# Output:
F1 F2 Sex
1 x2 a2 M
3 x4 a4 M
4 x5 a5 M
0 x1 a1 F
Any help ?
Might not be the best idea, but I believe it might help you to solve your problem somehow:
n = 2
fDf = df[df["Sex"] == "F"].sample(frac=0.5, random_state=123).iloc[:n]
mDf = df[df["Sex"] == "M"].sample(frac=0.5, random_state=123).iloc[:n]
fDf.append(mDf)
Output
F1 F2 Sex
0 x1 a1 F
2 x3 a3 F
5 x6 a6 M
1 x2 a2 M
This should also work
n = 2
df.groupby('Sex', group_keys=False).apply(lambda x: x.sample(n))
Don't use frac that will give your a fraction of each group, but n that will give you a fixed value per group:
df.groupby('Sex').sample(n=2)
example output:
F1 F2 Sex
2 x3 a3 F
0 x1 a1 F
3 x4 a4 M
4 x5 a5 M
using a custom ratio
ratios = {'F':0.4, 'M':0.6} # sum should be 1
# total number desired
total = 4
# note that the exact number in the output depends
# on the rounding method to convert to int
# round should give the correct number but floor/ceil might
# under/over-sample
# see below for an example
s = pd.Series(ratios)*total
# convert to integer (chose your method, ceil/floor/round...)
s = np.ceil(s).astype(int)
df.groupby('Sex').apply(lambda x: x.sample(n=s[x.name])).droplevel(0)
example output:
F1 F2 Sex
0 x1 a1 F
6 x7 a7 F
4 x5 a5 M
3 x4 a4 M
1 x2 a2 M

Python Pandas dataframes merge update

My problem is kind of a bit tricky (similar to sql merge/update), and not understanding how to fix: ( I am giving a small sample of the dataframes below)
I have two dataframes :
dfOld
A B C D E
x1 x2 g h r
q1 q2 x y s
t1 t2 h j u
p1 p2 r s t
AND
dfNew
A B C D E
x1 x2 a b c
s1 s2 p q r
t1 t2 h j u
q1 q2 x y z
We want to merge the dataframes with the following rule : ( we can think Col A & ColB as keys)
For any ColA & ColB combination if C/D/E are exact match then it takes value from any dataframe, however if any value has changed in Col C/D/E , it takes the value from new dataframe and if a new ColA/Col B combination is in DfNew then it takes those values and if the ColA/ColB combination does not exist in dfNew then it takes the value from dfOld:
So my OutPut should be like:
A B C D E
x1 x2 a b c
q1 q2 x y z
t1 t2 h j u
p1 p2 r s t
s1 s2 p q r
I was trying :
mydfL = (df.merge(df1,indicator = True, how='left').loc[lambda x : x['_merge']!='both'])
mydfR = (df1.merge(df,indicator = True, how='left').loc[lambda x : x['_merge']!='both'])
dfO = pd.concat([mydfL,mydfR])
dfO.drop("_merge", axis=1, inplace=True)
My output looks like: ( I kept the index for clarity)
A B C D E
0 x1 x2 a b c
2 s1 s2 p q r
3 q1 q2 x y z
0 x1 x2 g h r
2 q1 q2 x y s
3 p1 p2 r s t
However, this output does not serve my purpose. First and foremost it does not include the totally identical row (between dfOld & dfnew) which consists of :
t1 t2 h j u
and next it includes all the rows where for the ColA/Col x, y and q1, q2, where I just wanted the updated values in ColC/D/E in the new data frame ( dfNew). It includes data from both.
So can I get some help as to what am I missing and what may be a better and elegant way to do this. Thanks in advance.
You can use combine_first using A/B as temporary index:
out = (dfNew.set_index(['A', 'B'])
.combine_first(dfOld.set_index(['A', 'B']))
.reset_index()
)

Looping through multiple arrays & concatenating values in pandas

I've a dataframe with list of items separated by , commas as below.
+----------------------+
| Items |
+----------------------+
| X1,Y1,Z1 |
| X2,Z3 |
| X3 |
| X1,X2 |
| Y2,Y4,Z2,Y5,Z3 |
| X2,X3,Y1,Y2,Z2,Z4,X1 |
+----------------------+
Also I've 3 list of arrays which has all items said above clubbed into specific groups as below
X = [X1,X2,X3,X4,X5]
Y = [Y1,Y2,Y3,Y4,Y5]
Z = [Z1,Z2,Z3,Z4,Z5]
my task is to split the each value in the dataframe & check individual items in the 3 arrays and if an item is in any of the array, then it should concatenate the name of the groups which it is found, separated with &. Also if many items are in the same group/array, then it should mention the number of occurrence as well.
My desired output is as below. refer Category column
+----------------------+--------------+
| Items | Category |
+----------------------+--------------+
| X1,Y1,Z1 | X & Y & Z |
| X2,Z3 | X & Z |
| X3 | X |
| X1,X2 | 2X |
| Y2,Y4,Z2,Y5,Z3 | 3Y & 2Z |
| X2,X3,Y1,Y2,Z2,Z4,X1 | 3X & 2Y & 2Z |
+----------------------+--------------+
X,Y, and Z are the name of the arrays.
how shall I start to solve this using pandas? please guide.
Assuming a column of lists, explode the lists, then this is a simple isin check that we sum along the original index. I'd suggest a different output, which gets across the same information but is much easier to work with in the future.
Example
import pandas as pd
df = pd.DataFrame({'Items': [['X1', 'Y1', 'Z1'], ['X2', 'Z3'], ['X3'],
['X1', 'X2'], ['Y2', 'Y4', 'Z2', 'Y5', 'Z3'],
['X2', 'X3', 'Y1', 'Y2', 'Z2', 'Z4', 'X1']]})
X = ['X1','X2','X3','X4','X5']
Y = ['Y1','Y2','Y3','Y4','Y5']
Z = ['Z1','Z2','Z3','Z4','Z5']
s = df.explode('Items')['Items']
pd.concat([s.isin(l).sum(level=0).rename(name)
for name, l in [('X', X), ('Y', Y), ('Z', Z)]], axis=1).astype(int)
# X Y Z
#0 1 1 1
#1 1 0 1
#2 1 0 0
#3 2 0 0
#4 0 3 2
#5 3 2 2
To get your output, mask the 0s and add the columns names after the values. Then we string join to get the result. Here I use an apply for simplicity, alignment and NaN handling, but there are other slightly faster alternatives.
res = pd.concat([s.isin(l).sum(level=0).rename(name)
for name, l in [('X', X), ('Y', Y), ('Z', Z)]], axis=1).astype(int)
res = res.astype(str).replace('1', '').where(res.ne(0))
res = res.add(res.columns, axis=1)
# Aligns on index due to `.sum(level=0)`
df['Category'] = res.apply(lambda x: ' & '.join(x.dropna()), axis=1)
# Items Category
#0 [X1, Y1, Z1] X & Y & Z
#1 [X2, Z3] X & Z
#2 [X3] X
#3 [X1, X2] 2X
#4 [Y2, Y4, Z2, Y5, Z3] 3Y & 2Z
#5 [X2, X3, Y1, Y2, Z2, Z4, X1] 3X & 2Y & 2Z
Setup
df = pd.DataFrame(
[['X1,Y1,Z1'],
['X2,Z3'],
['X3'],
['X1,X2'],
['Y2,Y4,Z2,Y5,Z3'],
['X2,X3,Y1,Y2,Z2,Z4,X1']],
columns=['Items']
)
X = ['X1', 'X2', 'X3', 'X4', 'X5']
Y = ['Y1', 'Y2', 'Y3', 'Y4', 'Y5']
Z = ['Z1', 'Z2', 'Z3', 'Z4', 'Z5']
Counter
from collections import Counter
M = {**dict.fromkeys(X, 'X'), **dict.fromkeys(Y, 'Y'), **dict.fromkeys(Z, 'Z')}
num = lambda x: {1: ''}.get(x, x)
cat = ' & '.join
fmt = lambda c: cat(f'{num(v)}{k}' for k, v in c.items())
cnt = lambda x: Counter(map(M.get, x.split(',')))
df.assign(Category=[*map(fmt, map(cnt, df.Items))])
Items Category
0 X1,Y1,Z1 X & Y & Z
1 X2,Z3 X & Z
2 X3 X
3 X1,X2 2X
4 Y2,Y4,Z2,Y5,Z3 3Y & 2Z
5 X2,X3,Y1,Y2,Z2,Z4,X1 3X & 2Y & 2Z
OLD STUFF
pandas.Series.str.get_dummies and groupby
First convert the definitions of X, Y, and Z into one dictionary, then use that as the argument for groupby on axis=1
M = {**dict.fromkeys(X, 'X'), **dict.fromkeys(Y, 'Y'), **dict.fromkeys(Z, 'Z')}
counts = df.Items.str.get_dummies(',').groupby(M, axis=1).sum()
counts
X Y Z
0 1 1 1
1 1 0 1
2 1 0 0
3 2 0 0
4 0 3 2
5 3 2 2
Add the desired column
Work in Progress I don't like this solution
def fmt(row):
a = [f'{"" if v == 1 else v}{k}' for k, v in row.items() if v > 0]
return ' & '.join(a)
df.assign(Category=counts.apply(fmt, axis=1))
Items Category
0 X1,Y1,Z1 X & Y & Z
1 X2,Z3 X & Z
2 X3 X
3 X1,X2 2X
4 Y2,Y4,Z2,Y5,Z3 3Y & 2Z
5 X2,X3,Y1,Y2,Z2,Z4,X1 3X & 2Y & 2Z
NOT TO BE TAKEN SERIOUSLY
Because I'm leveraging the character of your contrived example and there is nowai you should depend on the first character of your values to be the thing that differentiates them.
from operator import itemgetter
df.Items.str.get_dummies(',').groupby(itemgetter(0), axis=1).sum()
X Y Z
0 1 1 1
1 1 0 1
2 1 0 0
3 2 0 0
4 0 3 2
5 3 2 2
Create your dataframe
import pandas as pd
df = pd.DataFrame({'Items': [['X1', 'Y1', 'Z1'],
['X2', 'Z3'],
['X3'],
['X1', 'X2'],
['Y2', 'Y4', 'Z2', 'Y5', 'Z3'],
['X2', 'X3', 'Y1', 'Y2', 'Z2', 'Z4', 'X1']]})
explode
df_exp = df.explode('Items')
def check_if_in_set(item, set):
return 1 if (item in set) else 0
dict = {'X': set(['X1','X2','X3','X4','X5']),
'Y': set(['Y1','Y2','Y3','Y4','Y5']),
'Z': set(['Z1','Z2','Z3','Z4','Z5'])}
for l, s in dict.items():
df_exp[l] = df_exp.apply(lambda row: check_if_in_set(row['Items'], s), axis=1)
groupby
df_exp.groupby(df_exp.index).agg(
Items_list = ('Items', list),
X_count = ('X', 'sum'),
y_count = ('Y', 'sum'),
Z_count = ('Z', 'sum')
)
Items_list X_count y_count Z_count
0 [X1, Y1, Z1] 1 1 1
1 [X2, Z3] 1 0 1
2 [X3] 1 0 0
3 [X1, X2] 2 0 0
4 [Y2, Y4, Z2, Y5, Z3] 0 3 2
5 [X2, X3, Y1, Y2, Z2, Z4, X1] 3 2 2

Map values based off matched columns - Python

I want to map values based how two columns are matched. For instance, the df below contains different labels, A or B. I want to assign a new column that describes these labels. How this occurs is comparing columns Z L and Z P. Z L will always contain either ['X1','X2','X3','X4']. While Z P will correspondingly contain ['LA','LB','LC','LD'].
These will always be in acceding order or reverse order. As in ascending order will mean X1 corresponds to LA, X2 corresponds to LB etc. Reverse order means X1 corresponds to LD, X2 corresponds to LC etc.
If ascending order I want to map an R. If reverse order I want to map an L.
X = ['X1','X2','X3','X4']
R = ['LA','LB','LC','LD']
L = ['LD','LC','LB','LA']
df = pd.DataFrame({
'Period' : [1,1,1,1,1,2,2,2,2,2],
'labels' : ['A','B','A','B','A','B','A','B','A','B'],
'Z L' : [np.nan,np.nan,'X3','X2','X4',np.nan,'X2','X3','X3','X1'],
'Z P' : [np.nan,np.nan,'LC','LC','LD',np.nan,'LC','LC','LB','LA'],
})
df = df.dropna()
This is the output dataset to determine the combinations. I have a large df with repeated combinations so I'm not too concerned with returning all of them. I'm mainly concerned with all unique Mapped values for each Period.
Period labels Z L Z P
2 1 A X3 LC
3 1 B X2 LC
4 1 A X4 LD
6 2 A X2 LC
7 2 B X3 LC
8 2 A X3 LB
9 2 B X1 LA
Attempt:
labels = df['labels'].unique().tolist()
I = df.loc[df['labels'] == labels[0]]
J = df.loc[df['labels'] == labels[1]]
I['Map'] = ((I['Z L'].isin(X)) | (I['Z P'].isin(R))).map({True:'R', False:'L'})
J['Map'] = ((J['Z L'].isin(X)) | (J['Z P'].isin(R))).map({True:'R', False:'L'})
If I drop duplicates from period and labels the intended df is:
Period labels Map
0 1 A R
1 1 B L
2 2 A L
3 2 B R
Here's my approach:
# the ascending orders
lst1,lst2 = ['X1','X2','X3','X4'], ['LA','LB','LC','LD']
# enumerate the orders
d1, d2 = ({v:k for k,v in enumerate(l)} for l in (lst1, lst2))
# check if the enumerations in `Z L` and `Z P` are the same
df['Map'] = np.where(df['Z L'].map(d1)== df['Z P'].map(d2), 'R', 'L')
Output:
Period labels Z L Z P Map
2 1 A X3 LC R
3 1 B X2 LC L
4 1 A X4 LD R
6 2 A X2 LC L
7 2 B X3 LC R
8 2 A X3 LB L
9 2 B X1 LA R
and df.drop_duplicates(['Period', 'labels']):
Period labels Z L Z P Map
2 1 A X3 LC R
3 1 B X2 LC L
6 2 A X2 LC L
7 2 B X3 LC R
You said your data is always either in ascending or reversed order. You only need to define a fix mapping between Z L and Z P as the R and check on this mapping. If True it is R, else L. I may be wrong, but I think solution may be reduced to this
r_dict = dict(zip(['X1','X2','X3','X4'], ['LA','LB','LC','LD']))
df1['Map'] = (df1['Z L'].map(r_dict) == df1['Z P']).map({True: 'R', False: 'L'})
Out[292]:
Period labels Z L Z P Map
2 1 A X3 LC R
3 1 B X2 LC L
4 1 A X4 LD R
6 2 A X2 LC L
7 2 B X3 LC R
8 2 A X3 LB L
9 2 B X1 LA R
For the bottom desired output, you just drop_duplicates as QuangHoang.

Rearrange data in csv with Python

I have a .csv file with the following format:
A B C D E F
X1 X2 X3 X4 X5 X6
Y1 Y2 Y3 Y4 Y5 Y6
Z1 Z2 Z3 Z4 Z5 Z6
What I want:
A X1
B X2
C X3
D X4
E X5
F X6
A Y1
B Y2
C Y3
D Y4
E Y5
F Y6
A Z1
B Z2
C Z3
D Z4
E Z5
F Z6
I am unable to wrap my mind around the built-in transpose functions in order to achieve the final result. Any help would be appreciated.
You can simply melt your dataframe using pandas:
import pandas as pd
df = pd.read_csv(csv_filename)
>>> pd.melt(df)
variable value
0 A X1
1 A Y1
2 A Z1
3 B X2
4 B Y2
5 B Z2
6 C X3
7 C Y3
8 C Z3
9 D X4
10 D Y4
11 D Z4
12 E X5
13 E Y5
14 E Z5
15 F X6
16 F Y6
17 F Z6
A pure python solution would be as follows:
file_out_delimiter = ',' # Use '\t' for tab delimited.
with open(filename, 'r') as f, open(filename_out, 'w') as f_out:
headers = f.readline().split()
for row in f:
for pair in zip(headers, row.split()):
f_out.write(file_out_delimiter.join(pair) + '\n')
resulting in the following file contents:
A,X1
B,X2
C,X3
D,X4
E,X5
F,X6
A,Y1
B,Y2
C,Y3
D,Y4
E,Y5
F,Y6
A,Z1
B,Z2
C,Z3
D,Z4
E,Z5
F,Z6

Categories

Resources