Python Pandas dataframes merge update - python

My problem is kind of a bit tricky (similar to sql merge/update), and not understanding how to fix: ( I am giving a small sample of the dataframes below)
I have two dataframes :
dfOld
A B C D E
x1 x2 g h r
q1 q2 x y s
t1 t2 h j u
p1 p2 r s t
AND
dfNew
A B C D E
x1 x2 a b c
s1 s2 p q r
t1 t2 h j u
q1 q2 x y z
We want to merge the dataframes with the following rule : ( we can think Col A & ColB as keys)
For any ColA & ColB combination if C/D/E are exact match then it takes value from any dataframe, however if any value has changed in Col C/D/E , it takes the value from new dataframe and if a new ColA/Col B combination is in DfNew then it takes those values and if the ColA/ColB combination does not exist in dfNew then it takes the value from dfOld:
So my OutPut should be like:
A B C D E
x1 x2 a b c
q1 q2 x y z
t1 t2 h j u
p1 p2 r s t
s1 s2 p q r
I was trying :
mydfL = (df.merge(df1,indicator = True, how='left').loc[lambda x : x['_merge']!='both'])
mydfR = (df1.merge(df,indicator = True, how='left').loc[lambda x : x['_merge']!='both'])
dfO = pd.concat([mydfL,mydfR])
dfO.drop("_merge", axis=1, inplace=True)
My output looks like: ( I kept the index for clarity)
A B C D E
0 x1 x2 a b c
2 s1 s2 p q r
3 q1 q2 x y z
0 x1 x2 g h r
2 q1 q2 x y s
3 p1 p2 r s t
However, this output does not serve my purpose. First and foremost it does not include the totally identical row (between dfOld & dfnew) which consists of :
t1 t2 h j u
and next it includes all the rows where for the ColA/Col x, y and q1, q2, where I just wanted the updated values in ColC/D/E in the new data frame ( dfNew). It includes data from both.
So can I get some help as to what am I missing and what may be a better and elegant way to do this. Thanks in advance.

You can use combine_first using A/B as temporary index:
out = (dfNew.set_index(['A', 'B'])
.combine_first(dfOld.set_index(['A', 'B']))
.reset_index()
)

Related

SQL sorting for INSERT

I have a data set like this:
LIST
ENAME
EDESC
NUM
ALPHA
A
aa
1
ALPHA
B
bb
2
ALPHA
C
cc
3
GAMMA
M
mm
1
GAMMA
N
nn
2
OMEGA
Y
yy
0
OMEGA
Z
zz
1
NUM is the priority of the 'E's or elements of each LIST
The logic for NUM (which I can't control) is that it orders elements smallest to largest, but the first element does not have to be NUM = 0. For example, you can see that OMEGA's elements are Y=0 and Z=1 while GAMMA's are M=1 and N=2 - the system considers these orderings equivalent.
This data comes from a query from a system whose data structure I can't control, but I want to put this data into my own database but in the following format:
LIST
E1NAME
E1DESC
E2NAME
E2DESC
E3NAME
E3DESC
ALPHA
A
aa
B
bb
C
cc
GAMMA
M
mm
N
nn
OMEGA
Y
yy
Z
zz
Can this be done with using SQL INSERT into a MySQL/MariaDB database?
The data is being pulled from the source via Python, and the result is parsed using Pandas - perhaps there is a way to do it in Pandas such that the subsequent SQL INSERT into my DB isn't as complicated?
In SQL, you can use conditional aggregation:
select list,
max(case when seqnum = 1 then ename end) as ename_1,
max(case when seqnum = 1 then edesc end) as edesc_1,
max(case when seqnum = 2 then ename end) as ename_2,
max(case when seqnum = 2 then edesc end) as edesc_2,
max(case when seqnum = 3 then ename end) as ename_3,
max(case when seqnum = 3 then edesc end) as edesc_3
from (select d.*,
row_number() over (partition by list order by num) as seqnum
from dataset d
) d
group by list;
factorize
num = df.groupby('LIST').NUM.transform(lambda x: pd.factorize(x, sort=True)[0]) + 1
new_df = df.assign(NUM=num).set_index(['LIST', 'NUM']) \
.unstack().swaplevel(0, 1, 1).sort_index(1, ascending=[True, False])
new_df.columns = [f'{b[0]}{a}{b[1:]}' for a, b in new_df.columns]
new_df
E1NAME E1DESC E2NAME E2DESC E3NAME E3DESC
LIST
ALPHA A aa B bb C cc
GAMMA M mm N nn NaN NaN
OMEGA Y yy Z zz NaN NaN

Map values based off matched columns - Python

I want to map values based how two columns are matched. For instance, the df below contains different labels, A or B. I want to assign a new column that describes these labels. How this occurs is comparing columns Z L and Z P. Z L will always contain either ['X1','X2','X3','X4']. While Z P will correspondingly contain ['LA','LB','LC','LD'].
These will always be in acceding order or reverse order. As in ascending order will mean X1 corresponds to LA, X2 corresponds to LB etc. Reverse order means X1 corresponds to LD, X2 corresponds to LC etc.
If ascending order I want to map an R. If reverse order I want to map an L.
X = ['X1','X2','X3','X4']
R = ['LA','LB','LC','LD']
L = ['LD','LC','LB','LA']
df = pd.DataFrame({
'Period' : [1,1,1,1,1,2,2,2,2,2],
'labels' : ['A','B','A','B','A','B','A','B','A','B'],
'Z L' : [np.nan,np.nan,'X3','X2','X4',np.nan,'X2','X3','X3','X1'],
'Z P' : [np.nan,np.nan,'LC','LC','LD',np.nan,'LC','LC','LB','LA'],
})
df = df.dropna()
This is the output dataset to determine the combinations. I have a large df with repeated combinations so I'm not too concerned with returning all of them. I'm mainly concerned with all unique Mapped values for each Period.
Period labels Z L Z P
2 1 A X3 LC
3 1 B X2 LC
4 1 A X4 LD
6 2 A X2 LC
7 2 B X3 LC
8 2 A X3 LB
9 2 B X1 LA
Attempt:
labels = df['labels'].unique().tolist()
I = df.loc[df['labels'] == labels[0]]
J = df.loc[df['labels'] == labels[1]]
I['Map'] = ((I['Z L'].isin(X)) | (I['Z P'].isin(R))).map({True:'R', False:'L'})
J['Map'] = ((J['Z L'].isin(X)) | (J['Z P'].isin(R))).map({True:'R', False:'L'})
If I drop duplicates from period and labels the intended df is:
Period labels Map
0 1 A R
1 1 B L
2 2 A L
3 2 B R
Here's my approach:
# the ascending orders
lst1,lst2 = ['X1','X2','X3','X4'], ['LA','LB','LC','LD']
# enumerate the orders
d1, d2 = ({v:k for k,v in enumerate(l)} for l in (lst1, lst2))
# check if the enumerations in `Z L` and `Z P` are the same
df['Map'] = np.where(df['Z L'].map(d1)== df['Z P'].map(d2), 'R', 'L')
Output:
Period labels Z L Z P Map
2 1 A X3 LC R
3 1 B X2 LC L
4 1 A X4 LD R
6 2 A X2 LC L
7 2 B X3 LC R
8 2 A X3 LB L
9 2 B X1 LA R
and df.drop_duplicates(['Period', 'labels']):
Period labels Z L Z P Map
2 1 A X3 LC R
3 1 B X2 LC L
6 2 A X2 LC L
7 2 B X3 LC R
You said your data is always either in ascending or reversed order. You only need to define a fix mapping between Z L and Z P as the R and check on this mapping. If True it is R, else L. I may be wrong, but I think solution may be reduced to this
r_dict = dict(zip(['X1','X2','X3','X4'], ['LA','LB','LC','LD']))
df1['Map'] = (df1['Z L'].map(r_dict) == df1['Z P']).map({True: 'R', False: 'L'})
Out[292]:
Period labels Z L Z P Map
2 1 A X3 LC R
3 1 B X2 LC L
4 1 A X4 LD R
6 2 A X2 LC L
7 2 B X3 LC R
8 2 A X3 LB L
9 2 B X1 LA R
For the bottom desired output, you just drop_duplicates as QuangHoang.

Filtering pandas Dataframe on Parent Child Condition

I have a Dataframe df having 3 columns.
_id parent_id type
A B Subcategory_level
B null Main_Level
D A Product_Level
M N Product_Level
X Y Subcategory_Level
Z X Subcategory_Level
L Z Product_Level
What I want as my output is :
_id parent_id type
D A product_level
M N product_level
L X product_level
What I tried is , Drop all of the rows having type equals main_level. Then
df1=df
df1.rename(columns= {'_id':'parent_id','parent_id':'_id'},
index=str,inplace=True)
Then Natural join of df1 with df:
final_df=pd.merge(df,df1,on='parent_id', how='inner')
But the problem in this natural join is, if there are more than one level of type , it will not work. e.g The relation between X and L have 2 level of hierarchy, in that case it is not working
Is that what you're saying?
df[df.type == 'product_level']
_id parent_id type
D A product_level
M N product_level
L X product_level
# Maybe I don't understand what you mean. I thought it was.
In [2]: df = pd.DataFrame({"a":[1,2,3,4], "b":["x","t","s","g"], "x":["l1", "l3", "l1", "l2"]})
In [3]: df
Out[3]:
a b x
0 1 x l1
1 2 t l3
2 3 s l1
3 4 g l2
In [4]: df[df.x=="l1"]
Out[4]:
a b x
0 1 x l1
2 3 s l1

Compare Excel cells Python

I would like to compare two parts of two different columns from an Excel file that have a different number of elements. The comparison should be made between a part of Column 3 and a part of Column 2. Column 3 part has a length of j elements and Column 2 has a length of k elements(k>j). Column 2 part starts from row "j+1" and column 3 part starts from row 1. If an element from column 3 part is matching an element from column 2 part, then should check if the element from column1, before the j row, which has the same index as matched item from column 3 part is matching with the element from Column 1 part between j+1 and k, which has the same index as matched item from column 2 part. If yes, then should be written the element from Column 4 with the same index as matched element from column 2 part in a new Excel sheet.
Example: Column3[1]==Column2[2](which represents element 'A') => Column1[1]==Column1[j+2](which represents element 'P') => Column4[j+2] should be written in a new sheet.
Column 1 Column 2 Column 3 Column 4
P F A S
B G X T
C H K V
D I M W
P B R B
P A R D
C D H E
D E J k
E M K W
F F L Q
Q F K Q
For reading the Excel sheet cells from original sheet, I have used the df27.ix[:j-1,1].
One part of the code which reads the values of the mention part from column 3 and column 2 might be:
for j in range(1,j):
c3=sheet['B'+str(j)].value
for k in range(j,j+k):
c2=sheet['B'+str(k)].value
Any hint how I can accomplish this?
UPDATED
I have tried a new code which takes in consideration that we have '-', like joaquin mentioned in his example.
Joaquin's example:
C1 C2 C3 C4
0 P - A -
1 B - X -
2 C - K -
3 D - M -
4 P B - B
5 P A - D
6 C D - E
7 D E - k
8 E M - W
9 F F - Q
10 Q F - Q
New code:
from pandas import DataFrame as df
import pandas as pd
import openpyxl
wb=openpyxl.load_workbook('/media/sf_vboxshared/x.xlsx')
sheet=wb.get_sheet_by_name('Sheet1')
C13=[]
C12=[]
C1=[]
C2=[]
C3=[]
for s in range(2, sheet.max_row+1):
C1second=sheet['A'+str(s)].value
C2second=sheet['B'+str(s)].value
C3second=sheet['C'+str(s)].value
C1.append(C1second)
C2.append(C2second)
C3.append(C3second)
C1=[x.encode('UTF8') for x in C1]
for y in C2:
if y is not None:
C2=[x.encode('UTF8') if x is not None else None for x in C2]
for z in C3:
if z is not None:
C3=[x.encode('UTF8') if x is not None else None for x in C3]
for x in C1:
C13.append(x)
for x in C3:
C13.append(x)
for x in C1:
C12.append(x)
for x in C2:
C12.append(x)
tosave = pd.DataFrame()
df[C13]=pd.DataFrame(C13)
df[C12]=pd.DataFrame(C12)
for item in df[C13]:
if '-' in item: continue
new = df[df[C12] == item]
tosave = tosave.append(new)
But I still get the following error: df[C13]=pd.DataFrame(C13) TypeError: 'type' object does not support item assignment. Any idea what is wrong?
Many thanks in advance,
Dan
Given your df is
C1 C2 C3 C4
0 P - A -
1 B - X -
2 C - K -
3 D - M -
4 P B - B
5 P A - D
6 C D - E
7 D E - k
8 E M - W
9 F F - Q
10 Q F - Q
then, I combine C1 and C3 and C1 and C2
df['C13'] = df.apply(lambda x: x['C1'] + x['C3'], axis=1)
df['C12'] = df.apply(lambda x: x['C1'] + x['C2'], axis=1)
and compare which rows have the same pair of characters in columns C13 and C12, and save them in tosave
tosave = p.DataFrame()
for item in df['C13']:
if '-' in item: continue
new = df[df['C12'] == item]
tosave = tosave.append(new)
this gives you a tosave dataframe with the rows matching:
C1 C2 C3 C4 C13 C12
5 P A - D P- PA
That can be directly saved as it is or you can save just column C4
UPDATE: If you have data on each row, then you can not use the '-' detection (or any other kind of detection based on the differences between empty and filled columns). On the other hand, if j,k are not defined (for any j and k), your problem is actually reduced to find, for each row, identical pairs below that row. In consecuence, this:
tosave = p.DataFrame()
for idx, item in enumerate(df['C13']):
new = df[df['C12'] == item]
tosave = tosave.append(new.loc[idx+1:])
solves the problem given your labels and data is like:
C1 C2 C3 C4
0 P F A S
1 B G X T
2 C H K V
3 D I M W
4 P B R B
5 P A R D
6 C D H E
7 D E J k
8 E M K W
9 F F L Q
10 Q F K Q
This code also produces the same output as before:
C1 C2 C3 C4 C13 C12
5 P A R D PR PA
Note this probably needs some refinenment (p.e when a row produces 2 matches, the second row with produce 1 match, and you will need to remove replicates from the final output).

pandas.DataFrame - how to reindex by group?

Can new index be applied to DF, respectively to grouping made with groupby? Precisely - is there an elegant way to do that, and can original DF be changed through groupby groups at all?
UPD:
My data looks like this:
A B C
0 a x 0.903343
1 a z 0.982050
2 g x 0.274823
3 g y 0.334491
4 c z 0.756728
5 f z 0.697841
6 d z 0.505845
7 b z 0.768199
8 b y 0.743012
9 e x 0.697212
I grouping by columns 'A' and 'B', and I want that every unique pair of values of that columns would have same index value in original DF. Also - original DF can be big, and Im trying to figure how to make such reindex without inefficient forming whole new DF.
Currently Im using this solution:
df = pd.DataFrame({'A': [random.choice(ascii_lowercase[:5]) for _ in xrange(10)],
'B': [random.choice(['x', 'y']) for _ in xrange(10)],
'C': [random.random() for _ in xrange(10)]})
df['id'] = None
new_df = pd.DataFrame()
for i, (n, g) in enumerate(df.groupby(['A', 'B'])):
g['id'] = i
new_df = new_df.append(g)
new_df.set_index('id', inplace=True)
You can do this quickly with some internal function in pandas:
Create test DataFrame first:
import pandas as pd
import random
random.seed(1)
df = pd.DataFrame({'A': [random.choice(ascii_lowercase[:5]) for _ in xrange(10)],
'B': [random.choice(['x', 'y']) for _ in xrange(10)],
'C': [random.random() for _ in xrange(10)]})
If you want the new id the same order as column A & B:
m = pd.MultiIndex.from_arrays((df.A, df.B))
df.index = pd.factorize(pd.lib.fast_zip(m.labels), sort=True)[0]
print df
The output is:
A B C
1 a y 0.025446
7 e x 0.541412
6 d y 0.939149
2 b x 0.381204
3 c x 0.216599
4 c y 0.422117
5 d x 0.029041
6 d y 0.221692
1 a y 0.437888
0 a x 0.495812
If you don't care the order of new id:
m = pd.MultiIndex.from_arrays((df.A, df.B))
la, lb = m.labels
df.index = pd.factorize(la*len(lb)+lb)[0]
print df
The output is:
A B C
0 a y 0.025446
1 e x 0.541412
2 d y 0.939149
3 b x 0.381204
4 c x 0.216599
5 c y 0.422117
6 d x 0.029041
2 d y 0.221692
0 a y 0.437888
7 a x 0.495812

Categories

Resources