SQL sorting for INSERT - python

I have a data set like this:
LIST
ENAME
EDESC
NUM
ALPHA
A
aa
1
ALPHA
B
bb
2
ALPHA
C
cc
3
GAMMA
M
mm
1
GAMMA
N
nn
2
OMEGA
Y
yy
0
OMEGA
Z
zz
1
NUM is the priority of the 'E's or elements of each LIST
The logic for NUM (which I can't control) is that it orders elements smallest to largest, but the first element does not have to be NUM = 0. For example, you can see that OMEGA's elements are Y=0 and Z=1 while GAMMA's are M=1 and N=2 - the system considers these orderings equivalent.
This data comes from a query from a system whose data structure I can't control, but I want to put this data into my own database but in the following format:
LIST
E1NAME
E1DESC
E2NAME
E2DESC
E3NAME
E3DESC
ALPHA
A
aa
B
bb
C
cc
GAMMA
M
mm
N
nn
OMEGA
Y
yy
Z
zz
Can this be done with using SQL INSERT into a MySQL/MariaDB database?
The data is being pulled from the source via Python, and the result is parsed using Pandas - perhaps there is a way to do it in Pandas such that the subsequent SQL INSERT into my DB isn't as complicated?

In SQL, you can use conditional aggregation:
select list,
max(case when seqnum = 1 then ename end) as ename_1,
max(case when seqnum = 1 then edesc end) as edesc_1,
max(case when seqnum = 2 then ename end) as ename_2,
max(case when seqnum = 2 then edesc end) as edesc_2,
max(case when seqnum = 3 then ename end) as ename_3,
max(case when seqnum = 3 then edesc end) as edesc_3
from (select d.*,
row_number() over (partition by list order by num) as seqnum
from dataset d
) d
group by list;

factorize
num = df.groupby('LIST').NUM.transform(lambda x: pd.factorize(x, sort=True)[0]) + 1
new_df = df.assign(NUM=num).set_index(['LIST', 'NUM']) \
.unstack().swaplevel(0, 1, 1).sort_index(1, ascending=[True, False])
new_df.columns = [f'{b[0]}{a}{b[1:]}' for a, b in new_df.columns]
new_df
E1NAME E1DESC E2NAME E2DESC E3NAME E3DESC
LIST
ALPHA A aa B bb C cc
GAMMA M mm N nn NaN NaN
OMEGA Y yy Z zz NaN NaN

Related

Python Pandas dataframes merge update

My problem is kind of a bit tricky (similar to sql merge/update), and not understanding how to fix: ( I am giving a small sample of the dataframes below)
I have two dataframes :
dfOld
A B C D E
x1 x2 g h r
q1 q2 x y s
t1 t2 h j u
p1 p2 r s t
AND
dfNew
A B C D E
x1 x2 a b c
s1 s2 p q r
t1 t2 h j u
q1 q2 x y z
We want to merge the dataframes with the following rule : ( we can think Col A & ColB as keys)
For any ColA & ColB combination if C/D/E are exact match then it takes value from any dataframe, however if any value has changed in Col C/D/E , it takes the value from new dataframe and if a new ColA/Col B combination is in DfNew then it takes those values and if the ColA/ColB combination does not exist in dfNew then it takes the value from dfOld:
So my OutPut should be like:
A B C D E
x1 x2 a b c
q1 q2 x y z
t1 t2 h j u
p1 p2 r s t
s1 s2 p q r
I was trying :
mydfL = (df.merge(df1,indicator = True, how='left').loc[lambda x : x['_merge']!='both'])
mydfR = (df1.merge(df,indicator = True, how='left').loc[lambda x : x['_merge']!='both'])
dfO = pd.concat([mydfL,mydfR])
dfO.drop("_merge", axis=1, inplace=True)
My output looks like: ( I kept the index for clarity)
A B C D E
0 x1 x2 a b c
2 s1 s2 p q r
3 q1 q2 x y z
0 x1 x2 g h r
2 q1 q2 x y s
3 p1 p2 r s t
However, this output does not serve my purpose. First and foremost it does not include the totally identical row (between dfOld & dfnew) which consists of :
t1 t2 h j u
and next it includes all the rows where for the ColA/Col x, y and q1, q2, where I just wanted the updated values in ColC/D/E in the new data frame ( dfNew). It includes data from both.
So can I get some help as to what am I missing and what may be a better and elegant way to do this. Thanks in advance.
You can use combine_first using A/B as temporary index:
out = (dfNew.set_index(['A', 'B'])
.combine_first(dfOld.set_index(['A', 'B']))
.reset_index()
)

pd.Dataframe.update puts the result at the top of the dataframe

Lets say I have two dataframes like this:
n = {'x':['a','b','c','d','e'], 'y':['1','2','3','4','5'],'z':['0','0','0','0','0']}
nf = pd.DataFrame(n)
m = {'x':['b','d','e'], 'z':['10','100','1000']}
mf = pd.DataFrame(n)
I want to update the zeroes in the z column in the nf dataframe with the values from the z column in the mf dataframe only in the rows with keys from the column x
when i call
nf.update(mf)
i get
x y z
b 1 10
d 2 100
e 3 1000
d 4 0
e 5 0
instead of the desired output
x y z
a 1 0
b 2 10
c 3 0
d 4 100
e 5 1000
To answer your problem, you need to match the indexes of both dataframes, here how you can do it :
n = {'x':['a','b','c','d','e'], 'y':['1','2','3','4','5'],'z':['0','0','0','0','0']}
nf = pd.DataFrame(n).set_index('x')
m = {'x':['b','d','e'], 'z':['10','100','1000']}
mf = pd.DataFrame(m).set_index('x')
nf.update(mf)
nf = nf.reset_index()

Map values based off matched columns - Python

I want to map values based how two columns are matched. For instance, the df below contains different labels, A or B. I want to assign a new column that describes these labels. How this occurs is comparing columns Z L and Z P. Z L will always contain either ['X1','X2','X3','X4']. While Z P will correspondingly contain ['LA','LB','LC','LD'].
These will always be in acceding order or reverse order. As in ascending order will mean X1 corresponds to LA, X2 corresponds to LB etc. Reverse order means X1 corresponds to LD, X2 corresponds to LC etc.
If ascending order I want to map an R. If reverse order I want to map an L.
X = ['X1','X2','X3','X4']
R = ['LA','LB','LC','LD']
L = ['LD','LC','LB','LA']
df = pd.DataFrame({
'Period' : [1,1,1,1,1,2,2,2,2,2],
'labels' : ['A','B','A','B','A','B','A','B','A','B'],
'Z L' : [np.nan,np.nan,'X3','X2','X4',np.nan,'X2','X3','X3','X1'],
'Z P' : [np.nan,np.nan,'LC','LC','LD',np.nan,'LC','LC','LB','LA'],
})
df = df.dropna()
This is the output dataset to determine the combinations. I have a large df with repeated combinations so I'm not too concerned with returning all of them. I'm mainly concerned with all unique Mapped values for each Period.
Period labels Z L Z P
2 1 A X3 LC
3 1 B X2 LC
4 1 A X4 LD
6 2 A X2 LC
7 2 B X3 LC
8 2 A X3 LB
9 2 B X1 LA
Attempt:
labels = df['labels'].unique().tolist()
I = df.loc[df['labels'] == labels[0]]
J = df.loc[df['labels'] == labels[1]]
I['Map'] = ((I['Z L'].isin(X)) | (I['Z P'].isin(R))).map({True:'R', False:'L'})
J['Map'] = ((J['Z L'].isin(X)) | (J['Z P'].isin(R))).map({True:'R', False:'L'})
If I drop duplicates from period and labels the intended df is:
Period labels Map
0 1 A R
1 1 B L
2 2 A L
3 2 B R
Here's my approach:
# the ascending orders
lst1,lst2 = ['X1','X2','X3','X4'], ['LA','LB','LC','LD']
# enumerate the orders
d1, d2 = ({v:k for k,v in enumerate(l)} for l in (lst1, lst2))
# check if the enumerations in `Z L` and `Z P` are the same
df['Map'] = np.where(df['Z L'].map(d1)== df['Z P'].map(d2), 'R', 'L')
Output:
Period labels Z L Z P Map
2 1 A X3 LC R
3 1 B X2 LC L
4 1 A X4 LD R
6 2 A X2 LC L
7 2 B X3 LC R
8 2 A X3 LB L
9 2 B X1 LA R
and df.drop_duplicates(['Period', 'labels']):
Period labels Z L Z P Map
2 1 A X3 LC R
3 1 B X2 LC L
6 2 A X2 LC L
7 2 B X3 LC R
You said your data is always either in ascending or reversed order. You only need to define a fix mapping between Z L and Z P as the R and check on this mapping. If True it is R, else L. I may be wrong, but I think solution may be reduced to this
r_dict = dict(zip(['X1','X2','X3','X4'], ['LA','LB','LC','LD']))
df1['Map'] = (df1['Z L'].map(r_dict) == df1['Z P']).map({True: 'R', False: 'L'})
Out[292]:
Period labels Z L Z P Map
2 1 A X3 LC R
3 1 B X2 LC L
4 1 A X4 LD R
6 2 A X2 LC L
7 2 B X3 LC R
8 2 A X3 LB L
9 2 B X1 LA R
For the bottom desired output, you just drop_duplicates as QuangHoang.

Compare Excel cells Python

I would like to compare two parts of two different columns from an Excel file that have a different number of elements. The comparison should be made between a part of Column 3 and a part of Column 2. Column 3 part has a length of j elements and Column 2 has a length of k elements(k>j). Column 2 part starts from row "j+1" and column 3 part starts from row 1. If an element from column 3 part is matching an element from column 2 part, then should check if the element from column1, before the j row, which has the same index as matched item from column 3 part is matching with the element from Column 1 part between j+1 and k, which has the same index as matched item from column 2 part. If yes, then should be written the element from Column 4 with the same index as matched element from column 2 part in a new Excel sheet.
Example: Column3[1]==Column2[2](which represents element 'A') => Column1[1]==Column1[j+2](which represents element 'P') => Column4[j+2] should be written in a new sheet.
Column 1 Column 2 Column 3 Column 4
P F A S
B G X T
C H K V
D I M W
P B R B
P A R D
C D H E
D E J k
E M K W
F F L Q
Q F K Q
For reading the Excel sheet cells from original sheet, I have used the df27.ix[:j-1,1].
One part of the code which reads the values of the mention part from column 3 and column 2 might be:
for j in range(1,j):
c3=sheet['B'+str(j)].value
for k in range(j,j+k):
c2=sheet['B'+str(k)].value
Any hint how I can accomplish this?
UPDATED
I have tried a new code which takes in consideration that we have '-', like joaquin mentioned in his example.
Joaquin's example:
C1 C2 C3 C4
0 P - A -
1 B - X -
2 C - K -
3 D - M -
4 P B - B
5 P A - D
6 C D - E
7 D E - k
8 E M - W
9 F F - Q
10 Q F - Q
New code:
from pandas import DataFrame as df
import pandas as pd
import openpyxl
wb=openpyxl.load_workbook('/media/sf_vboxshared/x.xlsx')
sheet=wb.get_sheet_by_name('Sheet1')
C13=[]
C12=[]
C1=[]
C2=[]
C3=[]
for s in range(2, sheet.max_row+1):
C1second=sheet['A'+str(s)].value
C2second=sheet['B'+str(s)].value
C3second=sheet['C'+str(s)].value
C1.append(C1second)
C2.append(C2second)
C3.append(C3second)
C1=[x.encode('UTF8') for x in C1]
for y in C2:
if y is not None:
C2=[x.encode('UTF8') if x is not None else None for x in C2]
for z in C3:
if z is not None:
C3=[x.encode('UTF8') if x is not None else None for x in C3]
for x in C1:
C13.append(x)
for x in C3:
C13.append(x)
for x in C1:
C12.append(x)
for x in C2:
C12.append(x)
tosave = pd.DataFrame()
df[C13]=pd.DataFrame(C13)
df[C12]=pd.DataFrame(C12)
for item in df[C13]:
if '-' in item: continue
new = df[df[C12] == item]
tosave = tosave.append(new)
But I still get the following error: df[C13]=pd.DataFrame(C13) TypeError: 'type' object does not support item assignment. Any idea what is wrong?
Many thanks in advance,
Dan
Given your df is
C1 C2 C3 C4
0 P - A -
1 B - X -
2 C - K -
3 D - M -
4 P B - B
5 P A - D
6 C D - E
7 D E - k
8 E M - W
9 F F - Q
10 Q F - Q
then, I combine C1 and C3 and C1 and C2
df['C13'] = df.apply(lambda x: x['C1'] + x['C3'], axis=1)
df['C12'] = df.apply(lambda x: x['C1'] + x['C2'], axis=1)
and compare which rows have the same pair of characters in columns C13 and C12, and save them in tosave
tosave = p.DataFrame()
for item in df['C13']:
if '-' in item: continue
new = df[df['C12'] == item]
tosave = tosave.append(new)
this gives you a tosave dataframe with the rows matching:
C1 C2 C3 C4 C13 C12
5 P A - D P- PA
That can be directly saved as it is or you can save just column C4
UPDATE: If you have data on each row, then you can not use the '-' detection (or any other kind of detection based on the differences between empty and filled columns). On the other hand, if j,k are not defined (for any j and k), your problem is actually reduced to find, for each row, identical pairs below that row. In consecuence, this:
tosave = p.DataFrame()
for idx, item in enumerate(df['C13']):
new = df[df['C12'] == item]
tosave = tosave.append(new.loc[idx+1:])
solves the problem given your labels and data is like:
C1 C2 C3 C4
0 P F A S
1 B G X T
2 C H K V
3 D I M W
4 P B R B
5 P A R D
6 C D H E
7 D E J k
8 E M K W
9 F F L Q
10 Q F K Q
This code also produces the same output as before:
C1 C2 C3 C4 C13 C12
5 P A R D PR PA
Note this probably needs some refinenment (p.e when a row produces 2 matches, the second row with produce 1 match, and you will need to remove replicates from the final output).

How can I find start and end occurrence of character in Python

I have a dataframe df with the following ids (in Col).
The last occurrence of A/B/C represents the start, and the last occurrence of X is the end. I should ignore any other A,B,C between start and end (e.g. rows 8 and 9).
I have to find start and end records from this data and assign a number to each of these occurrences. The column count is my desired output:
Col ID
P
Q
A
A
A 1
Q 1
Q 1
B 1
C 1
S 1
S 1
X 1
X 1
X 1
Q
Q
R
R
C
C 2
D 2
E 2
B 2
K 2
D 2
E 2
E 2
X 2
X 2
This code:
lc1 = df.index[df.Col.eq('A') & df.Col.ne(df.Col.shift(-1))]
would give me an array of all the last occurrences of Index values of 'A', in this case [5].
lc1 = df.index[df.Col.eq('C') & df.Col.ne(df.Col.shift(-1))] # [20]
lc2 = df.index[df.Col.eq('X') & df.Col.ne(df.Col.shift(-1))] # [14,29]
I would use iloc to print the count values:
df.iloc[5:14]['count'] = 1
df.iloc[20:29]['count'] = 2
How can I find the indices of A/B/C together and print the count values of each start and end occurrence?
To find your indices of A, B, and C you can do:
df[(df.Col =='A')|(df.Col =='B')|(df.Col =='C')].index
Print your start counts:
df1 = df[df['count'] != df['count'].shift(+1)]
print df1[df1['count'] != 0]['count']
Print your end counts:
df2 = df[df['count'] != df['count'].shift(-1)]
print df2[df2['count'] != 0]['count']
On a sidenote, calling a column count is a bad idea because count is a method of the DataFrame and then you get ambiguity when doing df.count.
EDIT: Corrected since I was answering to a wrong question.

Categories

Resources