Converting PL/SQL procedures to Pyspark

Converting PL/SQL procedures to Pyspark - python

BEGIN
open v_refcur for
SELECT A.LGCY_LNDR_NO
, A.LGCY_LNDR_BR_NO
, A.LNDR_NM
, B.ADDR_LINE1_TXT
, B.ADDR_LINE2_TXT
, B.CITY_NM
, B.ST_CD
, B.POSTAL_CD
, C.FAX_NO
FROM LNDR_CUST_XREF A
LEFT OUTER JOIN LNDR_CUST_ADDR B
ON A.LNDR_ID = B.LNDR_ID
AND B.ADDR_TYP_CD = 'MAIL'
LEFT OUTER JOIN LNDR_CUST_ADDR C
ON A.LNDR_ID = C.LNDR_ID
AND C.ADDR_TYP_CD = 'SITE'
WHERE A.LGCY_LNDR_NO = LNDR_NO
AND A.LGCY_LNDR_BR_NO = BRN_NO
AND A.TA_CUST_FLG = 'Y';
SQL_CD := SWV_SQLCODE;
END;
What will be the line by line conversion of this above code? I dont have the databases in-hand, so what would be the most appropriate gist of the PL/SQL code in Pyspark?

this statement can be re-written something like below -
df = (df_LNDR_CUST_XREF.alias('A')
.df_LNDR_CUST_ADDR.alias('B'), ((A.LNDR_ID == B.LNDR_ID) & (B.ADDR_TYP_CD == 'MAIL'), "left")
.df_LNDR_CUST_ADDR.alias('C'), ((A.LNDR_ID == C.LNDR_ID) & (C.ADDR_TYP_CD == 'SITE'), "left")
.where ((A.LGCY_LNDR_NO == LNDR_NO) & (A.LGCY_LNDR_BR_NO == BRN_NO) & (A.TA_CUST_FLG == 'Y'))
.select (F.col("A.LGCY_LNDR_NO")
, F.col("A.LGCY_LNDR_BR_NO")
, F.col("A.LNDR_NM")
, F.col("B.ADDR_LINE1_TXT")
, F.col("B.ADDR_LINE2_TXT")
, F.col("B.CITY_NM")
, F.col("B.ST_CD")
, F.col("B.POSTAL_CD")
, F.col("C.FAX_NO"))
)
I havent tested it though.

Related

Modify the code to loop over another dataset

I am using haversine_distance function to calculate distance between coordinates in a dataset to a specific coordinate. [start_lat, start_lon = 40.6976637, -74.1197643]
def haversine_distance(lat1, lon1, lat2, lon2):
r = 6371
phi1 = np.radians(lat1)
phi2 = np.radians(lat2)
delta_phi = np.radians(lat2-lat1)
delta_lambda = np.radians(lon2-lon1)
a = np.sin(delta_phi / 2)**2 + np.cos(phi1) * np.cos(phi2) * np.sin(delta_lambda / 2)**2
res = r * (2 * np.arctan2(np.sqrt(a), np.sqrt(1-a)))
return np.round(res, 2)
start_lat, start_lon = 40.6976637, -74.1197643
distances_km = []
for row in pandas_df.itertuples(index=False):
distances_km.append(
haversine_distance(start_lat, start_lon, row.lat, row.lon)
)
pandas_df['Distance'] = distances_km
pandas_df
This successfully creates a column in my data set measuring the distance from given point like this:
Now I want to modify this code so that instead of using [start_lat, start_lon = 40.6976637, -74.1197643] I want to use another dataset containing cities.
How can I modify this existing code such that I create column for every city using its coordinates instead.
So desired output shows different columns with each city name and distance as calculated above.
Any Help is appreciated, new to python!
Cities array as requested in comments
[['Nanaimo' -123.9364 49.1642]
['Prince Rupert' -130.3271 54.3122]
['Vancouver' -123.1386 49.2636]
['Victoria' -123.3673 48.4275]
['Edmonton' -113.4909 53.5445]
['Winnipeg' -97.1392 49.8994]
['Sarnia' -82.4065 42.9746]
['Sarnia' -82.4065 42.9746]
['North York' -79.4112 43.7598]
['Kingston' -76.4812 44.2305]
['St. Catharines' -79.2333 43.1833]
['Thunder Bay' -89.2461 48.3822]
['Gaspé' -64.4833 48.8333]
['Cap-aux-Meules' -61.8607 47.3801]
['Kangiqsujuaq' -71.9667 61.6]
['Montreal' -73.5534 45.5091]
['Quebec City' -71.2074 46.8142]
['Rimouski' -68.524 48.4489]
['Sept-Îles' -66.3833 50.2167]
['Bathurst' -65.6497 47.6186]
['Charlottetown' -63.1399 46.24]
['Corner Brook' -57.9711 48.9411]
['Dartmouth' -63.5714 44.6715]
['Lewisporte' -55.0667 49.2333]
['Port Hawkesbury' -61.3642 45.6153]
['Saint John' -66.0628 45.2796]
["St. John's" -52.7072 47.5675]
['Sydney' -60.1947 46.1381]
['Yarmouth' -66.1175 43.8361]]

The beauty of Python is that you can use the same code to do different things.
To consider different [start_lat, start_lon] values for every column in your data, you can use the same code that you have now. All you need to do is to define start_lat and start_lon as arrays:
# --------------------- Array Initialization ---------------------
import pandas as pd
import numpy as np
np.random.seed(0)
pandas_df = pd.DataFrame(data = {'lat': np.random.rand(100),
'lon': np.random.rand(100)})
start_cities = pd.DataFrame([['Nanaimo' , -123.9364 , 49.1642], ['Prince Rupert' , -130.3271 , 54.3122],
['Vancouver' , -123.1386 , 49.2636], ['Victoria' , -123.3673 , 48.4275],
['Edmonton' , -113.4909 , 53.5445], ['Winnipeg' , -97.1392 , 49.8994],
['Sarnia' , -82.4065 , 42.9746], ['Sarnia' , -82.4065 , 42.9746],
['North York' , -79.4112 , 43.7598], ['Kingston' , -76.4812 , 44.2305],
['St. Catharines' , -79.2333 , 43.1833], ['Thunder Bay' , -89.2461 , 48.3822],
['Gaspé' , -64.4833 , 48.8333], ['Cap-aux-Meules' , -61.8607 , 47.3801],
['Kangiqsujuaq' , -71.9667 , 61.6 ], ['Montreal' , -73.5534 , 45.5091],
['Quebec City' , -71.2074 , 46.8142], ['Rimouski' , -68.524 , 48.4489],
['Sept-Îles' , -66.3833 , 50.2167], ['Bathurst' , -65.6497 , 47.6186],
['Charlottetown' , -63.1399 , 46.24 ], ['Corner Brook' , -57.9711 , 48.9411],
['Dartmouth' , -63.5714 , 44.6715], ['Lewisporte' , -55.0667 , 49.2333],
['Port Hawkesbury' , -61.3642 , 45.6153], ['Saint John' , -66.0628 , 45.2796],
["St. John's" , -52.7072 , 47.5675], ['Sydney' , -60.1947 , 46.1381],
['Yarmouth' , -66.1175 , 43.8361]])
start_cities.columns = 'names', 'start_lat', 'start_lon'
start_lat = start_cities.start_lat
start_lon = start_cities.start_lon
# --------------------- Same code as before (as promised) ---------------------
def haversine_distance(lat1, lon1, lat2, lon2):
r = 6371
phi1 = np.radians(lat1)
phi2 = np.radians(lat2)
delta_phi = np.radians(lat2-lat1)
delta_lambda = np.radians(lon2-lon1)
a = np.sin(delta_phi / 2)**2 + np.cos(phi1) * np.cos(phi2) * np.sin(delta_lambda / 2)**2
res = r * (2 * np.arctan2(np.sqrt(a), np.sqrt(1-a)))
return np.round(res, 2)
distances_km = []
for row in pandas_df.itertuples(index=False):
distances_km.append(
haversine_distance(start_lat, start_lon, row.lat, row.lon))
# --------------------- Store data ---------------------
distances_km = np.array(distances_km)
for ind, name in enumerate(start_cities.names):
pandas_df['distance_km_' + name] = distances_km[:,ind]
# print(pandas_df.keys())
# ["lat" , "lon" ,
# "distance_km_Nanaimo" , "distance_km_Prince Rupert" ,
# "distance_km_Vancouver" , "distance_km_Victoria" ,
# "distance_km_Edmonton" , "distance_km_Winnipeg" ,
# "distance_km_Sarnia" , "distance_km_North York" ,
# "distance_km_Kingston" , "distance_km_St. Catharines" ,
# "distance_km_Thunder Bay" , "distance_km_Gaspé" ,
# "distance_km_Cap-aux-Meules" , "distance_km_Kangiqsujuaq" ,
# "distance_km_Montreal" , "distance_km_Quebec City" ,
# "distance_km_Rimouski" , "distance_km_Sept-Îles" ,
# "distance_km_Bathurst" , "distance_km_Charlottetown" ,
# "distance_km_Corner Brook" , "distance_km_Dartmouth" ,
# "distance_km_Lewisporte" , "distance_km_Port Hawkesbury",
# "distance_km_Saint John" , "distance_km_St. John's" ,
# "distance_km_Sydney" , "distance_km_Yarmouth" ]

False option returning in np select?

I made this np select but AND operators don't work!
df = pd.DataFrame({'A': [2107], 'B': [76380700]})
cond = [(df["A"]==2107)|(df["A"]==6316)&(df['B']>=10000000)&(df['B']<=19969999),
(df["A"]==2107)|(df["A"]==6316)&(df['B']>=1000000)&(df['B']<=99999999)]
choices =["Return 1", "Return 2"]
df["C"] = np.select(cond, choices, default = df["A"])
NP select return "Return 1" but correct option is "Return 2"
>>df["C"]
0 Return 1
Cause this line return false
>>df["B"]<=19969999
False
How can I solve this problem?

It's an operator precendence issue. Here's what you wrote:
cond = [
(df["A"]==2107) |
(df["A"]==6316) &
(df['B']>=10000000) &
(df['B']<=19969999),
(df["A"]==2107) |
(df["A"]==6316) &
(df['B']>=1000000) &
(df['B']<=99999999)
]
Here's how that is interpreted:
cond = [
(df["A"]==2107) |
(
(df["A"]==6316) &
(df['B']>=10000000) &
(df['B']<=19969999)
),
(df["A"]==2107) |
(
(df["A"]==6316) &
(df['B']>=1000000) &
(df['B']<=99999999)
)
]
You need parens around the "or" clause:
cond = [
( (df["A"]==2107) | (df["A"]==6316) ) &
(df['B']>=10000000) &
(df['B']<=19969999),
( (df["A"]==2107) | (df["A"]==6316) ) &
(df['B']>=1000000) &
(df['B']<=99999999)
)
]
And, by the way, there is absolutely nothing wrong with writing the expressions like I did there. Isn't it much more clear what's going on when it's spaced out like that?

I think you were missing parenthesis for (df["A"]==2107)|(df["A"]==6316). In your script, condition for Return 1 was checking (df["A"]==2107)|(df["A"]==6316))&(df['B']>=10000000)&(df['B']<=19969999) which means A==2107 OR (A == 6316 & B... & B... ). That's why np.select returns 'Returns 1', because it is True.
df = pd.DataFrame({'A': [2107], 'B': [76380700]})
cond = [((df["A"]==2107)|(df["A"]==6316))&(df['B']>=10000000)&(df['B']<=19969999),
(df["A"]==2107)|(df["A"]==6316)&(df['B']>=1000000)&(df['B']<=99999999)]
choices =["Return 1", "Return 2"]
df["C"] = np.select(cond, choices, default = df["A"])

Python - Create a nested join query using SQLAlchemy

Any ideas for how I could do a nested join using SQLAlchemy? Here is the raw query I am trying to recreate.
SELECT "tblLinkDivisions"."DivisionCoLinkID", 5 AS Score, "tblCompanies"."CompanyID"
FROM "tblS_AppliedStrategy"
INNER JOIN
(
(
"tblLinkIndustry" INNER JOIN "tblLinkDivisions" ON "tblLinkIndustry"."IndustryCoLinkID" = "tblLinkDivisions"."IndustryCoLinkID"
)
INNER JOIN "tblLinkAppliedStrategy"
ON "tblLinkDivisions"."DivisionCoLinkID" = "tblLinkAppliedStrategy"."DivisionCoLinkID"
)
ON "tblS_AppliedStrategy"."AppStrategyCode" = "tblLinkAppliedStrategy"."AppStrategyCode"
INNER JOIN "tblCompanies" ON ("tblCompanies"."CompanyID" = "tblLinkIndustry"."CompanyID" AND "tblCompanies"."CompanyID" != %s)
WHERE "tblS_AppliedStrategy"."AppStrategyCode" IN %s
I have tried a few different solutions with this being the closest outcome:
companies = (
session.query(tblLinkDivisions.DivisionCoLinkID, tblCompanies.CompanyID)
.select_from(tblS_AppliedStrategy)
.join(tblLinkIndustry, tblLinkDivisions.IndustryCoLinkID == tblLinkIndustry.IndustryCoLinkID)
.join(tblLinkAppliedStrategy, tblLinkDivisions.DivisionCoLinkID == tblLinkAppliedStrategy.DivisionCoLinkID)
.join(tblLinkAppliedStrategy, tblS_AppliedStrategy.AppStrategyCode == tblLinkAppliedStrategy.AppStrategyCode)
.join(tblCompanies, and_(tblCompanies.CompanyID == tblLinkIndustry.CompanyID, tblCompanies.CompanyID != company_id))
.filter(tblS_AppliedStrategy.AppStrategyCode.in_(strategy))
.all()
)
Any help or suggestions would be greatly appreciated!

'int' object is not subscriptable (function calling error)

I'm attempting to break a vigenere cipher without knowing the key and I'm struggling to figure out my error since the caesar_break function seems to work perfectly fine on its own.
This is the code for the function i'm attempting to call:
def caesar_break( cipher, frequencies ): # help!
alpha = ["A" , "B" , "C" , "D" , "E" , "F" , "G" , "H" , "I" , "J" , "K" , "L" , "M" ,
"N" , "O" , "P" , "Q" , "R" , "S" , "T" , "U" , "V" , "W" , "X" , "Y" , "Z" ]
dist_ls = []
for key in alpha:
poss_plaintxt = caesar_dec( cipher, key )
counts = letter_counts( poss_plaintxt )
observed = normalize( counts )
dist = distance( observed, frequencies )
dist_ls.append( (dist , poss_plaintxt , key) )
can = dist_ls[ 0 ][ 0 ]
can_t = 0
for t in dist_ls:
if t[ 0 ] < can:
can = t[ 0 ]
can_t = t
return [ can_t[ 2 ], can_t[ 1 ] ]
This is what I have so far for my current function, I am not completely done with it, but I just need to figure out this error in order to move forward:
def vig_break( c, maxlen, frequencies ):
from i in range( 1, maxlen ):
break_ls = list(vig_break_for_length( c, i, frequencies ))
print( break_ls ) #this print is unneeded, i just like to test my code as i go
For specificity this is the error code:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "ciphers.py", line 310, in vig_break
break_ls = list(vig_break_for_length( c, i, frequencies ))
File "ciphers.py", line 288, in vig_break_for_length
split_break = caesar_break( ciphertext, frequencies )
File "ciphers.py", line 192, in caesar_break
return [ can_t[ 2 ], can_t[ 1 ] ]
TypeError: 'int' object is not subscriptable

Can JModelica print results directly to file?

I am running the following JModelica script:
#!/usr/local/jmodelica/bin/jm_python.sh
import pyjmi
op = pyjmi.transfer_optimization_problem("BatchReactor", "model.mop")
opt_opts = op.optimize_options()
opt_opts['n_e'] = 40 # Number of elements
opt_opts['IPOPT_options']['tol'] = 1e-10
opt_opts['IPOPT_options']['print_level'] = 8
opt_opts['IPOPT_options']['output_file'] = '/z/out'
res = op.optimize(options=opt_opts)
I had hoped that the results (e.g. time, x1, x2, &c.) would be printed to the file /z/out. But the file only contains IPOPT verbose debugging/status info.
Is there a way to print the information that would be stored in res directly to a file? Either by somehow writing res itself or, preferably, having IPOPT/JModelica write the results without having to go through Python?

There is a way to print the information directly to a file. The following accomplishes this. Note that result_file_name is the key to making this happen.
#!/usr/local/jmodelica/bin/jm_python.sh
import pyjmi
op = pyjmi.transfer_optimization_problem("BatchReactor", "model.mop")
opt_opts = op.optimize_options()
opt_opts['n_e'] = 40 # Number of elements
opt_opts['result_file_name'] = '/z/out'
opt_opts['IPOPT_options']['tol'] = 1e-10
opt_opts['IPOPT_options']['print_level'] = 0
res = op.optimize(options=opt_opts)
Unfortunately, the contents of the file are somewhat mysterious.

You may find that using result_file_name per another answer here results in an output file which is difficult to understand.
The following produces a nicer format:
import StringIO
import numpy as np
def PrintResToFile(filename,result):
def StripMX(x):
return str(x).replace('MX(','').replace(')','')
varstr = '#Variable Name={name: <10}, Unit={unit: <7}, Val={val: <10}, Col={col:< 5}, Comment="{comment}"\n'
with open(filename,'w') as fout:
#Print all variables at the top of the file, along with relevant information
#about them.
for var in result.model.getAllVariables():
if not result.is_variable(var.getName()):
val = result.initial(var.getName())
col = -1
else:
val = "Varies"
col = result.get_column(var.getName())
unit = StripMX(var.getUnit())
if not unit:
unit = "X"
fout.write(varstr.format(
name = var.getName(),
unit = unit,
val = val,
col = col,
comment = StripMX(var.getAttribute('comment'))
))
#Ensure that time variable is printed
fout.write(varstr.format(
name = 'time',
unit = 's',
val = 'Varies',
col = 0,
comment = 'None'
))
#The data matrix contains only time-varying variables. So fetch all of
#these, couple them in tuples with their column number, sort by column
#number, and then extract the name of the variable again. This results in a
#list of variable names which are guaranteed to be in the same order as the
#data matrix.
vkeys_in_order = map(lambda x: x[1], sorted([(result.get_column(x),x) for x in result.keys() if result.is_variable(x)]))
for vk in vkeys_in_order:
fout.write("{0:>13},".format(vk))
fout.write("\n")
sio = StringIO.StringIO()
np.savetxt(sio, result.data_matrix, delimiter=',', fmt='%13.5f')
fout.write(sio.getvalue())
which looks like this:
#Variable Name=S0 , Unit=kg , Val=2.0 , Col=-1 , Comment="Solid Mass"
#Variable Name=F0 , Unit=kg , Val=0.0 , Col=-1 , Comment="Fluid Mass"
#Variable Name=a , Unit=Hz , Val=0.2 , Col=-1 , Comment="None"
#Variable Name=b , Unit=kg/s , Val=1.0 , Col=-1 , Comment="None"
#Variable Name=f , Unit=kg/s , Val=0.05 , Col=-1 , Comment="None"
#Variable Name=h , Unit=1/g , Val=0.05 , Col=-1 , Comment="None"
#Variable Name=der(F) , Unit=X , Val=Varies , Col= 1 , Comment="None"
#Variable Name=F , Unit=kg , Val=Varies , Col= 3 , Comment="None"
#Variable Name=der(S) , Unit=X , Val=Varies , Col= 2 , Comment="None"
#Variable Name=S , Unit=kg , Val=Varies , Col= 4 , Comment="None"
#Variable Name=u , Unit=X , Val=Varies , Col= 5 , Comment="None"
#Variable Name=startTime , Unit=X , Val=0.0 , Col=-1 , Comment="None"
#Variable Name=finalTime , Unit=X , Val=100.0 , Col=-1 , Comment="None"
#Variable Name=time , Unit=s , Val=Varies , Col= 0 , Comment="None"
time, der(F), der(S), F, S, u,
0.00000, 0.97097, -0.97097, 0.00000, 2.00000, 0.97097
0.38763, 1.07704, -1.05814, 0.38519, 1.61698, 1.00000
1.61237, 0.88350, -0.80485, 1.70714, 0.35885, 0.65862
2.50000, 0.00000, 0.09688, 2.14545, 0.00000, 0.00000
2.88763, 0.09842, -0.00000, 2.18330, 0.00000, 0.06851
4.11237, 0.10342, 0.00000, 2.30688, 0.00000, 0.07077
5.00000, 0.10716, 0.00000, 2.40033, 0.00000, 0.07240
5.38763, 0.10882, -0.00000, 2.44219, 0.00000, 0.07311
6.61237, 0.11421, 0.00000, 2.57875, 0.00000, 0.07535

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting PL/SQL procedures to Pyspark - python

Related

Modify the code to loop over another dataset

False option returning in np select?

Python - Create a nested join query using SQLAlchemy

'int' object is not subscriptable (function calling error)

Can JModelica print results directly to file?

Categories

Resources