How can i translate this UDF to Pandas UDF - python

I face some performance issues with this function, which aims to return True if a string of the string array matches with the val parameter. I would like to translate this into a Pandas UDF.
def list_contains(val):
# Perfom what ListContains generated
def list_contains_udf(column_list):
for element in column_list:
if element.startswith(val):
return True
return False
return udf(list_contains_udf, BooleanType())
How could I achieve this?

Inspired by #jxc comment, try to use the sql below in the cell of Databricks.
%sql
SELECT exists(column_list, element -> substr(element, 1, length(val)) == val)
The code element.startswith(val) I understand it using SQL is to take the head N (length(val)) length of the string element using substr and that whether be equals the val self.
Otherwise, please refer to the class pyspark.sql.UDFRegistration(sparkSession) of PySpark document to register the simiar functions as UDFs to combined use them.

Related

Check if each value within list is present in the given Django Model Table in a SINGLE query

So let's say I want to implement this generic function:
def do_exist(key:str, values:list[any], model: django.db.models.Model) -> bool
Which checks if all the given values exist within a given model's column named key in a SINGLE query.
I've implemented something like this
from django.db.models import Exists
def do_exist(key, values, model):
chained_exists = (Exists(model.objects.filter(F(key)=value)) for value in values)
qs = model.objects.filter(*chained_exists).values("pk")[:1]
# Limit and values used for the sake of less payload
return len(qs) > 0
It generates a pretty valid SQL query, but the thing that scares me is that if I try to append evaluating method call to qs like .first() instead of [:1] or .exists() MYSQL connection drops.
Does anyone have a more elegant way of solving this?
If you know you're passing in N pks, then a count() query filtered by those pks should have exactly N results.
def do_exist(model, pks):
return model.objects.filter(pk__in=pks).count() == len(pks)
qs = MyModel.objects.filter(id__in=pks)
This gives you a queryset that you can apply .all() etc to
res = MyModel.objects.filter(id__in=pks)
In Django, you can use "entry__in" to filter down a queryset based on a list of entries.
results = Model.objects.filter(id__in = pks)

How to Query a String in Pandas

I am currently practicing pandas
I am using some pokemon data as a practice https://gist.github.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6
i want to make a program that allows the user to input their queries and I will return the result that they need.
since i do not know how many parameters the user will input, i just made some code that will break that up and then put it in the format that pandas can understand, but when i am trying to execute my code, it just returns None.
whats wrong with my code?
thank you
import pandas as pd
df = pd.read_csv(r'PATH HERE')
column_heads = df.columns
print(f'''
This is a basic searcher
Input your search query as follows:
<Head1>:<Value1>, <Head2>:<Value2> etc..
Example:
Type 1:Bug,Type2:Steel,Legendary:False
Heads:
{column_heads}
''')
usr_inp = input('Enter Query: ')
queries = usr_inp.split(',')
parameters = {}
for query in queries:
head, value = query.split(':')
parameters[head] = value
print('Your search parameters:', parameters)
df_query = 'df.loc['
for key,value in parameters.items():
df_query += f'''(df['{key}'] == '{value}')&'''
df_query = df_query[:-1] + ']'
exec('''print(exec(df_query))''')
There's no need to use exec or eval—though, if you must, you should use eval instead of exec, as in print(eval(df_query)); eval will return the value of the expression (i.e. the result of the query), while exec just executes a statement, returning None.
You could do something like
import numpy as np
from functools import reduce
df[reduce(np.logical_and, (df[col] == val for col, val in parameters.items()))]
Step by step:
Collect a list of "conditions" (boolean Series) of the form df[column] == value, given the search query parameters:
conditions = [df[column] == value for column, value in parameters.items()]
Combine all conditions together using the and operator. With pandas Series/numpy arrays, this is done with the bitwise & operator, which is represented by the binary function operator.and_ (operator is a module in the Python standard library). reduce just means applying a binary operator to the first pair of elements, then to the result of that and the third element, and so on, until you only have one element; so, in this particular case: conditions[0] & conditions[1], (conditions[0] & conditions[1]) & conditions[2], etc.
mask = reduce(operator.and_, conditions)
Alternatively, it might be clearer (and less error-prone) to use np.logical_and, which represents the "proper" boolean and operation:
mask = reduce(np.logical_and, conditions)
Index the dataframe with the combined mask:
df[mask]

How does string formatting work in a spark.sql statement in PySpark?

I'm working with Pyspark and am writing a query using spark.sql. I want to choose values from an array declared somewhere else to avoid having to type in names of all rows again.
Here is my attempt but this does not work.
array_fields = ["cat", "dog"]
ans= spark.sql("""select {} from <table_name>.format(",".join[array_fields]) """)
I've also tried
ans= spark.sql("""select {} from <table_name> """).format(",".join[array_fields])
What am I doing wrong here?
Assuming your examples are correct as you have actually tried them, your use of format and join are not quite right.
Try:
array_fields = ["cat", "dog"]
ans= spark.sql("""select {} from <table_name> """.format(",".join(array_fields)))
The differences are:
The format method is applied to the string you are wanting to format.
The join method is a function call - it's parameter should be in round brackets, not square brackets (your 2nd example).
The join method is not part of the string (your 1st example).
You might also - in the first instance - try using print rather than calling spark.sql directly. That is:
array_fields = ["cat", "dog"]
print("""select {} from <table_name> """.format(",".join(array_fields)))
That way you can see what you will ultimately be passing to Spark. when you are ready, simply replace print with ans = spark.sql and away you go.
Is the format required? Try to use f-string.
f"""SELECT {",".join(array_fields)} FROM <table_name>"""

Python pandas if statement based off of boolean qualifier

I am try to do an IF statement where it keeps my currency pairs in alphabetic ordering (i.e. USD/EUR would flip to EUR/USD because E alphabetically comes before U, however CHF/JPY would stay the same because C comes alphabetically before J.) Initially I was going to write code specific to that, but realized there were other fields I'd need to flip (mainly changing a sign for positive to negative or vice versa.)
So what I did was write a function to create a new column and make a boolean identifier as to whether or not the field needs action (True) or not (False).
def flipFx(ccypair):
first = ccypair[:3]
last = ccypair[-3:]
if(first > last):
return True
else:
return False
brsPosFwd['Flip?'] = brsPosFwd['Currency Pair'].apply(flipFx)
This works great and does what I want it to.
Then I try and write an IF statement to use that field to create two new columns:
if brsPosFwd['Flip?'] is True:
brsPosFwd['CurrencyFlip'] = brsPosFwd['Sec Desc'].apply(lambda x:
x.str[-3:]+"/"+x.str[:3])
brsPosFwd['NotionalFlip'] = -brsPosFwd['Current Face']
else:
brsPosFwd['CurrencyFlip'] = brsPosFwd['Sec Desc']
brsPosFwd['NotionalFlip'] = brsPosFwd['Current Face']
However, this is not working properly. It's creating the two new fields, CurrencyFlip and NotionalFlip but treating every record like it is False and just pasting what came before it.
Does anyone have any ideas?
Pandas uses vectorised functions. You are performing operations on entire series objects as if they were single elements.
You can use numpy.where to vectorise your calculations:
import numpy as np
brsPosFwd['CurrencyFlip'] = np.where(brsPosFwd['Flip?'],
brsPosFwd['Sec Desc'].str[-3:]+'/'+brsPosFwd['Sec Desc'].str[:3]),
brsPosFwd['Sec Desc'])
brsPosFwd['NotionalFlip'] = np.where(brsPosFwd['Flip?'],
-brsPosFwd['Current Face'],
brsPosFwd['Current Face'])
Note also that pd.Series.apply should be used as a last resort; since it is a thinly veiled inefficient loop. Here you can simply use the .str accessor.

filter SqlAlchemy column value by number of resulting characters

How can I filter SqlAlchemy column by number of resulting Characters,
Here is a kind of implementation I am looking at,
query = query.filter(Take_Last_7_Characters(column_1) == '0321334')
Where "Take_Last_7_Characters" fetches the last 7 characters from the resulting value of column_1
So How can I implement Take_Last_7_Characters(column_1) ??
use sqlalchemy.sql.expression.func , to generate SQL functions.
check for more info
Please use the func to generate SQL functions as directed by #tuxuday.
Note that the code is RDBMS-dependant. The code below runs for SQLite, which offers SUBSTR and LENGTH functions. Your actual database might have different names for those (LEN, SUSBSTRING, LEFT, RIGHT, etc).
qry = session.query(Test)
qry = qry.filter(func.SUBST(Test.column_1, func.LENGTH(Test.column_1) - 6, 7) == '0321334')

Categories

Resources