How to Query a String in Pandas - python

I am currently practicing pandas
I am using some pokemon data as a practice https://gist.github.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6
i want to make a program that allows the user to input their queries and I will return the result that they need.
since i do not know how many parameters the user will input, i just made some code that will break that up and then put it in the format that pandas can understand, but when i am trying to execute my code, it just returns None.
whats wrong with my code?
thank you
import pandas as pd
df = pd.read_csv(r'PATH HERE')
column_heads = df.columns
print(f'''
This is a basic searcher
Input your search query as follows:
<Head1>:<Value1>, <Head2>:<Value2> etc..
Example:
Type 1:Bug,Type2:Steel,Legendary:False
Heads:
{column_heads}
''')
usr_inp = input('Enter Query: ')
queries = usr_inp.split(',')
parameters = {}
for query in queries:
head, value = query.split(':')
parameters[head] = value
print('Your search parameters:', parameters)
df_query = 'df.loc['
for key,value in parameters.items():
df_query += f'''(df['{key}'] == '{value}')&'''
df_query = df_query[:-1] + ']'
exec('''print(exec(df_query))''')

There's no need to use exec or eval—though, if you must, you should use eval instead of exec, as in print(eval(df_query)); eval will return the value of the expression (i.e. the result of the query), while exec just executes a statement, returning None.
You could do something like
import numpy as np
from functools import reduce
df[reduce(np.logical_and, (df[col] == val for col, val in parameters.items()))]
Step by step:
Collect a list of "conditions" (boolean Series) of the form df[column] == value, given the search query parameters:
conditions = [df[column] == value for column, value in parameters.items()]
Combine all conditions together using the and operator. With pandas Series/numpy arrays, this is done with the bitwise & operator, which is represented by the binary function operator.and_ (operator is a module in the Python standard library). reduce just means applying a binary operator to the first pair of elements, then to the result of that and the third element, and so on, until you only have one element; so, in this particular case: conditions[0] & conditions[1], (conditions[0] & conditions[1]) & conditions[2], etc.
mask = reduce(operator.and_, conditions)
Alternatively, it might be clearer (and less error-prone) to use np.logical_and, which represents the "proper" boolean and operation:
mask = reduce(np.logical_and, conditions)
Index the dataframe with the combined mask:
df[mask]

Related

Parse URL Parameters into separate Columns

I have a dataframe with a column of URL's that I would like to parse into new columns with rows based on the value of a specified parameter if it is present in the URL. I am using a function that is looping through each row in the dataframe column and parsing the specified URL parameter, but when I try to select the column after the function has finished I am getting a keyError. Should I be setting the value to this new column in a different manner? Is there a more effective approach than looping through the values in my table and running this process?
Error:
KeyError: 'utm_source'
Example URLs (df['landing_page_url']):
https://lp.example.com/test/lp
https://lp.example.com/test/ny/?utm_source=facebook&ref=test&utm_campaign=ny-newyork_test&utm_term=nice
https://lp.example.com/test/ny/?utm_source=facebook
NaN
https://lp.example.com/test/la/?utm_term=lp-test&utm_source=facebook
Code:
import pandas as pd
import numpy as np
import math
from urllib.parse import parse_qs, urlparse
def get_query_field(url, field):
if isinstance(url, str):
try:
return parse_qs(urlparse(url).query)[field][0]
except KeyError:
return ''
else:
return ''
for i in df['landing_page_url']:
print(i) // returns URL
print(get_query_field(i, 'utm_source')) // returns proper values
df['utm_source'] == get_query_field(i, 'utm_source')
df['utm_campaign'] == get_query_field(i, 'utm_campaign')
df['utm_term'] == get_query_field(i, 'utm_term')
I don't think your for loop will work. It looks like each time it will overwrite the entire column you are trying to set. I wanted to test the speed against my method, but I'm nearly certain this will be faster that iterating.
#Simplify the function here as recommended by Nick
def get_query_field(url, field):
if isinstance(url, str):
return parse_qs(urlparse(url).query).get(field, [''])[0]
return ''
#Use apply to create new columns based on the url
df['utm_source'] = df['landing_page_url'].apply(get_query_field, args=['utm_source'])
df['utm_campaign'] = df['landing_page_url'].apply(get_query_field, args=['utm_campaign'])
df['utm_term'] = df['landing_page_url'].apply(get_query_field, args=['utm_term'])
Instead of
try:
return parse_qs(urlparse(url).query)[field][0]
except KeyError:
return ''
You can just do:
return parse_qs(urlparse(url).query).get(field, [''])[0]
The trick here is my_dict.get(key, default) instead of my_dict[key]. The default will be returned if the key doesn't exist
Is there a more effective approach than looping through the values in my table and running this process?
Not really. Looping through each url is going to have to be done either way. Right now though, you are overriding the dataframe for every url. Meaning that if two different URLs have different sources in the query, the last one in the list will win. I have no idea if this is intentional or not.
Also note: this line
df['utm_source'] == get_query_field(i, 'utm_source')
Is not actually doing anything. == is a comparison operator, "does left side match right side'. You probably meant to use = or df.append({'utm_source': get_query_field(..)})

How can i translate this UDF to Pandas UDF

I face some performance issues with this function, which aims to return True if a string of the string array matches with the val parameter. I would like to translate this into a Pandas UDF.
def list_contains(val):
# Perfom what ListContains generated
def list_contains_udf(column_list):
for element in column_list:
if element.startswith(val):
return True
return False
return udf(list_contains_udf, BooleanType())
How could I achieve this?
Inspired by #jxc comment, try to use the sql below in the cell of Databricks.
%sql
SELECT exists(column_list, element -> substr(element, 1, length(val)) == val)
The code element.startswith(val) I understand it using SQL is to take the head N (length(val)) length of the string element using substr and that whether be equals the val self.
Otherwise, please refer to the class pyspark.sql.UDFRegistration(sparkSession) of PySpark document to register the simiar functions as UDFs to combined use them.

How to pass a "Take All" parameter in pandas loc filter condition?

I have a function with a parameter (in this case: "department") to filter (df.loc[(df['A'] == department) specific data out of my dataset. In one case, I want to use this specific function but instead of filtering the data, I want to get all the data.
Is there a way to pass a parameter which would result in something like
df.loc[(df['A'] == *) or
df.loc[(df['A'] == %)
# Write the data to the table
def table_creation(table, department, status):
def condition_to_value(df, kpi):
performance_indicator = df.loc[(df['A'] == department) & (df['C'] == kpi) & (df['B'] == status), 'D'].values[0]
return performance_indicator
One way I could think of is, instead of using: df['A'] == 'department' you can use df['A'].isin(['department']). The two yield the same result.
Once you do that, then you can pass the "Take All" parameter like so:
df['A'].isin(df['A'].unique())
where df['A'].unique() is a list all the unique paratemres in this column, so it will return all True.
Or you can pass multiple parameters like so:
df['A'].isin(['department', 'string_2', 'string_3']))
Building over Newskooler's answer, as you know the name of the column you'll be searching over, you could add his solution inside the function and process '*' accordingly.
It would look something like this:
# Write the data to the table
def table_creation(table, department, status):
def condition_to_value(df, kpi):
# use '*' to identify all departments
if isinstance(department, str) and department=='*':
department = df['A'].isin(df['A'].unique())
# make the function work with string or list inputs
if isinstance(department, str):
department = [department, ]
# notice the addition of the isin as recommended by Newskooler
performance_indicator = df.loc[(df['A'].isin(department)) & (df['C'] == kpi) & (df['B'] == status), 'D'].values[0]
return performance_indicator
I realize there are missing parts here, as they are also in the initial question, but this changes should work without having to change how you call your function now, but will include the benefits listed in the previous answer.
I don't think you can do it by passing a parameter like in an SQL query. You'll have to re-write your function a bit to take this condition into consideration.

In pandas Series data, how do you get the keys based on data the function returns?

I have a working script that creates an array of each line of text in a file. This data is passed to a pandas Series(). The function startswith("\n") is used to return boolean True or False for each string, to determine if it begins with \n (a blank line).
I am currently using a counter i and a conditional statement to iterate through and match position the startswith() function is returning.
import pandas as pd
import numpy as np
f = open('list-of-strings.txt','r')
lines = []
for line in f.xreadlines():
lines.append(line)
s = pd.Series(lines)
i = 0
for b in s.str.startswith("\n"):
if b == 0:
print s[i],; i += 1
else:
i += 1
I've realized I am looking at this from two different approahces. One being to to directly handle each item as it is evaluated by the startswith() function. Since the startswith() function returns boolean values, it is possible to allow direct handling of data based on the values returned. Something like for each item in startswith(), if value returned is True, index = current_index, print s[index].
In addition to being able to print only the strings that are evaluated as False by startswith(), how would I get the current key value from startswith()?
References:
https://www.tutorialspoint.com/python_pandas/python_pandas_series.htm
https://www.tutorialspoint.com/python_pandas/python_pandas_working_with_text_data.htm
Your question seems actually simpler than the one in the title. You're trying to get the indices for the values for which some predicate evaluated positively, not pass the index to a function.
In Pandas, the last block
i = 0
for b in s.str.startswith("\n"):
if b == 0:
print s[i],; i += 1
else:
i += 1
is equivalent to
print(s[~s.str.startswith('\n')].values)
Moreover, you don't need Pandas for this at all:
print(''.join([l for l in in open('list-of-strings.txt','r') if not l.startswith('\n')]))
should replace your entire block of code from the question.

Python pandas if statement based off of boolean qualifier

I am try to do an IF statement where it keeps my currency pairs in alphabetic ordering (i.e. USD/EUR would flip to EUR/USD because E alphabetically comes before U, however CHF/JPY would stay the same because C comes alphabetically before J.) Initially I was going to write code specific to that, but realized there were other fields I'd need to flip (mainly changing a sign for positive to negative or vice versa.)
So what I did was write a function to create a new column and make a boolean identifier as to whether or not the field needs action (True) or not (False).
def flipFx(ccypair):
first = ccypair[:3]
last = ccypair[-3:]
if(first > last):
return True
else:
return False
brsPosFwd['Flip?'] = brsPosFwd['Currency Pair'].apply(flipFx)
This works great and does what I want it to.
Then I try and write an IF statement to use that field to create two new columns:
if brsPosFwd['Flip?'] is True:
brsPosFwd['CurrencyFlip'] = brsPosFwd['Sec Desc'].apply(lambda x:
x.str[-3:]+"/"+x.str[:3])
brsPosFwd['NotionalFlip'] = -brsPosFwd['Current Face']
else:
brsPosFwd['CurrencyFlip'] = brsPosFwd['Sec Desc']
brsPosFwd['NotionalFlip'] = brsPosFwd['Current Face']
However, this is not working properly. It's creating the two new fields, CurrencyFlip and NotionalFlip but treating every record like it is False and just pasting what came before it.
Does anyone have any ideas?
Pandas uses vectorised functions. You are performing operations on entire series objects as if they were single elements.
You can use numpy.where to vectorise your calculations:
import numpy as np
brsPosFwd['CurrencyFlip'] = np.where(brsPosFwd['Flip?'],
brsPosFwd['Sec Desc'].str[-3:]+'/'+brsPosFwd['Sec Desc'].str[:3]),
brsPosFwd['Sec Desc'])
brsPosFwd['NotionalFlip'] = np.where(brsPosFwd['Flip?'],
-brsPosFwd['Current Face'],
brsPosFwd['Current Face'])
Note also that pd.Series.apply should be used as a last resort; since it is a thinly veiled inefficient loop. Here you can simply use the .str accessor.

Categories

Resources