Parse URL Parameters into separate Columns

Parse URL Parameters into separate Columns - python

I have a dataframe with a column of URL's that I would like to parse into new columns with rows based on the value of a specified parameter if it is present in the URL. I am using a function that is looping through each row in the dataframe column and parsing the specified URL parameter, but when I try to select the column after the function has finished I am getting a keyError. Should I be setting the value to this new column in a different manner? Is there a more effective approach than looping through the values in my table and running this process?
Error:
KeyError: 'utm_source'
Example URLs (df['landing_page_url']):
https://lp.example.com/test/lp
https://lp.example.com/test/ny/?utm_source=facebook&ref=test&utm_campaign=ny-newyork_test&utm_term=nice
https://lp.example.com/test/ny/?utm_source=facebook
NaN
https://lp.example.com/test/la/?utm_term=lp-test&utm_source=facebook
Code:
import pandas as pd
import numpy as np
import math
from urllib.parse import parse_qs, urlparse
def get_query_field(url, field):
if isinstance(url, str):
try:
return parse_qs(urlparse(url).query)[field][0]
except KeyError:
return ''
else:
return ''
for i in df['landing_page_url']:
print(i) // returns URL
print(get_query_field(i, 'utm_source')) // returns proper values
df['utm_source'] == get_query_field(i, 'utm_source')
df['utm_campaign'] == get_query_field(i, 'utm_campaign')
df['utm_term'] == get_query_field(i, 'utm_term')

I don't think your for loop will work. It looks like each time it will overwrite the entire column you are trying to set. I wanted to test the speed against my method, but I'm nearly certain this will be faster that iterating.
#Simplify the function here as recommended by Nick
def get_query_field(url, field):
if isinstance(url, str):
return parse_qs(urlparse(url).query).get(field, [''])[0]
return ''
#Use apply to create new columns based on the url
df['utm_source'] = df['landing_page_url'].apply(get_query_field, args=['utm_source'])
df['utm_campaign'] = df['landing_page_url'].apply(get_query_field, args=['utm_campaign'])
df['utm_term'] = df['landing_page_url'].apply(get_query_field, args=['utm_term'])

Instead of
try:
return parse_qs(urlparse(url).query)[field][0]
except KeyError:
return ''
You can just do:
return parse_qs(urlparse(url).query).get(field, [''])[0]
The trick here is my_dict.get(key, default) instead of my_dict[key]. The default will be returned if the key doesn't exist
Is there a more effective approach than looping through the values in my table and running this process?
Not really. Looping through each url is going to have to be done either way. Right now though, you are overriding the dataframe for every url. Meaning that if two different URLs have different sources in the query, the last one in the list will win. I have no idea if this is intentional or not.
Also note: this line
df['utm_source'] == get_query_field(i, 'utm_source')
Is not actually doing anything. == is a comparison operator, "does left side match right side'. You probably meant to use = or df.append({'utm_source': get_query_field(..)})

Related

Reading from nested json and getting None type Error -> try/except

I am reading data from nested json with this code:
data = json.loads(json_file.json)
for nodesUni in data["data"]["queryUnits"]['nodes']:
try:
tm = (nodesUni['sql']['busData'][0]['engine']['engType'])
except:
tm = ''
try:
to = (nodesUni['sql']['carData'][0]['engineData']['producer']['engName'])
except:
to = ''
json_output_for_one_GU_owner = {
"EngineType": tm,
"EngineName": to,
}
I am having an issue with None type error (eg. this one doesn't exists at all nodesUni['sql']['busData'][0]['engine']['engType'] cause there are no data, so I am using try/except. But my code is more complex and having a try/except for every value is crazy. Is there any other option how to deal with this?
Error: "TypeError: 'NoneType' object is not subscriptable"

This is non-trivial as your requirement is to traverse the dictionaries without errors, and get an empty string value in the end, all that in a very simple expression like cascading the [] operators.
First method
My approach is to add a hook when loading the json file, so it creates default dictionaries in an infinite way
import collections,json
def superdefaultdict():
return collections.defaultdict(superdefaultdict)
def hook(s):
c = superdefaultdict()
c.update(s)
return(c)
data = json.loads('{"foo":"bar"}',object_hook=hook)
print(data["x"][0]["zzz"]) # doesn't exist
print(data["foo"]) # exists
prints:
defaultdict(<function superdefaultdict at 0x000001ECEFA47160>, {})
bar
when accessing some combination of keys that don't exist (at any level), superdefaultdict recursively creates a defaultdict of itself (this is a nice pattern, you can read more about it in Is there a standard class for an infinitely nested defaultdict?), allowing any number of non-existing key levels.
Now the only drawback is that it returns a defaultdict(<function superdefaultdict at 0x000001ECEFA47160>, {}) which is ugly. So
print(data["x"][0]["zzz"] or "")
prints empty string if the dictionary is empty. That should suffice for your purpose.
Use like that in your context:
def superdefaultdict():
return collections.defaultdict(superdefaultdict)
def hook(s):
c = superdefaultdict()
c.update(s)
return(c)
data = json.loads(json_file.json,object_hook=hook)
for nodesUni in data["data"]["queryUnits"]['nodes']:
tm = nodesUni['sql']['busData'][0]['engine']['engType'] or ""
to = nodesUni['sql']['carData'][0]['engineData']['producer']['engName'] or ""
Drawbacks:
It creates a lot of empty dictionaries in your data object. Shouldn't be a problem (except if you're very low in memory) as the object isn't dumped to a file afterwards (where the non-existent values would appear)
If a value already exists, trying to access it as a dictionary crashes the program
Also if some value is 0 or an empty list, the or operator will pick "". This can be workarounded with another wrapper that tests if the object is an empty superdefaultdict instead. Less elegant but doable.
Second method
Convert the access of your successive dictionaries as a string (for instance just double quote your expression like "['sql']['busData'][0]['engine']['engType']", parse it, and loop on the keys to get the data. If there's an exception, stop and return an empty string.
import json,re,operator
def get(key,data):
key_parts = [x.strip("'") if x.startswith("'") else int(x) for x in re.findall(r"\[([^\]]*)\]",key)]
try:
for k in key_parts:
data = data[k]
return data
except (KeyError,IndexError,TypeError):
return ""
testing with some simple data:
data = json.loads('{"foo":"bar","hello":{"a":12}}')
print(get("['sql']['busData'][0]['engine']['engType']",data))
print(get("['hello']['a']",data))
print(get("['hello']['a']['e']",data))
we get, empty string (some keys are missing), 12 (the path is valid), empty string (we tried to traverse a non-dict existing value).
The syntax could be simplified (ex: "sql"."busData".O."engine"."engType") but would still have to retain a way to differentiate keys (strings) from indices (integers)
The second approach is probably the most flexible one.

Access string in a pandas series

I have a csv table with a column (tags) full of lists of strings. To convert it to a pd series I used
def flatten(series):
return pd.Series(series.dropna().sum())
tags_sorted = flatten(df['tags'])
Now I want to search the series for a string within one of the lists so that it returns the number of times that string occurs within the column. I found this function:
def find(series, tag):
for i in series.index:
if series[i] == tag:
return i
return None
and used it on my series:
print(find(tags_sorted, 'romance'))
but it keeps returning None even though the string is definitely in multiple lists.
I also tried
print(tags_sorted[tags_sorted == "romance"])
and
print(tags_sorted.loc[tags_sorted == 'romance'])
but those only return [].

I believe, that you need to change find function to:
def find(series, tag):
times_occurred = 0
for i in series.index:
if series[i] == tag:
times_occurred += 1
return times_occurred
, if you want to find how many times the specific string occurred.

How to Query a String in Pandas

I am currently practicing pandas
I am using some pokemon data as a practice https://gist.github.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6
i want to make a program that allows the user to input their queries and I will return the result that they need.
since i do not know how many parameters the user will input, i just made some code that will break that up and then put it in the format that pandas can understand, but when i am trying to execute my code, it just returns None.
whats wrong with my code?
thank you
import pandas as pd
df = pd.read_csv(r'PATH HERE')
column_heads = df.columns
print(f'''
This is a basic searcher
Input your search query as follows:
<Head1>:<Value1>, <Head2>:<Value2> etc..
Example:
Type 1:Bug,Type2:Steel,Legendary:False
Heads:
{column_heads}
''')
usr_inp = input('Enter Query: ')
queries = usr_inp.split(',')
parameters = {}
for query in queries:
head, value = query.split(':')
parameters[head] = value
print('Your search parameters:', parameters)
df_query = 'df.loc['
for key,value in parameters.items():
df_query += f'''(df['{key}'] == '{value}')&'''
df_query = df_query[:-1] + ']'
exec('''print(exec(df_query))''')

There's no need to use exec or eval—though, if you must, you should use eval instead of exec, as in print(eval(df_query)); eval will return the value of the expression (i.e. the result of the query), while exec just executes a statement, returning None.
You could do something like
import numpy as np
from functools import reduce
df[reduce(np.logical_and, (df[col] == val for col, val in parameters.items()))]
Step by step:
Collect a list of "conditions" (boolean Series) of the form df[column] == value, given the search query parameters:
conditions = [df[column] == value for column, value in parameters.items()]
Combine all conditions together using the and operator. With pandas Series/numpy arrays, this is done with the bitwise & operator, which is represented by the binary function operator.and_ (operator is a module in the Python standard library). reduce just means applying a binary operator to the first pair of elements, then to the result of that and the third element, and so on, until you only have one element; so, in this particular case: conditions[0] & conditions[1], (conditions[0] & conditions[1]) & conditions[2], etc.
mask = reduce(operator.and_, conditions)
Alternatively, it might be clearer (and less error-prone) to use np.logical_and, which represents the "proper" boolean and operation:
mask = reduce(np.logical_and, conditions)
Index the dataframe with the combined mask:
df[mask]

Use finditem() only on one Column

I have a QTableWidget populated with QtableWidgetItems.
I want a searchbar, where I can type in and as Response the Table should be refreshing and only showing the items that match partially with the string in the search field.
Im using finditem for that, but i want that only one column is used for the search. How can I do that?

Iterate the table manually.
columnOfInterest = 1 # or whatever
valueOfInterest = "foo"
for rowIndex in range(self.myTable.rowCount()):
twItem = self.myTable.item(rowIndex, columnOfInterest)
if twItem.text() == valueOfInterest:
self.myTable.setRowHidden(rowIndex, False)
else:
self.myTable.setRowHidden(rowIndex, True)
You will have to implement better matching criteria. You can use string functions like str.find and str.startswith and others if you want to do it yourself.

Check if Dictionary Values exist in a another Dictionary in Python

I am trying to compare values from 2 Dictionaries in Python. I want to know if a value from one Dictionary exists anywhere in another Dictionary. Here is what i have so far. If it exists I want to return True, else False.
The code I have is close, but not working right.
I'm using VS2012 with Python Plugin
I'm passing both Dictionary items into the functions.
def NameExists(best_guess, line):
return all (line in best_guess.values() #Getting Generator Exit Error here on values
for value in line['full name'])
Also, I want to see if there are duplicates within best_guess itself.
def CheckDuplicates(best_guess, line):
if len(set(best_guess.values())) != len(best_guess):
return True
else:
return False

As error is about generator exit, I guess you use python 3.x. So best_guess.values() is a generator, which exhaust for the first value in line['full name'] for which a match will not be found.
Also, I guess all usage is incorrect, if you look for any value to exist (not sure, from which one dictinary though).
You can use something like follows, providing line is the second dictionary:
def NameExists(best_guess, line):
vals = set(best_guess.values())
return bool(set(line.values()).intersection(vals))

The syntax in NameExists seems wrong, you aren't using the value and best_guess.values() is returning an iterator, so in will only work once, unless we convert it to a list or a set (you are using Python 3.x, aren't you?). I believe this is what you meant:
def NameExists(best_guess, line):
vals = set(best_guess.values())
return all(value in vals for value in line['full name'])
And the CheckDuplicates function can be written in a shorter way like this:
def CheckDuplicates(best_guess, line):
return len(set(best_guess.values())) != len(best_guess)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse URL Parameters into separate Columns - python

Related

Reading from nested json and getting None type Error -> try/except

Access string in a pandas series

How to Query a String in Pandas

Use finditem() only on one Column

Check if Dictionary Values exist in a another Dictionary in Python

Categories

Resources