Access string in a pandas series - python

I have a csv table with a column (tags) full of lists of strings. To convert it to a pd series I used
def flatten(series):
return pd.Series(series.dropna().sum())
tags_sorted = flatten(df['tags'])
Now I want to search the series for a string within one of the lists so that it returns the number of times that string occurs within the column. I found this function:
def find(series, tag):
for i in series.index:
if series[i] == tag:
return i
return None
and used it on my series:
print(find(tags_sorted, 'romance'))
but it keeps returning None even though the string is definitely in multiple lists.
I also tried
print(tags_sorted[tags_sorted == "romance"])
and
print(tags_sorted.loc[tags_sorted == 'romance'])
but those only return [].

I believe, that you need to change find function to:
def find(series, tag):
times_occurred = 0
for i in series.index:
if series[i] == tag:
times_occurred += 1
return times_occurred
, if you want to find how many times the specific string occurred.

Related

make a copy of a string column and cut the string based on certain value

I have a DataFrame with a column with installation KKS-codes in Python.
The KKS-codes look like this:
1BLA43AA030
1BOR53AR021
1BHY28UI021
I want to make a new column where the string only has the relevant information. sometimes the code requires a number but it usually doesn't. The required number is given after the 3digit letter which specify the certain object. like this:
BLA
BOR
BHY2
I cut the full KKS-codes with
df_1['KKS'] = df_1.Object.str[1:4]
but for certain strings i need it to be
df_1['KKS'] = df_1.Object.str[1:5]
My if-statements don't work, please help
I dont fully understand what you mean by
The required number is given after the 3digit letter which specify the certain object.
If you can explain this further with examples I can help more. Otherwise, this is how you can apply a function to a row in a dataframe:
import pandas as pd
def test_for_four(s: str) -> bool:
third_digit_letter = s[4]
if third_digit_letter != "2":
return True
return False
def split_kks_code(s: str) -> str:
if test_for_four(s):
return s[1:4]
return s[1:5]
df = pd.DataFrame([{'KKS-Code': '1BLA43AA030'},
{'KKS-Code': '1BOR53AR021'},
{'KKS-Code': '1BHY28UI021'}])
df['KKS'] = df['KKS-Code'].apply(split_kks_code)

Parse URL Parameters into separate Columns

I have a dataframe with a column of URL's that I would like to parse into new columns with rows based on the value of a specified parameter if it is present in the URL. I am using a function that is looping through each row in the dataframe column and parsing the specified URL parameter, but when I try to select the column after the function has finished I am getting a keyError. Should I be setting the value to this new column in a different manner? Is there a more effective approach than looping through the values in my table and running this process?
Error:
KeyError: 'utm_source'
Example URLs (df['landing_page_url']):
https://lp.example.com/test/lp
https://lp.example.com/test/ny/?utm_source=facebook&ref=test&utm_campaign=ny-newyork_test&utm_term=nice
https://lp.example.com/test/ny/?utm_source=facebook
NaN
https://lp.example.com/test/la/?utm_term=lp-test&utm_source=facebook
Code:
import pandas as pd
import numpy as np
import math
from urllib.parse import parse_qs, urlparse
def get_query_field(url, field):
if isinstance(url, str):
try:
return parse_qs(urlparse(url).query)[field][0]
except KeyError:
return ''
else:
return ''
for i in df['landing_page_url']:
print(i) // returns URL
print(get_query_field(i, 'utm_source')) // returns proper values
df['utm_source'] == get_query_field(i, 'utm_source')
df['utm_campaign'] == get_query_field(i, 'utm_campaign')
df['utm_term'] == get_query_field(i, 'utm_term')
I don't think your for loop will work. It looks like each time it will overwrite the entire column you are trying to set. I wanted to test the speed against my method, but I'm nearly certain this will be faster that iterating.
#Simplify the function here as recommended by Nick
def get_query_field(url, field):
if isinstance(url, str):
return parse_qs(urlparse(url).query).get(field, [''])[0]
return ''
#Use apply to create new columns based on the url
df['utm_source'] = df['landing_page_url'].apply(get_query_field, args=['utm_source'])
df['utm_campaign'] = df['landing_page_url'].apply(get_query_field, args=['utm_campaign'])
df['utm_term'] = df['landing_page_url'].apply(get_query_field, args=['utm_term'])
Instead of
try:
return parse_qs(urlparse(url).query)[field][0]
except KeyError:
return ''
You can just do:
return parse_qs(urlparse(url).query).get(field, [''])[0]
The trick here is my_dict.get(key, default) instead of my_dict[key]. The default will be returned if the key doesn't exist
Is there a more effective approach than looping through the values in my table and running this process?
Not really. Looping through each url is going to have to be done either way. Right now though, you are overriding the dataframe for every url. Meaning that if two different URLs have different sources in the query, the last one in the list will win. I have no idea if this is intentional or not.
Also note: this line
df['utm_source'] == get_query_field(i, 'utm_source')
Is not actually doing anything. == is a comparison operator, "does left side match right side'. You probably meant to use = or df.append({'utm_source': get_query_field(..)})

In pandas Series data, how do you get the keys based on data the function returns?

I have a working script that creates an array of each line of text in a file. This data is passed to a pandas Series(). The function startswith("\n") is used to return boolean True or False for each string, to determine if it begins with \n (a blank line).
I am currently using a counter i and a conditional statement to iterate through and match position the startswith() function is returning.
import pandas as pd
import numpy as np
f = open('list-of-strings.txt','r')
lines = []
for line in f.xreadlines():
lines.append(line)
s = pd.Series(lines)
i = 0
for b in s.str.startswith("\n"):
if b == 0:
print s[i],; i += 1
else:
i += 1
I've realized I am looking at this from two different approahces. One being to to directly handle each item as it is evaluated by the startswith() function. Since the startswith() function returns boolean values, it is possible to allow direct handling of data based on the values returned. Something like for each item in startswith(), if value returned is True, index = current_index, print s[index].
In addition to being able to print only the strings that are evaluated as False by startswith(), how would I get the current key value from startswith()?
References:
https://www.tutorialspoint.com/python_pandas/python_pandas_series.htm
https://www.tutorialspoint.com/python_pandas/python_pandas_working_with_text_data.htm
Your question seems actually simpler than the one in the title. You're trying to get the indices for the values for which some predicate evaluated positively, not pass the index to a function.
In Pandas, the last block
i = 0
for b in s.str.startswith("\n"):
if b == 0:
print s[i],; i += 1
else:
i += 1
is equivalent to
print(s[~s.str.startswith('\n')].values)
Moreover, you don't need Pandas for this at all:
print(''.join([l for l in in open('list-of-strings.txt','r') if not l.startswith('\n')]))
should replace your entire block of code from the question.

return a list in a method that is referencing a PD Dataframe

Is there any way to return a list or tuple when referencing a pandas DF? get_df() is a pandas column with a couple hundred float values. The code below is asking to return the values greater than 6000 and less than 7000. Can I return a list to my method? (I know I can print this but that is not what I am trying to do)
def mass_needed(numb_one, numb_two):
for i in get_df():
if i > numb_one and i < numb_two:
return(i)
print(mass_needed(6000, 7000))
What I am trying to accomplish is I want to be able to call mass_needed() and get a list values that I can print or manipulate just like a normal list.
in case anyone cared, I figured it out. Had to append the the values as they were being iterated through.
def mass_needed(numb_one, numb_two):
li = []
for i in get_df():
if i > numb_one and i < numb_two:
li.append(i)
return li
x = pd.DataFrame(mass_needed(6000, 7000))
print(x)

Django querying using array for column value

I'm making a search function for my website that breaks the entered text by spaces and then checks each work with __contains. I want to expand this so that I can pass through what columns I want it to check __contains with such as "First name", "Last Name"... ect.
What I have now:
def getSearchQuery(search,list,columns=None):
"""
Breaks up the search string and makes a query list
Filters the given list based on the query list
"""
if not columns:
columns = { name }
search = search.strip('\'"').split(" ")
for col in columns:
queries = [Q(col__contains=value) for value in search]
query = queries.pop()
for item in queries:
query |= item
return list.filter(query)
Issue is Q(col__contains=value) doesnt work as "col" is not a column. Is there some way to tell django that that is a variable and not the actual column? I have tried googling this but honestly dont know how to phrase it without putting all my code.
Do it this way:
import operator
from functools import reduce
def getSearchQuery(search, list, columns=None):
"""
Breaks up the search string and makes a query list
Filters the given list based on the query list
"""
if not columns:
return list
search = search.strip('\'"').split(" ")
queries = []
for col in columns:
queries.extend([Q((col+'__icontains', value)) for value in search])
return list.filter(reduce(operator.or_, queries))

Categories

Resources