How to combine queries with a single external variable using Pandas - python

I am trying to accept a variable input of many search terms seperated by commas via html form (#search) and query 2 columns of a dataframe.
Each column query works on its own but I cannot get them to work together in a and/or way.
First column query:
filtered = df.query ('`Drug Name` in #search')
Second column query:
filtered = df.query ('BP.str.contains(#search, na=False)', engine='python')
edit
combining like this:
filtered = df.query ("('`Drug Name` in #search') and ('BP.str.contains(#search, na=False)', engine='python')")
Gives the following error, highlighting the python identifier in the engine argument
SyntaxError: Python keyword not valid identifier in numexpr query
edit 2
The dataframe is read from an excel file, with columns:
Drug Name (containing a single drug name), BP, U&E (with long descriptive text entries)
The search terms will be input via html form:
search = request.values.get('searchinput').replace(" ","").split(',')
as a list of drugs which a patient may be on sometimes with the addition of specific conditions relating to medication use. sample user input:
Captopril, Paracetamol, kidney disease, chronic
I want the list to be checked against specific drug names and also to check other columns such as BP and U&E for any mention of the search terms.
edit 3
Apologies, but trying to implement the answers given is giving me stacks of errors. What I have below is giving me 90% of what I'm after, letting me search both columns including the whole contents of 'BP'. But I can only search a single term via the terminal, if I # out and swap the lines which collect the use input (taking it from the html form as apposed to the terminal) I get:
TypeError: unhashable type: 'list'
#app.route('/', methods=("POST", "GET"))
def html_table():
searchterms = []
#searchterms = request.values.get('searchinput').replace(" ","").split(',')
searchterms = input("Enter drug...")
filtered = df.query('`Drug Name` in #searchterms | BP.str.contains(#searchterms, na=False)', engine='python')
return render_template('drugsafety.html', tables=[filtered.to_html(classes='data')], titles=['na', 'Drug List'])
<form action="" method="post">
<p><label for="search">Search</label>
<input type="text" name="searchinput"></p>
<p><input type="submit"></p>
</form>
Sample data
The contents of the BP column can be quite long, descriptive and variable but an example is:
Every 12 months – Patients with CKD every 3 to 6 months.
Drug Name BP U&E
Perindopril Every 12 months Not needed
Alendronic Acid Not needed Every 12 months
Allopurinol Whilst titrating - 3 months Not needed
With this line:
searchterms = request.values.get('searchinput')
Entering 'months' into the html form outputs:
1 Perindopril Every 12 months Not needed
14 Allopurinol Whilst titrating – 3 months Not needed
All good.
Entering 'Alendronic Acid' into the html form outputs:
13 Alendronic Acid Not needed Every 12 months
Also good, but entering 'Perindopril, Allopurinol' returns nothing.
If I change the line to:
searchterms = request.values.get('searchinput').replace(" ","").split(',')
I get TypeError: unhashable type: 'list' when the page reloads.
However - If I then change:
filtered = df.query('`Drug Name` in #searchterms | BP.str.contains(#searchterms, na=False)', engine='python')
to:
filtered = df.query('`Drug Name` in #searchterms')
Then the unhashable type error goes and entering 'Perindopril, Allopurinol'
returns:
1 Perindopril Every 12 months Not needed
14 Allopurinol Whilst titrating – Every 3 months Not needed
But I'm now no longer searching the BP column for the searchterms.
Just thought that maybe its because searchterms is a list '[]' changed it t oa tuple '()' Didn't change anything.
Any help is much appreciated.

I am assuming you want to query 2 columns and want to return the row if any of the query matches.
In this line, the issue is that engine=python is inside query.
filtered = df.query ("('`Drug Name` in #search') and ('BP.str.contains(#search, na=False)', engine='python')")
It should be
df.query("BP.str.contains(#search, na=False)", engine='python')
If you do searchterms = request.values.get('searchinput').replace(" ","").split(','), it converts your string to list of words which will cause Unhashable type list error because str.contains expects str as input.
What you can do is use regex to search for search terms in list, it will look something like this:
df.query("BP.str.contains('|'.join(#search), na=False, regex=True)", engine='python')
What this does is it searches for all the individual words using regex. ('|'.join(#search) will be "searchterm_1|search_term2|..." and "|" is used to represent or in regex, so it looks for searchterm_1 or searchterm_2 in BP column value)
To combine the outputs of both queries, you can run those separately and concatenate the results
pd.concat([df.query("`Drug Name` in #search", engine='python'),df.query("BP.str.contains('|'.join(#search), na=False, regex=True)", engine='python')])
Also any string based matching will require your strings to match perfectly, including case. so you can maybe lowercase everything in dataframe and query. Similarly for space separated words, this will remove spaces.
if you do searchterms = request.values.get('searchinput').replace(" ","").split(',') on Every 12 months, it will get converted to "Every12months". so you can maybe remove the .replace() part and just use searchterms = request.values.get('searchinput').split(',')

Use sets. You can change the text columns to sets and check for intersection with the input. The rest is pure pandas. I never use .query because it is slow.
# change your search from list to set
search = set(request.values.get('searchinput').replace(" ","").split(','))
filtered = df.loc[(df['Drug Name'].str.split().map(lambda x: set(x).intersection(search)))
& (df['BP'].str.split().map(lambda x: set(x).intersection(search)))]
print(filtered)
Demo:
import pandas as pd
search = set(["apple", "banana", "orange"])
df = pd.DataFrame({
"Drug Name": ["I am eating an apple", "We are banana", "nothing is here"],
"BP": ["apple is good", "nothing is here", "nothing is there"],
"Just": [1, 2, 3]
})
filtered = df.loc[(df['Drug Name'].str.split().map(lambda x: set(x).intersection(search)))
& (df['BP'].str.split().map(lambda x: set(x).intersection(search)))]
print(filtered)
# Drug Name BP Just
# 0 I am eating an apple apple is good 1
Updated:
I would want the results to also show We are banana, nothing is here and 2
That requires or which is Pandas' | instead of and which Pandas' $
filtered = df.loc[(df['Drug Name'].str.split().map(lambda x: set(x).intersection(search)))
| (df['BP'].str.split().map(lambda x: set(x).intersection(search)))]
print(filtered)
# Drug Name BP Just
# 0 I am eating an apple apple is good 1
# 1 We are banana nothing is here 2

If you want to search for text in all columns, you can first join all columns, and then check for search terms in each row using str.contains and the regular expression pattern that matches at least one of the terms (term1|term2|...|termN). I've also added flags=re.IGNORECASE to make the search case-insensitive:
# search function
def search(searchterms):
return df.loc[df.apply(' '.join, axis=1) # join text in all columns
.str.contains( # check if it contains
'|'.join([ # regex pattern
x.strip() # strip spaces
for x in searchterms.split(',') # split by ','
]), flags=re.IGNORECASE)] # case-insensitive
# test search terms
for s in ['Alendronic Acid', 'months', 'Perindopril, Allopurinol']:
print(f'Search terms: "{s}"')
print(search(s))
print('-'*70)
Output:
Search terms: "Alendronic Acid"
Drug Name BP U&E
1 Alendronic Acid Not needed Every 12 months
----------------------------------------------------------------------
Search terms: "months"
Drug Name BP U&E
0 Perindopril Every 12 months Not needed
1 Alendronic Acid Not needed Every 12 months
2 Allopurinol Whilst titrating - 3 months Not needed
----------------------------------------------------------------------
Search terms: "Perindopril, Allopurinol"
Drug Name BP U&E
0 Perindopril Every 12 months Not needed
2 Allopurinol Whilst titrating - 3 months Not needed
----------------------------------------------------------------------
P.S. If you want to limit search to specific columns, here's a version that does that (with the default of searching all columns for convenience):
# search function
def search(searchterms, cols=None):
# search columns (if None, searches in all columns)
if cols is None:
cols = df.columns
return df.loc[df[cols].apply(' '.join, axis=1) # join text in cols
.str.contains( # check if it contains
'|'.join([ # regex pattern
x.strip() # remove spaces
for x in searchterms.split(',') # split by ','
]), flags=re.IGNORECASE)] # make search case-insensitive
Now if I search for months only in Drug Name and BP, it will not return Alendronic Acid where months is only found in U&E:
search('months', ['Drug Name', 'BP'])
Output:
Drug Name BP U&E
0 Perindopril Every 12 months Not needed
2 Allopurinol Whilst titrating - 3 months Not needed

Without having sample input data, I used a random generated dataset as a showcase:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Drug_Name':['Drug1','Drug2','Drug3','Drug2','Drug5','Drug3']*4,
'Inv_Type': ['X', 'Y']*12,
'Quant': np.random.randint(2,20, size=24)})
# Search 1
search = "Drug3"
df.query('Drug_Name==#search')
# Search 2
search2 = "Y"
df.query ('Inv_Type.str.contains(#search2, na=False)', engine='python')
# Combined (use booleans, such as & or | instead of and or or
df.query ('Drug_Name==#search & Inv_Type.str.contains(#search2, na=False)')
Please note that engine='python' should be avoided as stated in the documentation:
Likewise, you can pass engine='python' to evaluate an expression using
Python itself as a backend. This is not recommended as it is
inefficient compared to using numexpr as the engine.
That said, if you are hell-bent on using it, you can do it like this:
mask = df["Inv_Type"].str.contains(search2, na=False)
df.query('Drug_Name==#search & #mask')
Alternatvely, you can achive the same without using .query() at all:
df[(df['Drug_Name']==search) & df['Inv_Type'].str.contains(search2, na=False)]

Related

How to use writeStream from a pyspark streaming dataframe to chop the values into different columns?

I am trying to ingest some files and each of them is being read as a single column string (which is expected since it is a fixed-width file) and I have to split that single value into different columns. This means that I must access the dataframe but I must use writeStream since it is a streaming dataframe. This is an example of the input:
"64 Apple 32.32128Orange12.1932 Banana 2.45"
Expected dataframe:
64, Apple, 32.32
128, Orange, 12.19
32, Banana, 2.45
Notice how every column has the same amount of characters (3,6,5)<-This is what the META_SIZE has. Therefore each row has 14 characters each (sum of columns).
I tried using forEach as the following example but it is not doing anything:
two_d = []
streamingDF = (
spark.readStream.format("cloudFiles")
.option("encoding", sourceEncoding)
.option("badRecordsPath", badRecordsPath)
.options(**cloudfiles_config)
.load(sourceBasePath)
)
def process_row(string):
rows = round(len(string)/chars_per_row)
for i in range(rows):
current_index = 0
two_d.append([])
for j in range(len(META_SIZES)):
two_d[i].append(string[(i*chars_per_row+current_index) : (i*chars_per_row+current_index+META_SIZES[j])].strip())
current_index += META_SIZES[j]
print(two_d[i])
query = streamingDF.writeStream.foreach(process_row).start()
I will probably do a withColumn to add them instead of the list or use that list and make it a streaming dataframe if possible and better.
Edit: I added an input example and explained META_SIZES
Assuming the inputs are something like the following.
...
"64 Apple 32.32"
"128 Orange 12.19"
"32 Banana 2.45"
...
You can do this.
streamingDF = (
spark.readStream.format("cloudFiles")
.option("encoding", sourceEncoding)
.option("badRecordsPath", badRecordsPath)
.options(**cloudfiles_config)
.load(sourceBasePath)
)
#remove this line if strings are already utf-8
lines = stream_lines.select(stream_lines['value'].cast('string'))
lengths = (lines.withColumn('Count', functions.split(lines['value'], ' ').getItem(0))
.withColumn('Fruit', functions.split(lines['value'], ' ').getItem(1)
.withColumn('Price', functions.split(lines['value'], ' ').getItem(1))
Note that "value" is set as the default column name when reading a string using readStream. If clouds_config contains anything changing the column name of the input you will have to alter the column name in the code above.

Argument 'string' has incorrect type (expected str, got list) Spacy NLP

I want to calculate cosine similarity, but I got an error message after converting the dataframe column to its list: Argument 'string' has incorrect type (expected str, got list).
import pandas as pd
import spacy
nlp = spacy.load("en_core_web_sm")
df= [['24, Single, Consultant, Canada, I am interested in visiting Isreal again'], ['18, Single, Student, I want to go back Costa Rica again'], ['45,Married, Unemployed, I want to take my family to Florida for the summer vacation']]
df = pd.DataFrame(df, columns = ['Free Text'])
df["N_Application"]=range(0, len(df))
# convert datafram to list
data=df['Free Text'].tolist()
df_spacy=nlp(data)
I appreciate someone help me fix it, Thank you.
The way you get a function to operate across an entire pd.Series is to use .apply(). And you can chain .apply() calls.
Example:
# changing to strings instead of nested list
l = ['24, Single, Consultant, Canada, I am interested in visiting Isreal again',
'18, Single, Student, I want to go back Costa Rica again',
'45,Married, Unemployed, I want to take my family to Florida for the summer vacation']
# remove stop words and punctuation for later similarity calculations
df_spacy = df['Free Text'].apply(nlp)\
.apply(lambda doc: nlp(' '.join(str(t)
for t in doc
if not t.is_stop
and not t.is_punct)))
Edit: per your comment, here is a similarity calculation between each row and all other rows:
df_spacy.apply(lambda row: df_spacy\
.apply(lambda doc: row.similarity(doc) if row != doc else None))
Resulting similarity matrix:
0 1 2
0 NaN 0.776098 0.716560
1 0.776098 NaN 0.705024
2 0.716560 0.705024 NaN

spacy stemming on pandas df column not working

How to apply stemming on Pandas Dataframe column
am using this function for stemming which is working perfect on string
xx='kenichan dived times ball managed save 50 rest'
def make_to_base(x):
x_list = []
doc = nlp(x)
for token in doc:
lemma=str(token.lemma_)
if lemma=='-PRON-' or lemma=='be':
lemma=token.text
x_list.append(lemma)
print(" ".join(x_list))
make_to_base(xx)
But when i am applying this function on my pandas dataframe column it is not working neither giving any error
x = list(df['text']) #my df column
x = str(x)#converting into string otherwise it is giving error
make_to_base(x)
i've tried different thing but nothing working. like this
df["texts"] = df.text.apply(lambda x: make_to_base(x))
make_to_base(df['text'])
my dataset looks like this:
df['text'].head()
Out[17]:
0 Hope you are having a good week. Just checking in
1 K..give back my thanks.
2 Am also doing in cbe only. But have to pay.
3 complimentary 4 STAR Ibiza Holiday or £10,000 ...
4 okmail: Dear Dave this is your final notice to...
Name: text, dtype: object
You need to actually return the value you got inside the make_to_base method, use
def make_to_base(x):
x_list = []
for token in nlp(x):
lemma=str(token.lemma_)
if lemma=='-PRON-' or lemma=='be':
lemma=token.text
x_list.append(lemma)
return " ".join(x_list)
Then, use
df['texts'] = df['text'].apply(lambda x: make_to_base(x))

parsing dataframe columns containing functions

Python/pandas newbie here. The csv file I'm trying to work with has been populated with data that looks something like this:
A B C D
Option1(item1=12345, item12='string', item345=0.123) 2020-03-16 1.234 Option2(item4=123, item56=234, item678=345)
I'd like it to look like this:
item1 item12 item345 B C item4 item56 item678
12345 'string' 0.123 2020-03-16 1.234 123 234 345
In other words, I want to replace columns A and D with new columns headed by what's on the left of the equal sign, using what's to the right of the equal sign as the corresponding value, and with the Option1() and Option2() parts and the commas stripped out. The columns that don't contain functions should be left as is.
Is there an elegant way to do this?
Actually, at this point, I'd settle for any old way, elegant or not; I've found various ways of dealing with this situation if, say, there were dicts populating columns, but nothing to help me pick it apart if there are functions there. Trying to search for the answer only gives me a bunch of results for how to apply functions to dataframes.
As long as your functions always have the same arguments, this should work.
You can read the csv with (if separators are 2 or more spaces, that's what I get when I paste your question example):
df = pd.read_csv('test.csv',sep='[\s]{2,}', index_col=False, engine='python')
If your dataframe is df:
# break out both sides of the equal sign in function into columns
A_vals = df['A'].str.extractall(r'([\w\d]+)=([^,\)]*)')
# get rid of the multi-index and put the values after '=' into columns
A_converted = A_vals.unstack(level=-1)[1]
# set column names to values before '='
A_converted.columns = list(A_vals.unstack(level=-1)[0].values[0])
# same thing for 'D'
D_vals = df['D'].str.extractall(r'([\w\d]+)=([^,\)]*)')
D_converted = D_vals.unstack(level=-1)[1]
D_converted.columns = list(D_vals.unstack(level=-1)[0].values[0])
# join everything together
df = A_converted.join(df.drop(['A','D'], axis=1)).join(D_converted)
Some clarification on the regex '([\w\d]+)=([^,\)]*)' has two capture groups (each part in parens):
Group 1 ([\w\d]+) is one or more characters (+) that are word characters \w or numbers \d.
= between groups.
Group 2 ([^,\)]*) is 0 or more characters (*) that are not (^) a comma , or paren \).
I believe you're looking for something along these lines:
contracts = ["Option(conId=384688665, symbol='SPX', lastTradeDateOrContractMonth='20200116', strike=3205.0, right='P', multiplier='100', exchange='SMART', currency='USD', localSymbol='SPX 200117P03205000', tradingClass='SPX')",
"Option(conId=12345678, symbol='DJX', lastTradeDateOrContractMonth='20200113', strike=1205.0, right='P', multiplier='200', exchange='SMART', currency='USD', localSymbol='DJXX 333117Y13205000', tradingClass='DJX')"]
new_conts = []
columns = []
for i in range (len(contracts)):
mod = contracts[i].replace('Option(','').replace(')','')
contracts[i] = mod
new_cont = contracts[i].split(',')
new_conts.append(new_cont)
for contract in new_conts:
column = []
for i in range (len(contract)):
mod = contract[i].split('=')
contract[i] = mod[1]
column.append(mod[0])
columns.append(column)
print(len(columns[0]))
df = pd.DataFrame(new_conts,columns=columns[0])
df
Output:
conId symbol lastTradeDateOrContractMonth strike right multiplier exchange currency localSymbol tradingClass
0 384688665 'SPX' '20200116' 3205.0 'P' '100' 'SMART' 'USD' 'SPX 200117P03205000' 'SPX'
1 12345678 'DJX' '20200113' 1205.0 'P' '200' 'SMART' 'USD' 'DJXX 333117Y13205000' 'DJX'
Obviously you can then delete unwanted columns, change names, etc.

Use contains to merge data frame

I have two separates files, one from our service providers and the other is internal (HR).
The service providers write the names of our employer in different ways, there are those who write it in firstname lastname format, or first letter of the firstname and the last name or lastname firstname...while the HR file includes separately the first and last name.
DF1
Full Name
0 B.pitt
1 Mr Nickolson Jacl
2 Johnny, Deep
3 Streep Meryl
DF2
First Last
0 Brad Pitt
1 Jack Nicklson
2 Johnny Deep
3 Streep Meryl
My idea is to use str.contains to look for the first letter of the first name and the last name. I've succed to do it with static values using the following code:
df1[['Full Name']][df1['Full Name'].str.contains('B')
& df1['Full Name'].str.contains('pitt')]
Which gives the following result:
Full Name
0 B.pitt
The challenge is comparing the two datasets... Any advise on that please?
Regards
if you are just checking if it exists or no this could be useful:
because it is rare to have 2 exactly the same family name, I recommend to just split your Df1 and compare families, then for ensuring you can differ first names too
you can easily do it with a for:
for i in range('your index'):
if df1_splitted[i].str.contain('family you searching for'):
print("yes")
if you need to compare in other aspects just let me know
I suggest to use next module for parsing names:
pip install nameparser
Then you can process your data frames :
from nameparser import HumanName
import pandas as pd
df1 = pd.DataFrame({'Full Name':['B.pitt','Mr Nickolson Jack','Johnny, Deep','Streep Meryl']})
df2 = pd.DataFrame({'First':['Brad', 'Jack','Johnny', 'Streep'],'Last':['Pitt','Nicklson','Deep','Meryl']})
names1 = [HumanName(name) for name in df1['Full Name']]
names2 = [HumanName(str(row[0]+" "+ str(row[1]))) for i,row in df2.iterrows()]
After that you can try comparing HumanName instances which have parsed fileds. it looks like this:
<HumanName : [
title: ''
first: 'Brad'
middle: ''
last: 'Pitt'
suffix: ''
nickname: '' ]
I have used this approach for processing thousands of names and merging them to same names from other documents and results were good.
More about module can be found at https://nameparser.readthedocs.io/en/latest/
Hey you could use fuzzy string matching with fuzzywuzzy
First create Full Name for df2
df2_ = df2[['First', 'Last']].agg(lambda a: a[0] + ' ' + a[1], axis=1).rename('Full Name').to_frame()
Then merge the two dataframes by index
merged_df = df2_.merge(df1, left_index=True, right_index=True)
Now you can apply fuzz.token_sort_ratio so you get the similarity
merged_df['similarity'] = merged_df[['Full Name_x', 'Full Name_y']].apply(lambda r: fuzz.token_sort_ratio(*r), axis=1)
This results in the following dataframe. You can now filter or sort it by similarity.
Full Name_x Full Name_y similarity
0 Brad Pitt B.pitt 80
1 Jack Nicklson Mr Nickolson Jacl 80
2 Johnny Deep Johnny, Deep 100
3 Streep Meryl Streep Meryl 100

Categories

Resources