Pandas: remove characters based on conditions in a DataFrame - python

I have a DF that looks like this:
ids
-----------
cat-1,paws
dog-2,paws
bird-1,feathers,fish
cows-2,bird_3
.
.
.
I need to remove all the ids that have a - or _ in the dataframe. So, final data frame should be
ids
-----------
paws
paws
feathers,fish
.
.
.
I've tried using lambda like this:
df['ids'] = df['ids'].apply(lambda x: x.replace('cat-1', '').replace('dog-2', '' )...)
But this is not a scalable solution and I would need to add all the ids with dashes and underscores into the above. What would be a more scalable/efficient solution?

You can use a regex pattern:
df.ids.str.replace('\w*[-_]\w*,?', '')
Output:
0 paws
1 paws
2 feathers,fish
3
Name: ids, dtype: object

Related

Splitting column by multiple custom delimiters in Python

I need to split a column called Creative where each cell contains samples such as:
pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)
Where each two-letter code preceding each bubbled section ( ) is the title of the desired column, and are the same in every row. The only data that changes is what is inside the bubbles. I want the data to look like:
pn
io
ta
pt
cn
cs
2021
302
Yes
Blue
John
Doe
I tried
df[['Creative', 'Creative Size']] = df['Creative'].str.split('cs(',expand=True)
and
df['Creative Size'] = df['Creative Size'].str.replace(')','')
but got an error, error: missing ), unterminated subpattern at position 2, assuming it has something to do with regular expressions.
Is there an easy way to split these ? Thanks.
Use extract with named capturing groups (see here):
import pandas as pd
# toy example
df = pd.DataFrame(data=[["pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)"]], columns=["Creative"])
# extract with a named capturing group
res = df["Creative"].str.extract(
r"pn\((?P<pn>\d+)\)io\((?P<io>\d+)\)ta\((?P<ta>\w+)\)pt\((?P<pt>\w+)\)cn\((?P<cn>\w+)\)cs\((?P<cs>\w+)\)",
expand=True)
print(res)
Output
pn io ta pt cn cs
0 2021 302 Yes Blue John Doe
I'd use regex to generate a list of dictionaries via comprehensions. The idea is to create a list of dictionaries that each represent rows of the desired dataframe, then constructing a dataframe out of it. I can build it in one nested comprehension:
import re
rows = [{r[0]:r[1] for r in re.findall(r'(\w{2})\((.+)\)', c)} for c in df['Creative']]
subtable = pd.DataFrame(rows)
for col in subtable.columns:
df[col] = subtable[col].values
Basically, I regex search for instances of ab(*) and capture the two-letter prefix and the contents of the parenthesis and store them in a list of tuples. Then I create a dictionary out of the list of tuples, each of which is essentially a row like the one you display in your question. Then, I put them into a data frame and insert each of those columns into the original data frame. Let me know if this is confusing in any way!
David
Try with extractall:
names = df["Creative"].str.extractall("(.*?)\(.*?\)").loc[0][0].tolist()
output = df["Creative"].str.extractall("\((.*?)\)").unstack()[0].set_axis(names, axis=1)
>>> output
pn io ta pt cn cs
0 2021 302 Yes Blue John Doe
1 2020 301 No Red Jane Doe
Input df:
df = pd.DataFrame({"Creative": ["pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)",
"pn(2020)io(301)ta(No)pt(Red)cn(Jane)cs(Doe)"]})
We can use str.findall to extract matching column name-value pairs
pd.DataFrame(map(dict, df['Creative'].str.findall(r'(\w+)\((\w+)')))
pn io ta pt cn cs
0 2021 302 Yes Blue John Doe
Using regular expressions, different way of packaging final DataFrame:
import re
import pandas as pd
txt = 'pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)'
data = list(zip(*re.findall('([^\(]+)\(([^\)]+)\)', txt))
df = pd.DataFrame([data[1]], columns=data[0])

How to combine queries with a single external variable using Pandas

I am trying to accept a variable input of many search terms seperated by commas via html form (#search) and query 2 columns of a dataframe.
Each column query works on its own but I cannot get them to work together in a and/or way.
First column query:
filtered = df.query ('`Drug Name` in #search')
Second column query:
filtered = df.query ('BP.str.contains(#search, na=False)', engine='python')
edit
combining like this:
filtered = df.query ("('`Drug Name` in #search') and ('BP.str.contains(#search, na=False)', engine='python')")
Gives the following error, highlighting the python identifier in the engine argument
SyntaxError: Python keyword not valid identifier in numexpr query
edit 2
The dataframe is read from an excel file, with columns:
Drug Name (containing a single drug name), BP, U&E (with long descriptive text entries)
The search terms will be input via html form:
search = request.values.get('searchinput').replace(" ","").split(',')
as a list of drugs which a patient may be on sometimes with the addition of specific conditions relating to medication use. sample user input:
Captopril, Paracetamol, kidney disease, chronic
I want the list to be checked against specific drug names and also to check other columns such as BP and U&E for any mention of the search terms.
edit 3
Apologies, but trying to implement the answers given is giving me stacks of errors. What I have below is giving me 90% of what I'm after, letting me search both columns including the whole contents of 'BP'. But I can only search a single term via the terminal, if I # out and swap the lines which collect the use input (taking it from the html form as apposed to the terminal) I get:
TypeError: unhashable type: 'list'
#app.route('/', methods=("POST", "GET"))
def html_table():
searchterms = []
#searchterms = request.values.get('searchinput').replace(" ","").split(',')
searchterms = input("Enter drug...")
filtered = df.query('`Drug Name` in #searchterms | BP.str.contains(#searchterms, na=False)', engine='python')
return render_template('drugsafety.html', tables=[filtered.to_html(classes='data')], titles=['na', 'Drug List'])
<form action="" method="post">
<p><label for="search">Search</label>
<input type="text" name="searchinput"></p>
<p><input type="submit"></p>
</form>
Sample data
The contents of the BP column can be quite long, descriptive and variable but an example is:
Every 12 months – Patients with CKD every 3 to 6 months.
Drug Name BP U&E
Perindopril Every 12 months Not needed
Alendronic Acid Not needed Every 12 months
Allopurinol Whilst titrating - 3 months Not needed
With this line:
searchterms = request.values.get('searchinput')
Entering 'months' into the html form outputs:
1 Perindopril Every 12 months Not needed
14 Allopurinol Whilst titrating – 3 months Not needed
All good.
Entering 'Alendronic Acid' into the html form outputs:
13 Alendronic Acid Not needed Every 12 months
Also good, but entering 'Perindopril, Allopurinol' returns nothing.
If I change the line to:
searchterms = request.values.get('searchinput').replace(" ","").split(',')
I get TypeError: unhashable type: 'list' when the page reloads.
However - If I then change:
filtered = df.query('`Drug Name` in #searchterms | BP.str.contains(#searchterms, na=False)', engine='python')
to:
filtered = df.query('`Drug Name` in #searchterms')
Then the unhashable type error goes and entering 'Perindopril, Allopurinol'
returns:
1 Perindopril Every 12 months Not needed
14 Allopurinol Whilst titrating – Every 3 months Not needed
But I'm now no longer searching the BP column for the searchterms.
Just thought that maybe its because searchterms is a list '[]' changed it t oa tuple '()' Didn't change anything.
Any help is much appreciated.
I am assuming you want to query 2 columns and want to return the row if any of the query matches.
In this line, the issue is that engine=python is inside query.
filtered = df.query ("('`Drug Name` in #search') and ('BP.str.contains(#search, na=False)', engine='python')")
It should be
df.query("BP.str.contains(#search, na=False)", engine='python')
If you do searchterms = request.values.get('searchinput').replace(" ","").split(','), it converts your string to list of words which will cause Unhashable type list error because str.contains expects str as input.
What you can do is use regex to search for search terms in list, it will look something like this:
df.query("BP.str.contains('|'.join(#search), na=False, regex=True)", engine='python')
What this does is it searches for all the individual words using regex. ('|'.join(#search) will be "searchterm_1|search_term2|..." and "|" is used to represent or in regex, so it looks for searchterm_1 or searchterm_2 in BP column value)
To combine the outputs of both queries, you can run those separately and concatenate the results
pd.concat([df.query("`Drug Name` in #search", engine='python'),df.query("BP.str.contains('|'.join(#search), na=False, regex=True)", engine='python')])
Also any string based matching will require your strings to match perfectly, including case. so you can maybe lowercase everything in dataframe and query. Similarly for space separated words, this will remove spaces.
if you do searchterms = request.values.get('searchinput').replace(" ","").split(',') on Every 12 months, it will get converted to "Every12months". so you can maybe remove the .replace() part and just use searchterms = request.values.get('searchinput').split(',')
Use sets. You can change the text columns to sets and check for intersection with the input. The rest is pure pandas. I never use .query because it is slow.
# change your search from list to set
search = set(request.values.get('searchinput').replace(" ","").split(','))
filtered = df.loc[(df['Drug Name'].str.split().map(lambda x: set(x).intersection(search)))
& (df['BP'].str.split().map(lambda x: set(x).intersection(search)))]
print(filtered)
Demo:
import pandas as pd
search = set(["apple", "banana", "orange"])
df = pd.DataFrame({
"Drug Name": ["I am eating an apple", "We are banana", "nothing is here"],
"BP": ["apple is good", "nothing is here", "nothing is there"],
"Just": [1, 2, 3]
})
filtered = df.loc[(df['Drug Name'].str.split().map(lambda x: set(x).intersection(search)))
& (df['BP'].str.split().map(lambda x: set(x).intersection(search)))]
print(filtered)
# Drug Name BP Just
# 0 I am eating an apple apple is good 1
Updated:
I would want the results to also show We are banana, nothing is here and 2
That requires or which is Pandas' | instead of and which Pandas' $
filtered = df.loc[(df['Drug Name'].str.split().map(lambda x: set(x).intersection(search)))
| (df['BP'].str.split().map(lambda x: set(x).intersection(search)))]
print(filtered)
# Drug Name BP Just
# 0 I am eating an apple apple is good 1
# 1 We are banana nothing is here 2
If you want to search for text in all columns, you can first join all columns, and then check for search terms in each row using str.contains and the regular expression pattern that matches at least one of the terms (term1|term2|...|termN). I've also added flags=re.IGNORECASE to make the search case-insensitive:
# search function
def search(searchterms):
return df.loc[df.apply(' '.join, axis=1) # join text in all columns
.str.contains( # check if it contains
'|'.join([ # regex pattern
x.strip() # strip spaces
for x in searchterms.split(',') # split by ','
]), flags=re.IGNORECASE)] # case-insensitive
# test search terms
for s in ['Alendronic Acid', 'months', 'Perindopril, Allopurinol']:
print(f'Search terms: "{s}"')
print(search(s))
print('-'*70)
Output:
Search terms: "Alendronic Acid"
Drug Name BP U&E
1 Alendronic Acid Not needed Every 12 months
----------------------------------------------------------------------
Search terms: "months"
Drug Name BP U&E
0 Perindopril Every 12 months Not needed
1 Alendronic Acid Not needed Every 12 months
2 Allopurinol Whilst titrating - 3 months Not needed
----------------------------------------------------------------------
Search terms: "Perindopril, Allopurinol"
Drug Name BP U&E
0 Perindopril Every 12 months Not needed
2 Allopurinol Whilst titrating - 3 months Not needed
----------------------------------------------------------------------
P.S. If you want to limit search to specific columns, here's a version that does that (with the default of searching all columns for convenience):
# search function
def search(searchterms, cols=None):
# search columns (if None, searches in all columns)
if cols is None:
cols = df.columns
return df.loc[df[cols].apply(' '.join, axis=1) # join text in cols
.str.contains( # check if it contains
'|'.join([ # regex pattern
x.strip() # remove spaces
for x in searchterms.split(',') # split by ','
]), flags=re.IGNORECASE)] # make search case-insensitive
Now if I search for months only in Drug Name and BP, it will not return Alendronic Acid where months is only found in U&E:
search('months', ['Drug Name', 'BP'])
Output:
Drug Name BP U&E
0 Perindopril Every 12 months Not needed
2 Allopurinol Whilst titrating - 3 months Not needed
Without having sample input data, I used a random generated dataset as a showcase:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Drug_Name':['Drug1','Drug2','Drug3','Drug2','Drug5','Drug3']*4,
'Inv_Type': ['X', 'Y']*12,
'Quant': np.random.randint(2,20, size=24)})
# Search 1
search = "Drug3"
df.query('Drug_Name==#search')
# Search 2
search2 = "Y"
df.query ('Inv_Type.str.contains(#search2, na=False)', engine='python')
# Combined (use booleans, such as & or | instead of and or or
df.query ('Drug_Name==#search & Inv_Type.str.contains(#search2, na=False)')
Please note that engine='python' should be avoided as stated in the documentation:
Likewise, you can pass engine='python' to evaluate an expression using
Python itself as a backend. This is not recommended as it is
inefficient compared to using numexpr as the engine.
That said, if you are hell-bent on using it, you can do it like this:
mask = df["Inv_Type"].str.contains(search2, na=False)
df.query('Drug_Name==#search & #mask')
Alternatvely, you can achive the same without using .query() at all:
df[(df['Drug_Name']==search) & df['Inv_Type'].str.contains(search2, na=False)]

Use contains to merge data frame

I have two separates files, one from our service providers and the other is internal (HR).
The service providers write the names of our employer in different ways, there are those who write it in firstname lastname format, or first letter of the firstname and the last name or lastname firstname...while the HR file includes separately the first and last name.
DF1
Full Name
0 B.pitt
1 Mr Nickolson Jacl
2 Johnny, Deep
3 Streep Meryl
DF2
First Last
0 Brad Pitt
1 Jack Nicklson
2 Johnny Deep
3 Streep Meryl
My idea is to use str.contains to look for the first letter of the first name and the last name. I've succed to do it with static values using the following code:
df1[['Full Name']][df1['Full Name'].str.contains('B')
& df1['Full Name'].str.contains('pitt')]
Which gives the following result:
Full Name
0 B.pitt
The challenge is comparing the two datasets... Any advise on that please?
Regards
if you are just checking if it exists or no this could be useful:
because it is rare to have 2 exactly the same family name, I recommend to just split your Df1 and compare families, then for ensuring you can differ first names too
you can easily do it with a for:
for i in range('your index'):
if df1_splitted[i].str.contain('family you searching for'):
print("yes")
if you need to compare in other aspects just let me know
I suggest to use next module for parsing names:
pip install nameparser
Then you can process your data frames :
from nameparser import HumanName
import pandas as pd
df1 = pd.DataFrame({'Full Name':['B.pitt','Mr Nickolson Jack','Johnny, Deep','Streep Meryl']})
df2 = pd.DataFrame({'First':['Brad', 'Jack','Johnny', 'Streep'],'Last':['Pitt','Nicklson','Deep','Meryl']})
names1 = [HumanName(name) for name in df1['Full Name']]
names2 = [HumanName(str(row[0]+" "+ str(row[1]))) for i,row in df2.iterrows()]
After that you can try comparing HumanName instances which have parsed fileds. it looks like this:
<HumanName : [
title: ''
first: 'Brad'
middle: ''
last: 'Pitt'
suffix: ''
nickname: '' ]
I have used this approach for processing thousands of names and merging them to same names from other documents and results were good.
More about module can be found at https://nameparser.readthedocs.io/en/latest/
Hey you could use fuzzy string matching with fuzzywuzzy
First create Full Name for df2
df2_ = df2[['First', 'Last']].agg(lambda a: a[0] + ' ' + a[1], axis=1).rename('Full Name').to_frame()
Then merge the two dataframes by index
merged_df = df2_.merge(df1, left_index=True, right_index=True)
Now you can apply fuzz.token_sort_ratio so you get the similarity
merged_df['similarity'] = merged_df[['Full Name_x', 'Full Name_y']].apply(lambda r: fuzz.token_sort_ratio(*r), axis=1)
This results in the following dataframe. You can now filter or sort it by similarity.
Full Name_x Full Name_y similarity
0 Brad Pitt B.pitt 80
1 Jack Nicklson Mr Nickolson Jacl 80
2 Johnny Deep Johnny, Deep 100
3 Streep Meryl Streep Meryl 100

Pandas Drop Specified Duplicates After Concat

I'm trying to write a python script that concats two csv files and then drops the duplicate rows. Here is an example of the csv's I'm concating:
csv_1
type state city date estimate id
lux tx dal 2019/08/15 .8273452 10
sed ny ny 2019/05/12 .624356 10
cou cal la 2013/04/24 .723495 10
. . . . . .
. . . . . .
csv_2
type state city date estimate id
sed col den 2013/05/02 .7234957 232
sed mi det 2015/11/17 .4249357 232
lux nj al 2009/02/29 .627234 232
. . . . . .
. . . . . .
As of now, my code to concat these two together looks like this:
csv_1 = pd.read_csv('csv_1.csv')
csv_2 = pd.read_csv('csv_2.csv')
union_df = pd.concat([csv_1, csv_2])
union_df.drop_duplicates(subset=['type', 'state', 'city', 'date'], inplace=True, keep='first')
Is there any way I can ensure only rows with id = 232 are deleted and none with id = 10 are? Just a way to specify only rows from the second csv are removed from the concatenated csv?
Thank you
Use, duplicated and boolean logic:
union_df.loc[~union_df.duplicated(subset=['type','state','city','date'], keep='first') & (union_df['id'] == 233)]
Instead of directly dropping the duplicates using the drop_duplicates method, I would recommend you using the duplicated method. The latter works the same way as the first but it returns a boolean vector indicating which rows are duplicated. Once you call it, you can combine its output with the id for achieving your purpose. Take a look below.
csv_1 = pd.read_csv('csv_1.csv')
csv_2 = pd.read_csv('csv_2.csv')
union_df = pd.concat([csv_1, csv_2])
union_df["dups"]= union_df.duplicated(subset=['type', 'state', 'city', 'date'],
inplace=True, keep='first')
union_df = union_df.loc[lambda d: ~((d.dups) & (d.id==232))]

Pandas: read_table remove comment lines with '##' but not '#<string>'?

I have some large tab separated data sets that have long commented sections, followed by the table header, formatted like this:
##FORMAT=<ID=AMQ,Number=.,Type=Integer,Description="Average mapping quality for each allele present in the genotype">
##FORMAT=<ID=SS,Number=1,Type=Integer,Description="Variant status relative to non-adjacent Normal, 0=wildtype,1=germline,2=somatic,3=LOH,4=unknown">
##FORMAT=<ID=SSC,Number=1,Type=Integer,Description="Somatic Score">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR
chr1 2985885 . c G . . . GT:IGT:DP:DP4:BCOUNT:GQ:JGQ:VAQ:BQ:MQ:AMQ:SS:SSC 0/0:0/0:202:36,166,0,0:0,202,0,0:255:225:0:36:60:60:0:. 0/1:0/1:321:29,108,37,147:0,137,184,0:228:225:228:36,36:60:60,60:2:225
chr1 3312963 . C T . . . GT:IGT:DP:DP4:BCOUNT:GQ:JGQ:VAQ:BQ:MQ:AMQ:SS:SSC 0/1:0/1:80:36,1,43,0:0,37,0,43:80:195:80:36,31:60:60,60:1:. 0/0:0/0:143:138,5,0,0:0,143,0,0:255:195:255:36:60:60:3:57
Everything that starts with ## is a comment that needs to be stripped out, but I need to keep the header that starts with #CHROM. Is there any way to do this? The only options I am seeing for Pandas read_table allow only a single character for the comment string, and I do not see options for regular expressions.
The code I am using is this:
SS_txt_df = pd.read_table(SS_txt_file,sep='\t',comment='#')
This removes all lines that start with #, including the header I want to keep
EDIT: For clarification, the header region starting with ## is of variable length. In bash this would simply be grep -Ev '^##'.
you can easily calculate the number of header lines, that must be skipped when reading your CSV file:
fn = '/path/to/file.csv'
skip_rows = 0
with open(fn, 'r') as f:
for line in f:
if line.startswith('##'):
skip_rows += 1
else:
break
df = pd.read_table(fn, sep='\t', skiprows=skip_rows)
The first part will read only the header lines - so it should be very fast
use skiprows as a workaround:
SS_txt_df = pd.read_table(SS_txt_file,sep='\t',skiprows=3)
df
Out[13]:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR
0 chr1 2985885 . c G . . . GT:IGT...
1 chr1 3312963 . C T . . . GT:IGT...
then rename your first column to remove #.
Update:
As you said your ## varies so, I know this is not a feasible solution but you can drop all rows starting with # and then pass the column headers as listas your columns don't change:
name=['CHROM','POS','ID','REF','ALT','QUAL','FILTER','INFO' ,'FORMAT','NORMAL','TUMOR']
df=pd.read_table(SS_txt_file,sep='\t',comment='#',names=name)
df
Out[34]:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR
0 chr1 2985885 . c G . . . GT:IGT...
1 chr1 3312963 . C T . . . GT:IGT...

Categories

Resources