Check presence of multiple keywords and create another column using python - python

I have a data frame as shown below
df = pd.DataFrame({'meds': ['Calcium Acetate','insulin GLARGINE -- LANTUS - inJECTable','amoxicillin 1 g + clavulanic acid 200 mg ','digoxin - TABLET'],
'details':['DOSE: 667 mg - TDS with food - Inject','DOSE: 12 unit(s) - ON - SC (SubCutaneous)','-- AUGMENTIN - inJECTable','DOSE: 62.5 mcg - Every other morning - PO'],
'extracted':['Calcium Acetate 667 mg Inject','insulin GLARGINE -- LANTUS 12 unit(s) - SC (SubCutaneous)','amoxicillin 1 g + clavulanic acid 200 mg -- AUGMENTIN','digoxin - TABLET 62.5 mcg PO/Tube']})
df['concatenated'] = df['meds'] + " "+ df['details']
What I would like to do is
a) Check whether all of the individual keywords from extracted column is present in the concatenated column.
b) If present, assign 1 to the output column else 0
c) Assign the not found keyword in issue column as shown below
So, I was trying something like below
df['clean_extract'] = df.extracted.str.extract(r'([a-zA-Z0-9\s]+)')
#the above regex is incorrect. I would like to clean the text (remove all symbols except spaces and retain a clean text)
df['keywords'] = df.clean_extract.str.split(' ') #split them into keywords
def value_present(row): #check whether each of the keyword is present in `concatenated` column
if isinstance(row['keywords'], list):
for keyword in row['keywords']:
return 1
else:
return 0
df['output'] = df[df.apply(value_present, axis=1)][['concatenated', 'keywords']].head()
If you think its useful to clean concatenated column as well, its fine. Am only interested in finding the presence of all keywords.
Is there any efficient and elegant approach to do this on 7-8 million records?
I expect my output to be like as shown below. Red color indicates missing term between extracted and concatenated column. So, its assigned 0 and keyword is stored in issue column.

Let us zip the columns extracted and concatenated and for each pair map it to a function f which computes the set difference and returns the result accordingly:
def f(x, y):
s = set(x.split()) - set(y.split())
return [0, ', '.join(s)] if s else [1, np.nan]
df[['output', 'issue']] = [f(*s) for s in zip(df['extracted'], df['concatenated'])]
output issue
0 1 NaN
1 1 NaN
2 1 NaN
3 0 PO/Tube

Related

retrieve cell string values in a column between two unknown indexes based on substrings location

I need to locate the first location where the word 'then' appears on Words table. I'm trying to get a code to consolidate all strings on 'text' column from this location till the first text with a substring '666' or '999' in it (in this case a combination of their, stoma22, fe156, sligh334, pain666 (the desired subtrings_output = 'theirfe156sligh334pain666').
I've tried:
their_loc = np.where(words['text'].str.contains(r'their', na =True))[0][0]
666_999_loc = np.where(words['text'].str.contains(r'666', na =True))[0][0]
subtrings_output = Words['text'].loc[Words.index[their_loc:666_999_loc]]
as you can see I'm not sure how to extend the conditioning of 666_999_loc to include substring 666 or 999, also slicing the indexing between two variables renders an error. Many thanks
Words table:
page no
text
font
1
they
0
1
ate
0
1
apples
0
2
and
0
2
then
1
2
their
0
2
stoma22
0
2
fe156
1
2
sligh334
0
2
pain666
1
2
given
0
2
the
1
3
fruit
0
You just need to add one for the end of the slice, and add an or condition to the np.where of the 666_or_999_loc using the | operator.
text_col = words['text']
their_loc = np.where(text_col.str.contains(r'their', na=True))[0][0]
contains_666_or_999_loc = np.where(text_col.str.contains('666', na=True) |
text_col.str.contains('999', na=True))[0][0]
subtrings_output = ''.join(text_col.loc[words.index[their_loc:contains_666_or_999_loc + 1]])
print(subtrings_output)
Output:
theirstoma22fe156sligh334pain666
IIUC, use pandas.Series.idxmax with "".join().
Series.idxmax(axis=0, skipna=True, *args, **kwargs)
Return the row label of the maximum value.
If multiple values equal the maximum, the first row label with that
value is returned.
So, assuming (Words) is your dataframe, try this :
their_loc = Words["text"].str.contains("their").idxmax()
_666_999_loc = Words["text"].str.contains("666").idxmax()
subtrings_output = "".join(Words["text"].loc[Words.index[their_loc:_666_999_loc+1]])
Output :
print(subtrings_output)
#theirstoma22fe156sligh334pain666
#their stoma22 fe156 sligh334 pain666 # <- with " ".join()

Update Python dictionary with appended list value

I have a dataframe with price quotes for a variety of parts and makers. ~10k parts and 10 makers, so my dataset contains up to 100k rows, looking roughly like this:
Part
Maker
Price
1
Alpha
1.00
2
Alpha
1.30
3
Alpha
1.25
1
Bravo
1.10
2
Bravo
1.02
3
Bravo
1.15
4
Bravo
1.19
1
Charlie
.99
2
Charlie
1.10
3
Charlie
1.12
4
Charlie
1.19
I am wanting to return two dictionaries based on the best price, Part/Maker and Part/Price. My main issue is when two makers have the same best price.
I want my result to end up like this:
1:.99
2:1.1
3: 1.02
4:1.19
and the second one to be:
1:Charlie
2: Charlie
3: Bravo
4: [Bravo, Charlie]
The first dictionary is easy. Second one is what I'm stuck on. Here's what I have so far:
winning_price_dict={}
winning_mfg_dict={}
for index, row in quote_df.iterrows():
if row['Part'] not in winning_price_dict:
winning_price_dict[row['Part']] = row['Proposed Quote']
winning_mfg_dict[row['Part']] = list(row['Maker'])
if winning_price_dict[row['Part']]>row['Proposed Quote']:
winning_price_dict[row['Part']] = row['Proposed Quote']
winning_mfg_dict[row['Part']] = row['Maker']
if winning_price_dict[row['Part']]==row['Proposed Quote']:
winning_price_dict[row['Part']] = row['Proposed Quote']
winning_mfg_dict[row['Part']] = winning_mfg_dict[row['Part']].append(row['Maker']) #this is the only line that I don't believe works
When I run it as is, it says 'str' object has no attribute 'append'. However, I thought that it should be a list because of the list(row['Maker']) command.
When I change the relevant lines to this:
for index, row in quote_df.iterrows():
if row['Part'] not in winning_price_dict:
winning_mfg_dict[row['Part']] = list(row['Mfg'])
if winning_price_dict[row['Part']]>row['Proposed Quote']:
winning_mfg_dict[row['Part']] = list(row[['Mfg']])
if winning_price_dict[row['Part']]==row['Proposed Quote']:
winning_mfg_dict[row['Part']] = list(winning_mfg_dict[row['Part']]).append(row['Mfg'])
The winning_mfg_dict is all the part numbers and NoneType values, not the maker names.
What do I need to change to get it to return the list of suitable makers?
Thanks!
In your original code, the actual problem was on line 9 of the first fragment: you set vale to a string, not to a list. Also, calling list(some_string) dos not what you expect: it creates a list of single chars, not a [some_string].
I took the liberty to improve the overall readability by extracting common keys to variables, and joined two branches with same bodies. Something like this should work:
winning_price_dict = {}
winning_mfg_dict = {}
for index, row in quote_df.iterrows():
# Extract variables, saving a few accesses and reducing line lengths
part = row['Part']
quote = row['Proposed Quote']
maker = row['Maker']
if part not in winning_price_dict or winning_price_dict[part] > quote:
# First time here or higher value found - reset to initial
winning_price_dict[part] = quote
winning_mfg_dict[part] = [maker]
elif winning_price_dict[part] == quote:
# Add one more item with same value
# Not updating winning_price_dict - we already know it's proper
winning_mfg_dict[part].append(maker)
You can use groupby to get all quotes for one part
best_quotes = quote_df.groupby("part").apply(lambda df: df[df.price == df.price.min()])
Then you get a dataframe with the part number and the previous index as Multiindex. The lambda function selects only the quotes with the minimum price.
You can get the first dictionary with
winning_price_dict = {part : price for (part, _), price in best_quotes.price.iteritems()}
and the second one with
winning_mfg_dict = {part:list(best.loc[part]["maker"]) for part in best_quotes.index.get_level_values("part")}

How to read and modify a (.gct) file using python?

Which libraries would help me read a gct file in python and edit it like removing the rows with NaN values. And how will the following code change if I apply it to a .gct file?
data = pd.read_csv('PAAD1.csv')
new_data = data.dropna(axis = 0, how ='any')
print("Old data frame length:", len(data), "\nNew data frame length:",
len(new_data), "\nNumber of rows with at least 1 NA value: ",
(len(data)-len(new_data)))
new_data.to_csv('EditedPAAD.csv')
You should use the cmapPy package for this. Compared to read_csv it gives you more freedom and domain specific utilities. E.g. if your *.gct looks like this
#1.2
22215 2
Name Description Tumor_One Normal_One
1007_s_at na -0.214548 -0.18069
1053_at "RFC2 : replication factor C (activator 1) 2, 40kDa |#RFC2|" 0.868853 -1.330921
117_at na 1.124814 0.933021
121_at PAX8 : paired box gene 8 |#PAX8| -0.825381 0.102078
1255_g_at GUCA1A : guanylate cyclase activator 1A (retina) |#GUCA1A| -0.734896 -0.184104
1294_at UBE1L : ubiquitin-activating enzyme E1-like |#UBE1L| -0.366741 -1.209838
1316_at "THRA : thyroid hormone receptor, alpha (erythroblastic leukemia viral (v-erb-a) oncogene homolog, avian) |#THRA|" -0.126108 1.486972
1320_at "PTPN21 : protein tyrosine phosphatase, non-receptor type 21 |#PTPN21|" 3.083681 -0.086705
...
You can extract only rows with a desired probeset id (row id), e.g. ['1007_s_at', '1053_at', '117_at', '121_at', '1255_g_at', '1294_at UBE1L']
So to read a file, remove the nan in the description and save it again, do:
from cmapPy.pandasGEXpress.parse_gct import parse
from cmapPy.pandasGEXpress.write_gct import write
data = parse('example.gct', rid=['1007_s_at', '1053_at',
'117_at', '121_at',
'1255_g_at', '1294_at UBE1L'])
# remove nan values from row_metadata (description column)
data.row_metadata_df.dropna(inplace=True)
# remove the entries of .data_df where nan values are in row_metadata
data.data_df = data.data_df.loc[data.row_metadata_df.index]
# Can only write GCT version 1.3
write(data, 'new_example.gct')
The new_example.gct looks then like this:
#1.3
3 2 1 0
id Description Tumor_One Normal_One
1053_at RFC2 : replication factor C (activator 1) 2, 40kDa |#RFC2| 0.8689 -1.3309
121_at PAX8 : paired box gene 8 |#PAX8| -0.8254 0.1021
1255_g_at GUCA1A : guanylate cyclase activator 1A (retina) |#GUCA1A| -0.7349 -0.1841
Quick search in google will give you the following:
https://pypi.org/project/cmapPy/
Regarding to the code, if you don't care about the metadata in the 2 first rows, it seems to work for your purpose, but you should first indicate that the delimiter is TAB and skip the 2 first rows - pandas.read_csv(PATH_TO_GCT_FILE, sep='\t',skiprows=2)

How to consider other rows in a dataFrame when filtering?

I am trying to filter (and consequently change) certain rows in pandas that depend on values in other columns. Say my dataFrame looks like this:
SENT ID WORD POS HEAD
1 1 I PRON 2
1 2 like VERB 0
1 3 incredibly ADV 4
1 4 brown ADJ 5
1 5 sugar NOUN 2
2 1 Here ADV 2
2 2 appears VERB 0
2 3 my PRON 5
2 4 next ADJ 5
2 5 sentence NOUN 0
The structure is such that the 'HEAD' column points at the index of the word on which the row is dependent on. For example, if 'brown' depends on 'sugar' then the head of 'brown' is 4, because the index of 'sugar' is 4.
I need to extract a df of all the rows in which the POS is ADV whose head's POS VERB, so 'Here' will be in the new df but not 'incredibly', (and potentially make changes to their WORD entry).
At the moment I'm doing it with a loop, but I don't think it's the pandas way and it also creates problems further down the road. Here is my current code (the split("-") is from another story - ignore it):
def get_head(df, dependent):
head = dependent
target_index = int(dependent['HEAD'])
if target_index == 0:
return dependent
else:
if target_index < int(dependent['INDEX']):
# 1st int in cell
while (int(head['INDEX'].split("-")[0]) > target_index):
head = data.iloc[int(head.name) - 1]
elif target_index > int(dependent['INDEX']):
while int(head['INDEX'].split("-")[0]) < target_index:
head = data.iloc[int(head.name) + 1]
return head
A difficulty I had when I wrote this function is that I didn't (at the time) have a column 'SENTENCE' so I had to manually find the nearest head. I hope that adding the SENTENCE column should make things somewhat easier, though it is important to note that as there are hundreds of such sentences in the df, simply searching for an index '5' won't do, since there are hundreds of rows where df['INDEX']=='5'.
Here is an example of how I use get_head():
def change_dependent(extract_col, extract_value, new_dependent_pos, head_pos):
name = 0
sub_df = df[df[extract_col] == extract_value] #this is another condition on the df.
for i, v in sub_df.iterrows():
if (get_head(df, v)['POS'] == head_pos):
df.at[v.name, 'POS'] = new_dependent_pos
return df
change_dependent('POS', 'ADV', 'ADV:VERB', 'VERB')
Can anyone here think of a more elegant/efficient/pandas way in which I can get all the ADV instances whose head is VERB?
import pandas as pd
df = pd.DataFrame([[1,1,'I','NOUN',2],
[1,2,'like','VERB',0],
[1,3,'incredibly','ADV',4],
[1,4,'brown','ADJ',4],
[1,5,'sugar','NOUN',5],
[2,1,'Here','ADV',2],
[2,2,'appears','VERB',0],
[2,3,'my','PRON',5],
[2,4,'next','ADJ',5],
[2,5,'sentance','NOUN',0]]
,columns=['SENT','ID','WORD','POS','HEAD'])
adv=df[df['POS']=='ADV']
temp=df[df['POS']=='VERB'][['SENT','ID','POS']].merge(adv,left_on=['SENT','ID'],right_on=['SENT','HEAD'])
temp['WORD']

new column based on specific string info from two different columns Python Pandas

I have 2 columns, I need to take specific string information from each column and create a new column with new strings based on this.
In column "Name" I have wellnames, I need to look at the last 4 characters of each wellname and if it Contains "H" then call that "HZ" in a new column.
I need to do the same thing if the column "WELLTYPE" contains specific words.
Using a Data Analysis program Spotfire I can do this all in one simple equation. (see below).
case
When right([UWI],4)~="H" Then "HZ"
When [WELLTYPE]~="Horizontal" Then "HZ"
When [WELLTYPE]~="Deviated" Then "D"
When [WELLTYPE]~="Multilateral" Then "ML"
else "V"
End
What would be the best way to do this in Python Pandas?
Is there a simple clean way you can do this all at once like in the spotfire equaiton above?
Here is the datatable with the two columns and my hopeful outcome column. (it did not copy very well into this), I also provide the code for the table below.
Name WELLTYPE What I Want
0 HH-001HST2 Oil Horizontal HZ
1 HH-001HST Oil_Horizontal HZ
2 HB-002H Oil HZ
3 HB-002 Water_Deviated D
4 HB-002 Oil_Multilateral ML
5 HB-004 Oil V
6 HB-005 Source V
7 BB-007 Water V
Here is the code to create the dataframe
# Dataframe with hopeful outcome
raw_data = {'Name': ['HH-001HST2', 'HH-001HST', 'HB-002H', 'HB-002', 'HB-002','HB-004','HB-005','BB-007'],
'WELLTYPE':['Oil Horizontal', 'Oil_Horizontal', 'Oil', 'Water_Deviated', 'Oil_Multilateral','Oil','Source','Water'],
'What I Want': ['HZ', 'HZ', 'HZ', 'D', 'ML','V','V','V']}
df = pd.DataFrame(raw_data, columns = ['Name','WELLTYPE','What I Want'])
df
Nested 'where' variant:
df['What I Want'] = np.where(df.Name.str[-4:].str.contains('H'), 'HZ',
np.where(df.WELLTYPE.str.contains('Horizontal'),'HZ',
np.where(df.WELLTYPE.str.contains('Deviated'),'D',
np.where(df.WELLTYPE.str.contains('Multilateral'),'ML',
'V'))))
Using apply by row:
def criteria(row):
if row.Name[-4:].find('H') > 0:
return 'HZ'
elif row.WELLTYPE.find('Horizontal') > 0:
return 'HZ'
elif row.WELLTYPE.find('Deviated') > 0:
return 'D'
elif row.WELLTYPE.find('Multilateral') > 0:
return 'ML'
else:
return 'V'
df['want'] = df.apply(criteria, axis=1)
This feels more natural to me. Obviously subjective
from_name = df.Name.str[-4:].str.contains('H').map({True: 'HZ'})
regex = '(Horizontal|Deviated|Multilateral)'
m = dict(Horizontal='HZ', Deviated='D', Multilateral='ML')
from_well = df.WELLTYPE.str.extract(regex, expand=False).map(m)
df['What I Want'] = from_name.fillna(from_well).fillna('V')
print(df)
Name WELLTYPE What I Want
0 HH-001HST2 Oil Horizontal HZ
1 HH-001HST Oil_Horizontal HZ
2 HB-002H Oil HZ HZ
3 HB-002 Water_Deviated D
4 HB-002 Oil_Multilateral ML
5 HB-004 Oil V V
6 HB-005 Source V
7 BB-007 Water V

Categories

Resources