Which libraries would help me read a gct file in python and edit it like removing the rows with NaN values. And how will the following code change if I apply it to a .gct file?
data = pd.read_csv('PAAD1.csv')
new_data = data.dropna(axis = 0, how ='any')
print("Old data frame length:", len(data), "\nNew data frame length:",
len(new_data), "\nNumber of rows with at least 1 NA value: ",
(len(data)-len(new_data)))
new_data.to_csv('EditedPAAD.csv')
You should use the cmapPy package for this. Compared to read_csv it gives you more freedom and domain specific utilities. E.g. if your *.gct looks like this
#1.2
22215 2
Name Description Tumor_One Normal_One
1007_s_at na -0.214548 -0.18069
1053_at "RFC2 : replication factor C (activator 1) 2, 40kDa |#RFC2|" 0.868853 -1.330921
117_at na 1.124814 0.933021
121_at PAX8 : paired box gene 8 |#PAX8| -0.825381 0.102078
1255_g_at GUCA1A : guanylate cyclase activator 1A (retina) |#GUCA1A| -0.734896 -0.184104
1294_at UBE1L : ubiquitin-activating enzyme E1-like |#UBE1L| -0.366741 -1.209838
1316_at "THRA : thyroid hormone receptor, alpha (erythroblastic leukemia viral (v-erb-a) oncogene homolog, avian) |#THRA|" -0.126108 1.486972
1320_at "PTPN21 : protein tyrosine phosphatase, non-receptor type 21 |#PTPN21|" 3.083681 -0.086705
...
You can extract only rows with a desired probeset id (row id), e.g. ['1007_s_at', '1053_at', '117_at', '121_at', '1255_g_at', '1294_at UBE1L']
So to read a file, remove the nan in the description and save it again, do:
from cmapPy.pandasGEXpress.parse_gct import parse
from cmapPy.pandasGEXpress.write_gct import write
data = parse('example.gct', rid=['1007_s_at', '1053_at',
'117_at', '121_at',
'1255_g_at', '1294_at UBE1L'])
# remove nan values from row_metadata (description column)
data.row_metadata_df.dropna(inplace=True)
# remove the entries of .data_df where nan values are in row_metadata
data.data_df = data.data_df.loc[data.row_metadata_df.index]
# Can only write GCT version 1.3
write(data, 'new_example.gct')
The new_example.gct looks then like this:
#1.3
3 2 1 0
id Description Tumor_One Normal_One
1053_at RFC2 : replication factor C (activator 1) 2, 40kDa |#RFC2| 0.8689 -1.3309
121_at PAX8 : paired box gene 8 |#PAX8| -0.8254 0.1021
1255_g_at GUCA1A : guanylate cyclase activator 1A (retina) |#GUCA1A| -0.7349 -0.1841
Quick search in google will give you the following:
https://pypi.org/project/cmapPy/
Regarding to the code, if you don't care about the metadata in the 2 first rows, it seems to work for your purpose, but you should first indicate that the delimiter is TAB and skip the 2 first rows - pandas.read_csv(PATH_TO_GCT_FILE, sep='\t',skiprows=2)
Related
I have the following dataframe
index,name,score,attempts,qualify
a,Anastasia,12.5,1,yes
b,Dima,9.0,3,no
c,Katherine,16.5,2,yes
d,James,NaN,3,no
e,Emily,9.0,2,no
I am trying to use pandas map function to update name column where name is either James or Emily to any test value 99.
codes = {'James':'99','Emily':'99'}
dff['name'] = dff['name'].map(codes)
dff
I am getting the following output -
index,name,score,attempts,qualify
a,NaN,12.5,1,yes
b,NaN,9.0,3,no
c,NaN,16.5,2,yes
d,99,NaN,3,no
e,99,9.0,2,no
Note that name column values James and Emily have been updated to 99, but the rest of name values are mapped to NaN.
How can we ignore indexes which are not intended to be mapped?
The issue is that the map function will apply the dictionary values to all values in the 'name' column, not just the ones specified. To get around this, you can use the replace method instead:
dff['name'] = dff['name'].replace({'James':'99','Emily':'99'})
This will replace only the specified values and leave the others unchanged.
I believe you may be looking for replace instead of map.
import pandas as pd
names = pd.Series([
"Anastasia",
"Dima",
"Katherine",
"James",
"Emily"
])
names.replace({"James": "99", "Emily": "99"})
# 0 Anastasia
# 1 Dima
# 2 Katherine
# 3 99
# 4 99
# dtype: object
If you're really set on using map, then you have to provide a function that knows how to handle every single name it might encounter.
codes = {"James": "99", "Emily": "99"}
# If the lookup into `code` fails,
# return the name that was used for lookup
names.map(lambda name: codes.get(name, name))
codes = {'James':'99',
'Emily':'99'}
dff['name'] = dff['name'].replace(codes)
dff
replace() satisfies the requirement -
index,name,score,attempts,qualify
a,Anastasia,12.5,1,yes
b,Dima,9.0,3,no
c,Katherine,16.5,2,yes
d,99,NaN,3,no
e,99,9.0,2,no
You can replace back one way to achiev it
dff['name'] = dff['name'].map(codes).fillna(dff['name'])
codes = {'James':'99','Emily':'99'}
dff['name'] = dff['name'].map(codes).fillna(dff['name'])
dff
index name score attempts qualify
0 a Anastasia 12.5 1 yes
1 b Dima 9.0 3 no
2 c Katherine 16.5 2 yes
3 d 99 NaN 3 no
4 e 99 9.0 2 no
To summarize as concisely as I can, I have data file containing a list of chemical compounds along with their ID numbers ("CID" numbers). My goal is to use pubchempy's pubchempy.get_properties function along with pandas' df.map function to essentially obtain the properties of each compound (there is one compound per row) using the "CID" number as an identifier. The parameters of pubchempy.get_properties is an identifier ("CID" number in this case) along with the property of the chemical that you want to obtain from the pubchem website (Molecular weight in this case).
This is the code that I have written currently:
import pandas as pd
import pubchempy
import numpy as np
df = pd.read_csv("Data.tsv.txt", sep="\t")
from pubchempy import get_properties
df['CID'] = df['CID'].astype(str).apply(lambda x: x.replace('.0',''))
df['CID'] = df['CID'].astype(str).apply(lambda x: x.replace('0',''))
df = df.drop(df[df.CID=='nan'].index)
df = df.drop( df.index.to_list()[5:] ,axis = 0 )
df['CID']= df['CID'].map(lambda x: get_properties(identifier=x, properties='MolecularWeight') if float(x) > 0 else pd.NA)
df = df.rename(columns={'CID.': 'MolecularWeight'})
print(df)
This is the output that I was initially getting for that column (only including a few rows, in reality, dataset is very big):
MolecularWeight
[{'CID': 5339, 'MolecularWeight': '398.4'}]
[{'CID': 3889, 'MolecularWeight': '520.5'}]
[{'CID': 2788, 'MolecularWeight': '305.50'}]
[{'CID': 1422517, 'MolecularWeight': '440.5'}]
.
.
.
Now, the code was somewhat working in that it is providing me with the molecular weight of the compound (398.4) but I didn't want all that extra bit of writing nor did I want the quote marks around the molecular weight number (both of these get in the way of the next bit of code that I plan to write).
So I then added this bit of code:
df['MolecularWeight'] = df.MolecularWeight[0][0].get('MolecularWeight')
This is the output that I am now getting:
MolecularWeight
398.4
398.4
398.4
398.4
.
.
.
What I want to do is pretty much exactly the same it's just that instead of getting the molecular weight of the first row in the MolecularWeight column and copying it onto all the other rows, I want to have the molecular weight value of each individual row in that column as the output.
What I was hoping to get is something like this:
MolecularWeight
398.4
520.5
305.50
440.5
.
.
.
Does anyone know how I can solve this issue? I've spent many hours trying to figure it out myself with no luck. I'd appreciate any help!
Few lines of text file:
NO. compound_name IUPAC_name SMILES CID Inchi threshold reference group comments
1 sulphasalazine 2-hydroxy-5-[[4-(pyridin-2-ylsulfamoyl)phenyl]diazenyl]benzoic acid O=C(O)c1cc(N=Nc2ccc(S(=O)(=O)Nc3ccccn3)cc2)ccc1O 5339 InChI=1S/C18H14N4O5S/c23-16-9-6-13(11-15(16)18(24)25)21-20-12-4-7-14(8-5-12)28(26,27)22-17-3-1-2-10-19-17/h1-11,23H,(H,19,22)(H,24,25) R2|R2|R25|R46| A
2 moxalactam 7-[[2-carboxy-2-(4-hydroxyphenyl)acetyl]amino]-7-methoxy-3-[(1-methyltetrazol-5-yl)sulfanylmethyl]-8-oxo-5-oxa-1-azabicyclo[4.2.0]oct-2-ene-2-carboxylic acid COC1(NC(=O)C(C(=O)O)c2ccc(O)cc2)C(=O)N2C(C(=O)O)=C(CSc3nnnn3C)COC21 3889 InChI=1S/C20H20N6O9S/c1-25-19(22-23-24-25)36-8-10-7-35-18-20(34-2,17(33)26(18)13(10)16(31)32)21-14(28)12(15(29)30)9-3-5-11(27)6-4-9/h3-6,12,18,27H,7-8H2,1-2H3,(H,21,28)(H,29,30)(H,31,32) R25| A
3 clioquinol 5-chloro-7-iodoquinolin-8-ol Oc1c(I)cc(Cl)c2cccnc12 2788 InChI=1S/C9H5ClINO/c10-6-4-7(11)9(13)8-5(6)2-1-3-12-8/h1-4,13H R18|R26|R27| A
If you cast the column to float, that should help you: df['MolecularWeight'] = df['MolecularWeight'].astype(float).
It appears that you may want multiple properties from each CID:
props = ['HBondDonorCount', 'RotatableBondCount', 'MolecularWeight']
df2 = pd.DataFrame(get_properties(identifier=df.CID.to_list(), properties=props))
print(df2)
Output:
CID HBondDonorCount RotatableBondCount MolecularWeight
0 5339 398.4 3 6
1 3889 520.5 4 9
2 2788 305.50 1 0
You can then merge this information onto the original dataframe:
df = df.merge(df2) # df = df.merge(pd.DataFrame(get_properties(identifier=df.CID.to_list(), properties=props)))
print(df)
...
NO. compound_name IUPAC_name SMILES CID Inchi threshold reference group comments MolecularWeight HBondDonorCount RotatableBondCount
0 1 sulphasalazine 2-hydroxy-5-[[4-(pyridin-2-ylsulfamoyl)phenyl]... O=C(O)c1cc(N=Nc2ccc(S(=O)(=O)Nc3ccccn3)cc2)ccc1O 5339 InChI=1S/C18H14N4O5S/c23-16-9-6-13(11-15(16)18... NaN R2|R2|R25|R46| A NaN 398.4 3 6
1 2 moxalactam 7-[[2-carboxy-2-(4-hydroxyphenyl)acetyl]amino]... COC1(NC(=O)C(C(=O)O)c2ccc(O)cc2)C(=O)N2C(C(=O)... 3889 InChI=1S/C20H20N6O9S/c1-25-19(22-23-24-25)36-8... NaN R25| A NaN 520.5 4 9
2 3 clioquinol 5-chloro-7-iodoquinolin-8-ol Oc1c(I)cc(Cl)c2cccnc12 2788 InChI=1S/C9H5ClINO/c10-6-4-7(11)9(13)8-5(6)2-1... NaN R18|R26|R27| A NaN 305.50 1 0
I have a 2 datasets (dataframes), one called source and the other crossmap. I am trying to find rows with a specific column value starting with "999", if one is found I need to look up the complete value of that column (e.x. "99912345") on the crossmap dataset (dataframe) and return the value from a column on that row in the cross-map.
# Source Dataframe
0 1 2 3 4
------ -------- -- --------- -----
0 303290 544981 2 408300622 85882
1 321833 99910722 1 408300902 85897
2 323241 99902978 3 408056001 95564
# Cross Map Dataframe
ID NDC ID DIN(NDC) GTIN NAME PRDID
------- ------ -------- -------------- ---------------------- -----
44563 321833 99910722 99910722000000 SALBUTAMOL SULFATE (A) 90367
69281 321833 99910722 99910722000000 SALBUTAMOL SULFATE (A) 90367
6002800 323241 99902978 75402850039706 EPINEPHRINE (A) 95564
8001116 323241 99902978 99902978000000 EPINEPHRINE (A) 95564
The 'straw dog' logic I am working with is this:
search source file and find '999' entries in column 1
df_source[df_source['Column1'].str.contains('999')]
interate through the rows returned and search for the value in column 1 in the crossmap dataframe column (DIN(NDC)) and return the corresponding PRDID
update the source dataframe with the PRDID, and write the updated file
It is these last two logic pieces where I am struggling with how to do this. Appreciate any direction/guidance anyone can provide.
Is there maybe a better/easier means of doing this using python but not pandas/dataframes?
So, as far as I understood you correctly: we are looking for the first digits of 999 in the 'Source Dataframe' in the first column of the value. Next, we find these values in the 'Cross Map' column 'DIN(NDC)' and we get the values of the column 'PRDID' on these lines.
If everything is correct, then I can't understand your further actions?
import pandas as pd
import more_itertools as mit
Cross_Map = pd.DataFrame({'DIN(NDC)': [99910722, 99910722, 99902978, 99902978],
'PRDID': [90367, 90367, 95564, 95564]})
df = pd.DataFrame({0: [303290, 321833, 323241], 1: [544981, 99910722, 99902978], 2: [2, 1, 3],
3: [408300622, 408300902, 408056001], 4: [85882, 85897, 95564]})
m = [i for i in df[1] if str(i)[:3] == '999'] #find the values in column 1
index = list(mit.locate(list(Cross_Map['DIN(NDC)']), lambda x: x in m)) #get the indexes of the matched column values DIN(NDC)
print(Cross_Map['PRDID'][index])
I have a data frame as shown below
df = pd.DataFrame({'meds': ['Calcium Acetate','insulin GLARGINE -- LANTUS - inJECTable','amoxicillin 1 g + clavulanic acid 200 mg ','digoxin - TABLET'],
'details':['DOSE: 667 mg - TDS with food - Inject','DOSE: 12 unit(s) - ON - SC (SubCutaneous)','-- AUGMENTIN - inJECTable','DOSE: 62.5 mcg - Every other morning - PO'],
'extracted':['Calcium Acetate 667 mg Inject','insulin GLARGINE -- LANTUS 12 unit(s) - SC (SubCutaneous)','amoxicillin 1 g + clavulanic acid 200 mg -- AUGMENTIN','digoxin - TABLET 62.5 mcg PO/Tube']})
df['concatenated'] = df['meds'] + " "+ df['details']
What I would like to do is
a) Check whether all of the individual keywords from extracted column is present in the concatenated column.
b) If present, assign 1 to the output column else 0
c) Assign the not found keyword in issue column as shown below
So, I was trying something like below
df['clean_extract'] = df.extracted.str.extract(r'([a-zA-Z0-9\s]+)')
#the above regex is incorrect. I would like to clean the text (remove all symbols except spaces and retain a clean text)
df['keywords'] = df.clean_extract.str.split(' ') #split them into keywords
def value_present(row): #check whether each of the keyword is present in `concatenated` column
if isinstance(row['keywords'], list):
for keyword in row['keywords']:
return 1
else:
return 0
df['output'] = df[df.apply(value_present, axis=1)][['concatenated', 'keywords']].head()
If you think its useful to clean concatenated column as well, its fine. Am only interested in finding the presence of all keywords.
Is there any efficient and elegant approach to do this on 7-8 million records?
I expect my output to be like as shown below. Red color indicates missing term between extracted and concatenated column. So, its assigned 0 and keyword is stored in issue column.
Let us zip the columns extracted and concatenated and for each pair map it to a function f which computes the set difference and returns the result accordingly:
def f(x, y):
s = set(x.split()) - set(y.split())
return [0, ', '.join(s)] if s else [1, np.nan]
df[['output', 'issue']] = [f(*s) for s in zip(df['extracted'], df['concatenated'])]
output issue
0 1 NaN
1 1 NaN
2 1 NaN
3 0 PO/Tube
My collaborator wants me to process the a input text file into a structured table:
The raw input text file looks like
PMID 22224631
Title -765 G_C and -1195 A_G promoter variants of the cyclooxygenase-2 gene decrease the risk for preeclampsia.
Found 8 gene(s)
Gene CRP Mentions
Gene GC Mentions
Gene PTGS2 Mentions
Found 1 variant(s)
Variant I399V URL
Gene PTGS1 Mentions
Found 2 variant(s)
Variant L255L URL
Variant V255V URL
Gene CT49 Mentions
Gene GAA Mentions
Found 1 variant(s)
Variant Q255H URL
Gene CGA Mentions
Gene TAT Mentions
PMID 16076618
Title 1166C mutation of angiotensin II type 1 receptor gene is correlated with umbilical blood flow velocimetry in women with preeclampsia.
Found 13 gene(s)
Gene AGTR2 Mentions
Gene QTRT1 Mentions
Gene SLC25A10 Mentions
Gene ATM Mentions
Gene PIH Mentions
Gene CCL14 Mentions
Gene AGT Mentions
Gene REN Mentions
Gene ASAH1 Mentions
Gene AGTR1 Mentions
Gene SSD Mentions
Gene TAT Mentions
Found 1 variant(s)
Variant D389A URL
Gene ACE Mentions
Found 2 variant(s)
Variant D389A URL
Variant H389P URL
You can see, for each PMID (an id for scientific publication), there is some information about genes, for each gene, there may be some information about variants. The input text pretty much like a "print" function output instead of a table. Then each PMID block is separated by a empty line.
The final table the collaborate wants is like a long format table (.csv) comprised of three layers: PMID, gene and variant. PMID contains genes, the genes contains (or not) variants. Take the example from above input file:
PMID | Gene | Variant
22224631 | CRP | No
22224631 | GC | No
22224631 | PTGS2 | I399V
22224631 | PTGS1 | L255L
22224631 | PTGS1 | V255V
22224631 | CT49 | No
22224631 | GAA | Q255H
....... | .....
I do not have much experience processing raw text file to tables in Python.
My thinking is using regex to strip redundant words first. I try to read in this text file, it generate a big list of strings, in which each string is a line in input file
with open ("gene and variants.txt", "r") as myfile:
data=myfile.readlines()
data2 = [x for x in data if not x.startswith('Title') and not
x.startswith('Found')]
data3 = [x for x in data2 if x != " \t\n"]
data4 = [x.strip(" Mentions\n") for x in data3]
data4 = [x.strip(" URL") for x in data4]
data4 = [x.replace("Gene\t", "Gene") for x in data4]
data4 = [x.replace("PMID\t", "PMID ") for x in data4]
data4 = [x.replace("Variant\t", "Variant") for x in data4]
Luckily, I am able to strip most unnecessary information, finally get to this list of string:
The list of string like this:
Then I got stuck.... what to do next to convert this string list to my target table? I was thinking use Pandas, but it seems only take each string as a row in dataframe with a single column.
Am I on the right path? If so, what should I do next?
If not, do you have any suggestion on how should I approach this problem?
You can follow these steps to convert your text file into a Pandas dataframe with the desired format:
Use read_csv() to import the text file. To test out, I copied the raw input text you had pasted in above to a new text file and saved it as raw_input.txt:
df = pd.read_csv('raw_input.txt', header=-1)
The dataframe will contain a bunch of rows formatted like this:
0
0 PMID 22224631
1 Title -765 G_C and -1195 A_G promoter varia...
2 Found 8 gene(s)
3 Gene CRP Mentions
4 Gene GC Mentions
5 Gene PTGS2 Mentions
6 Found 1 variant(s)
7 Variant I399V URL
8 Gene PTGS1 Mentions
...
Our next step is to create a dictionary that stores the info for each PMID:
# Get the indices of each row that has a new PMID header
pmid_idxs = df[df[0].str.contains('PMID')].index
# Now construct the dictionary, using each PMID as a key and
# filling the entry for each key with the PMID's gene info.
pmid_dict = {}
for i, val in enumerate(pmid_idxs.values):
if pmid_idxs.values[-1] != val:
nxt_pmid_idx = pmid_idxs.values[i+1]
pmid_dict[df[0].iloc[val]] = df[0].iloc[val+1:nxt_pmid_idx].reset_index(drop=True)
else: # if last PMID
pmid_dict[df[0].iloc[val]] = df[0].iloc[val+1:].reset_index(drop=True)
Now for the main part -- this is the logic that will loop through each entry in the dictionary, extract and format each PMID's gene info into a small dataframe, and add that dataframe to a list:
df_list = []
for key, value in pmid_dict.items():
pmid_num = ''.join(c for c in key if c not in 'PMID ')
series = value
next_rows = series.shift(-1).fillna('placeholder')
df_dict = {'PMID': [],
'Gene': [],
'Variant': []}
gene = ''
variant = ''
for i, row in series.iteritems():
if 'Gene' in row:
gene = row[4:-9].strip(' ')
if i <= (len(series)) and 'variant' not in next_rows.iloc[i].lower():
df_dict['PMID'].append(pmid_num)
df_dict['Gene'].append(gene)
df_dict['Variant'].append('No')
elif i == len(series) + 1:
df_dict['PMID'].append(pmid_num)
df_dict['Gene'].append(gene)
df_dict['Variant'].append('No')
if 'Variant' in row:
variant = row[8:-4].strip(' ')
df_dict['PMID'].append(pmid_num)
df_dict['Gene'].append(gene)
df_dict['Variant'].append(variant)
df = pd.DataFrame(df_dict)
df_list.append(df)
The final output dataframe will merely be a concatenation of each small dataframe we created above:
output_df = pd.concat(df_list).reset_index(drop=True)
And that's it. The output dataframe looks like this, which I believe is your desired format:
PMID Gene Variant
0 22224631 CRP No
1 22224631 GC No
2 22224631 PTGS2 I399V
3 22224631 PTGS1 L255L
4 22224631 PTGS1 V255V
5 22224631 CT49 No
6 22224631 GAA Q255H
7 22224631 CGA No
8 22224631 TAT No
9 16076618 AGTR2 No
10 16076618 QTRT1 No
11 16076618 SLC25A10 No
12 16076618 ATM No
13 16076618 PIH No
14 16076618 CCL14 No
15 16076618 AGT No
16 16076618 REN No
17 16076618 ASAH1 No
18 16076618 AGTR1 No
19 16076618 SSD No
20 16076618 TAT D389A
21 16076618 ACE D389A
22 16076618 ACE H389P
I am not REALLY experienced in Python, but my approach would be to create tuples.
First one created manually, to make that first PMID | Gene | Variant part,
then using regex to strip unnessesary text and go adding those tuples in a single list.
Then printing them all using String formatting.
Or, you could make 3 lists, one for PMID, one for Gene, one for Variant.
Then iterating them with a forloop and printing them out to create that table.
Sorry for not being able to give specific tips.
Best of wishes!
You could work with dictionaries.
For example:
fileDict = {Gene : [], Variant: [], PMID: []}
Iterate through the list and check if Gene, Variant or PMID and append the values.
You can then do like
for x in fileDict['Gene']:
print(x)