My collaborator wants me to process the a input text file into a structured table:
The raw input text file looks like
PMID 22224631
Title -765 G_C and -1195 A_G promoter variants of the cyclooxygenase-2 gene decrease the risk for preeclampsia.
Found 8 gene(s)
Gene CRP Mentions
Gene GC Mentions
Gene PTGS2 Mentions
Found 1 variant(s)
Variant I399V URL
Gene PTGS1 Mentions
Found 2 variant(s)
Variant L255L URL
Variant V255V URL
Gene CT49 Mentions
Gene GAA Mentions
Found 1 variant(s)
Variant Q255H URL
Gene CGA Mentions
Gene TAT Mentions
PMID 16076618
Title 1166C mutation of angiotensin II type 1 receptor gene is correlated with umbilical blood flow velocimetry in women with preeclampsia.
Found 13 gene(s)
Gene AGTR2 Mentions
Gene QTRT1 Mentions
Gene SLC25A10 Mentions
Gene ATM Mentions
Gene PIH Mentions
Gene CCL14 Mentions
Gene AGT Mentions
Gene REN Mentions
Gene ASAH1 Mentions
Gene AGTR1 Mentions
Gene SSD Mentions
Gene TAT Mentions
Found 1 variant(s)
Variant D389A URL
Gene ACE Mentions
Found 2 variant(s)
Variant D389A URL
Variant H389P URL
You can see, for each PMID (an id for scientific publication), there is some information about genes, for each gene, there may be some information about variants. The input text pretty much like a "print" function output instead of a table. Then each PMID block is separated by a empty line.
The final table the collaborate wants is like a long format table (.csv) comprised of three layers: PMID, gene and variant. PMID contains genes, the genes contains (or not) variants. Take the example from above input file:
PMID | Gene | Variant
22224631 | CRP | No
22224631 | GC | No
22224631 | PTGS2 | I399V
22224631 | PTGS1 | L255L
22224631 | PTGS1 | V255V
22224631 | CT49 | No
22224631 | GAA | Q255H
....... | .....
I do not have much experience processing raw text file to tables in Python.
My thinking is using regex to strip redundant words first. I try to read in this text file, it generate a big list of strings, in which each string is a line in input file
with open ("gene and variants.txt", "r") as myfile:
data=myfile.readlines()
data2 = [x for x in data if not x.startswith('Title') and not
x.startswith('Found')]
data3 = [x for x in data2 if x != " \t\n"]
data4 = [x.strip(" Mentions\n") for x in data3]
data4 = [x.strip(" URL") for x in data4]
data4 = [x.replace("Gene\t", "Gene") for x in data4]
data4 = [x.replace("PMID\t", "PMID ") for x in data4]
data4 = [x.replace("Variant\t", "Variant") for x in data4]
Luckily, I am able to strip most unnecessary information, finally get to this list of string:
The list of string like this:
Then I got stuck.... what to do next to convert this string list to my target table? I was thinking use Pandas, but it seems only take each string as a row in dataframe with a single column.
Am I on the right path? If so, what should I do next?
If not, do you have any suggestion on how should I approach this problem?
You can follow these steps to convert your text file into a Pandas dataframe with the desired format:
Use read_csv() to import the text file. To test out, I copied the raw input text you had pasted in above to a new text file and saved it as raw_input.txt:
df = pd.read_csv('raw_input.txt', header=-1)
The dataframe will contain a bunch of rows formatted like this:
0
0 PMID 22224631
1 Title -765 G_C and -1195 A_G promoter varia...
2 Found 8 gene(s)
3 Gene CRP Mentions
4 Gene GC Mentions
5 Gene PTGS2 Mentions
6 Found 1 variant(s)
7 Variant I399V URL
8 Gene PTGS1 Mentions
...
Our next step is to create a dictionary that stores the info for each PMID:
# Get the indices of each row that has a new PMID header
pmid_idxs = df[df[0].str.contains('PMID')].index
# Now construct the dictionary, using each PMID as a key and
# filling the entry for each key with the PMID's gene info.
pmid_dict = {}
for i, val in enumerate(pmid_idxs.values):
if pmid_idxs.values[-1] != val:
nxt_pmid_idx = pmid_idxs.values[i+1]
pmid_dict[df[0].iloc[val]] = df[0].iloc[val+1:nxt_pmid_idx].reset_index(drop=True)
else: # if last PMID
pmid_dict[df[0].iloc[val]] = df[0].iloc[val+1:].reset_index(drop=True)
Now for the main part -- this is the logic that will loop through each entry in the dictionary, extract and format each PMID's gene info into a small dataframe, and add that dataframe to a list:
df_list = []
for key, value in pmid_dict.items():
pmid_num = ''.join(c for c in key if c not in 'PMID ')
series = value
next_rows = series.shift(-1).fillna('placeholder')
df_dict = {'PMID': [],
'Gene': [],
'Variant': []}
gene = ''
variant = ''
for i, row in series.iteritems():
if 'Gene' in row:
gene = row[4:-9].strip(' ')
if i <= (len(series)) and 'variant' not in next_rows.iloc[i].lower():
df_dict['PMID'].append(pmid_num)
df_dict['Gene'].append(gene)
df_dict['Variant'].append('No')
elif i == len(series) + 1:
df_dict['PMID'].append(pmid_num)
df_dict['Gene'].append(gene)
df_dict['Variant'].append('No')
if 'Variant' in row:
variant = row[8:-4].strip(' ')
df_dict['PMID'].append(pmid_num)
df_dict['Gene'].append(gene)
df_dict['Variant'].append(variant)
df = pd.DataFrame(df_dict)
df_list.append(df)
The final output dataframe will merely be a concatenation of each small dataframe we created above:
output_df = pd.concat(df_list).reset_index(drop=True)
And that's it. The output dataframe looks like this, which I believe is your desired format:
PMID Gene Variant
0 22224631 CRP No
1 22224631 GC No
2 22224631 PTGS2 I399V
3 22224631 PTGS1 L255L
4 22224631 PTGS1 V255V
5 22224631 CT49 No
6 22224631 GAA Q255H
7 22224631 CGA No
8 22224631 TAT No
9 16076618 AGTR2 No
10 16076618 QTRT1 No
11 16076618 SLC25A10 No
12 16076618 ATM No
13 16076618 PIH No
14 16076618 CCL14 No
15 16076618 AGT No
16 16076618 REN No
17 16076618 ASAH1 No
18 16076618 AGTR1 No
19 16076618 SSD No
20 16076618 TAT D389A
21 16076618 ACE D389A
22 16076618 ACE H389P
I am not REALLY experienced in Python, but my approach would be to create tuples.
First one created manually, to make that first PMID | Gene | Variant part,
then using regex to strip unnessesary text and go adding those tuples in a single list.
Then printing them all using String formatting.
Or, you could make 3 lists, one for PMID, one for Gene, one for Variant.
Then iterating them with a forloop and printing them out to create that table.
Sorry for not being able to give specific tips.
Best of wishes!
You could work with dictionaries.
For example:
fileDict = {Gene : [], Variant: [], PMID: []}
Iterate through the list and check if Gene, Variant or PMID and append the values.
You can then do like
for x in fileDict['Gene']:
print(x)
Related
I have a file which contains data of users in rows which is stored in some cryptic format. I want to decode that and create a dataframe
sample row -- AN04N010105SANDY0205SMITH030802031989
Note-
AN04N01 is standard 7 letter string at the start to denote that this row is valid.
Here 0105SANDY refers to 1st column(name) having length 5
01 -> 1st column ( which is name column )
05 -> length of name ( Sandy )
Similarly,0205SMITH refers to
02 -> 2nd column ( which is surname column )
05 -> length of surname ( Smith )
Similarly,030802031989 refers to
03 -> 3rd column ( DOB )
08 -> length of DOB
I want a data frame like --
| name | surname | DOB |
|Sandy | SMITH | 02031989 |
I was trying to use regex, but i don't know how to put this into a data frame after identifying names, also how will you find the number of characters to read?
Rather than using regex for groups that might be out of order and varying length, it might be simpler to consume the string in a serial manner.
With the following, you track an index i through the string and consume two characters for code, then length and finally the variable amount of characters given by length. Then, you store the values in a dict, append the dicts to a list and turn that list of dicts into a dataframe. Bonus, it works with the elements in any order.
import pandas as pd
test_strings = [
"AN04N010105ALICE0205ADAMS030802031989",
"AN04N010103BOB0205SMITH0306210876",
"AN04N0103060101010104FRED0204OWEN",
"XXXXXXX0105SANDY0205SMITH030802031989",
]
code_map = {"01": "name", "02": "surname", "03": "DOB"}
def parse(s):
i = 7
d = {}
while i < len(s):
code, i = s[i:i+2], i+2 # read code
length, i = int(s[i:i+2]), i+2 # read length
val, i = s[i:i+length], i + length # read value
d[code_map[code]] = val # store value
return d
ds = []
for s in test_strings:
if not s.startswith("AN04N01"):
continue
ds.append(parse(s))
df = pd.DataFrame(ds)
df contains:
name surname DOB
0 ALICE ADAMS 02031989
1 BOB SMITH 210876
2 FRED OWEN 010101
Try:
def fn(x):
rv, x = [], x[7:]
while x:
_, n, x = x[:2], x[2:4], x[4:]
value, x = x[: int(n)], x[int(n) :]
rv.append(value)
return rv
m = df["row"].str.startswith("AN04N01")
df[["NAME", "SURNAME", "DOB"]] = df.loc[m, "row"].apply(fn).apply(pd.Series)
print(df)
Prints:
row NAME SURNAME DOB
0 AN04N010105SANDY0205SMITH030802031989 SANDY SMITH 02031989
1 AN04N010105BANDY0205BMITH030802031989 BANDY BMITH 02031989
2 AN04N010105CANDY0205CMITH030802031989 CANDY CMITH 02031989
3 XXXXXXX0105DANDY0205DMITH030802031989 NaN NaN NaN
Dataframe used:
row
0 AN04N010105SANDY0205SMITH030802031989
1 AN04N010105BANDY0205BMITH030802031989
2 AN04N010105CANDY0205CMITH030802031989
3 XXXXXXX0105DANDY0205DMITH030802031989
here it is the code for this pattern :
(\w{2}\d{2}\w{1}\d{2})(\d{4}\w{5}\d+\w{5})(\d+)
or use this pattern :
(\D{5})\d+(\D+)\d+(02\d+)
I have a data frame as shown below
df = pd.DataFrame({'meds': ['Calcium Acetate','insulin GLARGINE -- LANTUS - inJECTable','amoxicillin 1 g + clavulanic acid 200 mg ','digoxin - TABLET'],
'details':['DOSE: 667 mg - TDS with food - Inject','DOSE: 12 unit(s) - ON - SC (SubCutaneous)','-- AUGMENTIN - inJECTable','DOSE: 62.5 mcg - Every other morning - PO'],
'extracted':['Calcium Acetate 667 mg Inject','insulin GLARGINE -- LANTUS 12 unit(s) - SC (SubCutaneous)','amoxicillin 1 g + clavulanic acid 200 mg -- AUGMENTIN','digoxin - TABLET 62.5 mcg PO/Tube']})
df['concatenated'] = df['meds'] + " "+ df['details']
What I would like to do is
a) Check whether all of the individual keywords from extracted column is present in the concatenated column.
b) If present, assign 1 to the output column else 0
c) Assign the not found keyword in issue column as shown below
So, I was trying something like below
df['clean_extract'] = df.extracted.str.extract(r'([a-zA-Z0-9\s]+)')
#the above regex is incorrect. I would like to clean the text (remove all symbols except spaces and retain a clean text)
df['keywords'] = df.clean_extract.str.split(' ') #split them into keywords
def value_present(row): #check whether each of the keyword is present in `concatenated` column
if isinstance(row['keywords'], list):
for keyword in row['keywords']:
return 1
else:
return 0
df['output'] = df[df.apply(value_present, axis=1)][['concatenated', 'keywords']].head()
If you think its useful to clean concatenated column as well, its fine. Am only interested in finding the presence of all keywords.
Is there any efficient and elegant approach to do this on 7-8 million records?
I expect my output to be like as shown below. Red color indicates missing term between extracted and concatenated column. So, its assigned 0 and keyword is stored in issue column.
Let us zip the columns extracted and concatenated and for each pair map it to a function f which computes the set difference and returns the result accordingly:
def f(x, y):
s = set(x.split()) - set(y.split())
return [0, ', '.join(s)] if s else [1, np.nan]
df[['output', 'issue']] = [f(*s) for s in zip(df['extracted'], df['concatenated'])]
output issue
0 1 NaN
1 1 NaN
2 1 NaN
3 0 PO/Tube
Which libraries would help me read a gct file in python and edit it like removing the rows with NaN values. And how will the following code change if I apply it to a .gct file?
data = pd.read_csv('PAAD1.csv')
new_data = data.dropna(axis = 0, how ='any')
print("Old data frame length:", len(data), "\nNew data frame length:",
len(new_data), "\nNumber of rows with at least 1 NA value: ",
(len(data)-len(new_data)))
new_data.to_csv('EditedPAAD.csv')
You should use the cmapPy package for this. Compared to read_csv it gives you more freedom and domain specific utilities. E.g. if your *.gct looks like this
#1.2
22215 2
Name Description Tumor_One Normal_One
1007_s_at na -0.214548 -0.18069
1053_at "RFC2 : replication factor C (activator 1) 2, 40kDa |#RFC2|" 0.868853 -1.330921
117_at na 1.124814 0.933021
121_at PAX8 : paired box gene 8 |#PAX8| -0.825381 0.102078
1255_g_at GUCA1A : guanylate cyclase activator 1A (retina) |#GUCA1A| -0.734896 -0.184104
1294_at UBE1L : ubiquitin-activating enzyme E1-like |#UBE1L| -0.366741 -1.209838
1316_at "THRA : thyroid hormone receptor, alpha (erythroblastic leukemia viral (v-erb-a) oncogene homolog, avian) |#THRA|" -0.126108 1.486972
1320_at "PTPN21 : protein tyrosine phosphatase, non-receptor type 21 |#PTPN21|" 3.083681 -0.086705
...
You can extract only rows with a desired probeset id (row id), e.g. ['1007_s_at', '1053_at', '117_at', '121_at', '1255_g_at', '1294_at UBE1L']
So to read a file, remove the nan in the description and save it again, do:
from cmapPy.pandasGEXpress.parse_gct import parse
from cmapPy.pandasGEXpress.write_gct import write
data = parse('example.gct', rid=['1007_s_at', '1053_at',
'117_at', '121_at',
'1255_g_at', '1294_at UBE1L'])
# remove nan values from row_metadata (description column)
data.row_metadata_df.dropna(inplace=True)
# remove the entries of .data_df where nan values are in row_metadata
data.data_df = data.data_df.loc[data.row_metadata_df.index]
# Can only write GCT version 1.3
write(data, 'new_example.gct')
The new_example.gct looks then like this:
#1.3
3 2 1 0
id Description Tumor_One Normal_One
1053_at RFC2 : replication factor C (activator 1) 2, 40kDa |#RFC2| 0.8689 -1.3309
121_at PAX8 : paired box gene 8 |#PAX8| -0.8254 0.1021
1255_g_at GUCA1A : guanylate cyclase activator 1A (retina) |#GUCA1A| -0.7349 -0.1841
Quick search in google will give you the following:
https://pypi.org/project/cmapPy/
Regarding to the code, if you don't care about the metadata in the 2 first rows, it seems to work for your purpose, but you should first indicate that the delimiter is TAB and skip the 2 first rows - pandas.read_csv(PATH_TO_GCT_FILE, sep='\t',skiprows=2)
I am doing some natural language processing on some twitter data. So I managed to successfully load and clean up some tweets and placed it into a data frame below.
id text
1104159474368024599 repmiketurner the only time that michael cohen told the truth is when he pled that he is guilty also when he said no collusion and i did not tell him to lie
1104155456019357703 rt msnbc president trump and first lady melania trump view memorial crosses for the 23 people killed in the alabama tornadoes t
The problem is that I am trying to construct a term frequency matrix where each row is a tweet and each column is the value that said word occurs in for a particular row. My only problem is that other post mentioning term frequency distribution text files. Here is the code I used to generate the data frame above
import nltk.classify
from nltk.tokenize import word_tokenize
from nltk.tokenize import wordpunct_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
df_tweetText = df_tweet
#Makes a dataframe of just the text and ID to make it easier to tokenize
df_tweetText = pd.DataFrame(df_tweetText['text'].str.replace(r'[^\w\s]+', '').str.lower())
#Removing Stop words
#nltk.download('stopwords')
stop = stopwords.words('english')
#df_tweetText['text'] = df_tweetText.apply(lambda x: [item for item in x if item not in stop])
#Remove the https linkes
df_tweetText['text'] = df_tweetText['text'].replace("[https]+[a-zA-Z0-9]{14}",'',regex=True, inplace=False)
#Tokenize the words
df_tweetText
At first I tried to use the function word_dist = nltk.FreqDist(df_tweetText['text']) but It would end up counting the value of the entire sentence instead of each word in the row.
Another thing I had tried was to tokenize each word using df_tweetText['text'] = df_tweetText['text'].apply(word_tokenize) then call FeqDist again but that gives me an error saying unhashable type: 'list'.
1104159474368024599 [repmiketurner, the, only, time, that, michael, cohen, told, the, truth, is, when, he, pled, that, he, is, guilty, also, when, he, said, no, collusion, and, i, did, not, tell, him, to, lie]
1104155456019357703 [rt, msnbc, president, trump, and, first, lady, melania, trump, view, memorial, crosses, for, the, 23, people, killed, in, the, alabama, tornadoes, t]
Is there some alternative way for trying to construct this term frequency matrix? Ideally, I want my data to look something like this
id |collusion | president |
------------------------------------------
1104159474368024599 | 1 | 0 |
1104155456019357703 | 0 | 2 |
EDIT 1: So I decided to take a look at the textmining library and recreated one of their examples. The only problem is that It creates the Term Document Matrix with one row of every single tweet.
import textmining
#Creates Term Matrix
tweetDocumentmatrix = textmining.TermDocumentMatrix()
for column in df_tweetText:
tweetDocumentmatrix.add_doc(df_tweetText['text'].to_string(index=False))
# print(df_tweetText['text'].to_string(index=False))
for row in tweetDocumentmatrix.rows(cutoff=1):
print(row)
EDIT2: So I tried SKlearn but that sortof worked but the problem is that I'm finding chinese/japanese characters in my columns which does should not exist. Also my columns are showing up as numbers for some reason
from sklearn.feature_extraction.text import CountVectorizer
corpus = df_tweetText['text'].tolist()
vec = CountVectorizer()
X = vec.fit_transform(corpus)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df)
00 007cigarjoe 08 10 100 1000 10000 100000 1000000 10000000 \
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
Probably not optimal by iterating over each row, but works. Milage may vary based on how long tweets are and how many tweets are being processed.
import pandas as pd
from collections import Counter
# example df
df = pd.DataFrame()
df['tweets'] = [['test','xd'],['hehe','xd'],['sam','xd','xd']]
# result dataframe
df2 = pd.DataFrame()
for i, row in df.iterrows():
df2 = df2.append(pd.DataFrame.from_dict(Counter(row.tweets), orient='index').transpose())
I am trying to read in a ascii-table into Numpy/Pandas/Astropy array/dataframe/table in Python. Each row in the table looks something like this:
329444.6949 0.0124 -6.0124 3 97.9459 15 32507 303 7 3 4 8 2 7 HDC-13-O
The problem is that there is no clear separator/delimiter between the columns, so for some rows there is no space between two columns, like this:
332174.9289 0.0995 -6.3039 3 1708.1601219 30501 30336 333 37 136 H2CO
From the web page it says these are called "card images". The information on the table format is described as this:
The catalog data files are composed of 80-character card images, with
one card image per spectral line. The format of each card image is:
FREQ, ERR, LGINT, DR, ELO, GUP, TAG, QNFMT, QN', QN" (F13.4,F8.4,
F8.4, I2,F10.4, I3, I7, I4, 6I2, 6I2)
I would really like a way where I just use the format specifier given above. The only thing I found wasNumpy's genfromtxt function. However, the following does not work.
np.genfromtxt('tablename', dtype='f13.4,f8.4,f8.4,i2,f10.4,i3,i7,i4,6i2,6i2')
Anyone knows how I could read this table into Python with the use of the format specification of each column that was given?
You can use the fixed-width reader in Astropy. See: http://astropy.readthedocs.org/en/latest/io/ascii/fixed_width_gallery.html#fixedwidthnoheader. This does still require you to count the columns, but you could probably write a simple parser for the dtype expression you showed.
Unlike the pandas solution above (e.g. df['FREQ'] = df.data.str[0:13]), this will automatically determine the column type and give float and int columns in your case. The pandas version results in all str type columns, which is presumably not what you want.
To quote the doc example there:
>>> from astropy.io import ascii
>>> table = """
... #1 9 19 <== Column start indexes
... #| | | <== Column start positions
... #<------><--------><-------------> <== Inferred column positions
... John 555- 1234 192.168.1.10
... Mary 555- 2134 192.168.1.123
... Bob 555- 4527 192.168.1.9
... Bill 555-9875 192.255.255.255
... """
>>> ascii.read(table,
... format='fixed_width_no_header',
... names=('Name', 'Phone', 'TCP'),
... col_starts=(1, 9, 19),
... )
<Table length=4>
Name Phone TCP
str4 str9 str15
---- --------- ---------------
John 555- 1234 192.168.1.10
Mary 555- 2134 192.168.1.123
Bob 555- 4527 192.168.1.9
Bill 555-9875 192.255.255.255