Question:
How do I apply the same python code to multiple columns of data.
Data Format:
I am just learning python and I have written a script to reformat my data. My current format starts with 4 descriptive columns followed by many columns of data (e.g., 1/1)
#CHROM POS REF ALT IND_1 IND_2 IND_3 IND_4
2L 6631 A G 1/1 0/0 0/0 0/0
2L 6633 T C 0/0 1/0 0/0 0/0
2L 6637 C G 1/1 0/0 0/0 0/0
I am trying to change the 0 and 1 to the values in the REF and ALT columns, respectively with the desired end format to look like:
2L 6631 A G G/G A/A A/A A/A
2L 6633 T C T/T C/T T/T T/T
2L 6637 C G G/G C/C C/C C/C
What I have so far:
I have written a script that will do this for an single column, but I have 100+ columns of data so I was wondering if there is a way to apply this script to multiple columns instead of writing it out specifically for each one.
for line in openfile:
## skip header
if line.startswith("#CHROM"):
continue
columns = line.rstrip().split("\t")
CHROM = columns[0]
POS = columns[1]
REF = columns[2]
ALT = columns[3]
ALLELES1 = columns[4].replace("0",REF).replace("1",ALT).replace(".","0")
ALLELES2 = columns[5].replace("0",REF).replace("1",ALT).replace(".","0")
print CHROM, POS, REF, ALT, ALLELES1, ALLELES2
Here is my solution:
def read_data(filename):
with open(filename, "r") as file_handle:
for line in file_handle:
# skip header
if line.startswith("#CHROM"):
continue
columns = line.rstrip().split("\t")
CHROM = columns[0]
POS = columns[1]
REF = columns[2]
ALT = columns[3]
ALLELS = [value.replace("0", REF).replace("1", ALT).replace(".", "0") for value in columns[4:]]
print("\t".join(columns[0:4] + ALLELS))
You call it like this:
read_data("file.txt")
[value.replace("0", REF).replace("1", ALT).replace(".", "0") for value in columns[4:]]is called a "List comprehension". It looks at every value of a list and does something with it. See Documentation.
columns[4:] means, look at all my columns and get me the columns starting at index 4 until the last column.
sep="\t" in the print statement means, that all the elements you pass to the print function should be printed with a TAB in between them.
"\t".join(columns[0:4] + ALLELS) returns a single string in which all elements are joined by a TAB. See Stephen Rauch.
I would suggest implementing this using a list comprehension:
for line in f.readlines():
## skip header
if line.startswith("#CHROM"):
continue
columns = line.rstrip().split("\t")
REF, ALT = columns[2:4]
modified = [c.replace("0", REF).replace("1", ALT).replace(".", "0")
for c in columns[4:]]
print('\t'.join(columns[0:4] + modified))
Three additions to your code:
REF, ALT = columns[2:4]
Which is a clean way to grab two elements from the list.
modified = [c.replace("0", REF).replace("1", ALT).replace(".", "0")
for c in columns[4:]]
Which is a list comprehension to do the replace across all of the fields at once. And then
print('\t'.join(columns[0:4] + modified))
which reassembles everything at once.
Related
I have a data frame as shown below
df = pd.DataFrame({'meds': ['Calcium Acetate','insulin GLARGINE -- LANTUS - inJECTable','amoxicillin 1 g + clavulanic acid 200 mg ','digoxin - TABLET'],
'details':['DOSE: 667 mg - TDS with food - Inject','DOSE: 12 unit(s) - ON - SC (SubCutaneous)','-- AUGMENTIN - inJECTable','DOSE: 62.5 mcg - Every other morning - PO'],
'extracted':['Calcium Acetate 667 mg Inject','insulin GLARGINE -- LANTUS 12 unit(s) - SC (SubCutaneous)','amoxicillin 1 g + clavulanic acid 200 mg -- AUGMENTIN','digoxin - TABLET 62.5 mcg PO/Tube']})
df['concatenated'] = df['meds'] + " "+ df['details']
What I would like to do is
a) Check whether all of the individual keywords from extracted column is present in the concatenated column.
b) If present, assign 1 to the output column else 0
c) Assign the not found keyword in issue column as shown below
So, I was trying something like below
df['clean_extract'] = df.extracted.str.extract(r'([a-zA-Z0-9\s]+)')
#the above regex is incorrect. I would like to clean the text (remove all symbols except spaces and retain a clean text)
df['keywords'] = df.clean_extract.str.split(' ') #split them into keywords
def value_present(row): #check whether each of the keyword is present in `concatenated` column
if isinstance(row['keywords'], list):
for keyword in row['keywords']:
return 1
else:
return 0
df['output'] = df[df.apply(value_present, axis=1)][['concatenated', 'keywords']].head()
If you think its useful to clean concatenated column as well, its fine. Am only interested in finding the presence of all keywords.
Is there any efficient and elegant approach to do this on 7-8 million records?
I expect my output to be like as shown below. Red color indicates missing term between extracted and concatenated column. So, its assigned 0 and keyword is stored in issue column.
Let us zip the columns extracted and concatenated and for each pair map it to a function f which computes the set difference and returns the result accordingly:
def f(x, y):
s = set(x.split()) - set(y.split())
return [0, ', '.join(s)] if s else [1, np.nan]
df[['output', 'issue']] = [f(*s) for s in zip(df['extracted'], df['concatenated'])]
output issue
0 1 NaN
1 1 NaN
2 1 NaN
3 0 PO/Tube
I have a pandas dataframe as shown below. There are many more columns in that frame that are not important concerning the task. The column id shows the sentenceID while the columns e1 and e2 contain entities (=words) of the sentence with their relationship in the column r
id e1 e2 r
10 a-5 b-17 A
10 b-17 a-5 N
17 c-1 a-23 N
17 a-23 c-1 N
17 d-30 g-2 N
17 g-20 d-30 B
I also created a graph for each sentence. The graph is created from a list of edges that looks somewhat like this
[('wordB-5', 'wordA-1'), ('wordC-8', 'wordA-1'), ...]
All of those edges are in one list (of lists). Each element in that list contains all the edges of each sentence. Meaning list[0] has the edges of sentence 0 and so on.
Now I want to perform operations like these:
graph = nx.Graph(graph_edges[i])
shortest_path = nx.shortest_path(graph, source="e1",
target="e2")
result_length = len(shortest_path)
result_path = shortest_path
For each row in the data frame, I'd like to calculate the shortest paths (from the entity in e1 to the entity in e2 and save all of the results in a new column in the DataFrame but I have no idea how to do that.
I tried using constructions such as these
e1 = DF["e1"].tolist()
e2 = DF["e2"].tolist()
for id in Df["sentenceID"]:
graph = nx.Graph(graph_edges[id])
shortest_path = nx.shortest_path(graph,source=e1, target=e2)
result_length = len(shortest_path)
result_path = shortest_path
to create the data but it says the target is not in the graph.
new df=
id e1 e2 r length path
10 a-5 b-17 A 4 ..
10 b-17 a-5 N 4 ..
17 c-1 a-23 N 3 ..
17 a-23 c-1 N 3 ..
17 d-30 g-2 N 7 ..
17 g-20 d-30 B 7 ..
Here's one way to do what you are trying to do, in three distinct steps so that it is easier to follow along.
Step 1: From a list of edges, build the networkx graph object.
Step 2: Create a data frame with 2 columns (For each row in this DF, we want the shortest distance and path from the e1 column to the entity in e2)
Step 3: Row by row for the DF, calculate shortest path and length. Store them in the DF as new columns.
Step 1: Build the graph and add edges, one by one
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
elist = [[('a-5', 'b-17'), ('b-17', 'c-1')], #sentence 1
[('c-1', 'a-23'), ('a-23', 'c-1')], #sentence 2
[('b-17', 'g-2'), ('g-20', 'c-1')]] #sentence 3
graph = nx.Graph()
for sentence_edges in elist:
for fromnode, tonode in sentence_edges:
graph.add_edge(fromnode, tonode)
nx.draw(graph, with_labels=True, node_color='lightblue')
Step 2: Create a data frame of desired distances
#Create a data frame to store distances from the element in column e1 to e2
DF = pd.DataFrame({"e1":['c-1', 'a-23', 'c-1', 'g-2'],
"e2":['b-17', 'a-5', 'g-20', 'g-20']})
DF
Step 3: Calculate Shortest path and length, and store in the data frame
This is the final step. Calculate shortest paths and store them.
pathlist, len_list = [], [] #placeholders
for row in DF.itertuples():
so, tar = row[1], row[2]
path = nx.shortest_path(graph, source=so, target=tar)
length=nx.shortest_path_length(graph,source=so, target=tar)
pathlist.append(path)
len_list.append(length)
#Add these lists as new columns in the DF
DF['length'] = len_list
DF['path'] = pathlist
Which produces the desired resulting data frame:
Hope this helps you.
For anyone that's interested in the solution (thanks to Ram Narasimhan) :
pathlist, len_list = [], []
so, tar = DF["e1"].tolist(), DF["e2"].tolist()
id = DF["id"].tolist()
for _,s,t in zip(id, so, tar):
graph = nx.Graph(graph_edges[_]) #Constructing each Graph
try:
path = nx.shortest_path(graph, source=s, target=t)
length = nx.shortest_path_length(graph,source=s, target=t)
pathlist.append(path)
len_list.append(length)
except nx.NetworkXNoPath:
path = "No Path"
length = "No Pathlength"
pathlist.append(path)
len_list.append(length)
#Add these lists as new columns in the DF
DF['length'] = len_list
DF['path'] = pathlist
I have some large tab separated data sets that have long commented sections, followed by the table header, formatted like this:
##FORMAT=<ID=AMQ,Number=.,Type=Integer,Description="Average mapping quality for each allele present in the genotype">
##FORMAT=<ID=SS,Number=1,Type=Integer,Description="Variant status relative to non-adjacent Normal, 0=wildtype,1=germline,2=somatic,3=LOH,4=unknown">
##FORMAT=<ID=SSC,Number=1,Type=Integer,Description="Somatic Score">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR
chr1 2985885 . c G . . . GT:IGT:DP:DP4:BCOUNT:GQ:JGQ:VAQ:BQ:MQ:AMQ:SS:SSC 0/0:0/0:202:36,166,0,0:0,202,0,0:255:225:0:36:60:60:0:. 0/1:0/1:321:29,108,37,147:0,137,184,0:228:225:228:36,36:60:60,60:2:225
chr1 3312963 . C T . . . GT:IGT:DP:DP4:BCOUNT:GQ:JGQ:VAQ:BQ:MQ:AMQ:SS:SSC 0/1:0/1:80:36,1,43,0:0,37,0,43:80:195:80:36,31:60:60,60:1:. 0/0:0/0:143:138,5,0,0:0,143,0,0:255:195:255:36:60:60:3:57
Everything that starts with ## is a comment that needs to be stripped out, but I need to keep the header that starts with #CHROM. Is there any way to do this? The only options I am seeing for Pandas read_table allow only a single character for the comment string, and I do not see options for regular expressions.
The code I am using is this:
SS_txt_df = pd.read_table(SS_txt_file,sep='\t',comment='#')
This removes all lines that start with #, including the header I want to keep
EDIT: For clarification, the header region starting with ## is of variable length. In bash this would simply be grep -Ev '^##'.
you can easily calculate the number of header lines, that must be skipped when reading your CSV file:
fn = '/path/to/file.csv'
skip_rows = 0
with open(fn, 'r') as f:
for line in f:
if line.startswith('##'):
skip_rows += 1
else:
break
df = pd.read_table(fn, sep='\t', skiprows=skip_rows)
The first part will read only the header lines - so it should be very fast
use skiprows as a workaround:
SS_txt_df = pd.read_table(SS_txt_file,sep='\t',skiprows=3)
df
Out[13]:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR
0 chr1 2985885 . c G . . . GT:IGT...
1 chr1 3312963 . C T . . . GT:IGT...
then rename your first column to remove #.
Update:
As you said your ## varies so, I know this is not a feasible solution but you can drop all rows starting with # and then pass the column headers as listas your columns don't change:
name=['CHROM','POS','ID','REF','ALT','QUAL','FILTER','INFO' ,'FORMAT','NORMAL','TUMOR']
df=pd.read_table(SS_txt_file,sep='\t',comment='#',names=name)
df
Out[34]:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR
0 chr1 2985885 . c G . . . GT:IGT...
1 chr1 3312963 . C T . . . GT:IGT...
I am trying to perform some operations on a file containing a list of TAB separated values, using Python 2.7.2. For more information, the file format is called BED, and represents a list of genes where each gene is represented by a line. The first three fields of each line represent coordinates. Another field contains a description that might be identical for more than one line.
Within lines with identical description, I need to group together all lines with overlapping coordinates, and somehow name this subgroups unequivocally. The problem is that I would actually need to group all lines that have overlapping coordinates as a group, e.g.:
chr1 1 3 geneA 1000 +
chr1 3 5 geneA 1000 +
chr1 4 6 geneA 1000 +
chr1 8 9 geneA 1000 +
should subgroup the genes as follows:
chr1 1 3 geneA 1000 +
chr1 3 5 geneA 1000 +
chr1 4 6 geneA 1000 +
and
chr1 8 9 geneA 1000 +
Eventually the goal would be to output, for each subgroup, a single (new line), e.g.:
chr1 1 6 geneA 1000 +
chr1 8 9 geneA 1000 +
The value of the first field (chr) is variable, subgroups should be built within lines with the same chr value.
Until now, I tried to solve the problem with this (wrong) approach:
#key = description
#values = list of lines (genes) with same description
#self.raw_groups_of_overlapping.items = attribute (dict) containing, given a description, all genes whose description matches the key
#self.picked_cherries = attribute (dict) in which I would like to store, for a given unique identifier, all genes in a specific subgroup (sub-grouping lines according to the aformentioned rule)
#self.__overlappingGenes__(j,k) method evaluating if lines (genes) j and k overlap
for key,values in self.raw_groups_of_overlapping.items():
for j in values:
#Remove gene j from list:
is_not_j = lambda x: x is not j
other_genes = filter(is_not_j, values)
for k in other_genes:
if self.__overlappingGenes__(j,k):
intersection = [x for x in j.overlaps_with if x in k.overlaps_with]
identifier = ''
for gene in intersection:
identifier += gene.chr.replace('chr', '') + str(gene.start) + str(gene.end) + gene.description + gene.strand.replace('\n', '')
try:
self.picked_cherries[identifier].append(j)
except:
self.picked_cherries[identifier] = []
self.picked_cherries[identifier].append(j)
break
I understand that somehow I am not considering all genes together, and I would appreciate your input.
I have written a program in python, where I have used a hash table to read data from a file and then add data in the last column of the file corresponding to the values in the 2nd column of the file. for example, for all entries in column 2 with same values, the corresponding last column values will be added.
Now I have implemented the above successfully. Now I want to sort the table in descending order according to last column values and print these values and the corresponding 2nd column (key) values. i am not able to figure out on how to do this. Can anyone please help ?
pmt txt file is of the form
0.418705 2 3 1985 20 0
0.420657 4 5 119 3849 5
0.430000 2 3 1985 20 500
and so on...
So, for example, for number 2 in column 2, i have added all data of last column corresponding to all numbers '2' in the 2nd column. So, this process will continue for the next set of numbers lie 4, 5 ,etc in column 2.
I'm using python 3
import math
source_ip = {}
f = open("pmt.txt","r",1)
lines = f.readlines()
for line in lines:
s_ip = line.split()[1]
bit_rate = int(line.split()[-1]) + 40
if s_ip in source_ip.keys():
source_ip[s_ip] = source_ip[s_ip] + bit_rate
print (source_ip[s_ip])
else:
source_ip[s_ip] = bit_rate
f.close()
for k in source_ip.keys():
print(str(k)+": "+str(source_ip[k]))
print ("-----------")
It sounds like you want to use the sorted function with a key parameter that gets the value from the key/value tuple:
sorted_items = sorted(source_ip.items(), key=lambda x: x[1])
You could also use itemgetter from the operator module, rather than a lambda function:
import operator
sorted_items = sorted(source_ip.items(), key=operator.itemgetter(1))
How about something like this?
#!/usr/local/cpython-3.4/bin/python
import collections
source_ip = collections.defaultdict(int)
with open("pmt.txt","r",1) as file_:
for line in file_:
fields = line.split()
s_ip = fields[1]
bit_rate = int(fields[-1]) + 40
source_ip[s_ip] += bit_rate
print (source_ip[s_ip])
for key, value in sorted(source_ip.items()):
print('{}: {}'.format(key, value))
print ("-----------")