I have file.txt with five columns and 50 lines. I want to name each column
1 5
2 4.2
. .
. .
you find here the code python ton generate file.txt
f = open("test_01.txt", "a")
for i in xrange(len(end_zapstart_zap)):
f.write(" {} {} {} {} {} \n".format(Leavestart_zap[i],end_zapUDP[i],end_zapstart_zap[i],Leave_join[i],UDP_join[i]))
f.close()
I like to rename the first column Leavestart_zap,the seond column end_zapUDP etc..
here is my file test_01.txt:
0:00:00.672511 0:00:02.615662 0:00:03.433344 0:00:00.119777 0:00:00.025394 0:00:00.002278
0:00:00.263144 0:00:03.184893 0:00:03.541187 0:00:00.090872 0:00:00.025394 0:00:00.002278
Related
How can I convert the following .vcf data into a pandas dataframe?
GDrive Link To .txt File
Ideally I would like it in the form:
Thus far I have only been able to get the headers:
import pandas as pd
f = open('clinvar_final.txt',"r")
for line in f.readlines():
if line[:5] == 'CHROM':
vcf_header = line.strip().split('\t')
df = pd.DataFrame
df.header = vcf_header
There is no need to read line by line.
Pandas has an option called comment which can be used to skip unwanted lines.
You can directly load VCF files into pandas by running the following line.
In [9]: pd.read_csv('clinvar_final.txt', sep="\t", comment='#')
Out[9]:
CHROM POS ID REF ALT FILTER QUAL INFO
0 1 1014O42 475283 G A . . AF_ESP=0.00546;AF_EXAC=0.00165;AF_TGP=0.00619;...
1 1 1O14122 542074 C T . . AF_ESP=0.00015;AF_EXAC=0.00010;ALLELEID=514926...
2 1 1014143 183381 C T . . ALLELEID=181485;CLNDISDB=MedGen:C4015293,OMIM:...
3 1 1014179 542075 C T . . ALLELEID=514896;CLNDISDB=MedGen:C4015293,OMIM:...
4 1 1014217 475278 C T . . AF_ESP=0.00515;AF_EXAC=0.00831;AF_TGP=0.00339;...
... ... ... ... .. .. ... ... ...
102316 3 179210507 403908 A G . . ALLELEID=393412;CLNDISDB=MedGen:C0018553,Orpha...
102317 3 179210511 526648 T C . . ALLELEID=519163;CLNDISDB=MedGen:C0018553,Orpha...
102318 3 179210515 526640 A C . . AF_EXAC=0.00002;ALLELEID=519178;CLNDISDB=MedGe...
102319 3 179210516 246681 A G . . AF_EXAC=0.00001;ALLELEID=245287;CLNDISDB=MedGe...
102320 3 179210538 259958 A T . . AF_EXAC=0.00001;ALLELEID=251013;CLNDISDB=MedGe...
GATK Variantstotable is what you need to avoid any issue due to the flexibility of the format of the VCF. Then, when having the csv, import it into pandas. I would say that this is the most robust way to do this.
https://gatk.broadinstitute.org/hc/en-us/articles/360036896892-VariantsToTable
import pandas as pd
with open(filename, "r") as f:
lines = f.readlines()
chrom_index = [i for i, line in enumerate(lines) if line.strip().startswith("#CHROM")]
data = lines[chrom_index[0]:]
header = data[0].strip().split("\t")
informations = [d.strip().split("\t") for d in data[1:]]
vcf = pd.DataFrame(informations, columns=header)
I wish to write a Python script which reads from a csv. The csv comprises of 2 columns. I want the script to read through the first column row by row and find the corresponding value in the second row. If it finds the value in the second row I want it to input a value into a third column.
example of output
Any help with this would be much appreciated and I hope my aim is clear. Apologies in advance if it is too vague.
this script read test.csv file and parse it an write to OUTPUT.txt
f = open("test.csv","r")
d={}
s={}
for line in f:
l=line.split(",")
if not l[0] in d:
d[l[0]]=l[1].rstrip()
s[l[0]]=''
else:
s[l[0]]+=str(";")+str(l[1].rstrip())
w=open("OUTPUT.txt","w")
w.write("%-10s %-10s %-10s\r\n" % ("ID","PARENTID","Atachment"))
for i in d.keys():
w.write("%-10s %-10s %-10s\r\n" % (i,d[i],s[i]))
f.close()
w.close()
example:
input:
1,123
2,456
1,333
3,
1,asas
1,333
000001,sasa
1,ss
1023265,333
0221212,
000001,sasa2
000001,sas4
OUTPUT:
ID PARENTID Atachment
000001 sasa ;sasa2;sas4
1023265 333
1 123 ;333;asas;333;ss
3
2 456
0221212
I have some large tab separated data sets that have long commented sections, followed by the table header, formatted like this:
##FORMAT=<ID=AMQ,Number=.,Type=Integer,Description="Average mapping quality for each allele present in the genotype">
##FORMAT=<ID=SS,Number=1,Type=Integer,Description="Variant status relative to non-adjacent Normal, 0=wildtype,1=germline,2=somatic,3=LOH,4=unknown">
##FORMAT=<ID=SSC,Number=1,Type=Integer,Description="Somatic Score">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR
chr1 2985885 . c G . . . GT:IGT:DP:DP4:BCOUNT:GQ:JGQ:VAQ:BQ:MQ:AMQ:SS:SSC 0/0:0/0:202:36,166,0,0:0,202,0,0:255:225:0:36:60:60:0:. 0/1:0/1:321:29,108,37,147:0,137,184,0:228:225:228:36,36:60:60,60:2:225
chr1 3312963 . C T . . . GT:IGT:DP:DP4:BCOUNT:GQ:JGQ:VAQ:BQ:MQ:AMQ:SS:SSC 0/1:0/1:80:36,1,43,0:0,37,0,43:80:195:80:36,31:60:60,60:1:. 0/0:0/0:143:138,5,0,0:0,143,0,0:255:195:255:36:60:60:3:57
Everything that starts with ## is a comment that needs to be stripped out, but I need to keep the header that starts with #CHROM. Is there any way to do this? The only options I am seeing for Pandas read_table allow only a single character for the comment string, and I do not see options for regular expressions.
The code I am using is this:
SS_txt_df = pd.read_table(SS_txt_file,sep='\t',comment='#')
This removes all lines that start with #, including the header I want to keep
EDIT: For clarification, the header region starting with ## is of variable length. In bash this would simply be grep -Ev '^##'.
you can easily calculate the number of header lines, that must be skipped when reading your CSV file:
fn = '/path/to/file.csv'
skip_rows = 0
with open(fn, 'r') as f:
for line in f:
if line.startswith('##'):
skip_rows += 1
else:
break
df = pd.read_table(fn, sep='\t', skiprows=skip_rows)
The first part will read only the header lines - so it should be very fast
use skiprows as a workaround:
SS_txt_df = pd.read_table(SS_txt_file,sep='\t',skiprows=3)
df
Out[13]:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR
0 chr1 2985885 . c G . . . GT:IGT...
1 chr1 3312963 . C T . . . GT:IGT...
then rename your first column to remove #.
Update:
As you said your ## varies so, I know this is not a feasible solution but you can drop all rows starting with # and then pass the column headers as listas your columns don't change:
name=['CHROM','POS','ID','REF','ALT','QUAL','FILTER','INFO' ,'FORMAT','NORMAL','TUMOR']
df=pd.read_table(SS_txt_file,sep='\t',comment='#',names=name)
df
Out[34]:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR
0 chr1 2985885 . c G . . . GT:IGT...
1 chr1 3312963 . C T . . . GT:IGT...
Hello I am looking for some help to do like an index match in excel i am very new to python but my data sets are far to large for excel now
I will dumb my question right down as much as possible cause the data contains alot of irrelevant information to this problem
CSV A (has 3 Basic columns)
Name, Date, Value
CSV B (has 2 columns)
Value, Score
CSV C (I want to create this using python; 2 columns)
Name, Score
All I want to do is enter a date and have it look up all rows in CSV A which match that "date" and then look up the "score" associated to the "value" from that row in CSV A in CSV B and returning it in CSV C along with the name of the person. Rinse and repeat through every row
Any help is much appreciated I don't seem to be getting very far
Here is a working script using Python's csv module:
It prompts the user to input a date (format is m-d-yy), then reads csvA row by row to check if the date in each row matches the inputted date.
If yes, it checks if the value that corresponds the date from the current row of A matches any of the rows in csvB.
If there are matches, it will write the name from csvA and the score from csvB to csvC.
import csv
date = input('Enter date: ').strip()
A = csv.reader( open('csvA.csv', newline=''), delimiter=',')
matches = 0
# reads each row of csvA
for row_of_A in A:
# removes whitespace before and after of each string in each row of csvA
row_of_A = [string.strip() for string in row_of_A]
# if date of row in csvA has equal value to the inputted date
if row_of_A[1] == date:
B = csv.reader( open('csvB.csv', newline=''), delimiter=',')
# reads each row of csvB
for row_of_B in B:
# removes whitespace before and after of each string in each row of csvB
row_of_B = [string.strip() for string in row_of_B]
# if value of row in csvA is equal to the value of row in csvB
if row_of_A[2] == row_of_B[0]:
# if csvC.csv does not exist
try:
open('csvC.csv', 'r')
except:
C = open('csvC.csv', 'a')
print('Name,', 'Score', file=C)
C = open('csvC.csv', 'a')
# writes name from csvA and value from csvB to csvC
print(row_of_A[0] + ', ' + row_of_B[1], file=C)
m = 'matches' if matches > 1 else 'match'
print('Found', matches, m)
Sample csv files:
csvA.csv
Name, Date, Value
John, 2-6-15, 10
Ray, 3-5-14, 25
Khay, 4-4-12, 30
Jake, 2-6-15, 100
csvB.csv
Value, Score
10, 500
25, 200
30, 300
100, 250
Sample Run:
>>> Enter date: 2-6-15
Found 2 matches
csvC.csv (generated by script)
Name, Score
John, 500
Jake, 250
if you are using unix you can do this by below shell script
also I am assuming that you are appending the search output in file_C and there are no duplicated in both source files
while true
do
echo "enter date ..."
read date
value_one=grep $date file_A | cut -d',' -f1
tmp1=grep $date' file_A | cut -d',' -f3
value_two=grep $tmp1 file_B | cut -d',' -f2
echo "${value_one},${value_two}" >> file_c
echo "want to search more dates ... press y|Y, press any other key to exit"
read ch
if [ "$ch" = "y" ] || [ "$ch" = "y" ]
then
continue
else
exit
fi
done
I am trying to extract two columns from a text file here datapoint and index, and I want both of the columns to be written in a text file as a column. I made a small program that is somewhat doing what I want but its not working completely,
any suggestion on this please ?
My program is:
f = open ('infilename', 'r')
header1= f.readline()
for line in f:
line = line.strip()
columns = line.split()
j=float(columns[1])
i=columns[3]
print i, j
f.close()
it is also giving an error
j=float(columns[1])
IndexError: list index out of range
Sample data:
datapoint index
66.199748 200 0.766113 0 1
66.295962 200 0.826375 1 0
66.295962 200 0.762582 1 1
66.318076 200 0.850936 2 0
66.318076 200 0.751474 2 1
66.479436 200 0.821261 3 0
66.479436 200 0.765673 3 1
66.460284 200 0.869779 4 0
66.460284 200 0.741051 4 1
66.551778 200 0.841143 5 0
66.551778 200 0.765198 5 1
66.303606 200 0.834398 6 0
. . . . .
. . . . .
. . . . .
. . . . .
69.284336 200 0.926158 19998 0
69.284336 200 0.872788 19998 1
69.403861 200 0.943316 19999 0
69.403861 200 0.884889 19999 1
The following code will allow you do all of the file writing through Python. Redirecting through the command line like you were doing works fine, this will just be self contained instead.
f = open ('in.txt', 'r')
out = open("out.txt", "w")
header1= f.readline()
for line in f:
line = line.strip()
columns = line.split()
if len(columns) > 2:
j=float(columns[1])
i = columns[3]
i=columns[3]
out.write("%s %s\n" %(i, j))
f.close()
Warning: This will always overwrite "out.txt". If you would simply like to add to the end of it if it already exists, or create a new file if it doesn't, you can change the "w" to "a" when you open out.