How to parse data from .TX0 file into dataframe - python

Hi I'm trying to basically convert a .TX0 file from a chromatogram file. the file is just a bunch of results including retention times etc... I want to eventually pick certain pieces of data from multiple files and do some analysis. So far I have:
filename = 'filepath'
f = open(filename, 'r')
lines = f.readlines()
print lines
my output is:
Out[29]:
[....................................................
'"condensate analysis (HP4890 Optic - FID)"\n',
'"Peak","Component","Time","Area","Height","BL"\n',
'"#","Name","[min]","[uV*sec]","[uV]",""\n',
'------,------,------,------,------,------\n',
'1,"Methane",0.689,5187666.22,994337.57,*BB\n',
'2,"Ethane",1.061,1453339.93,729285.09,*BB\n',
'3,"Propane",1.715,193334.09,63398.74,*BB\n',
'4,"i-Butane",2.792,157630.92,29233.56,*BV\n',
'5,"n-Butane",3.240,98943.96,15822.72,*VB\n',
'"","","",------,------,""\n',
'"","","",7090915.11,1.83e+06,""\n',
'"Missing Component Report"\n',
'"Component","Expected Retention (Calibration File)"\n',
'------,------\n',
'"All components were found"\n',
'"Report stored in ASCII file :","...
"\n'.......................]
Now, the problem i'm having. I can't get this output into a structured dataframe using pandas... =/ I've tried and it just gives me two columns...
pd.DataFrame(filename)
out:
Out[26]:
0
0 "=============================================...
1 "Software Version:",6.3.2.0646,"Date:","08/06/...
2 "Reprocess Number:","vma2: ......................
.......................
10 ""\n
11 ""\n
12 "condensate analysis (HP4890 Optic - FID)"\n
13 "Peak","Component","Time","Area","Height","BL"\n
14 "#","Name","[min]","[uV*sec]","[uV]",""\n
15 ------,------,------,------,------,------\n
16 1,"Methane",0.689,5187666.22,994337.57,*BB\n
17 2,"Ethane",1.061,1453339.93,729285.09,*BB\n
18 3,"Propane",1.715,193334.09,63398.74,*BB\n
19 4,"i-Butane",2.792,157630.92,29233.56,*BV\n
20 5,"n-Butane",3.240,98943.96,15822.72,*VB\n
21 "","","",------,------,""\n
22 "","","",7090915.11,1.83e+06,""\n
23 "Missing Component Report"\n
24 "Component","Expected Retention (Calibration F...
25 ------,------\n
26 "All components were found"\n
27 "Report stored in ASCII file :","C:\Shared Fol...

Related

Skip all rows containing strings and keep only rows with floats

I have a log file from a mathematical simulation. I tried to parse it in Python, but I am not quite satisfied with the result. Is there any "elegant" way to loop each line and sort it in order to keep only lines with physical values and ditch the rest?
The goal is to perform various analyses using numpy. Knowing that the lines I need only contain numerical values, is there a way to "tell" python to keep only rows / lines with numerical values and ditch all the rows containing string? Thank your for your help. A sample of the log file is attached.
5 Host 1 -- hnode146 -- Ranks 20-39
6 Host 2 -- hnode147 -- Ranks 40-59
7 Host 3 -- hnode148 -- Ranks 60-79
8 Process rank 0 hnode145 36210
9 Total number of processes : 80
10
11 STAR-CCM+ 12.02.011 (linux-x86_64-2.5/gnu4.8-r8)
12 License build date: 10 February 2015
13 This version of the code requires license version 2017.02 or greater.
14 Checking license file:
15 Checking license file:
16 Unable to list features for license file
17 1 copy of ccmppower checked out from
18 Feature ccmppower expires in
19 Thu Apr 19 17:22:54 2018
20
21 Server::start -host h
22 Loading object database:
23 Loading module: StarMeshing
24 Loading module: MeshingSurfaceRepair
25 Loading module: CadModeler
26 Started Parasolid modeler version 29.01.131
27 Loading module: StarResurfacer
28 Loading module: StarTrimmer
29 Loading module: SegregatedFlowModel
30 Loading module: KwTurbModel
31 Loading module: StarDualMesher
32 Loading module: StarBodyFittedMesher
33 Simulation database saved by:
34 STAR-CCM+ 12.02.011 (linux-x86_64-2.5/gnu4.8-r8) Fri Mar 10 20:03:37 UTC 2017 Serial
35 Loading into:
36 STAR-CCM+ 12.02.011 (linux-x86_64-2.5/gnu4.8-r8) Fri Mar 10 20:03:37 UTC 2017 Np=80
37 Object database load completed.
39 A Zeit und Datum : 2018.04.19 at 17:23:11
40
41 Startzeit: 1524151391534
42
43 Loading/configuring connectivity (old|new partitions: 1|80)
44 Domain (index 1): 1889922 cells, 5614862 faces, 1990686 verts.
45 Configuring finished
46 Reading material property database "/sw/apps/cd-adapco/12.02.011-R8/STAR-CCM+12.02.011-R8/star/props.mdb"...
47 Re-partitioning
48 Iteration Continuity X-momentum Y-momentum Z-momentum Tke Sdr Shear+Pressure (N) Pressure (N) Shear (N)
49 2001 1.076589e-01 9.570364e-01 2.588931e-01 1.984590e-01 4.028215e-03 3.964344e+01 -6.468809e+00 -1.253867e+00 -5.214942e+00
50 2002 5.987195e-02 4.004615e-01 2.597862e-01 1.808196e-01 2.819456e-03 2.537490e+01 -5.154729e+00 -1.228644e+00 -3.926085e+00
51 2003 4.824863e-02 2.048600e-01 1.359121e-01 1.103614e-01 1.384044e-03 1.623916e+01 -4.277053e+00 -1.216038e+00 -3.061015e+00
52 2004 3.684017e-02 1.322581e-01 1.350187e-01 8.827220e-02 9.023783e-04 1.039251e+01 -3.914011e+00 -1.213340e+00 -2.700671e+00
53 2005 3.224797e-02 1.093365e-01 1.059148e-01 7.461911e-02 6.307195e-04 6.650742e+00 -3.745949e+00 -1.217353e+00 -2.528596e+00
54 2006 2.788050e-02 9.180507e-02 8.311817e-02 6.417279e-02 4.603072e-04 4.256107e+00 -3.658613e+00 -1.224046e+00 -2.434567e+00
55 2007 2.332397e-02 7.688239e-02 6.222694e-02 4.860232e-02 3.534658e-04 2.723686e+00 -3.608431e+00 -1.231574e+00 -2.376857e+00
56 2008 1.916130e-02 6.201947e-02 4.645780e-02 3.654489e-02 2.833177e-04 1.743055e+00 -3.575486e+00 -1.237352e+00 -2.338134e+00
57 2009 1.600865e-02 4.780234e-02 3.909247e-02 2.959689e-02 2.370245e-04 1.115506e+00 -3.548365e+00 -1.240938e+00 -2.307427e+00
58 2010 1.389765e-02 3.570659e-02 3.492423e-02 2.537285e-02 2.055279e-04 7.138997e-01 -3.527530e+00 -1.242749e+00 -2.284781e+00
59 Iteration Continuity X-momentum Y-momentum Z-momentum Tke Sdr Shear+Pressure (N) Pressure (N) Shear (N)
60 2011 1.253570e-02 2.591702e-02 3.089287e-02 2.209728e-02 1.814997e-04 4.568718e-01 -3.511034e+00 -1.242906e+00 -2.268128e+00
61 2012 1.141436e-02 1.992464e-02 2.745902e-02 1.922942e-02 1.636478e-04 2.923702e-01 -3.498876e+00 -1.243006e+00 -2.255870e+00
62 2013 1.024511e-02 1.621655e-02 2.544053e-02 1.687660e-02 1.492828e-04 1.870937e-01 -3.489288e+00 -1.242425e+00 -2.246863e+00
63 2014 9.067693e-03 1.359007e-02 2.320886e-02 1.481687e-02 1.371763e-04 1.197299e-01 -3.482323e+00 -1.242027e+00 -2.240295e+00
64 2015 7.906450e-03 1.159567e-02 2.073906e-02 1.306014e-02 1.265825e-04 7.662597e-02 -3.479134e+00 -1.243537e+00 -2.235597e+00
65 2016 6.889290e-03 1.010569e-02 1.787383e-02 1.258395e-02 1.171344e-04 4.903984e-02 -3.479042e+00 -1.246677e+00 -2.232364e+00
66 2017 5.982303e-03 8.872579e-03 1.576665e-02 1.141871e-02 1.086443e-04 3.138620e-02 -3.480301e+00 -1.249988e+00 -2.230313e+00
67 2018 5.191895e-03 7.958489e-03 1.446382e-02 9.796685e-03 1.009937e-04 2.009149e-02 -3.482459e+00 -1.253255e+00 -2.229204e+00
68 2019 4.614927e-03 7.193031e-03 1.279295e-02 8.818100e-03 9.411761e-05 1.286594e-02 -3.484886e+00 -1.256002e+00 -2.228885e+00
69 2020 4.159939e-03 6.571088e-03 1.146195e-02 7.756150e-03 8.794392e-05 8.241197e-03 -3.487597e+00 -1.258382e+00 -2.229214e+00
70 Iteration Continuity X-momentum Y-momentum Z-momentum Tke Sdr Shear+Pressure (N) Pressure (N) Shear (N)
71 2021 3.779168e-03 5.961164e-03 1.034847e-02 6.969454e-03 8.240903e-05 5.278791e-03 -3.490138e+00 -1.260061e+00 -2.230078e+00
72 2022 3.414811e-03 5.350398e-03 9.329119e-03 6.398522e-03 7.743586e-05 3.381806e-03 -3.491624e+00 -1.260241e+00 -2.231384e+00
Read each line. Split on whitespace, attempt to convert each entity to a float. If the conversion fails, the line isn't kept. There's certainly a way to do this with a regex, but this should work off the top of my head.
lines_to_keep = []
for line in f.readlines():
try:
# Throws ValueError if `x` can't be converted to float
[float(x) for x in line.split()]
# If the above line didn't throw a ValueError, keep it
lines_to_keep.append(line)
except ValueError:
continue
import re
list_to_keep=[]
pattern= re.compile(r'[0-9 ]+[e.\-+][0-9]*',re.IGNORECASE)
with open(f, 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter='\n')
for row in reader:
if(pattern.match(str(row))):
list_to_keep.append(row)
Can use regex to find your row and keep it in list.
If you'd like regex. This matches continuous digits separated by numeric symbols like '+-.e'.
import re
r = re.compile(r'([0-9 ]+[e.\-+]*)+\n')
lines = [line for line in open('a.log') if r.fullmatch(line)]
# all the useful lines are ...
# 49 2001 1.076589e-01 9.570364e-01 2.588931e-01 1.984590e-01 4.028215e-03 3.964344e+01 -6.468809e+00 -1.253867e+00 -5.214942e+00
# 50 2002 5.987195e-02 4.004615e-01 2.597862e-01 1.808196e-01 2.819456e-03 2.537490e+01 -5.154729e+00 -1.228644e+00 -3.926085e+00
# 51 2003 4.824863e-02 2.048600e-01 1.359121e-01 1.103614e-01 1.384044e-03 1.623916e+01 -4.277053e+00 -1.216038e+00 -3.061015e+00

How to add column numbers to each column in a large text file

I would like to add column numbers to 128 columns in a text file
E.g.
My file
12 13 14 15
20 21 23 14
34 56 67 89
Required output
1:12 2:13 3:14 4:15
1: 20 2:21 3: 23 4:14
1: 34 2:56 3:67 4:89
Can this be done using awk / python
I tried paste command for joining two files : one with the values other file with column numbers, manually typed. Since the file size is very large manual typing didnt work.
As of my knowledge I could find answers for adding only one column to the end of a text file.
Thanks for the suggestions
awk to the rescue!
$ awk '{for(i=1;i<=NF;i++) $i=i":"$i}1' file
should do.

How to split one column into two columns in python?

I have a contig file loaded in pandas like this:
>NODE_1_length_4014_cov_1.97676
1 AATTAATGAAATAAAGCAAGAAGACAAGGTTAGACAAAAAAAAGAG...
2 CAAAGCCTCCAAGAAATATGGGACTATGTGAAAAGACCAAATCTAC...
3 CCTGAAAGTGACGGGGAGAATGGAACCAAGTTGGAAAACACTCTGC...
4 GAGAACTTCCCCAATCTAGCAAGGCAGGCCAACATTCAAATTCAGG...
5 CCACAAAGATACTCCTCGAGAAGAGCAACTCCAAGACACATAATTG...
6 GTTGAAATGAAGGAAAAAATGTTAAGGGCAGCCAGAGAGAAAGGTC...
7 GGGAAGCCCATCAGACTAACAGCGGATCTCTCGGCAGAAACCCTAC...
8 TGGGGGCCAATATTCAACATTCTTAAAGAAAAGAATTTTCAACCCA...
9 GCCAAACTAAGCTTCATAAGCAAAGGAGAAATAAAATCCTTTACAG...
10 AGAGATTTTGTCACCACCAGGCCTGCCTTACAAGAGCTCCTGAAGG...
11 GAAAGGAAAAACCGGTACCAGCCACTGCAAAATCATGCCAAACTGT...
12 CTAGGAAGAAACTGCATCAACTAATGAGCAAAATAACCAGCTAACA...
13 TCAAATTCACACATAACAATATTAACCTTAAATGTAAATGGGCTAA...
14 AGACACAGACTGGCAAATTGGATAAAGAGTCAAGACCCATCAGTGT...
15 ACCCATCTCAAATGCAGAGACACACATAGGCTCAAAATAAAGGGAT...
16 CAAGCAAATGGAAAACAAAAAAAGGCAGGGGTTGCAATCCTAGTCT...
17 TTTAAACCAACAAAGATCAAAAGAGACAAAGAAGGCCATTACATAA...
18 ATTCAACAAGAAGAGCTAACTATCCTAAATATATATGCACCCAATA...
19 TTCATAAAGCAAGTCCTCAGTGACCTACAAAGAGACTTAGACTCCC...
20 GGAGACTTTAACACCCCACTGTCAACATTAGACAGATCAACGAGAC...
21 GATATCCAGGAATTGAACTCAGCTCTGCACCAAGCGGACCTAATAG...
22 CTCCACCCCAAATCAACAGAATATACATTCTTTTCAGCACCACACC...
23 ATTGACCACATAGTTGGAAGTAAAGCTCTCCTCAGCAAATGTAAAA...
24 ACAAACTGTCTCTCAGACCACAGTGCAATCAAATTAGAACTCAGGA...
25 CAAAACTGCTCAACTACATGAAAACTGAACAACCTGCTCCTGAATG...
26 AACAAAATGAAGGCAGAAATAAAGATGTTCTTTGAAACCAATGAGA...
27 TACCAGAATCTCTGGGACGCATTCAAAGCAGTGTGTAGAGGGAAAT...
28 GCCCACAAGAGAAAGCAGGAAAGATCTAAAATTGACACCCTAACAT...
29 CTAGAGAAGCAAGAGCAAACACATTCAAAAGCTAGCAGAAGGCAAG...
...
8540 >NODE_2518_length_56_cov_219
8541 CCCTTGTTGGTGTTACAAAGCCCTTGAACTACATCAGCAAAGACAA...
8542 >NODE_2519_length_56_cov_174
8543 CCGACTACTATCGAATTCCGCTCGACTACTATCGAATTCCGCTCGA...
8544 >NODE_2520_length_56_cov_131
8545 CCCAGGAGACTTGTCTTTGCTGATGTAGTTCAAGAGCTTTGTAACA...
8546 >NODE_2521_length_56_cov_118
8547 GGCTCCCTATCGGCTCGAATTCCGCTCGACTATTATCGAATTCCGC...
8548 >NODE_2522_length_56_cov_96
8549 CCCGCCCCCAGGAGACTTGTCTTTGCTGATAGTAGTCGAGCGGAAT...
8550 >NODE_2523_length_56_cov_74
8551 AGAGACTTGTCTTTGCTGATGTAGTTCAAGGGCTTTGTAACACCGA...
8552 >NODE_2524_length_56_cov_70
8553 TGCTCGACTACTATCGAATTCCGCTCGACTACTATCGAATTCCGCT...
8554 >NODE_2525_length_56_cov_59
8555 GAGACCCTTGTCGGTGTTACAAAGCCCTTTAACTACATCAGCAAAG...
8556 >NODE_2526_length_56_cov_48
8557 CCGACTACTATCGAATTCCGCTCGACTACTATCGAATTCCGCTCGA...
8558 >NODE_2527_length_56_cov_44
8559 CCAAGGGCTTTGTAACACCGACAAGGGTCTCGAAAACATCGGCATT...
8560 >NODE_2528_length_56_cov_42
8561 GAGACCCTTGTAGGTGTTACAAAGCCCTTGAACTACATCAGCAAAG...
8562 >NODE_2529_length_56_cov_38
8563 GAGACCCTTGTCGGTGTCACAAAGCCCTTGAACTACATCAGCAAAG...
8564 >NODE_2530_length_56_cov_29
8565 GAGGGCTTTGTAACACCGACAAGGGTCTCGAAAACATCGGCATTCT...
8566 >NODE_2531_length_56_cov_26
8567 AGGTTCAAGGGCTTTGTAACACCGACAAGGGTCTCGAAAACATCGG...
8568 >NODE_2532_length_56_cov_25
8569 GAGATGTGTATAAGAGACTTGTCTTTGCTGATGTAGTTCAAGGGCT...
How to split this one column into two columns, making >NODE_...... in one column and the corresponding sequence in another column? Another issue is the sequences are in multiple lines, how to make them into one string? The result is expected like this:
contig sequence
NODE_1_length_4014_cov_1.97676 AAAAAAAAAAAAAAA
NODE_........ TTTTTTTTTTTTTTT
Thank you very much.
I can't reproduce your example, but my guess is that you are loading file with pandas that is not formatted in a tabular format. From your example it looks like your file is formatted:
>Identifier
sequence
>Identifier
sequence
You would have to parse the file before you can put the information into a pandas dataframe. The logic would be to loop through each line of your file, if the line starts with '>Node' you store the line as an identifier. If not you concatenate them to the sequence value. Something like this:
testfile = '>NODE_1_length_4014_cov_1.97676\nAAAAAAAATTTTTTCCCCCCCGGGGGG\n>NODE_2518_length_56_cov_219\nAAAAAAAAGCCCTTTTT'.split('\n')
identifiers = []
sequences = []
current_sequence = ''
for line in testfile:
if line.startswith('>'):
identifiers.append(line)
sequences.append(current_sequence)
current_sequence = ''
else:
current_sequence += line.strip('\n')
df = pd.DataFrame({'identifiers' = identifiers,
'sequences' = sequences})
Whether this code works depends on the details of your input which you didn't provide, but that might get you started.

How to combine header files with data files with python?

I have separated files, one part are files only contained header info, like the example shown in below:
~content of "header1.txt"~
a 3
b 2
c 4
~content of "header2.txt"~
a 4
b 3
c 5
~content of "header3.txt"~
a 1
b 7
c 6
And another part are files only contained data, as shown below:
~content of "data1.txt"~
10 20 30 40
20 14 22 33
~content of "data2.txt"~
11 21 31 41
21 24 12 23
~content of "data3.txt"~
21 22 11 31
10 26 14 33
After combined the corresponded data files, the results are similar with examples in below:
~content of "asc1.txt"~
a 3
b 2
c 4
10 20 30 40
20 14 22 33
~content of "asc2.txt"~
a 4
b 3
c 5
11 21 31 41
21 24 12 23
~content of "asc3.txt"~
a 1
b 7
c 6
21 22 11 31
10 26 14 33
Can anyone give me some help in writing this in python? Thanks!
If you really want it in Python, here is the way to do
for i in range(1,4):
h = open('header{0}.txt'.format(i),'r')
d = open('data{0}.txt'.format(i),'r')
a = open('asc{0}.txt'.format(i),'a')
hdata = h.readlines()
ddata = d.readlines()
a.writelines(hdata+ddata)
a.close()
Of course, assuming that the number of both files is 3 and all follow the same naming convention you used.
Try this (written in python 3.4 idle). Pretty long but should be easier to understand:
# can start by creating a function to read contents of
# each file and return the contents as a string
def readFile(file):
contentsStr = ''
for line in file:
contentsStr += line
return contentsStr
# Read all the header files header1, header2, header3
header1 = open('header1.txt','r')
header2 = open('header2.txt','r')
header3 = open('header3.txt','r')
# Read all the data files data1, data2, data3
data1 = open('data1.txt','r')
data2 = open('data2.txt','r')
data3 = open('data3.txt','r')
# Open/create output files asc1, asc2, asc3
asc1_outFile = open('asc1.txt','w')
asc2_outFile = open('asc2.txt','w')
asc3_outFile = open('asc3.txt','w')
# read contents of each header file and data file into string variabls
header1_contents = readFile(header1)
header2_contents = readFile(header2)
header3_contents = readFile(header3)
data1_contents = readFile(data1)
data2_contents = readFile(data2)
data3_contents = readFile(data3)
# Append the contents of each data file contents to its
# corresponding header file
asc1_contents = header1_contents + '\n' + data1_contents
asc2_contents = header2_contents + '\n' + data2_contents
asc3_contents = header3_contents + '\n' + data3_contents
# now write the necessary results to asc1.txt, asc2.txt, and
# asc3.txt output files respectively
asc1_outFile.write(asc1_contents)
asc2_outFile.write(asc2_contents)
asc3_outFile.write(asc3_contents)
# close the file streams
header1.close()
header2.close()
header3.close()
data1.close()
data2.close()
data3.close()
asc1_outFile.close()
asc2_outFile.close()
asc3_outFile.close()
# done!
By the way, ensure that the header files and data files are in the same directory as the python script. Otherwise, you can simply edit the file paths of these files accordingly in the code above. The output files asc1.txt, asc2.txt, and asc3.txt will be created in the same directory as your python source file.
This works if the number of header file is equal to the number of data files are equal
#Glob is imported to get file names matching to the given pattern
import glob
header=[]
data=[]
#Traversing through the file and getting the content
for files1 in glob.glob("directory/header*.txt"):
a=open(files1,"r").read()
header.append(a)
for files2 in glob.glob("directory/data*.txt"):
a1=open(files2,"r").read()
data.append(a1)
#Writng the content into the file
for i in range(1,len(data)+1):
writer=open("directory/asc"+str(i)+".txt","w")
writer.write(header[i-1]+"\n\n"+data[i-1])
writer.close()
Edit
This method will only work if they are in different folder and there should be no other files other than header or data file in that folder
#Glob is imported to get file names matching to the given pattern
import glob
header=[]
data=[]
#Traversing through the file and getting the content
for files1 in glob.glob("directory1/*.txt"):
a=open(files1,"r").read()
header.append(a)
for files2 in glob.glob("directory2/*.txt"):
a1=open(files2,"r").read()
data.append(a1)
#Writng the content into the file
for i in range(1,len(data)+1):
writer=open("directory/asc"+str(i)+".txt","w")
writer.write(header[i-1]+"\n\n"+data[i-1])
writer.close()

How to write dictionary values to a csv file using Python

I have a dictionary of class objects. I want to write the member values (timepoints, fitted, measured) of the class to a csv file using Python.
My Class:
class PlotReadingCurves:
def __init__(self, timepoints, fitted, measured):
self.timepoints = timepoints
self.fitted = fitted
self.measured = measured
obj = PlotReadingCurves(mTimePoints,mFitted,mMeasured)
PlotReadingCurvesList[csoId] = obj
Eg: timpoints : 1 2 3 4 5
fitted: 6 7 8 9 10
measured: 11 12 13 14
Expected results:
timepoints fitted measured fitted measured
1 6 11 .. ..
2 7 12
3 8 13
4 9 14
5 10 15
Try my mini wrapper library pyexcel. Although it is not as powerful as pandas, it is sufficient to write a dict to an excel file in a few lines of code:
>>> import pyexcel as pe
>>> your_dict = { "timepoints": [1,2,3], "fitted":[6,7,8]} # more columns omitted
>>> sheet = pe.Sheet(pe.utils.dict_to_array(your_dict))
>>> sheet.save_as("your_file_name.csv") # done
With pyexcel, you can easily write your data into other excel formats: xls, xlsx and even ods. The documentation can be found here
Try to use pandas, here is pandas's feature about your problem.
Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
It's very convenient and powerful.

Categories

Resources