How to parse data from .TX0 file into dataframe - python
Hi I'm trying to basically convert a .TX0 file from a chromatogram file. the file is just a bunch of results including retention times etc... I want to eventually pick certain pieces of data from multiple files and do some analysis. So far I have:
filename = 'filepath'
f = open(filename, 'r')
lines = f.readlines()
print lines
my output is:
Out[29]:
[....................................................
'"condensate analysis (HP4890 Optic - FID)"\n',
'"Peak","Component","Time","Area","Height","BL"\n',
'"#","Name","[min]","[uV*sec]","[uV]",""\n',
'------,------,------,------,------,------\n',
'1,"Methane",0.689,5187666.22,994337.57,*BB\n',
'2,"Ethane",1.061,1453339.93,729285.09,*BB\n',
'3,"Propane",1.715,193334.09,63398.74,*BB\n',
'4,"i-Butane",2.792,157630.92,29233.56,*BV\n',
'5,"n-Butane",3.240,98943.96,15822.72,*VB\n',
'"","","",------,------,""\n',
'"","","",7090915.11,1.83e+06,""\n',
'"Missing Component Report"\n',
'"Component","Expected Retention (Calibration File)"\n',
'------,------\n',
'"All components were found"\n',
'"Report stored in ASCII file :","...
"\n'.......................]
Now, the problem i'm having. I can't get this output into a structured dataframe using pandas... =/ I've tried and it just gives me two columns...
pd.DataFrame(filename)
out:
Out[26]:
0
0 "=============================================...
1 "Software Version:",6.3.2.0646,"Date:","08/06/...
2 "Reprocess Number:","vma2: ......................
.......................
10 ""\n
11 ""\n
12 "condensate analysis (HP4890 Optic - FID)"\n
13 "Peak","Component","Time","Area","Height","BL"\n
14 "#","Name","[min]","[uV*sec]","[uV]",""\n
15 ------,------,------,------,------,------\n
16 1,"Methane",0.689,5187666.22,994337.57,*BB\n
17 2,"Ethane",1.061,1453339.93,729285.09,*BB\n
18 3,"Propane",1.715,193334.09,63398.74,*BB\n
19 4,"i-Butane",2.792,157630.92,29233.56,*BV\n
20 5,"n-Butane",3.240,98943.96,15822.72,*VB\n
21 "","","",------,------,""\n
22 "","","",7090915.11,1.83e+06,""\n
23 "Missing Component Report"\n
24 "Component","Expected Retention (Calibration F...
25 ------,------\n
26 "All components were found"\n
27 "Report stored in ASCII file :","C:\Shared Fol...
Related
Skip all rows containing strings and keep only rows with floats
I have a log file from a mathematical simulation. I tried to parse it in Python, but I am not quite satisfied with the result. Is there any "elegant" way to loop each line and sort it in order to keep only lines with physical values and ditch the rest? The goal is to perform various analyses using numpy. Knowing that the lines I need only contain numerical values, is there a way to "tell" python to keep only rows / lines with numerical values and ditch all the rows containing string? Thank your for your help. A sample of the log file is attached. 5 Host 1 -- hnode146 -- Ranks 20-39 6 Host 2 -- hnode147 -- Ranks 40-59 7 Host 3 -- hnode148 -- Ranks 60-79 8 Process rank 0 hnode145 36210 9 Total number of processes : 80 10 11 STAR-CCM+ 12.02.011 (linux-x86_64-2.5/gnu4.8-r8) 12 License build date: 10 February 2015 13 This version of the code requires license version 2017.02 or greater. 14 Checking license file: 15 Checking license file: 16 Unable to list features for license file 17 1 copy of ccmppower checked out from 18 Feature ccmppower expires in 19 Thu Apr 19 17:22:54 2018 20 21 Server::start -host h 22 Loading object database: 23 Loading module: StarMeshing 24 Loading module: MeshingSurfaceRepair 25 Loading module: CadModeler 26 Started Parasolid modeler version 29.01.131 27 Loading module: StarResurfacer 28 Loading module: StarTrimmer 29 Loading module: SegregatedFlowModel 30 Loading module: KwTurbModel 31 Loading module: StarDualMesher 32 Loading module: StarBodyFittedMesher 33 Simulation database saved by: 34 STAR-CCM+ 12.02.011 (linux-x86_64-2.5/gnu4.8-r8) Fri Mar 10 20:03:37 UTC 2017 Serial 35 Loading into: 36 STAR-CCM+ 12.02.011 (linux-x86_64-2.5/gnu4.8-r8) Fri Mar 10 20:03:37 UTC 2017 Np=80 37 Object database load completed. 39 A Zeit und Datum : 2018.04.19 at 17:23:11 40 41 Startzeit: 1524151391534 42 43 Loading/configuring connectivity (old|new partitions: 1|80) 44 Domain (index 1): 1889922 cells, 5614862 faces, 1990686 verts. 45 Configuring finished 46 Reading material property database "/sw/apps/cd-adapco/12.02.011-R8/STAR-CCM+12.02.011-R8/star/props.mdb"... 47 Re-partitioning 48 Iteration Continuity X-momentum Y-momentum Z-momentum Tke Sdr Shear+Pressure (N) Pressure (N) Shear (N) 49 2001 1.076589e-01 9.570364e-01 2.588931e-01 1.984590e-01 4.028215e-03 3.964344e+01 -6.468809e+00 -1.253867e+00 -5.214942e+00 50 2002 5.987195e-02 4.004615e-01 2.597862e-01 1.808196e-01 2.819456e-03 2.537490e+01 -5.154729e+00 -1.228644e+00 -3.926085e+00 51 2003 4.824863e-02 2.048600e-01 1.359121e-01 1.103614e-01 1.384044e-03 1.623916e+01 -4.277053e+00 -1.216038e+00 -3.061015e+00 52 2004 3.684017e-02 1.322581e-01 1.350187e-01 8.827220e-02 9.023783e-04 1.039251e+01 -3.914011e+00 -1.213340e+00 -2.700671e+00 53 2005 3.224797e-02 1.093365e-01 1.059148e-01 7.461911e-02 6.307195e-04 6.650742e+00 -3.745949e+00 -1.217353e+00 -2.528596e+00 54 2006 2.788050e-02 9.180507e-02 8.311817e-02 6.417279e-02 4.603072e-04 4.256107e+00 -3.658613e+00 -1.224046e+00 -2.434567e+00 55 2007 2.332397e-02 7.688239e-02 6.222694e-02 4.860232e-02 3.534658e-04 2.723686e+00 -3.608431e+00 -1.231574e+00 -2.376857e+00 56 2008 1.916130e-02 6.201947e-02 4.645780e-02 3.654489e-02 2.833177e-04 1.743055e+00 -3.575486e+00 -1.237352e+00 -2.338134e+00 57 2009 1.600865e-02 4.780234e-02 3.909247e-02 2.959689e-02 2.370245e-04 1.115506e+00 -3.548365e+00 -1.240938e+00 -2.307427e+00 58 2010 1.389765e-02 3.570659e-02 3.492423e-02 2.537285e-02 2.055279e-04 7.138997e-01 -3.527530e+00 -1.242749e+00 -2.284781e+00 59 Iteration Continuity X-momentum Y-momentum Z-momentum Tke Sdr Shear+Pressure (N) Pressure (N) Shear (N) 60 2011 1.253570e-02 2.591702e-02 3.089287e-02 2.209728e-02 1.814997e-04 4.568718e-01 -3.511034e+00 -1.242906e+00 -2.268128e+00 61 2012 1.141436e-02 1.992464e-02 2.745902e-02 1.922942e-02 1.636478e-04 2.923702e-01 -3.498876e+00 -1.243006e+00 -2.255870e+00 62 2013 1.024511e-02 1.621655e-02 2.544053e-02 1.687660e-02 1.492828e-04 1.870937e-01 -3.489288e+00 -1.242425e+00 -2.246863e+00 63 2014 9.067693e-03 1.359007e-02 2.320886e-02 1.481687e-02 1.371763e-04 1.197299e-01 -3.482323e+00 -1.242027e+00 -2.240295e+00 64 2015 7.906450e-03 1.159567e-02 2.073906e-02 1.306014e-02 1.265825e-04 7.662597e-02 -3.479134e+00 -1.243537e+00 -2.235597e+00 65 2016 6.889290e-03 1.010569e-02 1.787383e-02 1.258395e-02 1.171344e-04 4.903984e-02 -3.479042e+00 -1.246677e+00 -2.232364e+00 66 2017 5.982303e-03 8.872579e-03 1.576665e-02 1.141871e-02 1.086443e-04 3.138620e-02 -3.480301e+00 -1.249988e+00 -2.230313e+00 67 2018 5.191895e-03 7.958489e-03 1.446382e-02 9.796685e-03 1.009937e-04 2.009149e-02 -3.482459e+00 -1.253255e+00 -2.229204e+00 68 2019 4.614927e-03 7.193031e-03 1.279295e-02 8.818100e-03 9.411761e-05 1.286594e-02 -3.484886e+00 -1.256002e+00 -2.228885e+00 69 2020 4.159939e-03 6.571088e-03 1.146195e-02 7.756150e-03 8.794392e-05 8.241197e-03 -3.487597e+00 -1.258382e+00 -2.229214e+00 70 Iteration Continuity X-momentum Y-momentum Z-momentum Tke Sdr Shear+Pressure (N) Pressure (N) Shear (N) 71 2021 3.779168e-03 5.961164e-03 1.034847e-02 6.969454e-03 8.240903e-05 5.278791e-03 -3.490138e+00 -1.260061e+00 -2.230078e+00 72 2022 3.414811e-03 5.350398e-03 9.329119e-03 6.398522e-03 7.743586e-05 3.381806e-03 -3.491624e+00 -1.260241e+00 -2.231384e+00
Read each line. Split on whitespace, attempt to convert each entity to a float. If the conversion fails, the line isn't kept. There's certainly a way to do this with a regex, but this should work off the top of my head. lines_to_keep = [] for line in f.readlines(): try: # Throws ValueError if `x` can't be converted to float [float(x) for x in line.split()] # If the above line didn't throw a ValueError, keep it lines_to_keep.append(line) except ValueError: continue
import re list_to_keep=[] pattern= re.compile(r'[0-9 ]+[e.\-+][0-9]*',re.IGNORECASE) with open(f, 'rb') as csvfile: reader = csv.reader(csvfile, delimiter='\n') for row in reader: if(pattern.match(str(row))): list_to_keep.append(row) Can use regex to find your row and keep it in list.
If you'd like regex. This matches continuous digits separated by numeric symbols like '+-.e'. import re r = re.compile(r'([0-9 ]+[e.\-+]*)+\n') lines = [line for line in open('a.log') if r.fullmatch(line)] # all the useful lines are ... # 49 2001 1.076589e-01 9.570364e-01 2.588931e-01 1.984590e-01 4.028215e-03 3.964344e+01 -6.468809e+00 -1.253867e+00 -5.214942e+00 # 50 2002 5.987195e-02 4.004615e-01 2.597862e-01 1.808196e-01 2.819456e-03 2.537490e+01 -5.154729e+00 -1.228644e+00 -3.926085e+00 # 51 2003 4.824863e-02 2.048600e-01 1.359121e-01 1.103614e-01 1.384044e-03 1.623916e+01 -4.277053e+00 -1.216038e+00 -3.061015e+00
How to add column numbers to each column in a large text file
I would like to add column numbers to 128 columns in a text file E.g. My file 12 13 14 15 20 21 23 14 34 56 67 89 Required output 1:12 2:13 3:14 4:15 1: 20 2:21 3: 23 4:14 1: 34 2:56 3:67 4:89 Can this be done using awk / python I tried paste command for joining two files : one with the values other file with column numbers, manually typed. Since the file size is very large manual typing didnt work. As of my knowledge I could find answers for adding only one column to the end of a text file. Thanks for the suggestions
awk to the rescue! $ awk '{for(i=1;i<=NF;i++) $i=i":"$i}1' file should do.
How to split one column into two columns in python?
I have a contig file loaded in pandas like this: >NODE_1_length_4014_cov_1.97676 1 AATTAATGAAATAAAGCAAGAAGACAAGGTTAGACAAAAAAAAGAG... 2 CAAAGCCTCCAAGAAATATGGGACTATGTGAAAAGACCAAATCTAC... 3 CCTGAAAGTGACGGGGAGAATGGAACCAAGTTGGAAAACACTCTGC... 4 GAGAACTTCCCCAATCTAGCAAGGCAGGCCAACATTCAAATTCAGG... 5 CCACAAAGATACTCCTCGAGAAGAGCAACTCCAAGACACATAATTG... 6 GTTGAAATGAAGGAAAAAATGTTAAGGGCAGCCAGAGAGAAAGGTC... 7 GGGAAGCCCATCAGACTAACAGCGGATCTCTCGGCAGAAACCCTAC... 8 TGGGGGCCAATATTCAACATTCTTAAAGAAAAGAATTTTCAACCCA... 9 GCCAAACTAAGCTTCATAAGCAAAGGAGAAATAAAATCCTTTACAG... 10 AGAGATTTTGTCACCACCAGGCCTGCCTTACAAGAGCTCCTGAAGG... 11 GAAAGGAAAAACCGGTACCAGCCACTGCAAAATCATGCCAAACTGT... 12 CTAGGAAGAAACTGCATCAACTAATGAGCAAAATAACCAGCTAACA... 13 TCAAATTCACACATAACAATATTAACCTTAAATGTAAATGGGCTAA... 14 AGACACAGACTGGCAAATTGGATAAAGAGTCAAGACCCATCAGTGT... 15 ACCCATCTCAAATGCAGAGACACACATAGGCTCAAAATAAAGGGAT... 16 CAAGCAAATGGAAAACAAAAAAAGGCAGGGGTTGCAATCCTAGTCT... 17 TTTAAACCAACAAAGATCAAAAGAGACAAAGAAGGCCATTACATAA... 18 ATTCAACAAGAAGAGCTAACTATCCTAAATATATATGCACCCAATA... 19 TTCATAAAGCAAGTCCTCAGTGACCTACAAAGAGACTTAGACTCCC... 20 GGAGACTTTAACACCCCACTGTCAACATTAGACAGATCAACGAGAC... 21 GATATCCAGGAATTGAACTCAGCTCTGCACCAAGCGGACCTAATAG... 22 CTCCACCCCAAATCAACAGAATATACATTCTTTTCAGCACCACACC... 23 ATTGACCACATAGTTGGAAGTAAAGCTCTCCTCAGCAAATGTAAAA... 24 ACAAACTGTCTCTCAGACCACAGTGCAATCAAATTAGAACTCAGGA... 25 CAAAACTGCTCAACTACATGAAAACTGAACAACCTGCTCCTGAATG... 26 AACAAAATGAAGGCAGAAATAAAGATGTTCTTTGAAACCAATGAGA... 27 TACCAGAATCTCTGGGACGCATTCAAAGCAGTGTGTAGAGGGAAAT... 28 GCCCACAAGAGAAAGCAGGAAAGATCTAAAATTGACACCCTAACAT... 29 CTAGAGAAGCAAGAGCAAACACATTCAAAAGCTAGCAGAAGGCAAG... ... 8540 >NODE_2518_length_56_cov_219 8541 CCCTTGTTGGTGTTACAAAGCCCTTGAACTACATCAGCAAAGACAA... 8542 >NODE_2519_length_56_cov_174 8543 CCGACTACTATCGAATTCCGCTCGACTACTATCGAATTCCGCTCGA... 8544 >NODE_2520_length_56_cov_131 8545 CCCAGGAGACTTGTCTTTGCTGATGTAGTTCAAGAGCTTTGTAACA... 8546 >NODE_2521_length_56_cov_118 8547 GGCTCCCTATCGGCTCGAATTCCGCTCGACTATTATCGAATTCCGC... 8548 >NODE_2522_length_56_cov_96 8549 CCCGCCCCCAGGAGACTTGTCTTTGCTGATAGTAGTCGAGCGGAAT... 8550 >NODE_2523_length_56_cov_74 8551 AGAGACTTGTCTTTGCTGATGTAGTTCAAGGGCTTTGTAACACCGA... 8552 >NODE_2524_length_56_cov_70 8553 TGCTCGACTACTATCGAATTCCGCTCGACTACTATCGAATTCCGCT... 8554 >NODE_2525_length_56_cov_59 8555 GAGACCCTTGTCGGTGTTACAAAGCCCTTTAACTACATCAGCAAAG... 8556 >NODE_2526_length_56_cov_48 8557 CCGACTACTATCGAATTCCGCTCGACTACTATCGAATTCCGCTCGA... 8558 >NODE_2527_length_56_cov_44 8559 CCAAGGGCTTTGTAACACCGACAAGGGTCTCGAAAACATCGGCATT... 8560 >NODE_2528_length_56_cov_42 8561 GAGACCCTTGTAGGTGTTACAAAGCCCTTGAACTACATCAGCAAAG... 8562 >NODE_2529_length_56_cov_38 8563 GAGACCCTTGTCGGTGTCACAAAGCCCTTGAACTACATCAGCAAAG... 8564 >NODE_2530_length_56_cov_29 8565 GAGGGCTTTGTAACACCGACAAGGGTCTCGAAAACATCGGCATTCT... 8566 >NODE_2531_length_56_cov_26 8567 AGGTTCAAGGGCTTTGTAACACCGACAAGGGTCTCGAAAACATCGG... 8568 >NODE_2532_length_56_cov_25 8569 GAGATGTGTATAAGAGACTTGTCTTTGCTGATGTAGTTCAAGGGCT... How to split this one column into two columns, making >NODE_...... in one column and the corresponding sequence in another column? Another issue is the sequences are in multiple lines, how to make them into one string? The result is expected like this: contig sequence NODE_1_length_4014_cov_1.97676 AAAAAAAAAAAAAAA NODE_........ TTTTTTTTTTTTTTT Thank you very much.
I can't reproduce your example, but my guess is that you are loading file with pandas that is not formatted in a tabular format. From your example it looks like your file is formatted: >Identifier sequence >Identifier sequence You would have to parse the file before you can put the information into a pandas dataframe. The logic would be to loop through each line of your file, if the line starts with '>Node' you store the line as an identifier. If not you concatenate them to the sequence value. Something like this: testfile = '>NODE_1_length_4014_cov_1.97676\nAAAAAAAATTTTTTCCCCCCCGGGGGG\n>NODE_2518_length_56_cov_219\nAAAAAAAAGCCCTTTTT'.split('\n') identifiers = [] sequences = [] current_sequence = '' for line in testfile: if line.startswith('>'): identifiers.append(line) sequences.append(current_sequence) current_sequence = '' else: current_sequence += line.strip('\n') df = pd.DataFrame({'identifiers' = identifiers, 'sequences' = sequences}) Whether this code works depends on the details of your input which you didn't provide, but that might get you started.
How to combine header files with data files with python?
I have separated files, one part are files only contained header info, like the example shown in below: ~content of "header1.txt"~ a 3 b 2 c 4 ~content of "header2.txt"~ a 4 b 3 c 5 ~content of "header3.txt"~ a 1 b 7 c 6 And another part are files only contained data, as shown below: ~content of "data1.txt"~ 10 20 30 40 20 14 22 33 ~content of "data2.txt"~ 11 21 31 41 21 24 12 23 ~content of "data3.txt"~ 21 22 11 31 10 26 14 33 After combined the corresponded data files, the results are similar with examples in below: ~content of "asc1.txt"~ a 3 b 2 c 4 10 20 30 40 20 14 22 33 ~content of "asc2.txt"~ a 4 b 3 c 5 11 21 31 41 21 24 12 23 ~content of "asc3.txt"~ a 1 b 7 c 6 21 22 11 31 10 26 14 33 Can anyone give me some help in writing this in python? Thanks!
If you really want it in Python, here is the way to do for i in range(1,4): h = open('header{0}.txt'.format(i),'r') d = open('data{0}.txt'.format(i),'r') a = open('asc{0}.txt'.format(i),'a') hdata = h.readlines() ddata = d.readlines() a.writelines(hdata+ddata) a.close() Of course, assuming that the number of both files is 3 and all follow the same naming convention you used.
Try this (written in python 3.4 idle). Pretty long but should be easier to understand: # can start by creating a function to read contents of # each file and return the contents as a string def readFile(file): contentsStr = '' for line in file: contentsStr += line return contentsStr # Read all the header files header1, header2, header3 header1 = open('header1.txt','r') header2 = open('header2.txt','r') header3 = open('header3.txt','r') # Read all the data files data1, data2, data3 data1 = open('data1.txt','r') data2 = open('data2.txt','r') data3 = open('data3.txt','r') # Open/create output files asc1, asc2, asc3 asc1_outFile = open('asc1.txt','w') asc2_outFile = open('asc2.txt','w') asc3_outFile = open('asc3.txt','w') # read contents of each header file and data file into string variabls header1_contents = readFile(header1) header2_contents = readFile(header2) header3_contents = readFile(header3) data1_contents = readFile(data1) data2_contents = readFile(data2) data3_contents = readFile(data3) # Append the contents of each data file contents to its # corresponding header file asc1_contents = header1_contents + '\n' + data1_contents asc2_contents = header2_contents + '\n' + data2_contents asc3_contents = header3_contents + '\n' + data3_contents # now write the necessary results to asc1.txt, asc2.txt, and # asc3.txt output files respectively asc1_outFile.write(asc1_contents) asc2_outFile.write(asc2_contents) asc3_outFile.write(asc3_contents) # close the file streams header1.close() header2.close() header3.close() data1.close() data2.close() data3.close() asc1_outFile.close() asc2_outFile.close() asc3_outFile.close() # done! By the way, ensure that the header files and data files are in the same directory as the python script. Otherwise, you can simply edit the file paths of these files accordingly in the code above. The output files asc1.txt, asc2.txt, and asc3.txt will be created in the same directory as your python source file.
This works if the number of header file is equal to the number of data files are equal #Glob is imported to get file names matching to the given pattern import glob header=[] data=[] #Traversing through the file and getting the content for files1 in glob.glob("directory/header*.txt"): a=open(files1,"r").read() header.append(a) for files2 in glob.glob("directory/data*.txt"): a1=open(files2,"r").read() data.append(a1) #Writng the content into the file for i in range(1,len(data)+1): writer=open("directory/asc"+str(i)+".txt","w") writer.write(header[i-1]+"\n\n"+data[i-1]) writer.close() Edit This method will only work if they are in different folder and there should be no other files other than header or data file in that folder #Glob is imported to get file names matching to the given pattern import glob header=[] data=[] #Traversing through the file and getting the content for files1 in glob.glob("directory1/*.txt"): a=open(files1,"r").read() header.append(a) for files2 in glob.glob("directory2/*.txt"): a1=open(files2,"r").read() data.append(a1) #Writng the content into the file for i in range(1,len(data)+1): writer=open("directory/asc"+str(i)+".txt","w") writer.write(header[i-1]+"\n\n"+data[i-1]) writer.close()
How to write dictionary values to a csv file using Python
I have a dictionary of class objects. I want to write the member values (timepoints, fitted, measured) of the class to a csv file using Python. My Class: class PlotReadingCurves: def __init__(self, timepoints, fitted, measured): self.timepoints = timepoints self.fitted = fitted self.measured = measured obj = PlotReadingCurves(mTimePoints,mFitted,mMeasured) PlotReadingCurvesList[csoId] = obj Eg: timpoints : 1 2 3 4 5 fitted: 6 7 8 9 10 measured: 11 12 13 14 Expected results: timepoints fitted measured fitted measured 1 6 11 .. .. 2 7 12 3 8 13 4 9 14 5 10 15
Try my mini wrapper library pyexcel. Although it is not as powerful as pandas, it is sufficient to write a dict to an excel file in a few lines of code: >>> import pyexcel as pe >>> your_dict = { "timepoints": [1,2,3], "fitted":[6,7,8]} # more columns omitted >>> sheet = pe.Sheet(pe.utils.dict_to_array(your_dict)) >>> sheet.save_as("your_file_name.csv") # done With pyexcel, you can easily write your data into other excel formats: xls, xlsx and even ods. The documentation can be found here
Try to use pandas, here is pandas's feature about your problem. Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format; It's very convenient and powerful.