Skip all rows containing strings and keep only rows with floats - python

I have a log file from a mathematical simulation. I tried to parse it in Python, but I am not quite satisfied with the result. Is there any "elegant" way to loop each line and sort it in order to keep only lines with physical values and ditch the rest?
The goal is to perform various analyses using numpy. Knowing that the lines I need only contain numerical values, is there a way to "tell" python to keep only rows / lines with numerical values and ditch all the rows containing string? Thank your for your help. A sample of the log file is attached.
5 Host 1 -- hnode146 -- Ranks 20-39
6 Host 2 -- hnode147 -- Ranks 40-59
7 Host 3 -- hnode148 -- Ranks 60-79
8 Process rank 0 hnode145 36210
9 Total number of processes : 80
10
11 STAR-CCM+ 12.02.011 (linux-x86_64-2.5/gnu4.8-r8)
12 License build date: 10 February 2015
13 This version of the code requires license version 2017.02 or greater.
14 Checking license file:
15 Checking license file:
16 Unable to list features for license file
17 1 copy of ccmppower checked out from
18 Feature ccmppower expires in
19 Thu Apr 19 17:22:54 2018
20
21 Server::start -host h
22 Loading object database:
23 Loading module: StarMeshing
24 Loading module: MeshingSurfaceRepair
25 Loading module: CadModeler
26 Started Parasolid modeler version 29.01.131
27 Loading module: StarResurfacer
28 Loading module: StarTrimmer
29 Loading module: SegregatedFlowModel
30 Loading module: KwTurbModel
31 Loading module: StarDualMesher
32 Loading module: StarBodyFittedMesher
33 Simulation database saved by:
34 STAR-CCM+ 12.02.011 (linux-x86_64-2.5/gnu4.8-r8) Fri Mar 10 20:03:37 UTC 2017 Serial
35 Loading into:
36 STAR-CCM+ 12.02.011 (linux-x86_64-2.5/gnu4.8-r8) Fri Mar 10 20:03:37 UTC 2017 Np=80
37 Object database load completed.
39 A Zeit und Datum : 2018.04.19 at 17:23:11
40
41 Startzeit: 1524151391534
42
43 Loading/configuring connectivity (old|new partitions: 1|80)
44 Domain (index 1): 1889922 cells, 5614862 faces, 1990686 verts.
45 Configuring finished
46 Reading material property database "/sw/apps/cd-adapco/12.02.011-R8/STAR-CCM+12.02.011-R8/star/props.mdb"...
47 Re-partitioning
48 Iteration Continuity X-momentum Y-momentum Z-momentum Tke Sdr Shear+Pressure (N) Pressure (N) Shear (N)
49 2001 1.076589e-01 9.570364e-01 2.588931e-01 1.984590e-01 4.028215e-03 3.964344e+01 -6.468809e+00 -1.253867e+00 -5.214942e+00
50 2002 5.987195e-02 4.004615e-01 2.597862e-01 1.808196e-01 2.819456e-03 2.537490e+01 -5.154729e+00 -1.228644e+00 -3.926085e+00
51 2003 4.824863e-02 2.048600e-01 1.359121e-01 1.103614e-01 1.384044e-03 1.623916e+01 -4.277053e+00 -1.216038e+00 -3.061015e+00
52 2004 3.684017e-02 1.322581e-01 1.350187e-01 8.827220e-02 9.023783e-04 1.039251e+01 -3.914011e+00 -1.213340e+00 -2.700671e+00
53 2005 3.224797e-02 1.093365e-01 1.059148e-01 7.461911e-02 6.307195e-04 6.650742e+00 -3.745949e+00 -1.217353e+00 -2.528596e+00
54 2006 2.788050e-02 9.180507e-02 8.311817e-02 6.417279e-02 4.603072e-04 4.256107e+00 -3.658613e+00 -1.224046e+00 -2.434567e+00
55 2007 2.332397e-02 7.688239e-02 6.222694e-02 4.860232e-02 3.534658e-04 2.723686e+00 -3.608431e+00 -1.231574e+00 -2.376857e+00
56 2008 1.916130e-02 6.201947e-02 4.645780e-02 3.654489e-02 2.833177e-04 1.743055e+00 -3.575486e+00 -1.237352e+00 -2.338134e+00
57 2009 1.600865e-02 4.780234e-02 3.909247e-02 2.959689e-02 2.370245e-04 1.115506e+00 -3.548365e+00 -1.240938e+00 -2.307427e+00
58 2010 1.389765e-02 3.570659e-02 3.492423e-02 2.537285e-02 2.055279e-04 7.138997e-01 -3.527530e+00 -1.242749e+00 -2.284781e+00
59 Iteration Continuity X-momentum Y-momentum Z-momentum Tke Sdr Shear+Pressure (N) Pressure (N) Shear (N)
60 2011 1.253570e-02 2.591702e-02 3.089287e-02 2.209728e-02 1.814997e-04 4.568718e-01 -3.511034e+00 -1.242906e+00 -2.268128e+00
61 2012 1.141436e-02 1.992464e-02 2.745902e-02 1.922942e-02 1.636478e-04 2.923702e-01 -3.498876e+00 -1.243006e+00 -2.255870e+00
62 2013 1.024511e-02 1.621655e-02 2.544053e-02 1.687660e-02 1.492828e-04 1.870937e-01 -3.489288e+00 -1.242425e+00 -2.246863e+00
63 2014 9.067693e-03 1.359007e-02 2.320886e-02 1.481687e-02 1.371763e-04 1.197299e-01 -3.482323e+00 -1.242027e+00 -2.240295e+00
64 2015 7.906450e-03 1.159567e-02 2.073906e-02 1.306014e-02 1.265825e-04 7.662597e-02 -3.479134e+00 -1.243537e+00 -2.235597e+00
65 2016 6.889290e-03 1.010569e-02 1.787383e-02 1.258395e-02 1.171344e-04 4.903984e-02 -3.479042e+00 -1.246677e+00 -2.232364e+00
66 2017 5.982303e-03 8.872579e-03 1.576665e-02 1.141871e-02 1.086443e-04 3.138620e-02 -3.480301e+00 -1.249988e+00 -2.230313e+00
67 2018 5.191895e-03 7.958489e-03 1.446382e-02 9.796685e-03 1.009937e-04 2.009149e-02 -3.482459e+00 -1.253255e+00 -2.229204e+00
68 2019 4.614927e-03 7.193031e-03 1.279295e-02 8.818100e-03 9.411761e-05 1.286594e-02 -3.484886e+00 -1.256002e+00 -2.228885e+00
69 2020 4.159939e-03 6.571088e-03 1.146195e-02 7.756150e-03 8.794392e-05 8.241197e-03 -3.487597e+00 -1.258382e+00 -2.229214e+00
70 Iteration Continuity X-momentum Y-momentum Z-momentum Tke Sdr Shear+Pressure (N) Pressure (N) Shear (N)
71 2021 3.779168e-03 5.961164e-03 1.034847e-02 6.969454e-03 8.240903e-05 5.278791e-03 -3.490138e+00 -1.260061e+00 -2.230078e+00
72 2022 3.414811e-03 5.350398e-03 9.329119e-03 6.398522e-03 7.743586e-05 3.381806e-03 -3.491624e+00 -1.260241e+00 -2.231384e+00

Read each line. Split on whitespace, attempt to convert each entity to a float. If the conversion fails, the line isn't kept. There's certainly a way to do this with a regex, but this should work off the top of my head.
lines_to_keep = []
for line in f.readlines():
try:
# Throws ValueError if `x` can't be converted to float
[float(x) for x in line.split()]
# If the above line didn't throw a ValueError, keep it
lines_to_keep.append(line)
except ValueError:
continue

import re
list_to_keep=[]
pattern= re.compile(r'[0-9 ]+[e.\-+][0-9]*',re.IGNORECASE)
with open(f, 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter='\n')
for row in reader:
if(pattern.match(str(row))):
list_to_keep.append(row)
Can use regex to find your row and keep it in list.

If you'd like regex. This matches continuous digits separated by numeric symbols like '+-.e'.
import re
r = re.compile(r'([0-9 ]+[e.\-+]*)+\n')
lines = [line for line in open('a.log') if r.fullmatch(line)]
# all the useful lines are ...
# 49 2001 1.076589e-01 9.570364e-01 2.588931e-01 1.984590e-01 4.028215e-03 3.964344e+01 -6.468809e+00 -1.253867e+00 -5.214942e+00
# 50 2002 5.987195e-02 4.004615e-01 2.597862e-01 1.808196e-01 2.819456e-03 2.537490e+01 -5.154729e+00 -1.228644e+00 -3.926085e+00
# 51 2003 4.824863e-02 2.048600e-01 1.359121e-01 1.103614e-01 1.384044e-03 1.623916e+01 -4.277053e+00 -1.216038e+00 -3.061015e+00

Related

Best way to read/analysis reoccurring data in a file

I want to know what is the best/easiest way to read and analysis reoccurring data from a file with python. E.g. file:
some text
some more text
x y z
23 23 45
45 67 89
23 64 67
some text
some more text
x y z
23 23 45
45 67 89
23 64 67
.
.
.
.
I want to read the xyz data and determine the change over the occurrences. I'm wondering if there is a better way then reading line by line and using an if statement to find and separate the numerical data. If the data was in different files I would loop over the file names and use np.loadtxt. For some reason I find it harder to deal with reoccurring data in a file then multiple files. Thanks for any help.

For loop store outcome in variable

I'm learning to webscrape data so I can use it to practice data visualization, and I am following a tutorial but I can't get the same results. The problem is that I have a for loop, but cant seem to store the data in a variable. When I run the for loop and try to store the results in a variable i will only get one result, but when i immediately print the for loop results i get all the data.
Can someone explain what I'm doing wrong?
for age in team_riders:
print(age.find('div', class_='age').text)
Results:
30
28
28
22
34
28
25
30
30
30
34
32
33
32
24
27
23
26
22
27
30
28
24
26
21
26
36
26
27
22
32
30
for age in team_riders:
age = age.find('div', class_='age').text
print(age)
prints:
30
Define an empty list before your loop and append() the single results from the loop to this list:
lst = []
for age in team_riders:
lst.append(age.find('div', class_='age').text)
You can also use this oneliner:
lst= [age.find('div', class_='age').text for age in team_riders]
Because you are storing de data as a variable, in this case you are saving the last one (and for every cycle you overwrite the last before), if you want to store every records you need to use a datastructure: List for example, and using append()
to add the record to the list.
So your code will be:
ages = []
for age in team_riders: ages.append(age.find('div', class_='age').text)

How to add column numbers to each column in a large text file

I would like to add column numbers to 128 columns in a text file
E.g.
My file
12 13 14 15
20 21 23 14
34 56 67 89
Required output
1:12 2:13 3:14 4:15
1: 20 2:21 3: 23 4:14
1: 34 2:56 3:67 4:89
Can this be done using awk / python
I tried paste command for joining two files : one with the values other file with column numbers, manually typed. Since the file size is very large manual typing didnt work.
As of my knowledge I could find answers for adding only one column to the end of a text file.
Thanks for the suggestions
awk to the rescue!
$ awk '{for(i=1;i<=NF;i++) $i=i":"$i}1' file
should do.

How to split one column into two columns in python?

I have a contig file loaded in pandas like this:
>NODE_1_length_4014_cov_1.97676
1 AATTAATGAAATAAAGCAAGAAGACAAGGTTAGACAAAAAAAAGAG...
2 CAAAGCCTCCAAGAAATATGGGACTATGTGAAAAGACCAAATCTAC...
3 CCTGAAAGTGACGGGGAGAATGGAACCAAGTTGGAAAACACTCTGC...
4 GAGAACTTCCCCAATCTAGCAAGGCAGGCCAACATTCAAATTCAGG...
5 CCACAAAGATACTCCTCGAGAAGAGCAACTCCAAGACACATAATTG...
6 GTTGAAATGAAGGAAAAAATGTTAAGGGCAGCCAGAGAGAAAGGTC...
7 GGGAAGCCCATCAGACTAACAGCGGATCTCTCGGCAGAAACCCTAC...
8 TGGGGGCCAATATTCAACATTCTTAAAGAAAAGAATTTTCAACCCA...
9 GCCAAACTAAGCTTCATAAGCAAAGGAGAAATAAAATCCTTTACAG...
10 AGAGATTTTGTCACCACCAGGCCTGCCTTACAAGAGCTCCTGAAGG...
11 GAAAGGAAAAACCGGTACCAGCCACTGCAAAATCATGCCAAACTGT...
12 CTAGGAAGAAACTGCATCAACTAATGAGCAAAATAACCAGCTAACA...
13 TCAAATTCACACATAACAATATTAACCTTAAATGTAAATGGGCTAA...
14 AGACACAGACTGGCAAATTGGATAAAGAGTCAAGACCCATCAGTGT...
15 ACCCATCTCAAATGCAGAGACACACATAGGCTCAAAATAAAGGGAT...
16 CAAGCAAATGGAAAACAAAAAAAGGCAGGGGTTGCAATCCTAGTCT...
17 TTTAAACCAACAAAGATCAAAAGAGACAAAGAAGGCCATTACATAA...
18 ATTCAACAAGAAGAGCTAACTATCCTAAATATATATGCACCCAATA...
19 TTCATAAAGCAAGTCCTCAGTGACCTACAAAGAGACTTAGACTCCC...
20 GGAGACTTTAACACCCCACTGTCAACATTAGACAGATCAACGAGAC...
21 GATATCCAGGAATTGAACTCAGCTCTGCACCAAGCGGACCTAATAG...
22 CTCCACCCCAAATCAACAGAATATACATTCTTTTCAGCACCACACC...
23 ATTGACCACATAGTTGGAAGTAAAGCTCTCCTCAGCAAATGTAAAA...
24 ACAAACTGTCTCTCAGACCACAGTGCAATCAAATTAGAACTCAGGA...
25 CAAAACTGCTCAACTACATGAAAACTGAACAACCTGCTCCTGAATG...
26 AACAAAATGAAGGCAGAAATAAAGATGTTCTTTGAAACCAATGAGA...
27 TACCAGAATCTCTGGGACGCATTCAAAGCAGTGTGTAGAGGGAAAT...
28 GCCCACAAGAGAAAGCAGGAAAGATCTAAAATTGACACCCTAACAT...
29 CTAGAGAAGCAAGAGCAAACACATTCAAAAGCTAGCAGAAGGCAAG...
...
8540 >NODE_2518_length_56_cov_219
8541 CCCTTGTTGGTGTTACAAAGCCCTTGAACTACATCAGCAAAGACAA...
8542 >NODE_2519_length_56_cov_174
8543 CCGACTACTATCGAATTCCGCTCGACTACTATCGAATTCCGCTCGA...
8544 >NODE_2520_length_56_cov_131
8545 CCCAGGAGACTTGTCTTTGCTGATGTAGTTCAAGAGCTTTGTAACA...
8546 >NODE_2521_length_56_cov_118
8547 GGCTCCCTATCGGCTCGAATTCCGCTCGACTATTATCGAATTCCGC...
8548 >NODE_2522_length_56_cov_96
8549 CCCGCCCCCAGGAGACTTGTCTTTGCTGATAGTAGTCGAGCGGAAT...
8550 >NODE_2523_length_56_cov_74
8551 AGAGACTTGTCTTTGCTGATGTAGTTCAAGGGCTTTGTAACACCGA...
8552 >NODE_2524_length_56_cov_70
8553 TGCTCGACTACTATCGAATTCCGCTCGACTACTATCGAATTCCGCT...
8554 >NODE_2525_length_56_cov_59
8555 GAGACCCTTGTCGGTGTTACAAAGCCCTTTAACTACATCAGCAAAG...
8556 >NODE_2526_length_56_cov_48
8557 CCGACTACTATCGAATTCCGCTCGACTACTATCGAATTCCGCTCGA...
8558 >NODE_2527_length_56_cov_44
8559 CCAAGGGCTTTGTAACACCGACAAGGGTCTCGAAAACATCGGCATT...
8560 >NODE_2528_length_56_cov_42
8561 GAGACCCTTGTAGGTGTTACAAAGCCCTTGAACTACATCAGCAAAG...
8562 >NODE_2529_length_56_cov_38
8563 GAGACCCTTGTCGGTGTCACAAAGCCCTTGAACTACATCAGCAAAG...
8564 >NODE_2530_length_56_cov_29
8565 GAGGGCTTTGTAACACCGACAAGGGTCTCGAAAACATCGGCATTCT...
8566 >NODE_2531_length_56_cov_26
8567 AGGTTCAAGGGCTTTGTAACACCGACAAGGGTCTCGAAAACATCGG...
8568 >NODE_2532_length_56_cov_25
8569 GAGATGTGTATAAGAGACTTGTCTTTGCTGATGTAGTTCAAGGGCT...
How to split this one column into two columns, making >NODE_...... in one column and the corresponding sequence in another column? Another issue is the sequences are in multiple lines, how to make them into one string? The result is expected like this:
contig sequence
NODE_1_length_4014_cov_1.97676 AAAAAAAAAAAAAAA
NODE_........ TTTTTTTTTTTTTTT
Thank you very much.
I can't reproduce your example, but my guess is that you are loading file with pandas that is not formatted in a tabular format. From your example it looks like your file is formatted:
>Identifier
sequence
>Identifier
sequence
You would have to parse the file before you can put the information into a pandas dataframe. The logic would be to loop through each line of your file, if the line starts with '>Node' you store the line as an identifier. If not you concatenate them to the sequence value. Something like this:
testfile = '>NODE_1_length_4014_cov_1.97676\nAAAAAAAATTTTTTCCCCCCCGGGGGG\n>NODE_2518_length_56_cov_219\nAAAAAAAAGCCCTTTTT'.split('\n')
identifiers = []
sequences = []
current_sequence = ''
for line in testfile:
if line.startswith('>'):
identifiers.append(line)
sequences.append(current_sequence)
current_sequence = ''
else:
current_sequence += line.strip('\n')
df = pd.DataFrame({'identifiers' = identifiers,
'sequences' = sequences})
Whether this code works depends on the details of your input which you didn't provide, but that might get you started.

How to parse data from .TX0 file into dataframe

Hi I'm trying to basically convert a .TX0 file from a chromatogram file. the file is just a bunch of results including retention times etc... I want to eventually pick certain pieces of data from multiple files and do some analysis. So far I have:
filename = 'filepath'
f = open(filename, 'r')
lines = f.readlines()
print lines
my output is:
Out[29]:
[....................................................
'"condensate analysis (HP4890 Optic - FID)"\n',
'"Peak","Component","Time","Area","Height","BL"\n',
'"#","Name","[min]","[uV*sec]","[uV]",""\n',
'------,------,------,------,------,------\n',
'1,"Methane",0.689,5187666.22,994337.57,*BB\n',
'2,"Ethane",1.061,1453339.93,729285.09,*BB\n',
'3,"Propane",1.715,193334.09,63398.74,*BB\n',
'4,"i-Butane",2.792,157630.92,29233.56,*BV\n',
'5,"n-Butane",3.240,98943.96,15822.72,*VB\n',
'"","","",------,------,""\n',
'"","","",7090915.11,1.83e+06,""\n',
'"Missing Component Report"\n',
'"Component","Expected Retention (Calibration File)"\n',
'------,------\n',
'"All components were found"\n',
'"Report stored in ASCII file :","...
"\n'.......................]
Now, the problem i'm having. I can't get this output into a structured dataframe using pandas... =/ I've tried and it just gives me two columns...
pd.DataFrame(filename)
out:
Out[26]:
0
0 "=============================================...
1 "Software Version:",6.3.2.0646,"Date:","08/06/...
2 "Reprocess Number:","vma2: ......................
.......................
10 ""\n
11 ""\n
12 "condensate analysis (HP4890 Optic - FID)"\n
13 "Peak","Component","Time","Area","Height","BL"\n
14 "#","Name","[min]","[uV*sec]","[uV]",""\n
15 ------,------,------,------,------,------\n
16 1,"Methane",0.689,5187666.22,994337.57,*BB\n
17 2,"Ethane",1.061,1453339.93,729285.09,*BB\n
18 3,"Propane",1.715,193334.09,63398.74,*BB\n
19 4,"i-Butane",2.792,157630.92,29233.56,*BV\n
20 5,"n-Butane",3.240,98943.96,15822.72,*VB\n
21 "","","",------,------,""\n
22 "","","",7090915.11,1.83e+06,""\n
23 "Missing Component Report"\n
24 "Component","Expected Retention (Calibration F...
25 ------,------\n
26 "All components were found"\n
27 "Report stored in ASCII file :","C:\Shared Fol...

Categories

Resources