Best way to read/analysis reoccurring data in a file

Best way to read/analysis reoccurring data in a file - python

I want to know what is the best/easiest way to read and analysis reoccurring data from a file with python. E.g. file:
some text
some more text
x y z
23 23 45
45 67 89
23 64 67
some text
some more text
x y z
23 23 45
45 67 89
23 64 67
.
.
.
.
I want to read the xyz data and determine the change over the occurrences. I'm wondering if there is a better way then reading line by line and using an if statement to find and separate the numerical data. If the data was in different files I would loop over the file names and use np.loadtxt. For some reason I find it harder to deal with reoccurring data in a file then multiple files. Thanks for any help.

Related

For loop store outcome in variable

I'm learning to webscrape data so I can use it to practice data visualization, and I am following a tutorial but I can't get the same results. The problem is that I have a for loop, but cant seem to store the data in a variable. When I run the for loop and try to store the results in a variable i will only get one result, but when i immediately print the for loop results i get all the data.
Can someone explain what I'm doing wrong?
for age in team_riders:
print(age.find('div', class_='age').text)
Results:
30
28
28
22
34
28
25
30
30
30
34
32
33
32
24
27
23
26
22
27
30
28
24
26
21
26
36
26
27
22
32
30
for age in team_riders:
age = age.find('div', class_='age').text
print(age)
prints:
30

Define an empty list before your loop and append() the single results from the loop to this list:
lst = []
for age in team_riders:
lst.append(age.find('div', class_='age').text)

You can also use this oneliner:
lst= [age.find('div', class_='age').text for age in team_riders]

Because you are storing de data as a variable, in this case you are saving the last one (and for every cycle you overwrite the last before), if you want to store every records you need to use a datastructure: List for example, and using append()
to add the record to the list.
So your code will be:
ages = []
for age in team_riders: ages.append(age.find('div', class_='age').text)

Skip all rows containing strings and keep only rows with floats

I have a log file from a mathematical simulation. I tried to parse it in Python, but I am not quite satisfied with the result. Is there any "elegant" way to loop each line and sort it in order to keep only lines with physical values and ditch the rest?
The goal is to perform various analyses using numpy. Knowing that the lines I need only contain numerical values, is there a way to "tell" python to keep only rows / lines with numerical values and ditch all the rows containing string? Thank your for your help. A sample of the log file is attached.
5 Host 1 -- hnode146 -- Ranks 20-39
6 Host 2 -- hnode147 -- Ranks 40-59
7 Host 3 -- hnode148 -- Ranks 60-79
8 Process rank 0 hnode145 36210
9 Total number of processes : 80
10
11 STAR-CCM+ 12.02.011 (linux-x86_64-2.5/gnu4.8-r8)
12 License build date: 10 February 2015
13 This version of the code requires license version 2017.02 or greater.
14 Checking license file:
15 Checking license file:
16 Unable to list features for license file
17 1 copy of ccmppower checked out from
18 Feature ccmppower expires in
19 Thu Apr 19 17:22:54 2018
20
21 Server::start -host h
22 Loading object database:
23 Loading module: StarMeshing
24 Loading module: MeshingSurfaceRepair
25 Loading module: CadModeler
26 Started Parasolid modeler version 29.01.131
27 Loading module: StarResurfacer
28 Loading module: StarTrimmer
29 Loading module: SegregatedFlowModel
30 Loading module: KwTurbModel
31 Loading module: StarDualMesher
32 Loading module: StarBodyFittedMesher
33 Simulation database saved by:
34 STAR-CCM+ 12.02.011 (linux-x86_64-2.5/gnu4.8-r8) Fri Mar 10 20:03:37 UTC 2017 Serial
35 Loading into:
36 STAR-CCM+ 12.02.011 (linux-x86_64-2.5/gnu4.8-r8) Fri Mar 10 20:03:37 UTC 2017 Np=80
37 Object database load completed.
39 A Zeit und Datum : 2018.04.19 at 17:23:11
40
41 Startzeit: 1524151391534
42
43 Loading/configuring connectivity (old|new partitions: 1|80)
44 Domain (index 1): 1889922 cells, 5614862 faces, 1990686 verts.
45 Configuring finished
46 Reading material property database "/sw/apps/cd-adapco/12.02.011-R8/STAR-CCM+12.02.011-R8/star/props.mdb"...
47 Re-partitioning
48 Iteration Continuity X-momentum Y-momentum Z-momentum Tke Sdr Shear+Pressure (N) Pressure (N) Shear (N)
49 2001 1.076589e-01 9.570364e-01 2.588931e-01 1.984590e-01 4.028215e-03 3.964344e+01 -6.468809e+00 -1.253867e+00 -5.214942e+00
50 2002 5.987195e-02 4.004615e-01 2.597862e-01 1.808196e-01 2.819456e-03 2.537490e+01 -5.154729e+00 -1.228644e+00 -3.926085e+00
51 2003 4.824863e-02 2.048600e-01 1.359121e-01 1.103614e-01 1.384044e-03 1.623916e+01 -4.277053e+00 -1.216038e+00 -3.061015e+00
52 2004 3.684017e-02 1.322581e-01 1.350187e-01 8.827220e-02 9.023783e-04 1.039251e+01 -3.914011e+00 -1.213340e+00 -2.700671e+00
53 2005 3.224797e-02 1.093365e-01 1.059148e-01 7.461911e-02 6.307195e-04 6.650742e+00 -3.745949e+00 -1.217353e+00 -2.528596e+00
54 2006 2.788050e-02 9.180507e-02 8.311817e-02 6.417279e-02 4.603072e-04 4.256107e+00 -3.658613e+00 -1.224046e+00 -2.434567e+00
55 2007 2.332397e-02 7.688239e-02 6.222694e-02 4.860232e-02 3.534658e-04 2.723686e+00 -3.608431e+00 -1.231574e+00 -2.376857e+00
56 2008 1.916130e-02 6.201947e-02 4.645780e-02 3.654489e-02 2.833177e-04 1.743055e+00 -3.575486e+00 -1.237352e+00 -2.338134e+00
57 2009 1.600865e-02 4.780234e-02 3.909247e-02 2.959689e-02 2.370245e-04 1.115506e+00 -3.548365e+00 -1.240938e+00 -2.307427e+00
58 2010 1.389765e-02 3.570659e-02 3.492423e-02 2.537285e-02 2.055279e-04 7.138997e-01 -3.527530e+00 -1.242749e+00 -2.284781e+00
59 Iteration Continuity X-momentum Y-momentum Z-momentum Tke Sdr Shear+Pressure (N) Pressure (N) Shear (N)
60 2011 1.253570e-02 2.591702e-02 3.089287e-02 2.209728e-02 1.814997e-04 4.568718e-01 -3.511034e+00 -1.242906e+00 -2.268128e+00
61 2012 1.141436e-02 1.992464e-02 2.745902e-02 1.922942e-02 1.636478e-04 2.923702e-01 -3.498876e+00 -1.243006e+00 -2.255870e+00
62 2013 1.024511e-02 1.621655e-02 2.544053e-02 1.687660e-02 1.492828e-04 1.870937e-01 -3.489288e+00 -1.242425e+00 -2.246863e+00
63 2014 9.067693e-03 1.359007e-02 2.320886e-02 1.481687e-02 1.371763e-04 1.197299e-01 -3.482323e+00 -1.242027e+00 -2.240295e+00
64 2015 7.906450e-03 1.159567e-02 2.073906e-02 1.306014e-02 1.265825e-04 7.662597e-02 -3.479134e+00 -1.243537e+00 -2.235597e+00
65 2016 6.889290e-03 1.010569e-02 1.787383e-02 1.258395e-02 1.171344e-04 4.903984e-02 -3.479042e+00 -1.246677e+00 -2.232364e+00
66 2017 5.982303e-03 8.872579e-03 1.576665e-02 1.141871e-02 1.086443e-04 3.138620e-02 -3.480301e+00 -1.249988e+00 -2.230313e+00
67 2018 5.191895e-03 7.958489e-03 1.446382e-02 9.796685e-03 1.009937e-04 2.009149e-02 -3.482459e+00 -1.253255e+00 -2.229204e+00
68 2019 4.614927e-03 7.193031e-03 1.279295e-02 8.818100e-03 9.411761e-05 1.286594e-02 -3.484886e+00 -1.256002e+00 -2.228885e+00
69 2020 4.159939e-03 6.571088e-03 1.146195e-02 7.756150e-03 8.794392e-05 8.241197e-03 -3.487597e+00 -1.258382e+00 -2.229214e+00
70 Iteration Continuity X-momentum Y-momentum Z-momentum Tke Sdr Shear+Pressure (N) Pressure (N) Shear (N)
71 2021 3.779168e-03 5.961164e-03 1.034847e-02 6.969454e-03 8.240903e-05 5.278791e-03 -3.490138e+00 -1.260061e+00 -2.230078e+00
72 2022 3.414811e-03 5.350398e-03 9.329119e-03 6.398522e-03 7.743586e-05 3.381806e-03 -3.491624e+00 -1.260241e+00 -2.231384e+00

Read each line. Split on whitespace, attempt to convert each entity to a float. If the conversion fails, the line isn't kept. There's certainly a way to do this with a regex, but this should work off the top of my head.
lines_to_keep = []
for line in f.readlines():
try:
# Throws ValueError if `x` can't be converted to float
[float(x) for x in line.split()]
# If the above line didn't throw a ValueError, keep it
lines_to_keep.append(line)
except ValueError:
continue

import re
list_to_keep=[]
pattern= re.compile(r'[0-9 ]+[e.\-+][0-9]*',re.IGNORECASE)
with open(f, 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter='\n')
for row in reader:
if(pattern.match(str(row))):
list_to_keep.append(row)
Can use regex to find your row and keep it in list.

If you'd like regex. This matches continuous digits separated by numeric symbols like '+-.e'.
import re
r = re.compile(r'([0-9 ]+[e.\-+]*)+\n')
lines = [line for line in open('a.log') if r.fullmatch(line)]
# all the useful lines are ...
# 49 2001 1.076589e-01 9.570364e-01 2.588931e-01 1.984590e-01 4.028215e-03 3.964344e+01 -6.468809e+00 -1.253867e+00 -5.214942e+00
# 50 2002 5.987195e-02 4.004615e-01 2.597862e-01 1.808196e-01 2.819456e-03 2.537490e+01 -5.154729e+00 -1.228644e+00 -3.926085e+00
# 51 2003 4.824863e-02 2.048600e-01 1.359121e-01 1.103614e-01 1.384044e-03 1.623916e+01 -4.277053e+00 -1.216038e+00 -3.061015e+00

How to add column numbers to each column in a large text file

I would like to add column numbers to 128 columns in a text file
E.g.
My file
12 13 14 15
20 21 23 14
34 56 67 89
Required output
1:12 2:13 3:14 4:15
1: 20 2:21 3: 23 4:14
1: 34 2:56 3:67 4:89
Can this be done using awk / python
I tried paste command for joining two files : one with the values other file with column numbers, manually typed. Since the file size is very large manual typing didnt work.
As of my knowledge I could find answers for adding only one column to the end of a text file.
Thanks for the suggestions

awk to the rescue!
$ awk '{for(i=1;i<=NF;i++) $i=i":"$i}1' file
should do.

Add a numeric column to an SFrame in Glab

In Graphlab,
I have a csv file which contains some_ids like the following:
some_ids
10
92
85
352
...
65664
I imported the csv file into Glab as an sframe:
my_csv = gl.SFrame.read_csv('my_csv_file.csv')
I need to add another column in the Sframe which contain the row number and I call it 'item_id'. The output will look like the following:
item_id,some_ids
1 10
2 92
3 85
4 352
...
13373 65664
I do no prefer to create another csv whereas prefer to do this inside Glab. We can also use numpy() if needed. How can this be done please? Thanks

There's an inbuilt command for just that:
graphlab.SFrame.add_row_number(column_name, start)
You can find out more about it in the documentation here

How to parse data from .TX0 file into dataframe

Hi I'm trying to basically convert a .TX0 file from a chromatogram file. the file is just a bunch of results including retention times etc... I want to eventually pick certain pieces of data from multiple files and do some analysis. So far I have:
filename = 'filepath'
f = open(filename, 'r')
lines = f.readlines()
print lines
my output is:
Out[29]:
[....................................................
'"condensate analysis (HP4890 Optic - FID)"\n',
'"Peak","Component","Time","Area","Height","BL"\n',
'"#","Name","[min]","[uV*sec]","[uV]",""\n',
'------,------,------,------,------,------\n',
'1,"Methane",0.689,5187666.22,994337.57,*BB\n',
'2,"Ethane",1.061,1453339.93,729285.09,*BB\n',
'3,"Propane",1.715,193334.09,63398.74,*BB\n',
'4,"i-Butane",2.792,157630.92,29233.56,*BV\n',
'5,"n-Butane",3.240,98943.96,15822.72,*VB\n',
'"","","",------,------,""\n',
'"","","",7090915.11,1.83e+06,""\n',
'"Missing Component Report"\n',
'"Component","Expected Retention (Calibration File)"\n',
'------,------\n',
'"All components were found"\n',
'"Report stored in ASCII file :","...
"\n'.......................]
Now, the problem i'm having. I can't get this output into a structured dataframe using pandas... =/ I've tried and it just gives me two columns...
pd.DataFrame(filename)
out:
Out[26]:
0
0 "=============================================...
1 "Software Version:",6.3.2.0646,"Date:","08/06/...
2 "Reprocess Number:","vma2: ......................
.......................
10 ""\n
11 ""\n
12 "condensate analysis (HP4890 Optic - FID)"\n
13 "Peak","Component","Time","Area","Height","BL"\n
14 "#","Name","[min]","[uV*sec]","[uV]",""\n
15 ------,------,------,------,------,------\n
16 1,"Methane",0.689,5187666.22,994337.57,*BB\n
17 2,"Ethane",1.061,1453339.93,729285.09,*BB\n
18 3,"Propane",1.715,193334.09,63398.74,*BB\n
19 4,"i-Butane",2.792,157630.92,29233.56,*BV\n
20 5,"n-Butane",3.240,98943.96,15822.72,*VB\n
21 "","","",------,------,""\n
22 "","","",7090915.11,1.83e+06,""\n
23 "Missing Component Report"\n
24 "Component","Expected Retention (Calibration F...
25 ------,------\n
26 "All components were found"\n
27 "Report stored in ASCII file :","C:\Shared Fol...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Best way to read/analysis reoccurring data in a file - python

Related

For loop store outcome in variable

Skip all rows containing strings and keep only rows with floats

How to add column numbers to each column in a large text file

Add a numeric column to an SFrame in Glab

How to parse data from .TX0 file into dataframe

Categories

Resources