Extracting multiple values after an exact string using regular expresions

Extracting multiple values after an exact string using regular expresions - python

I have 100s of .txt/.sed files with lots of lines in each.
Sample input file:
Time: 10:34:51.49,15:21:39.24
Box Temperature (K): 32.82,8.88,-10.07
Silicon Temperature (K): 10.90,9.88
Voltage: 7.52,7.41
Dark Mode: AUTO,AUTO
Radiometric Calibration: RADIANCE
Units: W/m^2/sr/nm
GPS Time: n/a
Satellites: n/a
Channels: 1024
Desired output:
Time 15:21:39.24
Box Temp 32.82
8.88
-10.07
Si Temp 10.90
9.88
I was trying to write the code for identifying the string and then making a list of the values and then later tackle arranging them into a DataFrame followed by writing them to a .csv file.
Sample code
testtxt = 'Temperature (K): 32.82,8.88,-10.07,32.66,8.94,-10.07'
exp = r'^Temperature (K):(\s*) ([0-9.]+)([0-9.]+), ([0-9.-]+) , (-[0-9-.]+),([0-9-.]+) , ([0-9-.]+),(-[0-9-.]+)'
regexp = re.compile(exp)
my_temp = regexp.search(txt)
print(my_temp.group(0))
ERROR:
AttributeError: 'NoneType' object has no attribute 'group'
Basically, it finds no match!
Clarification: I want an efficient way to only extract the Time and Temperature values, not the others. It would be great to be able to stop scanning the files once those are found since each file has over 500 lines and I have lots of them.

My suggestion would be to use string.startswith() to determine if the string starts with "Box Temperature (K)", or whatever. Once you find that, get the rest of the string, parse it as a CSV, and then validate each of the components. Trying to do this all with regular expressions is more trouble than it's worth.
If you want to have the code stop once it's found everything, just set flags for the things you want to find, and once all the flags are set you can exit. Something like:
foundTime = 0
foundBoxTemp = 0
foundSiTemp = 0
while (not end of file AND (foundTime == 0 || foundBoxTemp == 0 || foundSiTemp == 0))
if (line.startswith("Box Temperature (K):"))
// parse and output
else if (line.startswith("Time:"))
// parse and output
else ....

Related

How Can I Create Variables From Text and Convert Coordinates

I am aware there are similar questions to mine, but after trying numerous "answers" over several hours I thought my best next step is submit my conundrum here. I respect your time.
After several hours with no success in understanding why my Python script won't work I decided to see if someone could help me. Essentially, the goal is to use the astronomical program, "Stellarium" as a "day and night sky" to practice Celestial Navigation (CelNav) navigating the simulated world of Microsoft Flight Simulator X (FSX). The script actually writes a "startup.ssc" script which initializes Stellarium's date, time, and position.
The process is thus...
Use FSX and save a "flight." This creates a *.FLT file which is a text file which saves the complete situation, including time and location.
Run the FSXtoStellarium.py
Locate the lines of date, time, latitude, longitude, and altitude in the *.FLT text.
Read the data into variables.
Convert the Degrees(°), Minutes('), Seconds(") (DMS) to Decimal Degrees (DD).
Lastly, the script constructs a "startup.ssc" and opens Stellarium at the recorded time and place.
The Problem:
I have not been able to read the DMS into variable(s) correctly nor can I format the DMS into Decimal Degrees (DD). According to the "watches" I set in my IDE (PyScripter), the script is reading in an "int" value I can't decipher instead of the text string of the DMS (Example: W157° 27' 23.20").
Here are some excerpts of the file and script.
HMS Bounty.FLT
Various lines of data above...
[SimVars.0]
Latitude=N21° 20' 47.36"
Longitude=W157° 27' 23.20"
Altitude=+000004.93
Various lines of data below...
EOF
FSXtoStellarium.py
Various lines of script above...
# find lat & Lon in the file
start = content.find("SimVars.0")
latstart = content.find("Latitude=")
latend = content.find("Latitude=",latstart+1)
longstart = content.find("Longitude=",start)
longend = content.find(",",longstart)
# convert to dec deg
latitude = float(content[longend+1:latend])/120000
longitude = float(content[longstart+10:longend])/120000
Various lines of script below...
So, what am I missing?
FYI - I am an old man who gets confused. My professional career was in COBOL/DB2/CICS, but you can consider me a Python newbie (it shows, right?). :)
Your help s greatly appreciated and I will gladly provide any additional information.
Calvin

Here is a way to get from the text file (with multiple input lines) all the way to Decimal Degrees in python 2.7:
from __future__ import print_function
content='''
[SimVars.0]
Latitude=N21° 20' 47.36"
Longitude=W157° 27' 23.20"
'''
latKey = "Latitude="
longKey = "Longitude="
latstart = content.index(latKey) + len(latKey)
latend = content.find('"', latstart) + 1
longstart = content.find(longKey, latend) + len(longKey)
longend = content.find('"', longstart) + 1
lat = content[latstart:latend]
long = content[longstart:longend]
print()
print('lat ', lat)
print('long ', long)
deg, mnt, sec = [float(x[:-1]) for x in lat[1:].split()]
latVal = deg + mnt / 60 + sec / 3600
deg, mnt, sec = [float(x[:-1]) for x in long[1:].split()]
longVal = deg + mnt / 60 + sec / 3600
print()
print('latVal ', latVal)
print('longVal ', longVal)
Explanation:
we start with a multi-line string, content
the first index() call finds the start position of the substring "Latitude=" within content, to which we add the length of "Latitude=" since what we care about is the characters following the = character
the second index() call searches for the 'seconds' character " (which marks the end of the Latitude substring), to which we add one (for the length of the ")
the third index() call does for Longitude= something similar to what we did for latitude, except it starts at the position latend since we expect Longitude= to follow the latitude string following Latitude=
the fourth index() call seeks the end of the longitude substring and is completely analogous to the second index() call above for latitude
the assignment to lat uses square bracket slice notation for the list content to extract the substring from the end of Latitude= to the subsequent " character
the assignment to long is analogous to the previous step
the first assignment to deg, mnt, sec is assigning a tuple of 3 values to these variables using a list comprehension:
split lat[1:], which is to say lat with the leading cardinal direction character N removed, into space-delimited tokens 21°, 20' and 47.36"
for each token, x[:-1] uses slice notation to drop the final character which gives strings 21, 20 and 47.36
float() converts these strings to numbers of type float
the assignment to latVal does the necessary arithmetic to calculate a quantity in decimal degrees using the degrees, minutes and seconds stored in deg, mnt, sec.
the treatment of long to get to longVal is completely analogous to that for lat and latVal above.
Output:
lat N21° 20' 47.36"
long W157° 27' 23.20"
latVal 21.34648888888889
longVal 157.45644444444443

Applying function to pandas dataframe: is there a more efficient way of doing this?

I have a dataframe that has a small number of columns but many rows (about 900K right now, and it's going to get bigger as I collect more data). It looks like this:
Author
Title
Date
Category
Text
url
0
Amira Charfeddine
Wild Fadhila 01
2019-01-01
novel
الكتاب هذا نهديه لكل تونسي حس إلي الكتاب يحكي ...
NaN
1
Amira Charfeddine
Wild Fadhila 02
2019-01-01
novel
في التزغريت، والعياط و الزمامر، ليوم نتيجة الب...
NaN
2
253826
1515368_7636953
2010-12-28
/forums/forums/91/
هذا ما ينص عليه إدوستور التونسي لا رئاسة مدى ا...
https://www.tunisia-sat.com/forums/threads/151...
3
250442
1504416_7580403
2010-12-21
/forums/sports/
\n\n\n\n\n\nاعلنت الجامعة التونسية لكرة اليد ا...
https://www.tunisia-sat.com/forums/threads/150...
4
312628
1504416_7580433
2010-12-21
/forums/sports/
quel est le résultat final\n,,,,????
https://www.tunisia-sat.com/forums/threads/150...
The "Text" Column has a string of text that may be just a few words (in the case of a forum post) or it may a portion of a novel and have tens of thousands of words (as in the two first rows above).
I have code that constructs the dataframe from various corpus files (.txt and .json), then cleans the text and saves the cleaned dataframe as a pickle file.
I'm trying to run the following code to analyze how variable the spelling of different words are in the corpus. The functions seem simple enough: One counts the occurrence of a particular spelling variable in each Text row; the other takes a list of such frequencies and computes a Gini Coefficient for each lemma (which is just a numerical measure of how heterogenous the spelling is). It references a spelling_var dictionary that has a lemma as its key and the various ways of spelling that lemma as values. (like {'color': ['color', 'colour']} except not in English.)
This code works, but it uses a lot of CPU time. I'm not sure how much, but I use PythonAnywhere for my coding and this code sends me into the tarpit (in other words, it makes me exceed my daily allowance of CPU seconds).
Is there a way to do this so that it's less CPU intensive? Preferably without me having to learn another package (I've spent the past several weeks learning Pandas and am liking it, and need to just get on with my analysis). Once I have the code and have finished collecting the corpus, I'll only run it a few times; I won't be running it everyday or anything (in case that matters).
Here's the code:
import pickle
import pandas as pd
import re
with open('1_raw_df.pkl', 'rb') as pickle_file:
df = pickle.load(pickle_file)
spelling_var = {
'illi': ["الي", "اللي"],
'besh': ["باش", "بش"],
...
}
spelling_df = df.copy()
def count_word(df, word):
pattern = r"\b" + re.escape(word) + r"\b"
return df['Text'].str.count(pattern)
def compute_gini(freq_list):
proportions = [f/sum(freq_list) for f in freq_list]
squared = [p**2 for p in proportions]
return 1-sum(squared)
for w, var in spelling_var.items():
count_list = []
for v in var:
count_list.append(count_word(spelling_df, v))
gini = compute_gini(count_list)
spelling_df[w] = gini

I rewrote two lines in the last double loop, see the comments in the code below. does this solve your issue?
gini_lst = []
for w, var in spelling_var.items():
count_list = []
for v in var:
count_list.append(count_word(spelling_df, v))
#gini = compute_gini(count_list) # don't think you need to compute this at every iteration of the inner loop, right?
#spelling_df[w] = gini # having this inside of the loop creates a new column at each iteration, which could crash your CPU
gini_lst.append(compute_gini(count_list))
# this creates a df with a row for each lemma with its associated gini value
df_lemma_gini = pd.DataFrame(data={"lemma_column": list(spelling_var.keys()), "gini_column": gini_lst})

Clean text files - remove unwanted content in LOOP (R/python)

I want to clean all the "waste" (making the files unsuitable for analysis) in unstructured text-files.
In this specific situation, one option to only retain the wanted information, is to only retain all numbers above 250 (the text is a combination of string, numbers, ...)
For a large number of text files, I want to do follow action in R:
x <- x[which(x >= "250"),]
The code for 1 text file works perfectly (above), when I try to do the same in a loop (for the large N of text files, it fails (error: incorrect number of dimensions o)).
for(i in 1:length(files)){
i<- i[which(i >= "250"),]
}
Anyone any idea how to solve this in R (or python) ?
picture: very simplified example of a text file, I want to retain everything between (START) and (END)

This makes no sense if it is 10 K files, why are you even trying to do in R or python? Why not just a simple awk or bash command? Moreover, your images is parsing info between START and END from the text files, not sure if it is data frame with columns across ( try to put in a simple dput rather than images.)
All you are trying to do is a grep between start and end across 10 k files. I would do that in bash.
something like this in bash should work.
for i in *.txt
do
sed -n '/START/,/END/{//!p}' i > i.edited.txt
done
If the columns are standard across in R you can do the following ( But, I would not read 10 K files in R memory).
read the files as a list of dataframe Then simply do an lapply
a = data.frame(col1 = c(100,250,300))
b = data.frame(col1 = c(250,450,100,346))
c = data.frame(col1 = c(250,123,122,340))
df_list <- list(a = a ,b = b,c = c)
lapply(df_list, subset, col1 >= 250)

Difficulty reading data from a text file and converting to a float

UPDATE:
My problem was due to the input file having an odd encoding. Changing my opening statement to "open(os.path.join(root, 'Report.TXT'), 'r', encoding='utf-16')" fixed my problem
ORIGINAL TEXT
I'm trying to make a program that will allow me to more easily organize data from some lab equipment. This program recursively moves through folders, locates a file named Report.TXT, grabs a few numbers from it, and correctly organizes them in an excel file. There's a lot of irrelevant information from this file, so I need to grab only a specific part of it (e.g. line 56, characters 72-95).
Here's an example of a part of one of these Report.TXT files containing information I want to grab (under the ng/uL column):
RetTime Type Area Amt/Area Amount Grp Name
[min] [nRIU*s] [ng/ul]
-------|------|----------|----------|----------|--|------------------
4.232 BB 6164.18262 1.13680e-5 7.00746e-1 Compound1
5.046 BV 2.73487e5 1.34197e-5 36.70109 Compound2
5.391 VB 3.10324e5 1.34678e-5 41.79371 Compound3
6.145 - - - Compound4
7.258 - - - Compound5
8.159 - - - Compound6
11.092 BB 3447.12158 2.94609e-5 1.01555 Compound7
Totals : 80.21110
This is only a portion of the Report.TXT, the actual "Compound1" is on line 54 of the real file.
I've managed to form something that will grab these and insert it into an excel file as a string:
for rootdir in range(1,tdirs+1):
flask = 0
for root, subFolders, files in os.walk(str(rootdir)):
if 'Report.TXT' in files:
flask += 1
with open(os.path.join(root, 'Report.TXT'), 'r') as fin:
print(root)
for x in range(0,67):
line = fin.readline()
if x == 54:
if "-" in line[75:94]:
compound1 = 0
else:
compound1 = str(line[75:94].strip())
print(compound1)
datasheet.write(int(rootdir)+2,int(flask),compound1)
if x == 56:
if "-" in line[75:94]:
compound2 = 0
else:
compound2 = str(line[75:94].strip())
print(compound2)
datasheet.write(int(tdirs)+int(rootdir)+6,int(flask),compound2)
However, if I replace the str(line[75:94].strip()) with a float(line[75:94].strip()), then I get a cannot convert string to float error. The printing was just for my own troubleshooting but isn't seeming to give me any extra information.
Any ideas on what I can do to fix this?

converting to float is not such a good idea in this case.
since you are copying it to a delimited file, it doesnt matter if you convert to float or not. more over(the floating point issue in python suggest not to convert to float using the standard library float() method.
you will be better of writing the sting values since you would want to have you lab results accurate.
use numpy to convert the complex numbers to decimal if it is necessary.

Finding exon/ intron borders in a gene

I would like to go through a gene and get a list of 10bp long sequences containing the exon/intron borders from each feature.type =='mRNA'. It seems like I need to use compoundLocation, and the locations used in 'join' but I can not figure out how to do it, or find a tutorial.
Could anyone please give me an example or point me to a tutorial?

Assuming all the info in the exact format you show in the comment, and that you're looking for 20 bp on either side of each intro/exon boundary, something like this might be a start:
Edit: If you're actually starting from a GenBank record, then it's not much harder. Assuming that the full junction string you're looking for is in the CDS feature info, then:
for f in record.features:
if f.type == 'CDS':
jct_info = str(f.location)
converts the "location" information into a string and you can continue as below.
(There are ways to work directly with the location information without converting to a string - in particular you can use "extract" to pull the spliced sequence directly out of the parent sequence -- but the steps involved in what you want to do are faster and more easily done by converting to str and then int.)
import re
jct_info = "join{[0:229](+), [11680:11768](+), [11871:12135](+), [15277:15339](+), [16136:16416](+), [17220:17471](+), [17547:17671](+)"
jctP = re.compile("\[\d+\:\d+\]")
jcts = jctP.findall(jct_info)
jcts
['[0:229]', '[11680:11768]', '[11871:12135]', '[15277:15339]', '[16136:16416]', '[17220:17471]', '[17547:17671]']
Now you can loop through the list of start:end values, pull them out of the text and convert them to ints so that you can use them as sequence indexes. Something like this:
for jct in jcts:
(start,end) = jct.replace('[', '').replace(']', '').split(':')
try: # You need to account for going out of index, e.g. where start = 0
start_20_20 = seq[int(start)-20:int(start)+20]
except IndexError:
# do your alternatives e.g. start = int(start)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting multiple values after an exact string using regular expresions - python

Related

How Can I Create Variables From Text and Convert Coordinates

Applying function to pandas dataframe: is there a more efficient way of doing this?

Clean text files - remove unwanted content in LOOP (R/python)

Difficulty reading data from a text file and converting to a float

Finding exon/ intron borders in a gene

Categories

Resources