Repeatedly extracting substring inbetween specific characters, in a text file (python) - python

I have a several pieces of data stored in a text file. I am trying to extract each type of data into individual lists so that I can plot them/make various figures. There are thousands of values so doing it specifically isn't really an option.
An example of the text file is :
"G4WT7 > interaction in material = MATERIAL
G4WT7 > process PROCESSTYPE
G4WT7 > at position [um] = (x,y,z)
G4WT7 > with energy [keV] = 0.016
G4WT7 > track ID and parent ID = ,a,b
G4WT7 > with mom dir = (x,y,z)
G4WT7 > number of secondaries= c
G4WT1 > interaction in material = MATERIAL
G4WT1 > process PROCESSTYPE
G4WT1 > at position [um] = (x,y,z)
G4WT1 > with energy [keV] = 0.032
G4WT1 > track ID and parent ID = ,a,b
G4WT1 > with mom dir = (x,y,z)
G4WT1 > number of secondaries= c"
I would like to extract strings such as the string following "energy [keV] =" so 0.016, 0.032 etc, into a list. I hope to be able to separate all the data similarly to this.
So far I've tried to use regex, as following:
import re
file = open('file.txt')
textfile =file.read()
Energy = re.findall('[keV] = ;(.*)G', textfile)
But it just generates an empty list; []
I'm a newbie to python, so apologies if the answer is obvious, and any help would be greatly appreciated.

you might want to escape the square-brackets!
Energy = re.findall('\[keV\] = (.*)', text)
... or to be on the save-side you can also use re.escape to make sure all characters are properly escaped, e.g.:
Energy = re.findall(re.escape('[keV] = ') + '(.*)', text)

Related

Highlighting occurrences of a string in a Tkinter textField

I have a regex pattern return a list of all the start and stop indices of an occurring string and I want to be able to highlight each occurrence, it's extremely slow with my current setup — using a 133,000 line file it takes about 8 minutes to highlight all occurrences.
Here's my current solution:
if IPv == 4:
v4FoundUnique = v4FoundUnique + 1
# highlight all regions found
for j in range(qty):
v4Found = v4Found + 1
# don't highlight if they set the checkbox not to
if highlightText:
# get row.column coordinates of start and end of match
# very slow
startIndex = textField.index('1.0 + {} chars'.format(starts[j]))
# compute end based on start, using assumption that IP addresses
# won't span lines drastically faster than computing from raw index
endIndex = "{}.{}".format(startIndex.split(".")[0],
int(startIndex.split(".")[1]) + stops[j]-starts[j])
# apply tag
textField.tag_add("{}v4".format("public" if isPublic else "private"),
startIndex, endIndex)
So, TKinter has a pretty bad implementation of changing "absolute location" to its row.column format:
startIndex = textField.index('1.0 + {} chars'.format(starts[j]))
it's actually faster to do it like this:
for address in v4check.finditer(filetxt):
# address.group() returns matching text
# address.span() returns the indices (start,stop)
start,stop = address.span()
ip = address.group()
srow = filetxt.count("\n",0,start)+1
scol = start-filetxt.rfind("\n",0,start)-1
start = "{}.{}".format(srow,scol)
stop = "{}.{}".format(srow,scol+len(ip))
which takes the regex results and the input file to get the data we need (row.colum)
There could be a faster way of doing this but this is the solution I found that works!

Isolating the Sentence in which a Term appears

I have the following script that does the following:
Extracts all text from a PowerPoint (all separated by a ":::")
Compares each term in my search term list to the text and isolates just those lines of text that contain one or more of the terms
Creates a dataframe for the term + file which that term appeared
Iterates through each PowerPoint for the given folder
I am hoping to adjust this to include specifically the sentence in which it appears (e.g. the entire content between the ::: before and ::: after the term appears).
end = r'C:\Users\xxx\Table Lookup.xlsx'
rfps = r'C:\Users\xxx\Folder1'
ls = os.listdir(rfps)
ppt = [s for s in ls if '.ppt' in s]
files = []
text = []
for p in ppt:
try:
prs_text = []
prs = Presentation(os.path.join(rfps, p))
for slide in prs.slides:
for shape in slide.shapes:
if hasattr(shape, "text"):
prs_text.append(shape.text)
prs_text = ':::'.join(prs_text)
files.append(p)
text.append(prs_text)
except:
print("Failed: " + str(p))
agg = pd.DataFrame()
agg['File'] = files
agg['Unstructured'] = text
agg['Unstructured'] = agg['Unstructured'].str.lower()
terms = ['test','testing']
a = [(x, z, i) for x, z, y in zip(agg['File'],agg['Unstructured'], agg['Unstructured']) for i in terms if i in y]
#how do I also include the sentence where this term appears
onepager = pd.DataFrame(a, columns=['File', 'Unstructured', 'Term']) #will need to add a column here
onepager = onepager.drop_duplicates(keep="first")
1 line sample of agg:
File | Unstructured
File1.pptx | competitive offerings:::real-time insights and analyses for immediate use:::disruptive “moves”:::deeper strategic insights through analyses generated and assessed over time:::launch new business models:::enter new markets::::::::::::internal data:::external data:::advanced computing capabilities:::insights & applications::::::::::::::::::machine learning
write algorithms that continue to “learn” or test and improve themselves as they ingest data and identify patterns:::natural language processing
allow interactions between computers and human languages using voice and/or text. machines directly interact, analyze, understand, and reproduce information:::intelligent automation
Adjustment based on input:
onepager = pd.DataFrame(a, columns=['File', 'Unstructured','Term'])
for t in terms:
onepager['Sentence'] = onepager["Unstructured"].apply(lambda x: x[x.rfind(":::", 0, x.find(t))+3: x.find(":::",x.find(t))-3])
To find the sentence containing the word "test", try:
>>> agg["Unstructured"].apply(lambda x: x[x.rfind(":::", 0, x.find("test"))+3: x.find(":::",x.find("test"))-3])
Looping through your terms:
onepager = pd.DataFrame(a, columns=['File', 'Unstructured','Term'])
for t in terms:
onepager[term] = onepager["Unstructured"].apply(lambda x: x[x.rfind(":::", 0, x.find(t))+3: x.find(":::",x.find(t))-3])

Trying multiplying numbers on a line starting with the word "size" with a constant variable across 181 text files

I have a folder of 181 text file, each containing numbers but I only need to multiply those on lines that have "size" by a constant variable such as 0.5, but I did achieve this with this: Search and replace math operations with the result in Notepad++
But what I am trying to do is without expression or quotation marks so that the rest of the community I am in can simply do the same without editing every file to meet the format needed to multiply each number.
For Example:
farmers = {
culture = armenian
religion = coptic
size = 11850
}
being multiplied by 0.5 to:
farmers = {
culture = armenian
religion = coptic
size = 5925
}
I tried making a python script but it did not work although I don't know much python:
import operator
with open('*.txt', 'r') as file:
data = file.readlines()
factor = 0.5
count = 0
for index, line in enumerate(data):
try:
first_word = line.split()[0]
except IndexError:
pass
if first_word == 'size':
split_line = line.split(' ')
# print(' '.join(split_line))
# print(split_line)
new_line = split_line
new_line[-1] = ("{0:.6f}".format(float(split_line[-1]) * factor))
new_line = ' '.join(new_line) + '\n'
# print(new_line)
data[index] = new_line
count += 1
elif first_word == 'text_scale':
split_line = line.split(' ')
# print(split_line)
# print(' '.join(split_line))
new_line = split_line
new_line[-1] = "{0:.2f}".format(float(split_line[-1]) * factor)
new_line = ' '.join(new_line) + '\n'
# print(new_line)
data[index] = new_line
count += 1
with open('*.txt', 'w') as file:
file.writelines(data)
print("Lines changed:", count)
So are there any solutions to this, I rather not make people in my community format every single file to work with my solution. Anything could work just that I haven't found a simple solution that is quick and easy for anyone to understand for those who use notepad++ or Sublime Text 3.
If you use EmEditor, you can use the Replace in Files feature of EmEditor. In EmEditor, select Replace in Files on the Search menu (or press Ctrl + Shift + H), and enter:
Find: (?<=\s)size(.*?)(\d+)
Replace with: \J "size\1" + \2 / 2
File Types: *.txt (or file extension you are looking for)
In Folder: (folder path you are searching in)
Set the Keep Modified Files Open and Regular Expressions options (and Match Case option if size always appears in lower case),
Click the Replace All button.
Alternatively, if you would like to use a macro, this is a macro for you (you need to edit the folder path):
editor.ReplaceInFiles("(?<=\\s)size(.*?)(\\d+)","\\J \x22size\\1\x22 + \\2 / 2","E:\\Test\\*.txt",eeFindReplaceCase | eeFindReplaceRegExp | eeReplaceKeepOpen,0,"","",0,0);
To run this, save this code as, for instance, Replace.jsee, and then select this file from Select... in the Macros menu. Finally, select Run Replace.jsee in the Macros menu.
Explanations:
\J in the replacement expression specifies JavaScript. In this example, \2 is the backreference from (\d+), thus \2 / 2 represents a matched number divided by two.
References: EmEditor How to: Replace Expression Syntax

python structured array composition and transformation

I created a script that collects a huge data from a .txt file into an array in the format I want [3: 4: n] and the information is recorded as follows (I think). The .txt file is in this format
1.000000e-01 1.000000e-01 1.000000e-01
1.000000e-01 2.000000e-01 3.000000e-01
3.000000e-01 2.000000e-01 1.000000e-01
1.000000e-01 2.000000e-01 4.000000e-01
and repeats for N times and I store basically from 4 lines into for lines (like a block) because I'm working with ASCII files from STL parts.
In this sense, I have this code:
f = open("camaSTLfinalmente.txt","r")
b_line = 0
Coord = []
Normal = []
Vertice_coord = []
Tri = []
blook = []
for line in f:
line = line.rstrip()
if(line):
split = line.split()
for axis in range(0,3):
if(b_line == 0): #normal
Normal.append(split[axis])
else: #triangulo
Vertice_coord.append(split[axis])
if(b_line > 0):
Tri.append(Vertice_coord)
Vertice_coord = []
if(b_line == 3):
block.append(Normal)
block.append(Tri)
Coord.append(block)
block = []
Normal = []
Tri = []
b_line = 0
else:
b_line+=1
print(Coord[0]) #prints the follow line that I wrote after the code
the information is store in the way:
[['1.000000e-01', '1.000000e-01', '1.000000e-01'], [['1.000000e-01', '2.000000e-01', '3.000000e-01'], ['3.000000e-01', '2.000000e-01', '1.000000e-01'], ['1.000000e-01', '2.000000e-01', '-4.000000e-01']]]
Is there any way to simplify it?
I would like to take this opportunity to ask: I wanted to convert this information into numbers, and the ideal would be to read the number after the exponential (e) and change the numbers accordingly, that is, 1.000000e-01 goes to 0,1 (in order to make operations with a similar array where I store information from another .txt file with the same format)
Thanks for the attention,
Pedro
You can try changing the line split = line.split() to:
split = [float(x) for x in line.split()]
if you need the result to be in string and not float datatype:
split = [str(float(x)) for x in line.split()]
I'm not 100% sure if I fully understand what you want but the following code produces the same Coord:
coord = []
with open('camaSTLfinalmente.txt','r') as f:
content = [line.strip().split() for line in f]
for i in range(len(content)//4):
coord.append([content[4*i], content[(4*i+1):(4*i+4)]])
Regarding the second question, as remarked in another answer, the easiest way to handle strings containing a number is to convert them to a number and then format them as string.
s = '1.000000e-01'
n = float(s)
m = '{:.1f}'.format(n)
See the section about string formatting in the Python doc.
A couple remarks:
Generally Stackoverflow doesn't like questions of the form "how do I improve this piece of code", try to ask more specific questions.
The above assumes your file contains 4k lines, change the integer division ...//4 accordingly if you have some lines left at the end that do not form a pack of 4.
don't use capital letters for your variables. While style guides are not mandatory, it is good practice to follow them (Look up PEP-8, pylint, ...)

Semantic Similarity between Sentences in a Text

I have used material from here and a previous forum page to write some code for a program that will automatically calculate the semantic similarity between consecutive sentences across a whole text. Here it is;
The code for the first part is copy pasted from the first link, then I have this stuff below which I put in after the 245 line. I removed all excess after line 245.
with open ("File_Name", "r") as sentence_file:
while x and y:
x = sentence_file.readline()
y = sentence_file.readline()
similarity(x, y, true)
#boolean set to false or true
x = y
y = sentence_file.readline()
My text file is formatted like this;
Red alcoholic drink. Fresh orange juice. An English dictionary. The
Yellow Wallpaper.
In the end I want to display all the pairs of consecutive sentences with the similarity next to it, like this;
["Red alcoholic drink.", "Fresh orange juice.", 0.611],
["Fresh orange juice.", "An English dictionary.", 0.0]
["An English dictionary.", "The Yellow Wallpaper.", 0.5]
if norm(vec_1) > 0 and if norm(vec_2) > 0:
return np.dot(vec_1, vec_2.T) / (np.linalg.norm(vec_1)* np.linalg.norm(vec_2))
elif norm(vec_1) < 0 and if norm(vec_2) < 0:
???Move On???
This should work. There's a few things to note in the comments. Basically, you can loop through the lines in the file and store the results as you go. One way to process two lines at a time is to set up an "infinite loop" and check the last line we've read to see if we've hit the end (readline() will return None at the end of a file).
# You'll probably need the file extention (.txt or whatever) in open as well
with open ("File_Name.txt", "r") as sentence_file:
# Initialize a list to hold the results
results = []
# Loop until we hit the end of the file
while True:
# Read two lines
x = sentence_file.readline()
y = sentence_file.readline()
# Check if we've reached the end of the file, if so, we're done
if not y:
# Break out of the infinite loop
break
else:
# The .rstrip('\n') removes the newline character from each line
x = x.rstrip('\n')
y = y.rstrip('\n')
try:
# Calculate your similarity value
similarity_value = similarity(x, y, True)
# Add the two lines and similarity value to the results list
results.append([x, y, similarity_value])
except:
print("Error when parsing lines:\n{}\n{}\n".format(x, y))
# Loop through the pairs in the results list and print them
for pair in results:
print(pair)
Edit: In regards to issues you're getting from similarity(), if you want to simply ignore the line pairs that are causing these errors (without looking at the source in depth I really have no idea what's going on), you can add a try, catch around the call to similarity().

Categories

Resources