Semantic Similarity between Sentences in a Text - python

I have used material from here and a previous forum page to write some code for a program that will automatically calculate the semantic similarity between consecutive sentences across a whole text. Here it is;
The code for the first part is copy pasted from the first link, then I have this stuff below which I put in after the 245 line. I removed all excess after line 245.
with open ("File_Name", "r") as sentence_file:
while x and y:
x = sentence_file.readline()
y = sentence_file.readline()
similarity(x, y, true)
#boolean set to false or true
x = y
y = sentence_file.readline()
My text file is formatted like this;
Red alcoholic drink. Fresh orange juice. An English dictionary. The
Yellow Wallpaper.
In the end I want to display all the pairs of consecutive sentences with the similarity next to it, like this;
["Red alcoholic drink.", "Fresh orange juice.", 0.611],
["Fresh orange juice.", "An English dictionary.", 0.0]
["An English dictionary.", "The Yellow Wallpaper.", 0.5]
if norm(vec_1) > 0 and if norm(vec_2) > 0:
return np.dot(vec_1, vec_2.T) / (np.linalg.norm(vec_1)* np.linalg.norm(vec_2))
elif norm(vec_1) < 0 and if norm(vec_2) < 0:
???Move On???

This should work. There's a few things to note in the comments. Basically, you can loop through the lines in the file and store the results as you go. One way to process two lines at a time is to set up an "infinite loop" and check the last line we've read to see if we've hit the end (readline() will return None at the end of a file).
# You'll probably need the file extention (.txt or whatever) in open as well
with open ("File_Name.txt", "r") as sentence_file:
# Initialize a list to hold the results
results = []
# Loop until we hit the end of the file
while True:
# Read two lines
x = sentence_file.readline()
y = sentence_file.readline()
# Check if we've reached the end of the file, if so, we're done
if not y:
# Break out of the infinite loop
break
else:
# The .rstrip('\n') removes the newline character from each line
x = x.rstrip('\n')
y = y.rstrip('\n')
try:
# Calculate your similarity value
similarity_value = similarity(x, y, True)
# Add the two lines and similarity value to the results list
results.append([x, y, similarity_value])
except:
print("Error when parsing lines:\n{}\n{}\n".format(x, y))
# Loop through the pairs in the results list and print them
for pair in results:
print(pair)
Edit: In regards to issues you're getting from similarity(), if you want to simply ignore the line pairs that are causing these errors (without looking at the source in depth I really have no idea what's going on), you can add a try, catch around the call to similarity().

Related

How to remove dash/ hyphen from each line in .txt file

I wrote a little program to turn pages from book scans to a .txt file. On some lines, words are moved to another line. I wonder if this is any way to remove the dashes and merge them with the syllables in the line below?
E.g.:
effects on the skin is fully under-
stood one fights
to:
effects on the skin is fully understood
one fights
or:
effects on the skin is fully
understood one fights
Or something like that. As long as it was connected. Python is my third language and so far I can't think of anything, so maybe someone will give mea hint.
Edit:
The point is that the last symbol, if it is a dash, is removed and merged with the rest of the word below
This is a generator which takes the input line-by-line. If it ends with a - it extracts the last word and holds it over for the next line. It then yields any held-over word from the previous line combined with the current line.
To combine the results back into a single block of text, you can join it against the line separator of your choice:
source = """effects on the skin is fully under-
stood one fights
check-out Daft Punk's new sin-
le "Get Lucky" if you hav-
e the chance. Sound of the sum-
mer."""
def reflow(text):
holdover = ""
for line in text.splitlines():
if line.endswith("-"):
lin, _, e = line.rpartition(" ")
else:
lin, e = line, ""
yield f"{holdover}{lin}"
holdover = e[:-1]
print("\n".join(reflow(source)))
""" which is:
effects on the skin is fully
understood one fights
check-out Daft Punk's new
single "Get Lucky" if you
have the chance. Sound of the
summer.
"""
To read one file line-by-line and write directly to a new file:
def reflow(infile, outfile):
with open(infile) as source, open(outfile, "w") as dest:
holdover = ""
for line in source.readlines():
line = line.rstrip("\n")
if line.endswith("-"):
lin, _, e = line.rpartition(" ")
else:
lin, e = line, ""
dest.write(f"{holdover}{lin}\n")
holdover = e[:-1]
if __name__ == "__main__":
reflow("source.txt", "dest.txt")
Here is one way to do it
with open('test.txt') as file:
combined_strings = []
merge_line = False
for item in file:
item = item.replace('\n', '') # remove new line character at end of line
if '-' in item[-1]: # check that it is the last character
merge_line = True
combined_strings.append(item[:-1])
elif merge_line:
merge_line = False
combined_strings[-1] = combined_strings[-1] + item
else:
combined_strings.append(item)
If you just parse the line as a string then you can utilize the .split() function to move around these kinds of items
words = "effects on the skin is fully under-\nstood one fights"
#splitting among the newlines
wordsSplit = words.split("\n")
#splitting among the word spaces
for i in range(len(wordsSplit)):
wordsSplit[i] = wordsSplit[i].split(" ")
#checking for the end of line hyphens
for i in range(len(wordsSplit)):
for g in range(len(wordsSplit[i])):
if "-" in wordsSplit[i][g]:
#setting the new word in the list and removing the hyphen
wordsSplit[i][g] = wordsSplit[i][g][0:-1]+wordsSplit[i+1][0]
wordsSplit[i+1][0] = ""
#recreating the string
msg = ""
for i in range(len(wordsSplit)):
for g in range(len(wordsSplit[i])):
if wordsSplit[i][g] != "":
msg += wordsSplit[i][g]+" "
What this does is split by the newlines which are where the hyphens usually occur. Then it splits those into a smaller array by word. Then checks for the hyphens and if it finds one it replaces it with the next phrase in the words list and sets that word to nothing. Finally, it reconstructs the string into a variable called msg where it doesn't add a space if the value in the split array is a nothing string.
What about
import re
a = '''effects on the skin is fully under-
stood one fights'''
re.sub(r'-~([a-zA-Z0-9]*) ', r'\1\n', a.replace('\n', '~')).replace('~','\n')
Explanation
a.replace('\n', '~') concatenate input string into one line with (~ instead of \n - You need to choose some other if you want to use ~ char in the text.)
-~([a-zA-Z0-9]*) regex then selects all strings we want to alter with the () backreference which saves it to re.sub memory. Using '\1\n' it is later re-invoked.
.replace('~','\n') finally replaces all remaining ~ chars to newlines.

python structured array composition and transformation

I created a script that collects a huge data from a .txt file into an array in the format I want [3: 4: n] and the information is recorded as follows (I think). The .txt file is in this format
1.000000e-01 1.000000e-01 1.000000e-01
1.000000e-01 2.000000e-01 3.000000e-01
3.000000e-01 2.000000e-01 1.000000e-01
1.000000e-01 2.000000e-01 4.000000e-01
and repeats for N times and I store basically from 4 lines into for lines (like a block) because I'm working with ASCII files from STL parts.
In this sense, I have this code:
f = open("camaSTLfinalmente.txt","r")
b_line = 0
Coord = []
Normal = []
Vertice_coord = []
Tri = []
blook = []
for line in f:
line = line.rstrip()
if(line):
split = line.split()
for axis in range(0,3):
if(b_line == 0): #normal
Normal.append(split[axis])
else: #triangulo
Vertice_coord.append(split[axis])
if(b_line > 0):
Tri.append(Vertice_coord)
Vertice_coord = []
if(b_line == 3):
block.append(Normal)
block.append(Tri)
Coord.append(block)
block = []
Normal = []
Tri = []
b_line = 0
else:
b_line+=1
print(Coord[0]) #prints the follow line that I wrote after the code
the information is store in the way:
[['1.000000e-01', '1.000000e-01', '1.000000e-01'], [['1.000000e-01', '2.000000e-01', '3.000000e-01'], ['3.000000e-01', '2.000000e-01', '1.000000e-01'], ['1.000000e-01', '2.000000e-01', '-4.000000e-01']]]
Is there any way to simplify it?
I would like to take this opportunity to ask: I wanted to convert this information into numbers, and the ideal would be to read the number after the exponential (e) and change the numbers accordingly, that is, 1.000000e-01 goes to 0,1 (in order to make operations with a similar array where I store information from another .txt file with the same format)
Thanks for the attention,
Pedro
You can try changing the line split = line.split() to:
split = [float(x) for x in line.split()]
if you need the result to be in string and not float datatype:
split = [str(float(x)) for x in line.split()]
I'm not 100% sure if I fully understand what you want but the following code produces the same Coord:
coord = []
with open('camaSTLfinalmente.txt','r') as f:
content = [line.strip().split() for line in f]
for i in range(len(content)//4):
coord.append([content[4*i], content[(4*i+1):(4*i+4)]])
Regarding the second question, as remarked in another answer, the easiest way to handle strings containing a number is to convert them to a number and then format them as string.
s = '1.000000e-01'
n = float(s)
m = '{:.1f}'.format(n)
See the section about string formatting in the Python doc.
A couple remarks:
Generally Stackoverflow doesn't like questions of the form "how do I improve this piece of code", try to ask more specific questions.
The above assumes your file contains 4k lines, change the integer division ...//4 accordingly if you have some lines left at the end that do not form a pack of 4.
don't use capital letters for your variables. While style guides are not mandatory, it is good practice to follow them (Look up PEP-8, pylint, ...)

Number not Printing in python when returning amount

I have some code which reads from a text file and is meant to print max and min altitudes but the min altitude is not printing and there is no errors.
altitude = open("Altitude.txt","r")
read = altitude.readlines()
count = 0
for line in read:
count += 1
count = count - 1
print("Number of Different Altitudes: ",count)
def maxAlt(read):
maxA = (max(read))
return maxA
def minAlt(read):
minA = (min(read))
return minA
print()
print("Max Altitude:",maxAlt(read))
print("Min Altitude:",minAlt(read))
altitude.close()
I will include the Altitude text file if it is needed and once again the minimum altitude is not printing
I'm assuming, your file probably contains numbers & line-breaks (\n)
You are reading it here:
read = altitude.readlines()
At this point read is a list of strings.
Now, when you do:
minA = (min(read))
It's trying to get "the smallest string in read"
The smallest string is usually the empty string "" - which most probably exists at the end of your file.
So your minAlt is actually getting printed. But it happens to be the empty string.
You can fix it by parsing the lines you read into numbers.
read = [float(a) for a in altitude.readlines() if a]
Try below solution
altitudeFile = open("Altitude.txt","r")
Altitudes = [float(line) for line in altitudeFile if line] #get file data into list.
Max_Altitude = max(Altitudes)
Min_Altitude = min(Altitudes)
altitudeFile.close()
Change your code to this
with open('numbers.txt') as nums:
lines = nums.read().splitlines()
results = list(map(int, lines))
print(results)
print(max(results))
the first two lines read file and store it as a list. third line convert string list to integer and the last one search in list and return max, use min for minimum.

Merging lines in Python based on character position

I've a file with alternating lines, chords followed by lyrics:
C G Am
See the stone set in your eyes,
F C
see the thorn twist in your side,
G Am F
I wait for you
How could I merge subsequent lines in order to produce an output like the following, while keeping track of the character position:
(C)See the (G)stone set in your (Am)eyes,
see the t(F)horn twist in your s(C)ide,
I (G)wait for y(Am)ou(F)
From How do I read two lines from a file at a time using python it can be seen that iterating over the file 2 lines at a time can be done with
with open('lyrics.txt') as f:
for line1, line2 in zip(f, f):
... # process lines
but how can the lines be merged so that line 2 is split according to character positions (of chords) from line 1? A simple
chords = line1.split()
has no position information and
for i, c in enumerate(line1):
...
gives separate characters, not the chords.
You could use regexp match objects for extracting both position and content of chords from the 1st line. Care must be taken at the edges; the same chord may continue on the next line, and a line may contain chords with no matching lyrics. Both cases can be found in the example data.
import io
import re
# A chord is one or more consecutive non whitespace characters
CHORD = re.compile(r'\S+')
def inline_chords(lyrics):
for chords, words in zip(lyrics, lyrics):
# Produce a list of (position, chord) tuples
cs = [
# Handles chords that continue to next line.
(0, None),
# Unpack found chords with their positions.
*((m.start(), m[0]) for m in CHORD.finditer(chords)),
# Pair for the last chord. Slices rest of the words string.
(None, None)
]
# Remove newline.
words = words[:-1]
# Zip chords in order to get ranges for slicing lyrics.
for (start, chord), (end, _) in zip(cs, cs[1:]):
if start == end:
continue
# Extract the relevant lyrics.
ws = words[start:end]
if chord:
yield f"({chord})"
yield ws
yield "\n"
The edges could be handled differently, for example by testing if the 1st chord begins at 0 or not before the loop, but I feel that the single for-loop makes for cleaner code.
Trying it out:
test = """\
C G Am
See the stone set in your eyes,
F C
see the thorn twist in your side,
G Am F
I wait for you
"""
if __name__ == '__main__':
with io.StringIO(test) as f:
print("".join(list(inline_chords(f))))
produces the desired format:
(C)See the (G)stone set in your (Am)eyes,
see the t(F)horn twist in your s(C)ide,
I (G)wait for y(Am)ou(F)

Parsing GenBank to FASTA with yield in Python (x, y)

For now I have tried to define and document my own function to do it, but I am encountering issues with testing the code and I have actually no idea if it is correct. I found some solutions with BioPython, re or other, but I really want to make this work with yield.
#generator for GenBank to FASTA
def parse_GB_to_FASTA (lines):
#set Default label
curr_label = None
#set Default sequence
curr_seq = ""
for line in lines:
#if the line starts with ACCESSION this should be saved as the beginning of the label
if line.startswith('ACCESSION'):
#if the label has already been changed
if curr_label is not None:
#output the label and sequence
yield curr_label, curr_seq
''' if the label starts with ACCESSION, immediately replace the current label with
the next ACCESSION number and continue with the next check'''
#strip the first column and leave the number
curr_label = '>' + line.strip()[12:]
#check for the organism column
elif line.startswith (' ORGANISM'):
#add the organism name to the label line
curr_label = curr_label + " " + line.strip()[12:]
#check if the region of the sequence starts
elif line.startswith ('ORIGIN'):
#until the end of the sequence is reached
while line.startswith ('//') is False:
#get a line without spaces and numbers
curr_seq += line.upper().strip()[12:].translate(None, '1234567890 ')
#if no more lines, then give the last label and sequence
yield curr_label, curr_seq
I often work with very large GenBank files and found (years ago) that the BioPython parsers were too brittle to make it through 100's of thousands of records (at the time), without crashing on an unusual record.
I wrote a pure python(2) function to return the next whole record from an open file, reading in 1k chunks, and leaving the file pointer ready to get the next record. I tied this in with a simple iterator that uses this function, and a GenBank Record class which has a fasta(self) method to get a fasta version.
YMMV, but the function that gets the next record is here as should be pluggable in any iterator scheme you want to use. As far as converting to fasta goes you can use logic similar to your ACCESSION and ORIGIN grabbing above, or you can get the text of sections (like ORIGIN) using:
sectionTitle='ORIGIN'
searchRslt=re.search(r'^(%s.+?)^\S'%sectionTitle,
gbrText,re.MULTILINE | re.DOTALL)
sectionText=searchRslt.groups()[0]
Subsections like ORGANISM, require a left side pad of 5 spaces.
Here's my solution to the main issue:
def getNextRecordFromOpenFile(fHandle):
"""Look in file for the next GenBank record
return text of the record
"""
cSize =1024
recFound = False
recChunks = []
try:
fHandle.seek(-1,1)
except IOError:
pass
sPos = fHandle.tell()
gbr=None
while True:
cPos=fHandle.tell()
c=fHandle.read(cSize)
if c=='':
return None
if not recFound:
locusPos=c.find('\nLOCUS')
if sPos==0 and c.startswith('LOCUS'):
locusPos=0
elif locusPos == -1:
continue
if locusPos>0:
locusPos+=1
c=c[locusPos:]
recFound=True
else:
locusPos=0
if (len(recChunks)>0 and
((c.startswith('//\n') and recChunks[-1].endswith('\n'))
or (c.startswith('\n') and recChunks[-1].endswith('\n//'))
or (c.startswith('/\n') and recChunks[-1].endswith('\n/'))
)):
eorPos=0
else:
eorPos=c.find('\n//\n',locusPos)
if eorPos == -1:
recChunks.append(c)
else:
recChunks.append(c[:(eorPos+4)])
gbrText=''.join(recChunks)
fHandle.seek(cPos-locusPos+eorPos)
return gbrText

Categories

Resources