Merging lines in Python based on character position - python

I've a file with alternating lines, chords followed by lyrics:
C G Am
See the stone set in your eyes,
F C
see the thorn twist in your side,
G Am F
I wait for you
How could I merge subsequent lines in order to produce an output like the following, while keeping track of the character position:
(C)See the (G)stone set in your (Am)eyes,
see the t(F)horn twist in your s(C)ide,
I (G)wait for y(Am)ou(F)
From How do I read two lines from a file at a time using python it can be seen that iterating over the file 2 lines at a time can be done with
with open('lyrics.txt') as f:
for line1, line2 in zip(f, f):
... # process lines
but how can the lines be merged so that line 2 is split according to character positions (of chords) from line 1? A simple
chords = line1.split()
has no position information and
for i, c in enumerate(line1):
...
gives separate characters, not the chords.

You could use regexp match objects for extracting both position and content of chords from the 1st line. Care must be taken at the edges; the same chord may continue on the next line, and a line may contain chords with no matching lyrics. Both cases can be found in the example data.
import io
import re
# A chord is one or more consecutive non whitespace characters
CHORD = re.compile(r'\S+')
def inline_chords(lyrics):
for chords, words in zip(lyrics, lyrics):
# Produce a list of (position, chord) tuples
cs = [
# Handles chords that continue to next line.
(0, None),
# Unpack found chords with their positions.
*((m.start(), m[0]) for m in CHORD.finditer(chords)),
# Pair for the last chord. Slices rest of the words string.
(None, None)
]
# Remove newline.
words = words[:-1]
# Zip chords in order to get ranges for slicing lyrics.
for (start, chord), (end, _) in zip(cs, cs[1:]):
if start == end:
continue
# Extract the relevant lyrics.
ws = words[start:end]
if chord:
yield f"({chord})"
yield ws
yield "\n"
The edges could be handled differently, for example by testing if the 1st chord begins at 0 or not before the loop, but I feel that the single for-loop makes for cleaner code.
Trying it out:
test = """\
C G Am
See the stone set in your eyes,
F C
see the thorn twist in your side,
G Am F
I wait for you
"""
if __name__ == '__main__':
with io.StringIO(test) as f:
print("".join(list(inline_chords(f))))
produces the desired format:
(C)See the (G)stone set in your (Am)eyes,
see the t(F)horn twist in your s(C)ide,
I (G)wait for y(Am)ou(F)

Related

How to remove dash/ hyphen from each line in .txt file

I wrote a little program to turn pages from book scans to a .txt file. On some lines, words are moved to another line. I wonder if this is any way to remove the dashes and merge them with the syllables in the line below?
E.g.:
effects on the skin is fully under-
stood one fights
to:
effects on the skin is fully understood
one fights
or:
effects on the skin is fully
understood one fights
Or something like that. As long as it was connected. Python is my third language and so far I can't think of anything, so maybe someone will give mea hint.
Edit:
The point is that the last symbol, if it is a dash, is removed and merged with the rest of the word below
This is a generator which takes the input line-by-line. If it ends with a - it extracts the last word and holds it over for the next line. It then yields any held-over word from the previous line combined with the current line.
To combine the results back into a single block of text, you can join it against the line separator of your choice:
source = """effects on the skin is fully under-
stood one fights
check-out Daft Punk's new sin-
le "Get Lucky" if you hav-
e the chance. Sound of the sum-
mer."""
def reflow(text):
holdover = ""
for line in text.splitlines():
if line.endswith("-"):
lin, _, e = line.rpartition(" ")
else:
lin, e = line, ""
yield f"{holdover}{lin}"
holdover = e[:-1]
print("\n".join(reflow(source)))
""" which is:
effects on the skin is fully
understood one fights
check-out Daft Punk's new
single "Get Lucky" if you
have the chance. Sound of the
summer.
"""
To read one file line-by-line and write directly to a new file:
def reflow(infile, outfile):
with open(infile) as source, open(outfile, "w") as dest:
holdover = ""
for line in source.readlines():
line = line.rstrip("\n")
if line.endswith("-"):
lin, _, e = line.rpartition(" ")
else:
lin, e = line, ""
dest.write(f"{holdover}{lin}\n")
holdover = e[:-1]
if __name__ == "__main__":
reflow("source.txt", "dest.txt")
Here is one way to do it
with open('test.txt') as file:
combined_strings = []
merge_line = False
for item in file:
item = item.replace('\n', '') # remove new line character at end of line
if '-' in item[-1]: # check that it is the last character
merge_line = True
combined_strings.append(item[:-1])
elif merge_line:
merge_line = False
combined_strings[-1] = combined_strings[-1] + item
else:
combined_strings.append(item)
If you just parse the line as a string then you can utilize the .split() function to move around these kinds of items
words = "effects on the skin is fully under-\nstood one fights"
#splitting among the newlines
wordsSplit = words.split("\n")
#splitting among the word spaces
for i in range(len(wordsSplit)):
wordsSplit[i] = wordsSplit[i].split(" ")
#checking for the end of line hyphens
for i in range(len(wordsSplit)):
for g in range(len(wordsSplit[i])):
if "-" in wordsSplit[i][g]:
#setting the new word in the list and removing the hyphen
wordsSplit[i][g] = wordsSplit[i][g][0:-1]+wordsSplit[i+1][0]
wordsSplit[i+1][0] = ""
#recreating the string
msg = ""
for i in range(len(wordsSplit)):
for g in range(len(wordsSplit[i])):
if wordsSplit[i][g] != "":
msg += wordsSplit[i][g]+" "
What this does is split by the newlines which are where the hyphens usually occur. Then it splits those into a smaller array by word. Then checks for the hyphens and if it finds one it replaces it with the next phrase in the words list and sets that word to nothing. Finally, it reconstructs the string into a variable called msg where it doesn't add a space if the value in the split array is a nothing string.
What about
import re
a = '''effects on the skin is fully under-
stood one fights'''
re.sub(r'-~([a-zA-Z0-9]*) ', r'\1\n', a.replace('\n', '~')).replace('~','\n')
Explanation
a.replace('\n', '~') concatenate input string into one line with (~ instead of \n - You need to choose some other if you want to use ~ char in the text.)
-~([a-zA-Z0-9]*) regex then selects all strings we want to alter with the () backreference which saves it to re.sub memory. Using '\1\n' it is later re-invoked.
.replace('~','\n') finally replaces all remaining ~ chars to newlines.

How to parse letter by letter and make a list with Python?

I have a text file I am attempting to parse. Fairly new to Python.
It contains an ID, a sequence, and frequency
SA1 GDNNN 12
SA2 TDGNNED 8
SA3 VGGNNN 3
Say the user wants to compare the frequency of the first two sequences. They would input the ID number. I'm having trouble figuring out how I would parse with python to make a list like
GD this occurs once in the two so it = 12
DN this also occurs once =12
NN occurs 3 times = 12 + 12 + 8 =32
TD occurs once in the second sequence = 8
DG ""
NE ""
ED ""
What do you recommend to parse letter by letter? In a sequence GD, then DN, then NN (without repeating it in the list), TD.. Etc.?
I currently have:
#Read File
def main():
file = open("clonedata.txt", "r")
lines = file.readlines()
file.close()
class clone_data:
def __init__(id, seq, freq):
id.seq = seq
id.freq = freq
def myfunc(id)
id = input ("Input ID number to see frequency: ")
for line in infile:
line = line.strip().upper()
line.find(id)
#print('y')
I'm not entirely sure from the example, but it sounds like you're trying to look at each line in the file and determine if the ID is in a given line. If so, you want to add the number at the end of that line to the current count.
This can be done in Python with something like this:
def get_total_from_lines_for_id(id_string, lines):
total = 0 #record the total at the end of each line
#now loop over the lines searching for the ID string
for line in lines:
if id_string in line: #this will be true if the id_string is in the line and will only match once
split_line = line.split(" ") #split the line at each space character into an array
number_string = split_line[-1] #get the last item in the array, the number
number_int = int(number_string) #make the string a number so we can add it
total = total + number_int #increase the total
return total
I'm honestly not sure what part of that task seems difficult to you, in part because I'm not sure what exactly is the task you're trying to accomplish.
Unless you expect the datafile to be enormous, the simplest way to start would be to read it all into memory, recording the id, sequence and frequency in a dictionary indexed by id: [Note 1]
with open('clonedata.txt') as file:
data = { id : (sequence, int(frequency))
for id, sequence, frequency in (
line.split() for line in file)}
With the sample data provided, that gives you: (newlines added for legibility)
>>> data
{'SA1': ('GDNNN', 12),
'SA2': ('TDGNNED', 8),
'SA3': ('VGGNNN', 3)}
and you can get an individual sequence and frequency with something like:
seq, freq = data['SA2']
Apparently, you always want to count the number of digrams (instances of two consecutive letters) in a sequence of letters. You can do that easily with collections.Counter: [Note 2]
from collections import Counter
# ...
seq, freq = data['SA1']
Counter(zip(seq, seq[1:]))
which prints
Counter({('N', 'N'): 2, ('G', 'D'): 1, ('D', 'N'): 1})
It would probably be most convenient to make that into a function:
def count(seq):
return Counter(zip(seq, seq[1:]))
Also apparently, you actually want to multiply the counted frequency by the frequency extracted from the file. Unfortunately, Counter does not support multiplication (although you can, conveniently, add two Counters to get the sum of frequencies for each key, so there's no obvious reason why they shouldn't support multiplication.) However, you can multiply the counts afterwards:
def count_freq(seq, freq):
retval = count(seq)
for digram in retval:
retval[digram] *= freq
return retval
If you find tuples of pairs of letters annoying, you can easily turn them back into strings using ''.join().
Notes:
That code is completely devoid of error checking; it assumes that your data file is perfect, and will throw an exception for any line with two few elements, including blank lines. You could handle the blank lines by changing for line in file to for line in file if line.strip() or some other similar test, but a fully bullet-proof version would require more work.)
zip(a, a[1:]) is the idiomatic way of making an iterator out of overlapping pairs of elements of a list. If you want non-overlapping pairs, you can use something very similar, using the same list iterator twice:
def pairwise(a):
it = iter(a)
return zip(it, it)
(Or, javascript style: pairwise = lambda a: (lambda it:zip(it, it))(iter(a)).)

Parse data from several equally structured blocks of a text file in python

I've got a text file that has several of these blocks of text in it:
Module Resistor_SMD:R_0402_1005Metric (layer B.Cu) (tedit 5B301BBD) (tstamp 5CC0A687)
(at 120.316179 97.92138 90)
(descr "Resistor SMD 0402 (1005 Metric), square (rectangular) end terminal, IPC_7351 nominal, (Body size source: http://www.tortai-tech.com/upload/download/2011102023233369053.pdf), generated with kicad-footprint-generator")
(tags resistor)
(path /610532D4)
(attr smd)
(fp_text reference R59 (at 0 1.17 90) (layer B.SilkS)
I want to pull out the following:
120.316179, 97.92138 90 and R59
and store it somewhere...
Then, I want to take that collection of line items, and throw some away depending on the value(s) of the first two numbers....They're XY coordinates.
Then, write it to a list.
How can I do that with regular expressions?
I'm loading the file and trying to follow along here, but I'm getting lost in the addition of the pandas library.
IMO you don't need re for this task. You can iterate through the lines of your file and, depending on signal strings like '(at ' and 'fp_text reference', you can fill a list of lists of all your resistor data, e.g.:
with open('textfile.txt') as f:
data = []
row = []
for line in f:
if row:
if '(fp_text ref' in line.strip():
row.append(line.strip().split()[2])
data.append(row)
row = []
else:
if '(at ' in line.strip():
row = line.strip()[:-1].split()[1:4]
print(data)
# [['120.316179', '97.92138', '90', 'R59']]
And if you want a pandas dataframe from this data:
import pandas as pd
df = pd.DataFrame(data, columns=['x', 'y', 'z', 'R'])
print(df)
# x y z R
# 0 120.316179 97.92138 90 R59
This RegEx might help you to capture your three desired strings:
([\d]+\.[\d]{5,}|R[0-9]+)
There are two simple pattern connected using an | (OR):
the one on the left ([\d]+\.[\d]{5,}) checks for your desired float numbers with a 5+ boundary for the float part, and
the one on the right (R[0-9]+) has a left-side R boundary.
You can simply change these boundaries, however you wish, and call the captured output using $1 and do the coding.
You can escape language specific metachars such as . using a \, if necessary.

Semantic Similarity between Sentences in a Text

I have used material from here and a previous forum page to write some code for a program that will automatically calculate the semantic similarity between consecutive sentences across a whole text. Here it is;
The code for the first part is copy pasted from the first link, then I have this stuff below which I put in after the 245 line. I removed all excess after line 245.
with open ("File_Name", "r") as sentence_file:
while x and y:
x = sentence_file.readline()
y = sentence_file.readline()
similarity(x, y, true)
#boolean set to false or true
x = y
y = sentence_file.readline()
My text file is formatted like this;
Red alcoholic drink. Fresh orange juice. An English dictionary. The
Yellow Wallpaper.
In the end I want to display all the pairs of consecutive sentences with the similarity next to it, like this;
["Red alcoholic drink.", "Fresh orange juice.", 0.611],
["Fresh orange juice.", "An English dictionary.", 0.0]
["An English dictionary.", "The Yellow Wallpaper.", 0.5]
if norm(vec_1) > 0 and if norm(vec_2) > 0:
return np.dot(vec_1, vec_2.T) / (np.linalg.norm(vec_1)* np.linalg.norm(vec_2))
elif norm(vec_1) < 0 and if norm(vec_2) < 0:
???Move On???
This should work. There's a few things to note in the comments. Basically, you can loop through the lines in the file and store the results as you go. One way to process two lines at a time is to set up an "infinite loop" and check the last line we've read to see if we've hit the end (readline() will return None at the end of a file).
# You'll probably need the file extention (.txt or whatever) in open as well
with open ("File_Name.txt", "r") as sentence_file:
# Initialize a list to hold the results
results = []
# Loop until we hit the end of the file
while True:
# Read two lines
x = sentence_file.readline()
y = sentence_file.readline()
# Check if we've reached the end of the file, if so, we're done
if not y:
# Break out of the infinite loop
break
else:
# The .rstrip('\n') removes the newline character from each line
x = x.rstrip('\n')
y = y.rstrip('\n')
try:
# Calculate your similarity value
similarity_value = similarity(x, y, True)
# Add the two lines and similarity value to the results list
results.append([x, y, similarity_value])
except:
print("Error when parsing lines:\n{}\n{}\n".format(x, y))
# Loop through the pairs in the results list and print them
for pair in results:
print(pair)
Edit: In regards to issues you're getting from similarity(), if you want to simply ignore the line pairs that are causing these errors (without looking at the source in depth I really have no idea what's going on), you can add a try, catch around the call to similarity().

How can I count the line number between two character in a file with python?

Hi
I'm new to python and I have a 3.2 python!
I have a file which has some sort of format like this:
Number of segment pairs = 108570; number of pairwise comparisons = 54234
'+' means given segment; '-' means reverse complement
Overlaps Containments No. of Constraints Supporting Overlap
******************* Contig 1 ********************
E_180+
E_97-
******************* Contig 2 ********************
E_254+
E_264+ is in E_254+
E_276+
******************* Contig 3 ********************
E_256-
E_179-
I want to count the number of non-empty lines between the *****contig#****
and I want to get a result like this
contig1=2
contig2=3
contig3=2**
Probably, it's best to use regular expressions here. You can try the following:
import re
str = open(file).read()
pairs = re.findall(r'\*+ (Contig \d+) \*+\n([^*]*)',str)
pairs is a list of tuples, where the tuples have the form ('Contig x', '...')
The second component of each tuple contains the text after the mark
Afterwards, you could count the number of '\n' in those texts; most easily this can be done via a list comprehension:
[(contig, txt.count('\n')) for (contig,txt) in pairs]
(edit: if you don't want to count empty lines you can try:
[(contig, txt.count('\n')-txt.count('\n\n')) for (contig,txt) in pairs]
)
def give(filename):
with open(filename) as f:
for line in f:
if 'Contig' in line:
category = line.strip('* \r\n')
break
cnt = 0
aim = []
for line in f:
if 'Contig' in line:
yield (category+'='+str(cnt),aim)
category = line.strip('* \r\n')
cnt = 0
aim= []
elif line.strip():
cnt+=1
if 'is in' in line:
aim.append(line.strip())
yield (category+'='+str(cnt),aim)
for a,b in give('input.txt'):
print a
if b: print b
result
Contig 1=2
Contig 2=3
['E_264+ is in E_254+']
Contig 3=2
The function give() isn't a normal function, it is a generator function. See the doc, and if you have question, I will answer.
strip() is a function that eliminates characters at the beginning and at the end of a string
When used without argument, strip() removes the whitespaces (that is to say \f \n \r \t \v and blank space). When there is a string as argument, all the characters present in the string argument that are found in the treated string are removed from the treated string. The order of characters in the string argument doesn't matter: such an argument doesn't designates a string but a set of characters to be removed.
line.strip() is a means to know if there are characters that aren't whitespaces in a line
The fact that elif line.strip(): is situated after the line if 'Contig' in line: , and that it is written elif and not if, is important: if it was the contrary, line.strip() would be True for line being for exemple
******** Contig 2 *********\n
I suppose that you will be interested to know the content of the lines like this one:
E_264+ is in E_254+
because it is this kind of line that make a difference in the countings
So I edited my code in order that the function give() produce also the information of these kind of lines

Categories

Resources