Remove transcript timestamps and join the lines to make paragraph - python

File: Plain Text Document
Content: Youtube timestamped transcript
I can separately remove each line's timestamp:
for count, line in enumerate(content, start=1):
if count % 2 == 0:
s = line.replace('\n','')
print(s)
I can also join the sentences if I don't remove the timestamps:
with open('file.txt') as f:
print (" ".join(line.strip() for line in f))
But I attempted to do these together (removing timestamps and joining the lines) in various formats but no right outcome:
with open('Russell Brand Script.txt') as m:
for count, line in enumerate(m, start=1):
if count % 2 == 0:
sentence=line.replace('\n',' ')
print(" ".join(sentence.rstrip('\n')))
I also tried various form of print(" ".join(sentence.rstrip('\n'))) and print(" ".join(sentence.strip())) but the results is always either of below:
How can I remove the timestamps and join the sentences to create a paragraph at once?

Whenever you call .join() on a string, it inserts the separator between every character of the string. You should also note that print(), by default, adds a newline after the string is printed.
To get around this, you can save each modified sentence to a list, and then output the entire paragraph at once at the end using "".join(). This gets around the newline issue described above, and gives you the ability to do additional processing on the paragraph afterwards, if desired.
with open('put_your_filename_here.txt') as m:
sentences = []
for count, line in enumerate(m, start=1):
if count % 2 == 0:
sentence=line.replace('\n', '')
sentences.append(sentence)
print(' '.join(sentences))
(Made a small edit to the code -- the old version of the code produced a trailing space after the paragraph.)

TL;DR: copy-paste solution using list-comprehension with if as filter and regex to match timestamp:
' '.join([line.strip() for line in transcript if not re.match(r'\d{2}:\d{2}', line)]).
Explained
Suppose your text input given is:
00:00
merry christmas it's our christmas video
00:03
to you i already regret this hat but if
00:05
we got some fantastic content for you a
00:07
look at the most joyous and wonderful
00:09
aspects have a very merry year ho ho ho
Then you can ignore the timestamps with regex \d{2}:\d{2} and append all filtered lines as phrase to a list. Trim each phrase using strip() which removes heading/trailing whitespace. But when you finally join all phrases to a paragraph use a space as delimiter:
import re
def to_paragraph(transcript_lines):
phrases = []
for line in transcript_lines:
trimmed = line.strip()
if trimmed != '' and not re.matches(r'\d{2}:\d{2}', trimmed):
phrases.append(trimmed)
else: # TODO: for debug only, remove
print(line) # TODO: for debug only, remove
return " ".join(phrases)
t = '''
00:00
merry christmas it's our christmas video
00:03
to you i already regret this hat but if
00:05
we got some fantastic content for you a
00:07
look at the most joyous and wonderful
00:09
aspects have a very merry year ho ho ho
'''
paragraph = to_paragraph(t.splitlines())
print(paragraph)
with open('put_your_filename_here.txt') as f:
print(to_paragraph(f.readlines())
Outputs:
00:00
00:03
00:05
00:07
00:09
('result:', "merry christmas it's our christmas video to you i already regret this hat but if we got some fantastic content for you a look at the most joyous and wonderful aspects have a very merry year ho ho ho")
Result is same as youtubetranscript.com returned for the given youtube video.

Related

How to find required word in novel in python?

I have a text and I have got a task in python with reading module:
Find the names of people who are referred to as Mr. XXX. Save the result in a dictionary with the name as key and number of times it is used as value. For example:
If Mr. Churchill is in the novel, then include {'Churchill' : 2}
If Mr. Frank Churchill is in the novel, then include {'Frank Churchill' : 4}
The file is .txt and it contains around 10-15 paragraphs.
Do you have ideas about how can it be improved? (It gives me error after some words, I guess error happens due to the reason that one of the Mr. is at the end of the line.)
orig_text= open('emma.txt', encoding = 'UTF-8')
lines= orig_text.readlines()[32:16267]
counts = dict()
for line in lines:
wordsdirty = line.split()
try:
print (wordsdirty[wordsdirty.index('Mr.') + 1])
except ValueError:
continue
Try this:
text = "When did Mr. Churchill told Mr. James Brown about the fish"
m = [x[0] for x in re.findall('(Mr\.( [A-Z][a-z]*)+)', text)]
You get:
['Mr. Churchill', 'Mr. James Brown']
To solve the line issue simply read the entire file:
text = file.read()
Then, to count the occurrences, simply run:
Counter(m)
Finally, if you'd like to drop 'Mr. ' from all your dictionary entries, use x[0][4:] instead of x[0].
This can be easily done using regex and capturing group.
Take a look here for reference, in this scenario you might want to do something like
# retrieve a list of strings that match your regex
matches = re.findall("Mr\. ([a-zA-Z]+)", your_entire_file) # not sure about the regex
# then create a dictionary and count the occurrences of each match
# if you are allowed to use modules, this can be done using Counter
Counter(matches)
To access the entire file like that, you might want to map it to memory, take a look at this question

Count lines containing *both* of two strings, from a larger/multiline string in Python

I am looking at the entire transcript of the play, Romeo and Juliet and I want to see how many times'Romeo' and 'Juliet' appear on the same line within the entire play. AKA how many different lines in the play have both words 'Romeo' and 'Juliet' in them?
Note: 'gbdata' is the name of my data aka the entire transcript of the play. For purposes of testing, we might use:
gbdata = '''
Romeo and Juliet # this should count once
Juliet and Romeo, and Romeo, and Juliet # this also should count once
Romeo # this should not count at all
Juliet # this should not count at all
some other string # this should not count at all
'''
The correct answer should be 2, since only the first two lines contain both strings; and more matches within a line don't add to the total count.
This is what I have done so far:
gbdata.count('Romeo' and 'Juliet') # counts 'Juliet's, returning 4
and
gbdata.count('Romeo') + gbdata.count('Juliet') # combines individual counts, returning 8
How can I get the desired output for the above test string, 2?
You can't use str.count() here; it's not built for your purpose, since it doesn't have any concept of "lines". That said, given a string, you can break it down into a list of individual lines by splitting on '\n', the newline character.
A very terse approach might be:
count = sum((1 if ('Romeo' in l and 'Juliet' in l) else 0) for l in gbdata.split('\n'))
Expanding that out into a bunch of separate commands might look like:
count = 0
for line in gbdata.split('\n'):
if 'Romeo' in line and 'Juliet' in line:
count += 1

How to return to original formatting

I have broken down lines of text file into individual words to check if they are in a dictionary. I now want to return/print the words back in the same lines.
I have tried editing the positions in my loop as I know I have the lines broken down already. I have thought that maybe I have to use a pop or remove function. I cannot use swap function.
def replace_mode(text_list,misspelling):
for line in text_list:
word = line.split(' ')
for element in word:
if element in misspelling.keys():
print(misspelling[element], end=(' '))
else:
print(element, end=(' '))
It is printing in a single line:
"joe and his family went to the zoo the other day the zoo had many animals including an elephant the elephant was being too dramatic though after they walked around joe left the zoo"
I want the processed text to be back in its original format(4 lines):
joe and his family went to the zoo the other day
the zooo had many animals including an elofent
the elaphant was being too dramati though
after they walked around joe left the zo
Add this line, right after your last print(element, end=(' ')) statement, at the same level of indentation as for element in word::
print()
This will print a newline at the end of each of the original lines, right after you've finished processing every word from that line but before you've moved on to the next line.

How to split an element in a list into two elements?

I want to split elements of list, each element is currently made up of a movie and a date, however I now need to separate them so I can add them to a database
This is what I've tried
movies=["The Big Bad Fox and Other Tales (English subtitles)('23rd', 'May')"]
splitter=re.compile('(/(.+)').split
[part for img in movies for part in splitter(img) if part]
How do I solve this problem?
You were almost there ;D
import re
movies=["The Big Bad Fox and Other Tales (English subtitles)('23rd', 'May')"]
matcher = re.compile(r"^(.*)\((.*?)\)$").match
print([matcher(movie).groups() for movie in movies])
I suggest using RegExr to learn and test regular expressions.
I am not sure what format you were hoping to get the elements into, but you could take hone in on similarities, like if each date starts with "('".
movies = ["The Big Bad Fox and Other Tales (English subtitles) ('23rd','May')"]
titles,dates = [],[]
for i in range(len(movies)):
newTitle,newDate,sign,count = "","",False,0
for char in movies[i]:
if char == "(":
sign = True
elif sign == True:
if char == "'":
newDate += "(" + movies[i][count:]
break
else:
newTitle += char
count += 1
titles.append(newTitle)
dates.append(newDate)
print(titles)
print(dates)
Output:
['The Big Bad Fox and Other Tales ']
["('23rd','May')"]
Hope this helped!
We can use three important python functions for this problem:
replace(pattern, replacement)
string[start_position:end_position] and string.index(pattern)
movies=["The Big Bad Fox and Other Tales (English subtitles)('23rd', 'May')"]
First, make 2 patterns which denote the beginning and end of the date area:
date_start = "('"
date_end = "')"
Then, remove that part of the string for further analysis:
date_information = movies[0][movies[0].index(date_start):movies[0].index(date_end)]
At this point, "date information" should be ('23rd', 'May
Then, just trim the first 2 characters and replace the single quotations:
date_information = date_information[2:].replace("'", "")
This will give you a final string, "date_information" which should be the date and the month, separated by a comma:
23rd, May
Finally, you can split this string by comma (date_information.split(",")) to get it into a database.
Rather than using regex, you can use split
movies=["The Big Bad Fox and Other Tales (English subtitles)('23rd', 'May')"]
splitter= movies[0].split(')(')
movie_name = f"{splitter[0]})"
date = f"({splitter[1]}"
this is parsing so, keep in mind this will only work in this standard format.

Get date from string by splitting

I have a batch of raw text files. Each file begins with Date>>month.day year News garbage.
garbage is a whole lot of text I don't need, and varies in length. The words Date>> and News always appear in the same place and do not change.
I want to copy month day year and insert this data into a CSV file, with a new line for every file in the format day month year.
How do I copy month day year into separate variables?
I tryed to split a string after a known word and before a known word. I'm familiar with string[x:y], but I basically want to change x and y from numbers into actual words (i.e. string[Date>>:News])
import re, os, sys, fnmatch, csv
folder = raw_input('Drag and drop the folder > ')
for filename in os.listdir(folder):
# First, avoid system files
if filename.startswith("."):
pass
else:
# Tell the script the file is in this directory and can be written
file = open(folder+'/'+filename, "r+")
filecontents = file.read()
thestring = str(filecontents)
print thestring[9:20]
An example text file:
Date>>January 2. 2012 News 122
5 different news agencies have reported the story of a man washing his dog.
Here's a solution using the re module:
import re
s = "Date>>January 2. 2012 News 122"
m = re.match("^Date>>(\S+)\s+(\d+)\.\s+(\d+)", s)
if m:
month, day, year = m.groups()
print("{} {} {}").format(month, day, year)
Outputs:
January 2 2012
Edit:
Actually, there's another nicer (imo) solution using re.split described in the link Robin posted. Using that approach you can just do:
month, day, year = re.split(">>| |\. ", s)[1:4]
You can use the string method .split(" ") to separate the output into a list of variables split at the space character. Because year and month.day will always be in the same place you can access them by their position in the output list. To separate month and day use the .split function again, but this time for .
Example:
list = theString.split(" ")
year = list[1]
month= list[0].split(".")[0]
day = list[0].split(".")[1]
You could use string.split:
x = "A b c"
x.split(" ")
Or you could use regular expressions (which I see you import but don't use) with groups. I don't remember the exact syntax off hand, but the re is something like r'(.*)(Date>>)(.*). This re searches for the string "Date>>" in between two strings of any other type. The parentheses will capture them into numbered groups.

Categories

Resources