Creating a csv-file from an srt-file ("Friends" subtitles) in python - python

Currently, I am trying to create a csv file containing the subtitles of NBC's "Friends" and their corresponding starting time. So basically I am trying to turn an srt-file into a csv-file in python.
For those of you that are unfamiliar with srt-files, they look like this:
1
00:00:47,881 --> 00:00:49,757
[CAR HORNS HONKING]
2
00:00:49,966 --> 00:00:52,760
There's nothing to tell.
It's just some guy I work with.
3
00:00:52,969 --> 00:00:55,137
Come on.
You're going out with a guy.
…
Now I have used readlines() to turn it into a list like this:
['\ufeff1\n', '00:00:47,881 --> 00:00:49,757\n', '[CAR HORNS HONKING]\n',
'\n', '2\n', '00:00:49,966 --> 00:00:52,760\n',
"There's nothing to tell.\n", "It's just some guy I work with.\n",
'\n', '3\n', '00:00:52,969 --> 00:00:55,137\n', 'Come on.\n',
"You're going out with a guy.\n", ...]
Is there a way to create a dict or dataframe from this list (or the file it is based on) that contains the starting time (end time is not needed) and the lines that belong to it. I've been struggling because while sometimes just one line corresponds to a starting time, other times there are two (There are two lines at most per starting time in this file. However, a solution that can be used in case even more lines are present would be preferable).
Lines that look like the first one ("[CAR HORNS HONKING]") or others that simply say e. g. "CHANDLER:" and their starting times would ideally not be included but that's not all that important right now.
Any help is very much appreciated!

I think this code cover your problem. The main idea is to use a regular expression to locate the starting time of each legend and extract its value and the corresponding lines. The code is not in the most polished form, but I think the main idea is well expressed. I hope it helps.
import re
with open('sub.srt', 'r') as h:
sub = h.readlines()
re_pattern = r'[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} -->'
regex = re.compile(re_pattern)
# Get start times
start_times = list(filter(regex.search, sub))
start_times = [time.split(' ')[0] for time in start_times]
# Get lines
lines = [[]]
for sentence in sub:
if re.match(re_pattern, sentence):
lines[-1].pop()
lines.append([])
else:
lines[-1].append(sentence)
lines = lines[1:]
# Merge results
subs = {start_time:line for start_time,line in zip(start_times, lines)}

Related

How to get an unknown substring between two known substrings, within a giant string/file

I'm trying to get all the substrings under a "customLabel" tag, for example "Month" inside of ...,"customLabel":"Month"},"schema":"metric...
Unusual issue: this is a 1071552 characters long ndjson file, of a single line ("for line in file:" is pointless since there's only one).
The best I found was that:
How to find a substring of text with a known starting point but unknown ending point in python
but if I use this, the result obviously doesn't stop (at Month) and keeps writing the whole remaining of the file, same as if using partition()[2].
Just know that Month is only an example, customLabel has about 300 variants and they are not listed (I'm actually doing this to list them...)
To give some details here's my script so far:
with open("file.ndjson","rt", encoding='utf-8') as ndjson:
filedata = ndjson.read()
x="customLabel"
count=filedata.count(x)
for i in range (count):
if filedata.find(x)>0:
print("Found "+str(i+1))
So right now it properly tells me how many occurences of customLabel there are, I'd like to get the substring that comes after customLabel":" instead (Month in the example) to put them all in a list, to locate them way more easily and enable the use of replace() for traductions later on.
I'd guess regex are the solution but I'm pretty new to that, so I'll post that question by the time I learn about them...
If you want to search for all (even nested) customLabel values like this:
{"customLabel":"Month" , "otherJson" : {"customLabel" : 23525235}}
you can use RegEx patterns with the re module
import re
label_values = []
regex_pattern = r"\"customLabel\"[ ]?:[ ]?([1-9a-zA-z\"]+)"
with open("file.ndjson", "rt", encoding="utf-8") as ndjson:
for line in ndjson:
values = re.findall(regex_pattern, line)
label_values.extend(values)
print(label_values) # ['"Month"', '23525235']
# If you don't want the items to have quotations
label_values = [i.replace('"', "") for i in label_values]
print(label_values) # ['Month', '23525235']
Note: If you're only talking about ndjson files and not nested searching, then it'd be better to use the json module to parse the lines and then easily get the value of your specific key which is customLabel.
import json
label = "customLabel"
label_values = []
with open("file.ndjson", "rt", encoding="utf-8") as ndjson:
for line in ndjson:
line_json = json.loads(line)
if line_json.get(label) is not None:
label_values.append(line_json.get(label))
print(label_values) # ['Month']

Parsing multiple lines from a txt output

I need to parse a part of my output file that looks like this (Image is also attached for clarity)
y,mz) = (-.504D-04,-.543D-04,-.538D-03)
The expected output is :
the code I have so far looks like below:
class NACParser(ParseSection):
name = "coupling"
which is good but there are some issues:
It only prints from the very last line and this, I think, is because it overwrites due to similar other lines.
This code will only work for this specific molecule and I want something that can work for any molecule. What I mean is : in this example - I have a molecule with 15 atoms and the first atom is c (carbon) , 5th atom is h (hydrogen) and 11th atom is s (sulfur) but the total number of atoms (which is currently 15 ) and the name of atoms can be different when I have different molecule.
So I am wondering how can I write a general code that can work for a general molecule . Any help?
This will to literally what you asked. Maybe you can use this as a basis. I just gather all the atom IDs when I find a line with "ATOM", and create the dict entries when I find a line with "d/d". I would show the output, but I just typed in faked data because I didn't want to retype all of that.
import re
from pprint import pprint
header = r"(\d+ [a-z]{1,2})"
atoms = []
gather = {}
for line in open('x.txt'):
if len(line) < 5:
continue
if 'ATOM' in line:
atoms = re.findall( header, line )
atoms = [s.replace(' ','') for s in atoms]
continue
if '/d' in line:
parts = line.split()
row = parts[0].replace('/','')
for at,val in zip(atoms,parts[1:]):
gather[at+'_'+row] = val
pprint(gather)
Here's the output from your test data. I hope you realize that the cut-and-paste data doesn't match the image. The image uses d/dx, but the cut and paste uses dE/dx. I have assumed you want the "E" in the dict tag too, but that's easy to fix if you don't.
{'10c_dEdx': '0.8337613D-02',
'10c_dEdy': '-0.8171767D-01',
'10c_dEdz': '-0.4316928D-02',
'11s_dEdx': '0.3138990D-01',
'11s_dEdy': '0.3893252D-01',
'11s_dEdz': '0.2767787D-02',
'12h_dEdx': '0.8416159D-02',
'12h_dEdy': '0.3335059D-02',
'12h_dEdz': '0.1357569D-01',
'13h_dEdx': '0.1128067D-02',
'13h_dEdy': '-0.1457401D-01',
'13h_dEdz': '-0.7834375D-03',
'14h_dEdx': '0.8941240D-02',
'14h_dEdy': '0.4869915D-02',
'14h_dEdz': '-0.1273530D-01',
'15h_dEdx': '0.4292434D-03',
'15h_dEdy': '-0.1418384D-01',
'15h_dEdz': '-0.7764904D-03',
'1c_dEdx': '-0.1150239D-01',
'1c_dEdy': '0.4798462D-02',
'1c_dEdz': '0.6015413D-05',
'2c_dEdx': '0.2259669D-01',
'2c_dEdy': '0.5902019D-01',
'2c_dEdz': '0.3707704D-02',
'3c_dEdx': '-0.3153006D-02',
'3c_dEdy': '-0.4060517D-01',
'3c_dEdz': '-0.2306249D-02',
'4n_dEdx': '-0.2718508D-01',
'4n_dEdy': '0.3404657D-01',
'4n_dEdz': '0.1334956D-02',
'5h_dEdx': '-0.1064344D-01',
'5h_dEdy': '-0.1054522D-01',
'5h_dEdz': '-0.8032586D-03',
'6c_dEdx': '0.3017851D-01',
'6c_dEdy': '-0.2805275D-01',
'6c_dEdz': '-0.9413310D-03',
'7s_dEdx': '-0.2253417D-01',
'7s_dEdy': '0.1196856D-01',
'7s_dEdz': '0.2069422D-03',
'8n_dEdx': '-0.3195785D-01',
'8n_dEdy': '0.1888257D-01',
'8n_dEdz': '0.3914382D-03',
'9h_dEdx': '-0.4441489D-02',
'9h_dEdy': '0.1382483D-01',
'9h_dEdz': '0.6724659D-03'}

parsing a .srt file with regex

I am doing a small script in python, but since I am quite new I got stuck in one part:
I need to get timing and text from a .srt file. For example, from
1
00:00:01,000 --> 00:00:04,074
Subtitles downloaded from www.OpenSubtitles.org
I need to get:
00:00:01,000 --> 00:00:04,074
and
Subtitles downloaded from www.OpenSubtitles.org.
I have already managed to make the regex for timing, but i am stuck for the text. I've tried to use look behind where I use my regex for timing:
( ?<=(\d+):(\d+):(\d+)(?:\,)(\d+) --> (\d+):(\d+):(\d+)(?:\,)(\d+) )\w+
but with no effect. Personally, i think that using look behind is the right way to solve this, but i am not sure how to write it correctly. Can anyone help me? Thanks.
Honestly, I don't see any reason to throw regex at this problem. .srt files are highly structured. The structure goes like:
an integer starting at 1, monotonically increasing
start --> stop timing
one or more lines of subtitle content
a blank line
... and repeat. Note the bold part - you might have to capture 1, 2, or 20 lines of subtitle content after the time code.
So, just take advantage of the structure. In this way you can parse everything in just one pass, without needing to put more than one line into memory at a time and still keeping all the information for each subtitle together.
from itertools import groupby
# "chunk" our input file, delimited by blank lines
with open(filename) as f:
res = [list(g) for b,g in groupby(f, lambda x: bool(x.strip())) if b]
For example, using the example on the SRT doc page, I get:
res
Out[60]:
[['1\n',
'00:02:17,440 --> 00:02:20,375\n',
"Senator, we're making\n",
'our final approach into Coruscant.\n'],
['2\n', '00:02:20,476 --> 00:02:22,501\n', 'Very good, Lieutenant.\n']]
And I could further transform that into a list of meaningful objects:
from collections import namedtuple
Subtitle = namedtuple('Subtitle', 'number start end content')
subs = []
for sub in res:
if len(sub) >= 3: # not strictly necessary, but better safe than sorry
sub = [x.strip() for x in sub]
number, start_end, *content = sub # py3 syntax
start, end = start_end.split(' --> ')
subs.append(Subtitle(number, start, end, content))
subs
Out[65]:
[Subtitle(number='1', start='00:02:17,440', end='00:02:20,375', content=["Senator, we're making", 'our final approach into Coruscant.']),
Subtitle(number='2', start='00:02:20,476', end='00:02:22,501', content=['Very good, Lieutenant.'])]
Disagree with #roippi. Regex is a very nice solution to text matching. And the Regex for this solution is not tricky.
import re
f = file.open(yoursrtfile)
# Parse the file content
content = f.read()
# Find all result in content
# The first big (__) retrieve the timing, \s+ match all timing in between,
# The (.+) means retrieve any text content after that.
result = re.findall("(\d+:\d+:\d+,\d+ --> \d+:\d+:\d+,\d+)\s+(.+)", content)
# Just print out the result list. I recommend you do some formatting here.
print result
number:^[0-9]+$
Time:
^[0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9]$
string: *[a-zA-Z]+*
hope this help.
Thanks #roippi for this excellent parser.
It helped me a lot to write a srt to stl converter in less than 40 lines (in python2 though, as it has to fit in a larger project)
from __future__ import print_function, division
from itertools import groupby
from collections import namedtuple
# prepare - adapt to you needs or use sys.argv
inputname = 'FR.srt'
outputname = 'FR.stl'
stlheader = """
$FontName = Arial
$FontSize = 34
$HorzAlign = Center
$VertAlign = Bottom
"""
def converttime(sttime):
"convert from srt time format (0...999) to stl one (0...25)"
st = sttime.split(',')
return "%s:%02d"%(st[0], round(25*float(st[1]) /1000))
# load
with open(inputname,'r') as f:
res = [list(g) for b,g in groupby(f, lambda x: bool(x.strip())) if b]
# parse
Subtitle = namedtuple('Subtitle', 'number start end content')
subs = []
for sub in res:
if len(sub) >= 3: # not strictly necessary, but better safe than sorry
sub = [x.strip() for x in sub]
number, start_end, content = sub[0], sub[1], sub[2:] # py 2 syntax
start, end = start_end.split(' --> ')
subs.append(Subtitle(number, start, end, content))
# write
with open(outputname,'w') as F:
F.write(stlheader)
for sub in subs:
F.write("%s , %s , %s\n"%(converttime(sub.start), converttime(sub.end), "|".join(sub.content)) )
for time:
pattern = ("(\d{2}:\d{2}:\d{2},\d{3}?.*)")
None of the pure REGEx solution above worked for the real life srt files.
Let's take a look of the following SRT patterned text :
1
00:02:17,440 --> 00:02:20,375
Some multi lined text
This is a second line
2
00:02:20,476 --> 00:02:22,501
as well as a single line
3
00:03:20,476 --> 00:03:22,501
should be able to parse unicoded text too
こんにちは
Take a note that :
text may contain unicode characters.
Text can consist of several lines.
every cue started with an integer value and ended with a blank new line which both unix style and windows style CR/LF are accepted
Here is the working regex :
\d+[\r\n](\d+:\d+:\d+,\d+ --> \d+:\d+:\d+,\d+)[\r\n]((.+\r?\n)+(?=(\r?\n)?))
https://regex101.com/r/qICmEM/1

Creating a table which has sentences from a paragraph each on a row with Python

I have an abstract which I've split to sentences in Python. I want to write to 2 tables. One which has the following columns: abstract id (which is the file number that I extracted from my document), sentence id (automatically generated) and each sentence of this abstract on a row.
I would want a table that looks like this
abstractID SentenceID Sentence
a9001755 0000001 Myxococcus xanthus development is regulated by(1st sentence)
a9001755 0000002 The C signal appears to be the polypeptide product (2nd sentence)
and another table NSFClasses having abstractID and nsfOrg.
How to write sentences (each on a row) to table and assign sentenceId as shown above?
This is my code:
import glob;
import re;
import json
org = "NSF Org";
fileNo = "File";
AbstractString = "Abstract";
abstractFlag = False;
abstractContent = []
path = 'awardsFile/awd_1990_00/*.txt';
files = glob.glob(path);
for name in files:
fileA = open(name,'r');
for line in fileA:
if line.find(fileNo)!= -1:
file = line[14:]
if line.find(org) != -1:
nsfOrg = line[14:].split()
print file
print nsfOrg
fileA = open(name,'r')
content = fileA.read().split(':')
abstract = content[len(content)-1]
abstract = abstract.replace('\n','')
abstract = abstract.split();
abstract = ' '.join(abstract)
sentences = abstract.split('.')
print sentences
key = str(len(sentences))
print "Sentences--- "
As others have pointed out, it's very difficult to follow your code. I think this code will do what you want, based on your expected output and what we can see. I could be way off, though, since we can't see the file you are working with. I'm especially troubled by one part of your code that I can't see enough to refactor, but feels obviously wrong. It's marked below.
import glob
for filename in glob.glob('awardsFile/awd_1990_00/*.txt'):
fh = open(filename, 'r')
abstract = fh.read().split(':')[-1]
fh.seek(0) # reset file pointer
# See comments below
for line in fh:
if line.find('File') != -1:
absID = line[14:]
print absID
if line.find('NSF Org') != -1:
print line[14:].split()
# End see comments
fh.close()
concat_abstract = ''.join(abstract.replace('\n', '').split())
for s_id, sentence in enumerate(concat_abstract.split('.')):
# Adjust numeric width arguments to prettify table
print absID.ljust(15),
print '{:06d}'.format(s_id).ljust(15),
print sentence
In that section marked, you are searching for the last occurrence of the strings 'File' and 'NSF Org' in the file (whether you mean to or not because the loop will keep overwriting your variables as long as they occur), then doing something with the 15th character onward of that line. Without seeing the file, it is impossible to say how to do it, but I can tell you there is a better way. It probably involves searching through the whole file as one string (or at least the first part of it if this is in its header) rather than looping over it.
Also, notice how I condensed your code. You store a lot of things in variables that you aren't using at all, and collecting a lot of cruft that spreads the state around. To understand what line N does, I have to keep glancing ahead at line N+5 and back over lines N-34 to N-17 to inspect variables. This creates a lot of action at a distance, which for reasons cited is best to avoid. In the smaller version, you can see how I substituted in string literals in places where they are only used once and called print statements immediately instead of storing the results for later. The results are usually more concise and easily understood.

How do I read from a file consists of city names and coordinates/Populations and create functions to get the coordinates and population?

I'm using Python, and I have a file which has city names and information such as names, coordinates of the city and population of the city:
Youngstown, OH[4110,8065]115436
Yankton, SD[4288,9739]12011
966
Yakima, WA[4660,12051]49826
1513 2410
Worcester, MA[4227,7180]161799
2964 1520 604
Wisconsin Dells, WI[4363,8977]2521
1149 1817 481 595
How can I create a function to take the city name and return a list containing the latitude and longitude of the given city?
fin = open ("miles.dat","r")
def getCoordinates
cities = []
for line in fin:
cities.append(line.rstrip())
for word in line:
print line.split()
That's what I tried now; how could I get the coordinates of the city by calling the names of the city and how can I return the word of each line but not letters?
Any help will be much appreciated, thanks all.
I am feeling generous since you responded to my comment and made an effort to provide more info....
Your code example isn't even runnable right now, but from a purely pseudocode standpoint, you have at least the basic concept of the first part right. Normally I would want to parse out the information using a regex, but I think giving you an answer with a regex is beyond what you already know and won't really help you learn anything at this stage. So I will try and keep this example within the realm of the tools with which you seem to already be familiar.
def getCoordinates(filename):
'''
Pass in a filename.
Return a parsed dictionary in the form of:
{
city: [lat, lon]
}
'''
fin = open(filename,"r")
cities = {}
for line in fin:
# this is going to split on the comma, and
# only once, so you get the city, and the rest
# of the line
city, extra = line.split(',', 1)
# we could do a regex, but again, I dont think
# you know what a regex is and you seem to already
# understand split. so lets just stick with that
# this splits on the '[' and we take the right side
part = extra.split('[')[1]
# now take the remaining string and split off the left
# of the ']'
part = part.split(']')[0]
# we end up with something like: '4660, 12051'
# so split that string on the comma into a list
latLon = part.split(',')
# associate the city, with the latlon in the dictionary
cities[city] = latLong
return cities
Even though I have provided a full code solution for you, I am hoping that it will be more of a learning experience with the added comments. Eventually you should learn to do this using the re module and a regex pattern.

Categories

Resources