Use Python xlsxwriter module to write srt data into and excel - python

this time I tried to use Python's xlsxwriter module to write data from a .srt into an excel.
The subtitle file looks like this in sublime text:
but I want to write the data into an excel, so it looks like this:
It's my first time to code python for this, so I'm still in the stage of trial and error...I tried to write some code like below
but I don't think it makes sense...
I'll continue trying out, but if you know how to do it, please let me know. I'll read your code and try to understand them! Thank you! :)

The following breaks the problem into a few pieces:
Parsing the input file. parse_subtitles is a generator that takes a source of lines and yields up a sequence of records in the form {'index':'N', 'timestamp':'NN:NN:NN,NNN -> NN:NN:NN,NNN', 'subtitle':'TEXT'}'. The approach I took was to track which of three distinct states we're in:
seeking to next entry for when we're looking for the next index number, which should match the regular expression ^\d*$ (nothing but a bunch of numbers)
looking for timestamp when an index is found and we expect a timestamp to come in the next line, which should match the regular expression ^\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}$ (HH:MM:SS,mmm -> HH:MM:SS,mmm) and
reading subtitles while consuming actual subtitle text, with blank lines and EOF interpreted as subtitle termination points.
Writing the above records to a row in a worksheet. write_dict_to_worksheet accepts a row and worksheet, plus a record and a dictionary defining the Excel 0-indexed column numbers for each of the record's keys, and then it writes the data appropriately.
Organizaing the overall conversion convert accepts an input filename (e.g. 'Wildlife.srt' that'll be opened and passed to the parse_subtitles function, and an output filename (e.g. 'Subtitle.xlsx' that will be created using xlsxwriter. It then writes a header and, for each record parsed from the input file, writes that record to the XLSX file.
Logging statements left in for self-commenting purposes, and because when reproducing your input file I fat-fingered a : to a ; in a timestamp, making it unrecognized, and having the error pop up was handy for debugging!
I've put a text version of your source file, along with the below code, in this Gist
import xlsxwriter
import re
import logging
def parse_subtitles(lines):
line_index = re.compile('^\d*$')
line_timestamp = re.compile('^\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}$')
line_seperator = re.compile('^\s*$')
current_record = {'index':None, 'timestamp':None, 'subtitles':[]}
state = 'seeking to next entry'
for line in lines:
line = line.strip('\n')
if state == 'seeking to next entry':
if line_index.match(line):
logging.debug('Found index: {i}'.format(i=line))
current_record['index'] = line
state = 'looking for timestamp'
else:
logging.error('HUH: Expected to find an index, but instead found: [{d}]'.format(d=line))
elif state == 'looking for timestamp':
if line_timestamp.match(line):
logging.debug('Found timestamp: {t}'.format(t=line))
current_record['timestamp'] = line
state = 'reading subtitles'
else:
logging.error('HUH: Expected to find a timestamp, but instead found: [{d}]'.format(d=line))
elif state == 'reading subtitles':
if line_seperator.match(line):
logging.info('Blank line reached, yielding record: {r}'.format(r=current_record))
yield current_record
state = 'seeking to next entry'
current_record = {'index':None, 'timestamp':None, 'subtitles':[]}
else:
logging.debug('Appending to subtitle: {s}'.format(s=line))
current_record['subtitles'].append(line)
else:
logging.error('HUH: Fell into an unknown state: `{s}`'.format(s=state))
if state == 'reading subtitles':
# We must have finished the file without encountering a blank line. Dump the last record
yield current_record
def write_dict_to_worksheet(columns_for_keys, keyed_data, worksheet, row):
"""
Write a subtitle-record to a worksheet.
Return the row number after those that were written (since this may write multiple rows)
"""
current_row = row
#First, horizontally write the entry and timecode
for (colname, colindex) in columns_for_keys.items():
if colname != 'subtitles':
worksheet.write(current_row, colindex, keyed_data[colname])
#Next, vertically write the subtitle data
subtitle_column = columns_for_keys['subtitles']
for morelines in keyed_data['subtitles']:
worksheet.write(current_row, subtitle_column, morelines)
current_row+=1
return current_row
def convert(input_filename, output_filename):
workbook = xlsxwriter.Workbook(output_filename)
worksheet = workbook.add_worksheet('subtitles')
columns = {'index':0, 'timestamp':1, 'subtitles':2}
next_available_row = 0
records_processed = 0
headings = {'index':"Entries", 'timestamp':"Timecodes", 'subtitles':["Subtitles"]}
next_available_row=write_dict_to_worksheet(columns, headings, worksheet, next_available_row)
with open(input_filename) as textfile:
for record in parse_subtitles(textfile):
next_available_row = write_dict_to_worksheet(columns, record, worksheet, next_available_row)
records_processed += 1
print('Done converting {inp} to {outp}. {n} subtitle entries found. {m} rows written'.format(inp=input_filename, outp=output_filename, n=records_processed, m=next_available_row))
workbook.close()
convert(input_filename='Wildlife.srt', output_filename='Subtitle.xlsx')
Edit: Updated to split multiline subtitles across multiple rows in output

Related

commas in between data cells not quoted load to dataframe in pandas

read a comma-separated CSV file with commas in cells has no quotes in python
For example the CSV file is in the below format
product,unit,count,alter,denom
(any name) xyz,kg,1,000,volume,1
reposting with data
read a comma separated csv file with commas in cells has no quotes in python example the csv file is in below format
product,unit,count,alter,denom
(any name or id) xyz,kg,1,000,volume,1
1142,KG,1,000,L,910
1143,v,1,000,L,910
11144,K,1,EA,1
11529,KG,1,EA,1
11548,V,1,EA,10
11551,V,1,EA,4
11562,K,1,000,TO,1
11567,K,28,EA,100
11569,v,1,000,TO,1
here count value is 1,000 but it is separated by comma which gives 2 values this should be rectified and load data to dataframes output should be like
product unit count alter denom
xyz kg 1,000 volume 1
i have used
df=pd.read_csv("filename.csv",sep=",")
here count value is 1,000 but it is separated by a comma which gives 2 values
this should be rectified and load data to data frames
the output should be like
product unit count alter denom
xyz kg 1,000 volume 1
1142 KG 1,000 L 910
I have used
df=pd.read_csv("filename.csv",sep=",")
The fundamental problem is that your input is not a valid .csv file. Either a comma is part of the data or it is a field delimiter. It can't be both.
The simplest approach is to go back to whoever or whatever supplied the file and complain that the format is invalid.
The producer of the file has several, usually easy, options to fix this: (1) Suppress the thousands separator. (2) Quote the field containing the comma, for example "1,000". (3) Choose a different field delimiter, such as ;. This is a very common approach in Europe because , frequently means a decimal point and so ignoring it is a bad idea.
You should not be in the position of having to clean up someone else's sloppy export.
However, since you have the file that you have, and don't seem in a position to take this advice, your only option is to reprocess the file so that it is valid.
The approach is to read the defective input file, check each row to see how many fields it has, and if it has one too many and the cause is a thousands separator comma masquerading as a field delimiter, then glue the two halves of the number back together; and then write out the modified file.
# fixit.py
# Program to accept an invalid csv file with an unescaped comma in column 3 and regularize it
# Use like this: python fixit.py < wrongfile.csv > rightfile.csv
import sys
import csv
def fix(row: list[str]) -> list[str]:
"""
If there are 5 columns:
return unchanged.
If there are 6 columns
and columns 2 and 3 can be interpreted as a number with a thousand separator:
combine columns 2 and 3 and return the row.
Otherwise return an empty list.
"""
if len(row) == 5:
return row
if len(row) == 6 and row[2].isdigit() and row[3].isdigit():
return row[:2] + [row[2] + row[3]] + row[4:]
return []
def main(infile, outfile):
reader = csv.reader(infile)
writer = csv.writer(outfile)
for row in reader:
if fixed_row := fix(row):
writer.writerow(fixed_row)
else:
print(f"Line {reader.line_num} could not be fixed", file=sys.stderr)
if __name__ == '__main__':
sys.stdout.reconfigure(newline="")
# This is because module csv does its own thing with end-of-line and requires the file have newline=""
main(sys.stdin,sys.stdout)
Given this input:
product,unit,count,alter,denom
(any name or id) xyz,kg,1,000,volume,11142,KG,1,000,L,910
1143,v,1,000,L,910
11144,K,1,EA,1
11529,KG,1,EA,1
11548,V,1,EA,10
11551,V,1,EA,4
11562,K,1,000,TO,1
11567,K,28,EA,100
11569,v,1,000,TO,1
you will see this output:
product,unit,count,alter,denom
1143,v,1000,L,910
11144,K,1,EA,1
11529,KG,1,EA,1
11548,V,1,EA,10
11551,V,1,EA,4
11562,K,1000,TO,1
11567,K,28,EA,100
11569,v,1000,TO,1
along with a warning written to the console about line 2.
Your question shows the data with a blank line between each row of data. I'm assuming that your data is not really like that and the blank lines are the result of your inexperience in formatting a Stack Overflow question properly. But if your data really is like that, the program will still work. You will just get a lot of warnings about blank lines. There won't be any blank lines in the output because pandas.read_csv() doesn't need them.

Data Formatting and Fixing

I am trying to clean user reviews which they are crawled in the web. When I try to read on the pandas. There is no warning or error. Then print the lenght of the dataframe.
Then I would like to apply normalization step. But I am focusing on Turkish language,so I cannot use python library. I will use third party software.
For this purpose, I am trying to write reviews columns to text file. When I write to these data text file lenght of the sample is
and target size:
Basically I do this:
Note: As I mentioned these are the customer reviews, as we expected they are dirty and noisy. Some of the samples contains many enter characters such as approximately 56 of the sample contains "\n\n\n\n". I have tried solve this problem in python by cleaning data but every time I am losing sample. Also I tried to fix it on Excel, it did not work.
Question: Do you have any suggestion for fixing data?
It seems that you are producing two CSVs files from your df and then read them back as reviews and targets.
If you use pd.read_csv to read them back, pd.read_csv has this argument skip_blank_lines=True by default which skips blank lines. If some rows of your original df contains only a number of '\n', then it will end up with an empty line in your new CSVs which will be skipped the next time they get read.
You can verify this by setting up two counter variables for the total number of empty lines and see if that matches with the 'loss'.
num_empty_review = 0
num_empty_target = 0
for ..., ... in df.iterrows():
review = ...replace('\n', '')
target = ...replace('\n', '')
if review.replace(' ', '') == '':
num_empty_review += 1
if target.replace(' ', '') == '':
num_empty_target += 1
...
...
print(num_empty_review, num_empty_target)
Lastly, next time, please paste your code here in text form like what I did in above :)

How to remove the last 2 numbers from a string?

I am trying to take a Display Name / Keypad code from an excel document and add it into my companies website. My problem is when I parse the data from the excel document, the document will show 4240, but when it goes to add it into the website it picks it up at 4240.0. How can I remove the ".0" when I parse the data?
This is the code I currently have, the only problem with this is for some reason it will not picking up the "0" if it is in the front or end of a code.
For example, if the code is 0420, it only picks up 42 and doesn't apply the leading and ending 0. I tried changing the excel format to text that way it doesn't pick it up as a number but that didn't help either.
I think the best method would be to remove the last 2 pieces of information with index?
def addCodesA():
workbook = xlrd.open_workbook(path)
sheet = workbook.sheet_by_index(0)
for y in range(sheet.nrows):
names = []
codes = []
convertedcodes = []
names.append(str(sheet.cell_value(y, 0)))
codes.append(str(sheet.cell_value(y, 1)))
for strippedcode in codes:
convertedcodes.append(strippedcode.strip('.0'))
print(names)
print(codes)
driver.find_element_by_xpath('//*[#id="device_keypad_relay"][#value="0"]').click()
time.sleep(1)
codeadd = driver.find_element_by_name('keypad_code_1')
nameadd = driver.find_element_by_name('keypad_code_1_display')
codeadd.clear()
nameadd.clear()
codeadd.send_keys(convertedcodes)
nameadd.send_keys(names)
driver.find_element_by_class_name('btn-form-end').send_keys(Keys.SHIFT,Keys.ENTER)
time.sleep(6)
driver.get(customercodes)

Problem skipping line whilst iterating using previous line and current line comparison

I have a list of sorted data arranged so that each item in the list is a csv line to be written to file.
The final step of the script checks the contents of each field and if all but the last field match then it will copy the current line's last field onto the previous line's last field.
I would like to as I've found and processed one of these matches skip the current line where the field was copied from thus only leaving one of the lines.
Here's an example set of data
field1,field2,field3,field4,something
field1,field2,field3,field4,else
Desired output
field1,field2,field3,field4,something else
This is what I have so far
output_csv = ['field1,field2,field3,field4,something',
'field1,field2,field3,field4,else']
# run through the output
# open and create a csv file to save output
with open('output_table.csv', 'w') as f:
previous_line = None
part_duplicate_line = None
part_duplicate_flag = False
for line in output_csv:
part_duplicate_flag = False
if previous_line is not None:
previous = previous_line.split(',')
current = line.split(',')
if (previous[0] == current[0]
and previous[1] == current[1]
and previous[2] == current[2]
and previous[3] == current[3]):
print(previous[0], current[0])
previous[4] = previous[4].replace('\n', '') + ' ' + current[4]
part_duplicate_line = ','.join(previous)
part_duplicate_flag = True
f.write(part_duplicate_line)
if part_duplicate_flag is False:
f.write(previous_line)
previous_line = line
ATM script adds the line but doesn't skip the next line, I've tried various renditions of continue statements after part_duplicate_line is written to file but to no avail.
Looks like you want one entry for each combination of the first 4 fields
You can use a dict to aggregate data -
#First we extract the key and values
output_csv_keys = list(map(lambda x: ','.join(x.split(',')[:-1]), output_csv))
output_csv_values = list(map(lambda x: x.split(',')[-1], output_csv))
#Then we construct a dictionary with these keys and combine the values into a list
from collections import defaultdict
output_csv_dict = defaultdict(list)
for key, value in zip(output_csv_keys, output_csv_values):
output_csv_dict[key].append(value)
#Then we extract the key/value combinations from this dictionary into a list
for_printing = [','.join([k, ' '.join(v)]) for k, v in output_csv_dict.items()]
print(for_printing)
#Output is ['field1,field2,field3,field4,something else']
#Each entry of this list can be output to the csv file
I propose to encapsulate what you want to do in a function where the important part obeys this logic:
either join the new info to the old record
or output the old record and forget it
of course at the end of the loop we have in any case a dangling old record to output
def join(inp_fname, out_fname):
'''Input file contains sorted records, when two (or more) records differ
only in the last field, we join the last fields with a space
and output only once, otherwise output the record as-is.'''
######################### Prepare for action ##########################
from csv import reader, writer
with open(inp_fname) as finp, open(out_fname, 'w') as fout:
r, w = reader(finp), writer(fout)
######################### Important Part starts here ##############
old = next(r)
for new in r:
if old[:-1] == new[:-1]:
old[-1] += ' '+new[-1]
else:
w.writerow(old)
old = new
w.writerow(old)
To check what I've proposed you can use these two snippets (note that these records are shorter than yours, but it's an example and it doesn't matter because we use only -1 to index our records).
The 1st one has a "regular" last record
open('a0.csv', 'w').write('1,1,2\n1,1,3\n1,2,0\n1,3,1\n1,3,2\n3,3,0\n')
join('a0.csv', 'a1.csv')
while the 2nd has a last record that must be joined to the previous one.
open('b0.csv', 'w').write('1,1,2\n1,1,3\n1,2,0\n1,3,1\n1,3,2\n')
join('b0.csv', 'b1.csv')
If you run the snippets, as I have done before posting, in the environment where you have defined join you should get what you want.

Rewriting Single Words in a .txt with Python

I need to create a Database, using Python and a .txt file.
Creating new items is no Problem,the inside of the Databse.txt looks like this:
Index Objektname Objektplace Username
i.e:
1 Pen Office Daniel
2 Saw Shed Nic
6 Shovel Shed Evelyn
4 Knife Room6 Evelyn
I get the index from a QR-Scanner (OpenCV) and the other informations are gained via Tkinter Entrys and if an objekt is already saved in the Database, you should be able to rewrite Objektplace and Username.
My Problems now are the following:
If I scan the Code with the index 6, how do i navigate to that entry, even if it's not in line 6, without causing a Problem with the Room6?
How do I, for example, only replace the "Shed" from Index 4 when that Objekt is moved to f.e. Room6?
Same goes for the Usernames.
Up until now i've tried different methods, but nothing worked so far.
The last try looked something like this
def DBChange():
#Removes unwanted bits from the scanned code
data2 = data.replace("'", "")
Index = data2.replace("b","")
#Gets the Data from the Entry-Widgets
User = Nutzer.get()
Einlagerungsort = Ort.get()
#Adds a whitespace at the end of the Entrys to seperate them
Userlen = len(User)
User2 = User.ljust(Userlen)
Einlagerungsortlen = len(Einlagerungsort)+1
Einlagerungsort2 = Einlagerungsort.ljust(Einlagerungsortlen)
#Navigate to the exact line of the scanned Index and replace the words
#for the place and the user ONLY in this line
file = open("Datenbank.txt","r+")
lines=file.readlines()
for word in lines[Index].split():
List.append(word)
checkWords = (List[2],List[3])
repWords = (Einlagerungsort2, User2)
for line in file:
for check, rep in zip(checkWords, repWords):
line = line.replace(check, rep)
file.write(line)
file.close()
Return()
Thanks in advance
I'd suggest using Pandas to read and write your textfile. That way you can just use the index to select the approriate line. And if there is no specific reason to use your text format, I would switch to csv for ease of use.
import pandas as pd
def DBChange():
#Removes unwanted bits from the scanned code
# I haven't changed this part, since I guess you need this for some input data
data2 = data.replace("'", "")
Indexnr = data2.replace("b","")
#Gets the Data from the Entry-Widgets
User = Nutzer.get()
Einlagerungsort = Ort.get()
# I removed the lines here. This isn't necessary when using csv and Pandas
# read in the csv file
df = pd.read_csv("Datenbank.csv")
# Select line with index and replace value
df.loc[Indexnr, 'Username'] = User
df.loc[Indexnr, 'Objektplace'] = Einlagerungsort
# Write back to csv
df.to_csv("Datenbank.csv")
Return()
Since I can't reproduce your specific problem, I haven't tested it. But something like this should work.
Edit
To read and write text-file, use ' ' as the seperator. (I assume all values do not contain spaces, and your text file now uses 1 space between values).
reading:
df = pd.read_csv('Datenbank.txt', sep=' ')
Writing:
df.to_csv('Datenbank.txt', sep=' ')
First of all, this is a terrible way to store data. My suggestion is not particularily well code, don't do this in production! (edit
newlines = []
for line in lines:
entry = line.split()
if entry[0] == Index:
#line now is the correct line
#Index 2 is the place, index 0 the ID, etc
entry[2] = Einlagerungsort2
newlines.append(" ".join(entry))
# Now write newlines back to the file

Categories

Resources