Converting sequential data from a .txt file to a data frame

Converting sequential data from a .txt file to a data frame - python

Hello Data science community, I'm new at data science and python programming.
Here is the structure of my txt file but there are many missing values
#*Improved Channel Routing by Via Minimization and Shifting.
##Chung-Kuan Cheng
David N. Deutsch
#t1988
#cDAC
#index131751
#%133716
#%133521
#%134343
#!Channel routing area improvement by means of via minimization and via shifting in two dimensions (compaction) is readily achievable. Routing feature area can be minimized by wire straightening. The implementation of algorithms for each of these procedures has produced a solution for Deutsch's Difficult Example
the standard channel routing benchmark
that is more than 5% smaller than the best result published heretofore. Suggestions for possible future work are also given.
#*A fast simultaneous input vector generation and gate replacement algorithm for leakage power reduction.
##Lei Cheng
Liang Deng
Deming Chen
Martin D. F. Wong
#t2006
#cDAC
#index131752
#%132550
#%530568
#%436486
#%134259
#%283007
#%134422
#%282140
#%1134324
#!Input vector control (IVC) technique is based on the observation that the leakage current in a CMOS logic gate depends on the gate input state
and a good input vector is able to minimize the leakage when the circuit is in the sleep mode. The gate replacement technique is a very effective method to further reduce the leakage current. In this paper
we propose a fast algorithm to find a low leakage input vector with simultaneous gate replacement. Results on MCNC91 benchmark circuits show that our algorithm produces $14 %$ better leakage current reduction with several orders of magnitude speedup in runtime for large circuits compared to the previous state-of-the-art algorithm. In particular
the average runtime for the ten largest combinational circuits has been dramatically reduced from 1879 seconds to 0.34 seconds.
#*On the Over-Specification Problem in Sequential ATPG Algorithms.
##Kwang-Ting Cheng
Hi-Keung Tony Ma
#t1992
#cDAC
#index131756
#%455537
#%1078626
#%131745
#!The authors show that some ATPG (automatic test pattern generation) programs may err in identifying untestable faults. These test generators may not be able to find the test sequence for a testable fault
even allowed infinite run time
and may mistakenly claim it as untestable. The main problem of these programs is that the underlying combinational test generation algorithm may over-specify the requirements at the present state lines. A necessary condition that the underlying combinational test generation algorithm must satisfy is considered to ensure a correct sequential ATPG program. It is shown that the simple D-algorithm satisfies this condition while PODEM and the enhanced D-algorithm do not. The impact of over-specification on the length of the generated test sequence was studied. Over-specification caused a longer test sequence. Experimental results are presented
#*Device and architecture co-optimization for FPGA power reduction.
##Lerong Cheng
Phoebe Wong
Fei Li
Yan Lin
Lei He
#t2005
#cDAC
#index131759
#%214244
#%215701
#%214503
#%282575
#%214411
#%214505
#%132929
#!Device optimization considering supply voltage Vdd and threshold voltage Vt tuning does not increase chip area but has a great impact on power and performance in the nanometer technology. This paper studies the simultaneous evaluation of device and architecture optimization for FPGA. We first develop an efficient yet accurate timing and power evaluation method
called trace-based model. By collecting trace information from cycle-accurate simulation of placed and routed FPGA benchmark circuits and re-using the trace for different Vdd and Vt
we enable the device and architecture co-optimization for hundreds of combinations. Compared to the baseline FPGA which has the architecture same as the commercial FPGA used by Xilinx
and has Vdd suggested by ITRS but Vt optimized by our device optimization
architecture and device co-optimization can reduce energy-delay product by 20.5% without any chip area increase compared to the conventional FPGA architecture. Furthermore
considering power-gating of unused logic blocks and interconnect switches
our co-optimization method reduces energy-delay product by 54.7% and chip area by 8.3%. To the best of our knowledge
this is the first in-depth study on architecture and device co-optimization for FPGAs.
I want to convert it to a data frame , exp lines which start by ## are authors , #! are abstracts ,#* are titles ,#% are references and #c are venues using python
Each article start by its title , the problem may have to do with abstracts
I tried different approaches such as
import csv
with open('names7.csv', 'w', encoding="utf-8") as csvfile:
fieldnames = ["Venue", "Year", "Authors","Title","id","ListCitation","NbrCitations","Abstract"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
with open(r"C:\Users\lenovo\Downloads\1.txt", "r", encoding="utf-8") as f:
cnt = 1
for line in f :
if line.startswith('#*'):
writer.writerow({'Title': line})
cnt += 1
elif line.startswith('##'):
writer.writerow({'Authors': line})
cnt += 1
elif line.startswith("#t"):
writer.writerow({'Year': line})
cnt += 1
elif line.startswith("#!"):
writer.writerow({'Abstract': line})
cnt += 1
elif line.startswith("#c"):
writer.writerow({'Venue': line})
cnt +=1
elif line.startswith("#index"):
writer.writerow({'id': line})
cnt +=1
else:
writer.writerow({'ListCitation': line})
cnt +=1
f.close()
I tried this aproache but it didn't work i want to convert it to data frame with the said columns, how can i convert this file to a data frame and storing result in csv file
the output of answer's code
There the output that i want
for example for this case (abstract column) there is space between paragraphs which caused me a problem so this case must be taken into account and for the column reference there are many references so , they mus be taken into account
#*Total power reduction in CMOS circuits via gate sizing and multiple threshold voltages.
##Feng Gao
John P. Hayes
#t2005
#cDAC
#index132139
#%437038
#%437006
#%436596
#%285977
#%1135402
#%132206
#%194016
#%143061
#!Minimizing power consumption is one of the most important objectives in IC design. Resizing gates and assigning different Vt's are common ways to meet power and timing budgets. We propose an automatic implementation of both these techniques using a mixedinteger linear programming model called MLP-exact
which minimizes a circuit's total active-mode power consumption. Unlike previous linear programming methods which only consider local optimality
MLP-exact
can find a true global optimum. An efficient
non-optimal way to solve the MLP model
called MLP-fast
is also described. We present a set of benchmark experiments which show that MLP-fast
is much faster than MLP-exact
while obtaining designs with only slightly higher power consumption. Furthermore
the designs generated by MLP-fast
consume 30% less power than those obtained by conventional
sensitivity-based methods.

csv.DictWriter().writerow() takes a dictionary representing the entire row for that specific sample. What's happening to your code right now is, everything you call writerow, its creating a completely new row instead of adding to the current row. Instead, what you should do is
define a variable called row as a dictionary
store the values for the entire row in this variable
write to the csv using writerow for this dictionary
This will write the entire row to the csv instead of a new row for every new value.
Though this is not the only problem we must look at. Each new line in the text document that does not have a starting id is considered a value it is not. For example !Channel is separated across 3 lines of text, yet only the first line will be considered as !Channel while the other 2 lines will be considered as something else.
Below is a improved version of code with documentation using a dictionary to store starting values and corresponding ids. To add new cases, just modify the dictionary keys and fieldnames
"""
## are authors
#! are abstracts
#* are titles
#% are references
#index are index
#c are venues using python Each article start by its title , the problem may have to do with abstracts
"""
# use dictionary to store fieldnames with corresponding id's/tags
keys = {
'Venue': '#c',
'Year':'#t',
'Authors':'##',
'Title':'#*',
'id': '#index',
'References': '#%',
'Abstract': '#!',
}
fieldnames = ["Venue", "Year", "Authors", "Title","id","Abstract", 'References']
outFile = 'names7.csv' # path to csv output
inFile = r"1.txt" # path to input text file
import csv
with open(outFile, 'w', encoding="utf-8") as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
with open(inFile, "r", encoding="utf-8") as f:
row = dict()
stack = [] # (key, value) pair, used to store multiple values regarding rows
for line in f:
line = line.strip() # remove white space from beginning and end of line
prev = "" # store value of previous column
for col in fieldnames:
# if column is defined and doesnt start with any id, add to previous value
# this handles cases with results over one line or containing new lines
if prev in row and not any([line.startswith(prefix) for prefix in keys.items()]):
# remove prefix
prefix = keys[prev]
line = line[len(prefix):]
row[prev] += ' ' + line
# initate or append to current value. Handles (References #%)
elif col in keys and line.startswith(keys[col]):
# remove prefix
prefix = keys[col]
line = line[len(prefix):]
if col in row:
stack.append((col, line))
else:
row[col] = line
prev = col # define prev col if answer goes over one line
break # go to next line in text
writer.writerow(row)
for col, line in stack:
row[col] = line
writer.writerow(row)
f.close()
Result produced given test case above.

Updated Previous Answer with this result produced given this specific text file
"""
## are authors
#! are abstracts
#* are titles
#% are references
#index are index
#c are venues using python Each article start by its title , the problem may have to do with abstracts
"""
# use dictionary to store fieldnames with corresponding id's/tags
keys = {
'#c': 'Venue',
'#t': 'Year',
'##': 'Authors',
'#*': 'Title',
'#index': 'id',
'#%': 'References',
'#!': 'Abstract'
}
fieldnames = ["Venue", "Year", "Authors", "Title", "NbrAuthor", "id", "ListCitation", "NbrCitation", "References", "NbrReferences", "Abstract"]
# fieldnames = ["Venue", "Year", "Authors", "Title", "NbrAuthor", "id", "ListCitation", "NbrCitation"]
# References and Authors store on one line
# Count number of authors and references
# We want to count the Authors, NbrAuthor, NbrCitations
outFile = 'names7.csv' # path to csv output
inFile = r"1.txt" # path to input text file
import csv
import re
with open(outFile, 'w', encoding="utf-8") as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
with open(inFile, "r", encoding="utf-8") as f:
row = dict()
prev = ""
for line in f.readlines():
line = line.strip() # remove any leading or trailing whitespace
# remove any parentheses at the end of the string
query = re.findall(r'\([^)]*\)', line)
if len(query) > 0:
line = line.replace(query[-1], '')
# if none of the keys match, then belongs to previous key
if prev != "" and not any([line.startswith(k) for k in keys]):
if prev == 'Abstract':
row[prev] += " " + line
else:
row[prev] += ", " + line
else:
for k in keys:
prefix = ""
if line.startswith(k):
# remove prefix
prefix = k
line = line[len(prefix):]
if keys[k] in row:
if keys[k] == "References":
row[keys[k]] += ", " + line
else:
row[keys[k]] += " " + line
else:
row[keys[k]] = line
prev = keys[k]
# count number of references and Citations
row["NbrAuthor"] = row["Authors"].count(',') + 1
row["NbrCitation"] = 0
row["NbrReferences"] = row["References"].count(',') + 1
writer.writerow(row)

Edit: added clause to if statement
prefixes = {
'#*': 'Title',
'##': 'Authors',
'#t': 'Year',
'#c': 'Venue',
'#index': 'id',
'#%': 'References',
'#!': 'Abstract',
}
outFile = 'names7.csv' # path to csv output
inFile = r"1.txt" # path to input text file
import csv
import re
with open(outFile, 'w', encoding="utf-8") as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=list(prefixes.values()) + ['NbrAuthor', 'NbrCitations', 'ListCitations'])
writer.writeheader()
with open(inFile, "r", encoding="utf-8") as f:
row = dict()
prev = ''
for line in f.readlines():
# remove leading and trailing whitespace
line = line.strip()
# remove close brackets at end of lines
# query = re.findall(r'\([^)]*\)', line)
# if len(query) > 0:
# line = line.replace(query[-1], '')
for prefix, col in prefixes.items():
if line.startswith(prefix):
line = line[len(prefix):]
if col == "Authors" or col == 'Abstract':
row[col] = ""
else:
row[col] = line
prev = prefix
break
# special cases
try:
if prev == '##':
if row['Authors'] == "":
row['Authors'] = line
else:
row['Authors'] += ', ' + line
elif prev == '#%':
row['References'] += ', ' + line
elif prev == '#!':
row['Abstract'] += ' ' + line
except Exception as e:
print(e)
if len(line) == 0:
row['NbrAuthor'] = row['Authors'].count(',') + 1
row['NbrCitations'] = 0
writer.writerow(row)
prev = ''
row = dict()

Related

Problem skipping line whilst iterating using previous line and current line comparison

I have a list of sorted data arranged so that each item in the list is a csv line to be written to file.
The final step of the script checks the contents of each field and if all but the last field match then it will copy the current line's last field onto the previous line's last field.
I would like to as I've found and processed one of these matches skip the current line where the field was copied from thus only leaving one of the lines.
Here's an example set of data
field1,field2,field3,field4,something
field1,field2,field3,field4,else
Desired output
field1,field2,field3,field4,something else
This is what I have so far
output_csv = ['field1,field2,field3,field4,something',
'field1,field2,field3,field4,else']
# run through the output
# open and create a csv file to save output
with open('output_table.csv', 'w') as f:
previous_line = None
part_duplicate_line = None
part_duplicate_flag = False
for line in output_csv:
part_duplicate_flag = False
if previous_line is not None:
previous = previous_line.split(',')
current = line.split(',')
if (previous[0] == current[0]
and previous[1] == current[1]
and previous[2] == current[2]
and previous[3] == current[3]):
print(previous[0], current[0])
previous[4] = previous[4].replace('\n', '') + ' ' + current[4]
part_duplicate_line = ','.join(previous)
part_duplicate_flag = True
f.write(part_duplicate_line)
if part_duplicate_flag is False:
f.write(previous_line)
previous_line = line
ATM script adds the line but doesn't skip the next line, I've tried various renditions of continue statements after part_duplicate_line is written to file but to no avail.

Looks like you want one entry for each combination of the first 4 fields
You can use a dict to aggregate data -
#First we extract the key and values
output_csv_keys = list(map(lambda x: ','.join(x.split(',')[:-1]), output_csv))
output_csv_values = list(map(lambda x: x.split(',')[-1], output_csv))
#Then we construct a dictionary with these keys and combine the values into a list
from collections import defaultdict
output_csv_dict = defaultdict(list)
for key, value in zip(output_csv_keys, output_csv_values):
output_csv_dict[key].append(value)
#Then we extract the key/value combinations from this dictionary into a list
for_printing = [','.join([k, ' '.join(v)]) for k, v in output_csv_dict.items()]
print(for_printing)
#Output is ['field1,field2,field3,field4,something else']
#Each entry of this list can be output to the csv file

I propose to encapsulate what you want to do in a function where the important part obeys this logic:
either join the new info to the old record
or output the old record and forget it
of course at the end of the loop we have in any case a dangling old record to output
def join(inp_fname, out_fname):
'''Input file contains sorted records, when two (or more) records differ
only in the last field, we join the last fields with a space
and output only once, otherwise output the record as-is.'''
######################### Prepare for action ##########################
from csv import reader, writer
with open(inp_fname) as finp, open(out_fname, 'w') as fout:
r, w = reader(finp), writer(fout)
######################### Important Part starts here ##############
old = next(r)
for new in r:
if old[:-1] == new[:-1]:
old[-1] += ' '+new[-1]
else:
w.writerow(old)
old = new
w.writerow(old)
To check what I've proposed you can use these two snippets (note that these records are shorter than yours, but it's an example and it doesn't matter because we use only -1 to index our records).
The 1st one has a "regular" last record
open('a0.csv', 'w').write('1,1,2\n1,1,3\n1,2,0\n1,3,1\n1,3,2\n3,3,0\n')
join('a0.csv', 'a1.csv')
while the 2nd has a last record that must be joined to the previous one.
open('b0.csv', 'w').write('1,1,2\n1,1,3\n1,2,0\n1,3,1\n1,3,2\n')
join('b0.csv', 'b1.csv')
If you run the snippets, as I have done before posting, in the environment where you have defined join you should get what you want.

String Cutting with multiple lines

so i'm new to python besides some experience with tKintner (some GUI experiments).
I read an .mbox file and copy the plain/text in a string. This text contains a registering form. So a Stefan, living in Maple Street, London working for the Company "MultiVendor XXVideos" has registered with an email for a subscription.
Name_OF_Person: Stefan
Adress_HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos
I would like to take this data and put in a .csv row with column
"Name", "Adress", "Company",...
Now i tried to cut and slice everything. For debugging i use "print"(IDE = KATE/KDE + terminal... :-D ).
Problem is, that the data contains multiple lines after keywords but i only get the first line.
How would you improve my code?
import mailbox
import csv
import email
from time import sleep
import string
fieldnames = ["ID","Subject","Name", "Adress", "Company"]
searchKeys = [ 'Name_OF_Person','Adress_HOME','Company_NAME']
mbox_file = "REG.mbox"
export_file_name = "test.csv"
if __name__ == "__main__":
with open(export_file_name,"w") as csvfile:
writer = csv.DictWriter(csvfile, dialect='excel',fieldnames=fieldnames)
writer.writeheader()
for message in mailbox.mbox(mbox_file):
if message.is_multipart():
content = '\n'.join(part.get_payload() for part in message.get_payload())
content = content.split('<')[0] # only want text/plain.. Ill split #right before HTML starts
#print content
else:
content = message.get_payload()
idea = message['message-id']
sub = message['subject']
fr = message['from']
date = message['date']
writer.writerow ('ID':idea,......) # CSV writing will work fine
for line in content.splitlines():
line = line.strip()
for pose in searchKeys:
if pose in line:
tmp = line.split(pose)
pmt = tmp[1].split(":")[1]
if next in line !=:
print pose +"\t"+pmt
sleep(1)
csvfile.closed
OUTPUT:
OFFICIAL_POSTAL_ADDRESS =20
Here, the lines are missing..
from file:
OFFICIAL_POSTAL_ADDRESS: =20
London, testarossa street 41
EDIT2:
#Yaniv
Thank you, iam still trying to understand every step, but just wanted to give a comment. I like the idea to work with the list/matrix/vector "key_value_pairs"
The amount of keywords in the emails is ~20 words. Additionally, my values are sometimes line broken by "=".
I was thinking something like:
Search text for Keyword A,
if true:
search text from Keyword A until keyword B
if true:
copy text after A until B
Name_OF_=
Person: Stefan
Adress_
=HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos
Maybe the HTML from EMAIL.mbox is easier to process?
<tr><td bgcolor=3D"#eeeeee"><font face=3D"Verdana" size=3D"1">
<strong>NAM=
E_REGISTERING_PERSON</strong></font></td><td bgcolor=3D"#eeeeee"><font
fac=e=3D"Verdana" size=3D"1">Stefan </font></td></tr>
But the "=" are still there
should i replace ["="," = "] with "" ?

I would go for a "routine" parsing loop over the input lines, and maintain a current_key and current_value variables, as a value for a certain key in your data might be "annoying", and spread across multiple lines.
I've demonstrated such parsing approach in the code below, with some assumptions regarding your problem. For example, if an input line starts with a whitespace, I assumed it must be the case of such "annoying" value (spread across multiple lines). Such lines would be concatenated into a single value, using some configurable string (the parameter join_lines_using_this). Another assumption is that you might want to strip whitespaces from both keys and values.
Feel free to adapt the code to fit your assumptions on the input, and raise Exceptions whenever they don't hold!
# Note the usage of .strip() in some places, to strip away whitespaces. I assumed you might want that.
def parse_funky_text(text, join_lines_using_this=" "):
key_value_pairs = []
current_key, current_value = None, ""
for line in text.splitlines():
line_split = line.split(':')
if line.startswith(" ") or len(line_split) == 1:
if current_key is None:
raise ValueError("Failed to parse this line, not sure which key it belongs to: %s" % line)
current_value += join_lines_using_this + line.strip()
else:
if current_key is not None:
key_value_pairs.append((current_key, current_value))
current_key, current_value = None, ""
current_key = line_split[0].strip()
# We've just found a new key, so here you might want to perform additional checks,
# e.g. if current_key not in sharedKeys: raise ValueError("Encountered a weird key?! %s in line: %s" % (current_key, line))
current_value = ':'.join(line_split[1:]).strip()
# Don't forget the last parsed key, value
if current_key is not None:
key_value_pairs.append((current_key, current_value))
return key_value_pairs
Example usage:
text = """Name_OF_Person: Stefan
Adress_HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos"""
parse_funky_text(text)
Will output:
[('Name_OF_Person', 'Stefan'), ('Adress_HOME', 'London, Maple Street 45'), ('Company_NAME', 'MultiVendor XXVideos')]

You indicate in the comments that your input strings from the content should be relatively consistent. If that is the case, and you want to be able to split that string across multiple lines, the easiest thing to do would be to replace \n with spaces and then just parse the single string.
I've intentionally constrained my answer to using just string methods rather than inventing a huge function to do this. Reason: 1) Your process is already complex enough, and 2) your question really boils down to how to process the string data across multiple lines. If that is the case, and the pattern is consistent, this will get this one off job done
content = content.replace('\n', ' ')
Then you can split on each of the boundries in your consistently structured headers.
content = content.split("Name_OF_Person:")[1] #take second element of the list
person = content.split("Adress_HOME:")[0] # take content before "Adress Home"
content = content.split("Adress_HOME:")[1] #take second element of the list
address = content.split("Company_NAME:")[0] # take content before
company = content.split("Adress_HOME:")[1] #take second element of the list (the remainder) which is company
Normally, I would suggest regex. (https://docs.python.org/3.4/library/re.html). Long term, if you need to do this sort of thing again, regex is going to pay dividends on time spend munging data. To make a regex function "cut" across multiple lines, you would use the re.MULTILINE option. So it might endup looking something like re.search('Name_OF_Person:(.*)Adress_HOME:', html_reg_form, re.MULTILINE)

How to speed up file parsing in python?

Below is a section from an app I have been working on. The section is used to update a text file with addValue. At first I thought it was working but it seams to add more lines in and also it is very very slow.
trakt_shows_seen is a dictionary of shows, 1 show section looks like
{'episodes': [{'season': 1, 'playcount': 0, 'episode': 1}, {'season': 1, 'playcount': 0, 'episode': 2}, {'season': 1, 'playcount': 0, 'episode': 3}], 'title': 'The Ice Cream Girls'}
The section should search for each title, season and episode in the file and when found check if it has a watched marker (checkValue) if it does, it changes it to addvalue, if it does not it should add addValue to the end of the line.
A line from the file
_F /share/Storage/NAS/Videos/Tv/The Ice Cream Girls/Season 01/The Ice Cream Girls - S01E01 - Episode 1.mkv _ai Episode 1 _e 1 _r 6.5 _Y 71 _s 1 _DT 714d861 _et Episode 1 _A 4379,4376,4382,4383 _id 2551 _FT 714d861 _v c0=h264,f0=25,h0=576,w0=768 _C T _IT 717ac9d _R GB: _m 1250 _ad 2013-04-19 _T The Ice Cream Girls _G d _U thetvdb:268910 imdb:tt2372806 _V HDTV
So my question, is there a better faster way? Can I load the file into memory (file is around 1Mb) change the required lines and then save the file, or can anyone suggest another method that will speed things up.
Thanks for taking the time to look.
EDIT
I have changed the code quite a lot and this does work a lot faster, but the output is not as expected, for some reason it writes lines_of_interest to the file even though there is no code to do this??
I also have not yet added any encoding options but as the file is in utf-8 I suspect there will be an issue with accented titles.
if trakt_shows_seen:
addValue = "\t_w\t1\t"
replacevalue = "\t_w\t0\t"
with open(OversightFile, 'rb') as infile:
p = '\t_C\tT\t'
for line in infile:
if p in line:
tv_offset = infile.tell() - len(line) - 1#Find first TV in file, search from here
break
lines_of_interest = set()
for show_dict in trakt_shows_seen:
for episode in show_dict['episodes']:
p = re.compile(r'\t_s\t('+str(episode["season"])+')\t.*\t_T\t('+show_dict["title"]+')\t.*\t_e\t('+str(episode["episode"])+')\t')
infile.seek(tv_offset)#search from first Tv show
for line in infile:
if p.findall(line):
search_offset = infile.tell() - len(line) - 1
lines_of_interest.add(search_offset)#all lines that need to be changed
with open(OversightFile, 'rb+') as outfile:
for lines in lines_of_interest:
for change_this in outfile:
outfile.seek(lines)
if replacevalue in change_this:
change_this = change_this.replace(replacevalue, addValue)
outfile.write(change_this)
break#Only check 1 line
elif not addValue in change_this:
#change_this.extend(('_w', '1'))
change_this = change_this.replace("\t\n", addValue+"\n")
outfile.write(change_this)
break#Only check 1 line

Aham -- you are opening, reading and rewriting your file in every repetition of your for loop - once for each episode for each show. few things in the whole Multiverse could be slower than that.
You cango along the same line - just read all your "file" once, before the for loops,
iterate over the list read, and write everything back to disk, just once =
more or less:
if trakt_shows_seen:
addValue = "\t_w\t1\t"
checkvalue = "\t_w\t0\t"
print ' %s TV shows episodes playcount will be updated on Oversight' % len(trakt_shows_seen)
myfile_list = open(file).readlines()
for show in trakt_shows_seen:
print ' --> ' + show['title'].encode('utf-8')
for episode in show['episodes']:
print ' Season %i - Episode %i' % (episode['season'], episode['episode'])
p = re.compile(r'\t_s\t('+str(episode["season"])+')\t.*\t_T\t('+show["title"]+')\t.*\t_e\t('+str(episode["episode"])+')\t')
newList = []
for line in myfile_list:
if p.findall(line) :
if checkvalue in line:
line = line.replace(checkvalue, addValue)
elif not addValue in line:
line = line.strip("\t\n") + addValue+"\n"
newList.append(line)
myfile_list = newlist
outref = open(file,'w')
outref.writelines(newList)
outref.close()
This is still far from optimal - but is the least amoutn of change in your code to stop what is slowing it down so much.

You're rereading and rewriting your entire file for every episode of every show you track - of course this is slow. Don't do that. Instead, read the file once. Parse out the show title and season and episode numbers from each line (probably using the csv built-in library with delimiter='\t'), and see if they're in the set you're tracking. Make your substitution if they are, and write the line either way.
It's going to look something like this:
title_index = # whatever column number has the show title
season_index = # whatever column number has the season number
episode_index = # whatever column number has the episode number
with open('somefile', 'rb') as infile:
reader = csv.reader(infile, delimiter='\t')
modified_lines = []
for line in reader:
showtitle = line[title_index]
if showtitle in trakt_shows_seen:
season_number = int(line[season_index])
episode_number = int(line[episode_index])
if any((x for x in trakt_shows_seen[showtitle] if x['season'] = season_number and x['episode'] = episode_number)):
# line matches a tracked episode
watch_count_index = line.index('_w')
if watch_count_index != -1:
# possible check value found - you may be able to skip straight to assigning the next element to '1'
if line[watch_count_index + 1] == '0':
# check value found, replace
line[watch_count_index + 1] = '1'
elif line[watch_count_index + 1] != '1':
# not sure what you want to do if something like \t_w\t2\t is present
line[watch_count_index + 1] = '1'
else:
line.extend(('_w', '1'))
modified_lines.append(line)
with open('somefile', 'wb') as outfile:
writer = csv.writer(outfile, delimiter='\t')
writer.writerows(modified_lines)
The exact details will depend on how strict your file format is - the more you know about the structure of the line beforehand the better. If the indices of the title, season and episode fields vary, probably the best thing to do is iterate once through the list representing the line looking for the relevant markers.
I have skipped over error checking - depending on your confidence in the original file you might want to ensure that season and episode numbers can be converted to ints, or stringify your trakt_shows_seen values. The csv reader will return encoded bytestrings, so if show names in trakt_shows_seen are Unicode objects (which they don't appear to be in your pasted code) you should either decode the csv reader's results or encode the dictionary values.
I personally would probably convert trakt_shows_seen to a set of (title, season, episode) tuples, for more convenient checking to see if a line is of interest. At least if the field numbers for title, season and episode are fixed. I would also write to my outfile file (under a different filename) as I read the input file rather than keeping a list of lines in memory; that would allow some sanity checking with, say, a shell's diff utility before overwriting the original input.
To create a set from your existing dictionary - to some extent it depends on exactly what format trakt_shows_seen uses. Your example shows an entry for one show, but doesn't indicate how it represents more than one show. For now I'm going to assume it's a list of such dictionaries, based on your attempted code.
shows_of_interest = set()
for show_dict in trakt_shows_seen:
title = show_dict['title']
for episode_dict in show_dict['episodes']:
shows_of_interest.add((title, episode_dict['season'], episode_dict['episode']))
Then in the loop that reads the file:
# the rest as shown above
season_number = int(line[season_index])
episode_number = int(line[episode_index])
if (showtitle, season_number, episode_number) in shows_of_interest:
# line matches a tracked episode

Python - reading data from file with variable attributes and line lengths

I'm trying to find the best way to parse through a file in Python and create a list of namedtuples, with each tuple representing a single data entity and its attributes. The data looks something like this:
UI: T020
STY: Acquired Abnormality
ABR: acab
STN: A1.2.2.2
DEF: An abnormal structure, or one that is abnormal in size or location, found
in or deriving from a previously normal structure. Acquired abnormalities are
distinguished from diseases even though they may result in pathological
functioning (e.g., "hernias incarcerate").
HL: {isa} Anatomical Abnormality
UI: T145
RL: exhibits
ABR: EX
RIN: exhibited_by
RTN: R3.3.2
DEF: Shows or demonstrates.
HL: {isa} performs
STL: [Animal|Behavior]; [Group|Behavior]
UI: etc...
While several attributes are shared (eg UI), some are not (eg STY). However, I could hardcode an exhaustive list of necessary.
Since each grouping is separated by an empty line, I used split so I can process each chunk of data individually:
input = file.read().split("\n\n")
for chunk in input:
process(chunk)
I've seen some approaches use string find/splice, itertools.groupby, and even regexes. I was thinking of doing a regex of '[A-Z]*:' to find where the headers are, but I'm not sure how to approach pulling out multiple lines afterwards until another header is reached (such as the multilined data following DEF in the first example entity).
I appreciate any suggestions.

I took assumption that if you have string span on multiple lines you want newlines replaced with spaces (and to remove any additional spaces).
def process_file(filename):
reg = re.compile(r'([\w]{2,3}):\s') # Matches line header
tmp = '' # Stored/cached data for mutliline string
key = None # Current key
data = {}
with open(filename,'r') as f:
for row in f:
row = row.rstrip()
match = reg.match(row)
# Matches header or is end, put string to list:
if (match or not row) and key:
data[key] = tmp
key = None
tmp = ''
# Empty row, next dataset
if not row:
# Prevent empty returns
if data:
yield data
data = {}
continue
# We do have header
if match:
key = str(match.group(1))
tmp = row[len(match.group(0)):]
continue
# No header, just append string -> here goes assumption that you want to
# remove newlines, trailing spaces and replace them with one single space
tmp += ' ' + row
# Missed row?
if key:
data[key] = tmp
# Missed group?
if data:
yield data
This generator returns dict with pairs like UI: T020 in each iteration (and always at least one item).
Since it uses generator and continuous reading it should be effective event on large files and it won't read whole file into memory at once.
Here's little demo:
for data in process_file('data.txt'):
print('-'*20)
for i in data:
print('%s:'%(i), data[i])
print()
And actual output:
--------------------
STN: A1.2.2.2
DEF: An abnormal structure, or one that is abnormal in size or location, found in or deriving from a previously normal structure. Acquired abnormalities are distinguished from diseases even though they may result in pathological functioning (e.g., "hernias incarcerate").
STY: Acquired Abnormality
HL: {isa} Anatomical Abnormality
UI: T020
ABR: acab
--------------------
DEF: Shows or demonstrates.
STL: [Animal|Behavior]; [Group|Behavior]
RL: exhibits
HL: {isa} performs
RTN: R3.3.2
UI: T145
RIN: exhibited_by
ABR: EX

source = """
UI: T020
STY: Acquired Abnormality
ABR: acab
STN: A1.2.2.2
DEF: An abnormal structure, or one that is abnormal in size or location, found
in or deriving from a previously normal structure. Acquired abnormalities are
distinguished from diseases even though they may result in pathological
functioning (e.g., "hernias incarcerate").
HL: {isa} Anatomical Abnormality
"""
inpt = source.split("\n") #just emulating file
import re
reg = re.compile(r"^([A-Z]{2,3}):(.*)$")
output = dict()
current_key = None
current = ""
for line in inpt:
line_match = reg.match(line) #check if we hit the CODE: Content line
if line_match is not None:
if current_key is not None:
output[current_key] = current #if so - update the current_key with contents
current_key = line_match.group(1)
current = line_match.group(2)
else:
current = current + line #if it's not - it should be the continuation of previous key line
output[current_key] = current #don't forget the last guy
print(output)

import re
from collections import namedtuple
def process(chunk):
split_chunk = re.split(r'^([A-Z]{2,3}):', chunk, flags=re.MULTILINE)
d = dict()
fields = list()
for i in xrange(len(split_chunk)/2):
fields.append(split_chunk[i])
d[split_chunk[i]] = split_chunk[i+1]
my_tuple = namedtuple(split_chunk[1], fields)
return my_tuple(**d)
should do. I think I'd just do the dict though -- why are you so attached to a namedtuple?

How do I perform binary search on a text file to search a keyword in python?

The text file contains two columns- index number(5 spaces) and characters(30 spaces).
It is arranged in lexicographic order. I want to perform binary search to search for the keyword.

Here's an interesting way to do it with Python's built-in bisect module.
import bisect
import os
class Query(object):
def __init__(self, query, index=5):
self.query = query
self.index = index
def __lt__(self, comparable):
return self.query < comparable[self.index:]
class FileSearcher(object):
def __init__(self, file_pointer, record_size=35):
self.file_pointer = file_pointer
self.file_pointer.seek(0, os.SEEK_END)
self.record_size = record_size + len(os.linesep)
self.num_bytes = self.file_pointer.tell()
self.file_size = (self.num_bytes // self.record_size)
def __len__(self):
return self.file_size
def __getitem__(self, item):
self.file_pointer.seek(item * self.record_size)
return self.file_pointer.read(self.record_size)
if __name__ == '__main__':
with open('data.dat') as file_to_search:
query = raw_input('Query: ')
wrapped_query = Query(query)
searchable_file = FileSearcher(file_to_search)
print "Located # line: ", bisect.bisect(searchable_file, wrapped_query)

Do you need do do a binary search? If not, try converting your flatfile into a cdb (constant database). This will give you very speedy hash lookups to find the index for a given word:
import cdb
# convert the corpus file to a constant database one time
db = cdb.cdbmake('corpus.db', 'corpus.db_temp')
for line in open('largecorpus.txt', 'r'):
index, word = line.split()
db.add(word, index)
db.finish()
In a separate script, run queries against it:
import cdb
db = cdb.init('corpus.db')
db.get('chaos')
12345

If you need to find a single keyword in a file:
line_with_keyword = next((line for line in open('file') if keyword in line),None)
if line_with_keyword is not None:
print line_with_keyword # found
To find multiple keywords you could use set() as #kriegar suggested:
def extract_keyword(line):
return line[5:35] # assuming keyword starts on 6 position and has length 30
with open('file') as f:
keywords = set(extract_keyword(line) for line in f) # O(n) creation
if keyword in keywords: # O(1) search
print(keyword)
You could use dict() above instead of set() to preserve index information.
Here's how you could do a binary search on a text file:
import bisect
lines = open('file').readlines() # O(n) list creation
keywords = map(extract_keyword, lines)
i = bisect.bisect_left(keywords, keyword) # O(log(n)) search
if keyword == keywords[i]:
print(lines[i]) # found
There is no advantage compared to the set() variant.
Note: all variants except the first one load the whole file in memory. FileSearcher() suggested by #Mahmoud Abdelkader don't require to load the whole file in memory.

I wrote a simple Python 3.6+ package that can do this. (See its github page for more information!)
Installation: pip install binary_file_search
Example file:
1,one
2,two_a
2,two_b
3,three
Usage:
from binary_file_search.BinaryFileSearch import BinaryFileSearch
with BinaryFileSearch('example.file', sep=',', string_mode=False) as bfs:
# assert bfs.is_file_sorted() # test if the file is sorted.
print(bfs.search(2))
Result: [[2, 'two_a'], [2, 'two_b']]

It is quite possible, with a slight loss of efficiency to perform a binary search on a sorted text file with records of unknown length, by repeatedly bisecting the range, and reading forward past the line terminator. Here's what I do to look for look thru a csv file with 2 header lines for a numeric in the first field. Give it an open file, and the first field to look for. It should be fairly easy to modify this for your problem. A match on the very first line at offset zero will fail, so this may need to be special-cased. In my circumstance, the first 2 lines are headers, and are skipped.
Please excuse my lack of polished python below. I use this function, and a similar one, to perform GeoCity Lite latitude and longitude calculations directly from the CSV files distributed by Maxmind.
Hope this helps
========================================
# See if the input loc is in file
def look1(f,loc):
# Compute filesize of open file sent to us
hi = os.fstat(f.fileno()).st_size
lo=0
lookfor=int(loc)
# print "looking for: ",lookfor
while hi-lo > 1:
# Find midpoint and seek to it
loc = int((hi+lo)/2)
# print " hi = ",hi," lo = ",lo
# print "seek to: ",loc
f.seek(loc)
# Skip to beginning of line
while f.read(1) != '\n':
pass
# Now skip past lines that are headers
while 1:
# read line
line = f.readline()
# print "read_line: ", line
# Crude csv parsing, remove quotes, and split on ,
row=line.replace('"',"")
row=row.split(',')
# Make sure 1st fields is numeric
if row[0].isdigit():
break
s=int(row[0])
if lookfor < s:
# Split into lower half
hi=loc
continue
if lookfor > s:
# Split into higher half
lo=loc
continue
return row # Found
# If not found
return False

Consider using a set instead of a binary search for finding a keyword in your file.
Set:
O(n) to create, O(1) to find, O(1) to insert/delete
If your input file is separated by a space then:
f = open('file')
keywords = set( (line.strip().split(" ")[1] for line in f.readlines()) )
f.close()
my_word in keywords
<returns True or False>
Dictionary:
f = open('file')
keywords = dict( [ (pair[1],pair[0]) for pair in [line.strip().split(" ") for line in f.readlines()] ] )
f.close()
keywords[my_word]
<returns index of my_word>
Binary Search is:
O(n log n) create, O(log n) lookup
edit: for your case of 5 characters and 30 characters you can just use string slicing
f = open('file')
keywords = set( (line[5:-1] for line in f.readlines()) )
f.close()
myword_ in keywords
or
f = open('file')
keywords = dict( [(line[5:-1],line[:5]) for line in f.readlines()] )
f.close()
keywords[my_word]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.