How to speed up file parsing in python? - python

Below is a section from an app I have been working on. The section is used to update a text file with addValue. At first I thought it was working but it seams to add more lines in and also it is very very slow.
trakt_shows_seen is a dictionary of shows, 1 show section looks like
{'episodes': [{'season': 1, 'playcount': 0, 'episode': 1}, {'season': 1, 'playcount': 0, 'episode': 2}, {'season': 1, 'playcount': 0, 'episode': 3}], 'title': 'The Ice Cream Girls'}
The section should search for each title, season and episode in the file and when found check if it has a watched marker (checkValue) if it does, it changes it to addvalue, if it does not it should add addValue to the end of the line.
A line from the file
_F /share/Storage/NAS/Videos/Tv/The Ice Cream Girls/Season 01/The Ice Cream Girls - S01E01 - Episode 1.mkv _ai Episode 1 _e 1 _r 6.5 _Y 71 _s 1 _DT 714d861 _et Episode 1 _A 4379,4376,4382,4383 _id 2551 _FT 714d861 _v c0=h264,f0=25,h0=576,w0=768 _C T _IT 717ac9d _R GB: _m 1250 _ad 2013-04-19 _T The Ice Cream Girls _G d _U thetvdb:268910 imdb:tt2372806 _V HDTV
So my question, is there a better faster way? Can I load the file into memory (file is around 1Mb) change the required lines and then save the file, or can anyone suggest another method that will speed things up.
Thanks for taking the time to look.
EDIT
I have changed the code quite a lot and this does work a lot faster, but the output is not as expected, for some reason it writes lines_of_interest to the file even though there is no code to do this??
I also have not yet added any encoding options but as the file is in utf-8 I suspect there will be an issue with accented titles.
if trakt_shows_seen:
addValue = "\t_w\t1\t"
replacevalue = "\t_w\t0\t"
with open(OversightFile, 'rb') as infile:
p = '\t_C\tT\t'
for line in infile:
if p in line:
tv_offset = infile.tell() - len(line) - 1#Find first TV in file, search from here
break
lines_of_interest = set()
for show_dict in trakt_shows_seen:
for episode in show_dict['episodes']:
p = re.compile(r'\t_s\t('+str(episode["season"])+')\t.*\t_T\t('+show_dict["title"]+')\t.*\t_e\t('+str(episode["episode"])+')\t')
infile.seek(tv_offset)#search from first Tv show
for line in infile:
if p.findall(line):
search_offset = infile.tell() - len(line) - 1
lines_of_interest.add(search_offset)#all lines that need to be changed
with open(OversightFile, 'rb+') as outfile:
for lines in lines_of_interest:
for change_this in outfile:
outfile.seek(lines)
if replacevalue in change_this:
change_this = change_this.replace(replacevalue, addValue)
outfile.write(change_this)
break#Only check 1 line
elif not addValue in change_this:
#change_this.extend(('_w', '1'))
change_this = change_this.replace("\t\n", addValue+"\n")
outfile.write(change_this)
break#Only check 1 line

Aham -- you are opening, reading and rewriting your file in every repetition of your for loop - once for each episode for each show. few things in the whole Multiverse could be slower than that.
You cango along the same line - just read all your "file" once, before the for loops,
iterate over the list read, and write everything back to disk, just once =
more or less:
if trakt_shows_seen:
addValue = "\t_w\t1\t"
checkvalue = "\t_w\t0\t"
print ' %s TV shows episodes playcount will be updated on Oversight' % len(trakt_shows_seen)
myfile_list = open(file).readlines()
for show in trakt_shows_seen:
print ' --> ' + show['title'].encode('utf-8')
for episode in show['episodes']:
print ' Season %i - Episode %i' % (episode['season'], episode['episode'])
p = re.compile(r'\t_s\t('+str(episode["season"])+')\t.*\t_T\t('+show["title"]+')\t.*\t_e\t('+str(episode["episode"])+')\t')
newList = []
for line in myfile_list:
if p.findall(line) :
if checkvalue in line:
line = line.replace(checkvalue, addValue)
elif not addValue in line:
line = line.strip("\t\n") + addValue+"\n"
newList.append(line)
myfile_list = newlist
outref = open(file,'w')
outref.writelines(newList)
outref.close()
This is still far from optimal - but is the least amoutn of change in your code to stop what is slowing it down so much.

You're rereading and rewriting your entire file for every episode of every show you track - of course this is slow. Don't do that. Instead, read the file once. Parse out the show title and season and episode numbers from each line (probably using the csv built-in library with delimiter='\t'), and see if they're in the set you're tracking. Make your substitution if they are, and write the line either way.
It's going to look something like this:
title_index = # whatever column number has the show title
season_index = # whatever column number has the season number
episode_index = # whatever column number has the episode number
with open('somefile', 'rb') as infile:
reader = csv.reader(infile, delimiter='\t')
modified_lines = []
for line in reader:
showtitle = line[title_index]
if showtitle in trakt_shows_seen:
season_number = int(line[season_index])
episode_number = int(line[episode_index])
if any((x for x in trakt_shows_seen[showtitle] if x['season'] = season_number and x['episode'] = episode_number)):
# line matches a tracked episode
watch_count_index = line.index('_w')
if watch_count_index != -1:
# possible check value found - you may be able to skip straight to assigning the next element to '1'
if line[watch_count_index + 1] == '0':
# check value found, replace
line[watch_count_index + 1] = '1'
elif line[watch_count_index + 1] != '1':
# not sure what you want to do if something like \t_w\t2\t is present
line[watch_count_index + 1] = '1'
else:
line.extend(('_w', '1'))
modified_lines.append(line)
with open('somefile', 'wb') as outfile:
writer = csv.writer(outfile, delimiter='\t')
writer.writerows(modified_lines)
The exact details will depend on how strict your file format is - the more you know about the structure of the line beforehand the better. If the indices of the title, season and episode fields vary, probably the best thing to do is iterate once through the list representing the line looking for the relevant markers.
I have skipped over error checking - depending on your confidence in the original file you might want to ensure that season and episode numbers can be converted to ints, or stringify your trakt_shows_seen values. The csv reader will return encoded bytestrings, so if show names in trakt_shows_seen are Unicode objects (which they don't appear to be in your pasted code) you should either decode the csv reader's results or encode the dictionary values.
I personally would probably convert trakt_shows_seen to a set of (title, season, episode) tuples, for more convenient checking to see if a line is of interest. At least if the field numbers for title, season and episode are fixed. I would also write to my outfile file (under a different filename) as I read the input file rather than keeping a list of lines in memory; that would allow some sanity checking with, say, a shell's diff utility before overwriting the original input.
To create a set from your existing dictionary - to some extent it depends on exactly what format trakt_shows_seen uses. Your example shows an entry for one show, but doesn't indicate how it represents more than one show. For now I'm going to assume it's a list of such dictionaries, based on your attempted code.
shows_of_interest = set()
for show_dict in trakt_shows_seen:
title = show_dict['title']
for episode_dict in show_dict['episodes']:
shows_of_interest.add((title, episode_dict['season'], episode_dict['episode']))
Then in the loop that reads the file:
# the rest as shown above
season_number = int(line[season_index])
episode_number = int(line[episode_index])
if (showtitle, season_number, episode_number) in shows_of_interest:
# line matches a tracked episode

Related

Converting sequential data from a .txt file to a data frame

Hello Data science community, I'm new at data science and python programming.
Here is the structure of my txt file but there are many missing values
#*Improved Channel Routing by Via Minimization and Shifting.
##Chung-Kuan Cheng
David N. Deutsch
#t1988
#cDAC
#index131751
#%133716
#%133521
#%134343
#!Channel routing area improvement by means of via minimization and via shifting in two dimensions (compaction) is readily achievable. Routing feature area can be minimized by wire straightening. The implementation of algorithms for each of these procedures has produced a solution for Deutsch's Difficult Example
the standard channel routing benchmark
that is more than 5% smaller than the best result published heretofore. Suggestions for possible future work are also given.
#*A fast simultaneous input vector generation and gate replacement algorithm for leakage power reduction.
##Lei Cheng
Liang Deng
Deming Chen
Martin D. F. Wong
#t2006
#cDAC
#index131752
#%132550
#%530568
#%436486
#%134259
#%283007
#%134422
#%282140
#%1134324
#!Input vector control (IVC) technique is based on the observation that the leakage current in a CMOS logic gate depends on the gate input state
and a good input vector is able to minimize the leakage when the circuit is in the sleep mode. The gate replacement technique is a very effective method to further reduce the leakage current. In this paper
we propose a fast algorithm to find a low leakage input vector with simultaneous gate replacement. Results on MCNC91 benchmark circuits show that our algorithm produces $14 %$ better leakage current reduction with several orders of magnitude speedup in runtime for large circuits compared to the previous state-of-the-art algorithm. In particular
the average runtime for the ten largest combinational circuits has been dramatically reduced from 1879 seconds to 0.34 seconds.
#*On the Over-Specification Problem in Sequential ATPG Algorithms.
##Kwang-Ting Cheng
Hi-Keung Tony Ma
#t1992
#cDAC
#index131756
#%455537
#%1078626
#%131745
#!The authors show that some ATPG (automatic test pattern generation) programs may err in identifying untestable faults. These test generators may not be able to find the test sequence for a testable fault
even allowed infinite run time
and may mistakenly claim it as untestable. The main problem of these programs is that the underlying combinational test generation algorithm may over-specify the requirements at the present state lines. A necessary condition that the underlying combinational test generation algorithm must satisfy is considered to ensure a correct sequential ATPG program. It is shown that the simple D-algorithm satisfies this condition while PODEM and the enhanced D-algorithm do not. The impact of over-specification on the length of the generated test sequence was studied. Over-specification caused a longer test sequence. Experimental results are presented
#*Device and architecture co-optimization for FPGA power reduction.
##Lerong Cheng
Phoebe Wong
Fei Li
Yan Lin
Lei He
#t2005
#cDAC
#index131759
#%214244
#%215701
#%214503
#%282575
#%214411
#%214505
#%132929
#!Device optimization considering supply voltage Vdd and threshold voltage Vt tuning does not increase chip area but has a great impact on power and performance in the nanometer technology. This paper studies the simultaneous evaluation of device and architecture optimization for FPGA. We first develop an efficient yet accurate timing and power evaluation method
called trace-based model. By collecting trace information from cycle-accurate simulation of placed and routed FPGA benchmark circuits and re-using the trace for different Vdd and Vt
we enable the device and architecture co-optimization for hundreds of combinations. Compared to the baseline FPGA which has the architecture same as the commercial FPGA used by Xilinx
and has Vdd suggested by ITRS but Vt optimized by our device optimization
architecture and device co-optimization can reduce energy-delay product by 20.5% without any chip area increase compared to the conventional FPGA architecture. Furthermore
considering power-gating of unused logic blocks and interconnect switches
our co-optimization method reduces energy-delay product by 54.7% and chip area by 8.3%. To the best of our knowledge
this is the first in-depth study on architecture and device co-optimization for FPGAs.
I want to convert it to a data frame , exp lines which start by ## are authors , #! are abstracts ,#* are titles ,#% are references and #c are venues using python
Each article start by its title , the problem may have to do with abstracts
I tried different approaches such as
import csv
with open('names7.csv', 'w', encoding="utf-8") as csvfile:
fieldnames = ["Venue", "Year", "Authors","Title","id","ListCitation","NbrCitations","Abstract"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
with open(r"C:\Users\lenovo\Downloads\1.txt", "r", encoding="utf-8") as f:
cnt = 1
for line in f :
if line.startswith('#*'):
writer.writerow({'Title': line})
cnt += 1
elif line.startswith('##'):
writer.writerow({'Authors': line})
cnt += 1
elif line.startswith("#t"):
writer.writerow({'Year': line})
cnt += 1
elif line.startswith("#!"):
writer.writerow({'Abstract': line})
cnt += 1
elif line.startswith("#c"):
writer.writerow({'Venue': line})
cnt +=1
elif line.startswith("#index"):
writer.writerow({'id': line})
cnt +=1
else:
writer.writerow({'ListCitation': line})
cnt +=1
f.close()
I tried this aproache but it didn't work i want to convert it to data frame with the said columns, how can i convert this file to a data frame and storing result in csv file
the output of answer's code
There the output that i want
for example for this case (abstract column) there is space between paragraphs which caused me a problem so this case must be taken into account and for the column reference there are many references so , they mus be taken into account
#*Total power reduction in CMOS circuits via gate sizing and multiple threshold voltages.
##Feng Gao
John P. Hayes
#t2005
#cDAC
#index132139
#%437038
#%437006
#%436596
#%285977
#%1135402
#%132206
#%194016
#%143061
#!Minimizing power consumption is one of the most important objectives in IC design. Resizing gates and assigning different Vt's are common ways to meet power and timing budgets. We propose an automatic implementation of both these techniques using a mixedinteger linear programming model called MLP-exact
which minimizes a circuit's total active-mode power consumption. Unlike previous linear programming methods which only consider local optimality
MLP-exact
can find a true global optimum. An efficient
non-optimal way to solve the MLP model
called MLP-fast
is also described. We present a set of benchmark experiments which show that MLP-fast
is much faster than MLP-exact
while obtaining designs with only slightly higher power consumption. Furthermore
the designs generated by MLP-fast
consume 30% less power than those obtained by conventional
sensitivity-based methods.
csv.DictWriter().writerow() takes a dictionary representing the entire row for that specific sample. What's happening to your code right now is, everything you call writerow, its creating a completely new row instead of adding to the current row. Instead, what you should do is
define a variable called row as a dictionary
store the values for the entire row in this variable
write to the csv using writerow for this dictionary
This will write the entire row to the csv instead of a new row for every new value.
Though this is not the only problem we must look at. Each new line in the text document that does not have a starting id is considered a value it is not. For example !Channel is separated across 3 lines of text, yet only the first line will be considered as !Channel while the other 2 lines will be considered as something else.
Below is a improved version of code with documentation using a dictionary to store starting values and corresponding ids. To add new cases, just modify the dictionary keys and fieldnames
"""
## are authors
#! are abstracts
#* are titles
#% are references
#index are index
#c are venues using python Each article start by its title , the problem may have to do with abstracts
"""
# use dictionary to store fieldnames with corresponding id's/tags
keys = {
'Venue': '#c',
'Year':'#t',
'Authors':'##',
'Title':'#*',
'id': '#index',
'References': '#%',
'Abstract': '#!',
}
fieldnames = ["Venue", "Year", "Authors", "Title","id","Abstract", 'References']
outFile = 'names7.csv' # path to csv output
inFile = r"1.txt" # path to input text file
import csv
with open(outFile, 'w', encoding="utf-8") as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
with open(inFile, "r", encoding="utf-8") as f:
row = dict()
stack = [] # (key, value) pair, used to store multiple values regarding rows
for line in f:
line = line.strip() # remove white space from beginning and end of line
prev = "" # store value of previous column
for col in fieldnames:
# if column is defined and doesnt start with any id, add to previous value
# this handles cases with results over one line or containing new lines
if prev in row and not any([line.startswith(prefix) for prefix in keys.items()]):
# remove prefix
prefix = keys[prev]
line = line[len(prefix):]
row[prev] += ' ' + line
# initate or append to current value. Handles (References #%)
elif col in keys and line.startswith(keys[col]):
# remove prefix
prefix = keys[col]
line = line[len(prefix):]
if col in row:
stack.append((col, line))
else:
row[col] = line
prev = col # define prev col if answer goes over one line
break # go to next line in text
writer.writerow(row)
for col, line in stack:
row[col] = line
writer.writerow(row)
f.close()
Result produced given test case above.
Updated Previous Answer with this result produced given this specific text file
"""
## are authors
#! are abstracts
#* are titles
#% are references
#index are index
#c are venues using python Each article start by its title , the problem may have to do with abstracts
"""
# use dictionary to store fieldnames with corresponding id's/tags
keys = {
'#c': 'Venue',
'#t': 'Year',
'##': 'Authors',
'#*': 'Title',
'#index': 'id',
'#%': 'References',
'#!': 'Abstract'
}
fieldnames = ["Venue", "Year", "Authors", "Title", "NbrAuthor", "id", "ListCitation", "NbrCitation", "References", "NbrReferences", "Abstract"]
# fieldnames = ["Venue", "Year", "Authors", "Title", "NbrAuthor", "id", "ListCitation", "NbrCitation"]
# References and Authors store on one line
# Count number of authors and references
# We want to count the Authors, NbrAuthor, NbrCitations
outFile = 'names7.csv' # path to csv output
inFile = r"1.txt" # path to input text file
import csv
import re
with open(outFile, 'w', encoding="utf-8") as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
with open(inFile, "r", encoding="utf-8") as f:
row = dict()
prev = ""
for line in f.readlines():
line = line.strip() # remove any leading or trailing whitespace
# remove any parentheses at the end of the string
query = re.findall(r'\([^)]*\)', line)
if len(query) > 0:
line = line.replace(query[-1], '')
# if none of the keys match, then belongs to previous key
if prev != "" and not any([line.startswith(k) for k in keys]):
if prev == 'Abstract':
row[prev] += " " + line
else:
row[prev] += ", " + line
else:
for k in keys:
prefix = ""
if line.startswith(k):
# remove prefix
prefix = k
line = line[len(prefix):]
if keys[k] in row:
if keys[k] == "References":
row[keys[k]] += ", " + line
else:
row[keys[k]] += " " + line
else:
row[keys[k]] = line
prev = keys[k]
# count number of references and Citations
row["NbrAuthor"] = row["Authors"].count(',') + 1
row["NbrCitation"] = 0
row["NbrReferences"] = row["References"].count(',') + 1
writer.writerow(row)
Edit: added clause to if statement
prefixes = {
'#*': 'Title',
'##': 'Authors',
'#t': 'Year',
'#c': 'Venue',
'#index': 'id',
'#%': 'References',
'#!': 'Abstract',
}
outFile = 'names7.csv' # path to csv output
inFile = r"1.txt" # path to input text file
import csv
import re
with open(outFile, 'w', encoding="utf-8") as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=list(prefixes.values()) + ['NbrAuthor', 'NbrCitations', 'ListCitations'])
writer.writeheader()
with open(inFile, "r", encoding="utf-8") as f:
row = dict()
prev = ''
for line in f.readlines():
# remove leading and trailing whitespace
line = line.strip()
# remove close brackets at end of lines
# query = re.findall(r'\([^)]*\)', line)
# if len(query) > 0:
# line = line.replace(query[-1], '')
for prefix, col in prefixes.items():
if line.startswith(prefix):
line = line[len(prefix):]
if col == "Authors" or col == 'Abstract':
row[col] = ""
else:
row[col] = line
prev = prefix
break
# special cases
try:
if prev == '##':
if row['Authors'] == "":
row['Authors'] = line
else:
row['Authors'] += ', ' + line
elif prev == '#%':
row['References'] += ', ' + line
elif prev == '#!':
row['Abstract'] += ' ' + line
except Exception as e:
print(e)
if len(line) == 0:
row['NbrAuthor'] = row['Authors'].count(',') + 1
row['NbrCitations'] = 0
writer.writerow(row)
prev = ''
row = dict()

Write a program to read/close the file, and display the following output

I need to write a Python program to read/close the file (i.e., Stock.txt), and display the following output, using the split method of list. There is only one line in the Stock.txt which is the stock portfolio of an investor, consisting of the invested amount of four stocks.
Inside content of the file Stock.txt:
hsbc, 84564.24, boc, 46392.45, manulife, 34562.98, galaxy, 89321.23
I only know how to write related Python code to open to read/close the file. I really don't know what python code I should write to display the following expected output which the assignment requires me to do!
My current code:
infile = open("Stock.txt", 'c')
data = [line.rstrip() for line in infile]
infile.close()
But I am not sure whether my current code is right since I am a Python beginner.
Expected output of this assignment:
01234567890123456789012345678901234567890123456789
The amount invested in HSBC: 844563.24
The amount invested in BOC: 465392.46
The amount invested in MANULIFE: 345612.98
The amount invested in GALAXY: 893421.23
STOCK PERCENTAGE
---------------------
HSBC 33.13%
BOC 18.26%
MANULIFE 13.56%
GALAXY 35.05%
Total Amount Invested: $2,548,989.91
I don't think I'm allowed to fully solve it for you, but I can get you started.
first_line = data[0] # 'hsbc, 84564.24, boc, 46392.45, manulife, 34562.98, galaxy, 89321.23'
real_data = first_line.split(', ') # ['hsbc', '84564.24', 'boc', '46392.45', 'manulife', '34562.98', 'galaxy', '89321.23']
There is one line in our file, so we take the first line with data[0], then split into a list with .split(', ').
stock_names = real_data[::2] # ['hsbc', 'boc', 'manulife', 'galaxy']
stock_values = real_data[1::2] # ['84564.24', '46392.45', '34562.98', '89321.23']
The first line gets every second element of real_data starting from the 0th.
The first line gets every second element of real_data starting from the 1st.
Both use the list splicing syntax:
list[start:end:step]
Understanding slice notation
for name, value in zip(stock_names, stock_values):
print(name, value)
# perform calculations ect.
All together:
infile = open("stocks.txt", 'r')
infile.close()
data = [line.rstrip() for line in infile]
first_line = data[0]
real_data = first_line.split(', ')
stock_names = real_data[::2]
stock_values = real_data[1::2]
for name, value in zip(stock_names, stock_values):
print("something")
# perform calculations
# Good luck :)
Note I have moved infile.close() to the beginning as there is no need for the file to be open for the whole time.

String Cutting with multiple lines

so i'm new to python besides some experience with tKintner (some GUI experiments).
I read an .mbox file and copy the plain/text in a string. This text contains a registering form. So a Stefan, living in Maple Street, London working for the Company "MultiVendor XXVideos" has registered with an email for a subscription.
Name_OF_Person: Stefan
Adress_HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos
I would like to take this data and put in a .csv row with column
"Name", "Adress", "Company",...
Now i tried to cut and slice everything. For debugging i use "print"(IDE = KATE/KDE + terminal... :-D ).
Problem is, that the data contains multiple lines after keywords but i only get the first line.
How would you improve my code?
import mailbox
import csv
import email
from time import sleep
import string
fieldnames = ["ID","Subject","Name", "Adress", "Company"]
searchKeys = [ 'Name_OF_Person','Adress_HOME','Company_NAME']
mbox_file = "REG.mbox"
export_file_name = "test.csv"
if __name__ == "__main__":
with open(export_file_name,"w") as csvfile:
writer = csv.DictWriter(csvfile, dialect='excel',fieldnames=fieldnames)
writer.writeheader()
for message in mailbox.mbox(mbox_file):
if message.is_multipart():
content = '\n'.join(part.get_payload() for part in message.get_payload())
content = content.split('<')[0] # only want text/plain.. Ill split #right before HTML starts
#print content
else:
content = message.get_payload()
idea = message['message-id']
sub = message['subject']
fr = message['from']
date = message['date']
writer.writerow ('ID':idea,......) # CSV writing will work fine
for line in content.splitlines():
line = line.strip()
for pose in searchKeys:
if pose in line:
tmp = line.split(pose)
pmt = tmp[1].split(":")[1]
if next in line !=:
print pose +"\t"+pmt
sleep(1)
csvfile.closed
OUTPUT:
OFFICIAL_POSTAL_ADDRESS =20
Here, the lines are missing..
from file:
OFFICIAL_POSTAL_ADDRESS: =20
London, testarossa street 41
EDIT2:
#Yaniv
Thank you, iam still trying to understand every step, but just wanted to give a comment. I like the idea to work with the list/matrix/vector "key_value_pairs"
The amount of keywords in the emails is ~20 words. Additionally, my values are sometimes line broken by "=".
I was thinking something like:
Search text for Keyword A,
if true:
search text from Keyword A until keyword B
if true:
copy text after A until B
Name_OF_=
Person: Stefan
Adress_
=HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos
Maybe the HTML from EMAIL.mbox is easier to process?
<tr><td bgcolor=3D"#eeeeee"><font face=3D"Verdana" size=3D"1">
<strong>NAM=
E_REGISTERING_PERSON</strong></font></td><td bgcolor=3D"#eeeeee"><font
fac=e=3D"Verdana" size=3D"1">Stefan </font></td></tr>
But the "=" are still there
should i replace ["="," = "] with "" ?
I would go for a "routine" parsing loop over the input lines, and maintain a current_key and current_value variables, as a value for a certain key in your data might be "annoying", and spread across multiple lines.
I've demonstrated such parsing approach in the code below, with some assumptions regarding your problem. For example, if an input line starts with a whitespace, I assumed it must be the case of such "annoying" value (spread across multiple lines). Such lines would be concatenated into a single value, using some configurable string (the parameter join_lines_using_this). Another assumption is that you might want to strip whitespaces from both keys and values.
Feel free to adapt the code to fit your assumptions on the input, and raise Exceptions whenever they don't hold!
# Note the usage of .strip() in some places, to strip away whitespaces. I assumed you might want that.
def parse_funky_text(text, join_lines_using_this=" "):
key_value_pairs = []
current_key, current_value = None, ""
for line in text.splitlines():
line_split = line.split(':')
if line.startswith(" ") or len(line_split) == 1:
if current_key is None:
raise ValueError("Failed to parse this line, not sure which key it belongs to: %s" % line)
current_value += join_lines_using_this + line.strip()
else:
if current_key is not None:
key_value_pairs.append((current_key, current_value))
current_key, current_value = None, ""
current_key = line_split[0].strip()
# We've just found a new key, so here you might want to perform additional checks,
# e.g. if current_key not in sharedKeys: raise ValueError("Encountered a weird key?! %s in line: %s" % (current_key, line))
current_value = ':'.join(line_split[1:]).strip()
# Don't forget the last parsed key, value
if current_key is not None:
key_value_pairs.append((current_key, current_value))
return key_value_pairs
Example usage:
text = """Name_OF_Person: Stefan
Adress_HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos"""
parse_funky_text(text)
Will output:
[('Name_OF_Person', 'Stefan'), ('Adress_HOME', 'London, Maple Street 45'), ('Company_NAME', 'MultiVendor XXVideos')]
You indicate in the comments that your input strings from the content should be relatively consistent. If that is the case, and you want to be able to split that string across multiple lines, the easiest thing to do would be to replace \n with spaces and then just parse the single string.
I've intentionally constrained my answer to using just string methods rather than inventing a huge function to do this. Reason: 1) Your process is already complex enough, and 2) your question really boils down to how to process the string data across multiple lines. If that is the case, and the pattern is consistent, this will get this one off job done
content = content.replace('\n', ' ')
Then you can split on each of the boundries in your consistently structured headers.
content = content.split("Name_OF_Person:")[1] #take second element of the list
person = content.split("Adress_HOME:")[0] # take content before "Adress Home"
content = content.split("Adress_HOME:")[1] #take second element of the list
address = content.split("Company_NAME:")[0] # take content before
company = content.split("Adress_HOME:")[1] #take second element of the list (the remainder) which is company
Normally, I would suggest regex. (https://docs.python.org/3.4/library/re.html). Long term, if you need to do this sort of thing again, regex is going to pay dividends on time spend munging data. To make a regex function "cut" across multiple lines, you would use the re.MULTILINE option. So it might endup looking something like re.search('Name_OF_Person:(.*)Adress_HOME:', html_reg_form, re.MULTILINE)

Python:Loop through .csv of urls and save it as another column

New to python, read a bunch and watched a lot of videos. I can't get it to work and i'm getting frustrated.
I have a List of links like below:
"KGS ID","Latitude","Longitude","Location","Operator","Lease","API","Elevation","Elev_Ref","Depth_start","Depth_stop","URL"
"1002880800","37.2354869","-100.4607509","T32S R29W, Sec. 27, SW SW NE","Stanolind Oil and Gas Co.","William L. Rickers 1","15-119-00164","2705"," KB","2790","7652","http://www.kgs.ku.edu/WellLogs/32S29W/1043696830.zip"
"1002880821","37.1234622","-100.1158111","T34S R26W, Sec. 2, NW NW NE","SKELLY OIL CO","GRACE MCKINNEY 'A' 1","15-119-00181","2290"," KB","4000","5900","http://www.kgs.ku.edu/WellLogs/34S26W/1043696831.zip"
I'm trying to get python to go to "URL" and save it in a folder named "location" as filename "API.las".
ex) ......"location"/Section/"API".las
C://.../T32S R29W/Sec.27/15-119-00164.las
The file has hundred of rows and links to download. I wanted to implement a sleep function at the also to not bombard the servers.
What are some of the different ways to do this? I've tried pandas and a few other methods... any ideas?
You will have to do something like this
for link, file_name in zip(links, file_names):
u = urllib.urlopen(link)
udata = u.read()
f = open(file_name+".las", "w")
f.write(udata)
f.close()
u.close()
If the contents of your file are not what you wanted, you might want to look at a scraping library like BeautifulSoup for parsing.
This might be a little dirty, but it's a first pass at solving the problem. This is all contingent on each value in the CSV being encompassed in double quotes. If this is not true, this solution will need heavy tweaking.
Code:
import os
csv = """
"KGS ID","Latitude","Longitude","Location","Operator","Lease","API","Elevation","Elev_Ref","Depth_start","Depth_stop","URL"
"1002880800","37.2354869","-100.4607509","T32S R29W, Sec. 27, SW SW NE","Stanolind Oil and Gas Co.","William L. Rickers 1","15-119-00164","2705"," KB","2790","7652","http://www.kgs.ku.edu/WellLogs/32S29W/1043696830.zip"
"1002880821","37.1234622","-100.1158111","T34S R26W, Sec. 2, NW NW NE","SKELLY OIL CO","GRACE MCKINNEY 'A' 1","15-119-00181","2290"," KB","4000","5900","http://www.kgs.ku.edu/WellLogs/34S26W/1043696831.zip"
""".strip() # trim excess space at top and bottom
root_dir = '/tmp/so_test'
lines = csv.split('\n') # break CSV on newlines
header = lines[0].strip('"').split('","') # grab first line and consider it the header
lines_d = [] # we're about to perform the core actions, and we're going to store it in this variable
for l in lines[1:]: # we want all lines except the top line, which is a header
line_broken = l.strip('"').split('","') # strip off leading and trailing double-quote
line_assoc = zip(header, line_broken) # creates a tuple of tuples out of the line with the header at matching position as key
line_dict = dict(line_assoc) # turn this into a dict
lines_d.append(line_dict)
section_parts = [s.strip() for s in line_dict['Location'].split(',')] # break Section value to get pieces we need
file_out = os.path.join(root_dir, '%s%s%s%sAPI.las'%(section_parts[0], os.path.sep, section_parts[1], os.path.sep)) # format output filename the way I think is requested
# stuff to show what's actually put in the files
print file_out, ':'
print ' ', '"%s"'%('","'.join(header),)
print ' ', '"%s"'%('","'.join(line_dict[h] for h in header))
output:
~/so_test $ python so_test.py
/tmp/so_test/T32S R29W/Sec. 27/API.las :
"KGS ID","Latitude","Longitude","Location","Operator","Lease","API","Elevation","Elev_Ref","Depth_start","Depth_stop","URL"
"1002880800","37.2354869","-100.4607509","T32S R29W, Sec. 27, SW SW NE","Stanolind Oil and Gas Co.","William L. Rickers 1","15-119-00164","2705"," KB","2790","7652","http://www.kgs.ku.edu/WellLogs/32S29W/1043696830.zip"
/tmp/so_test/T34S R26W/Sec. 2/API.las :
"KGS ID","Latitude","Longitude","Location","Operator","Lease","API","Elevation","Elev_Ref","Depth_start","Depth_stop","URL"
"1002880821","37.1234622","-100.1158111","T34S R26W, Sec. 2, NW NW NE","SKELLY OIL CO","GRACE MCKINNEY 'A' 1","15-119-00181","2290"," KB","4000","5900","http://www.kgs.ku.edu/WellLogs/34S26W/1043696831.zip"
~/so_test $
Approach 1 :-
Your file has suppose 1000 rows.
Create masterlist which has the data stored in this form ->
[row1,row2,row3 and so on]
Once done, loop through this masterlist. You will get a a row in string format in every iteration.
split it make a list and splice the last column of url i.e. row[-1]
and append it to a empty list named result_url. Once it has run for all rows, save it in a file and you can easily create a directory using os module and move your file over there
Approach 2 :-
If file is too huge, read line one by one in try block and process your data (using csv module you can get each row as a list, splice url and write it to file API.las everytime).
Once your program moves 1001th line it will move to except block where you can just 'pass' or write a print to get notified.
In approach 2, you are not saving all data in any data structure, you just storing a single row at executing it, so it is more fast.
import csv, os
directory_creater = os.mkdir('Locations')
fme = open('./Locations/API.las','w+')
with open('data.csv','r') as csvfile:
spamreader = csv.reader(csvfile, delimiter = ',')
print spamreader.next()
while True:
try:
row= spamreader.next()
get_url = row[-1]
to_write = get_url+'\n'
fme.write(to_write)
except:
print "Program has run. Check output."
exit(1)
This code can do all that you mentioned efficiently in less time.

retrieving name from number ID

I have a code that takes data from online where items are referred to by a number ID, compared data about those items, and builds a list of item ID numbers based on some criteria. What I'm struggling with is taking this list of numbers and turning it into a list of names. I have a text file with the numbers and corresponding names but am having trouble using it because it contains multi-word names and retains the \n at the end of each line when i try to parse the file in any way with python. the text file looks like this:
number name\n
14 apple\n
27 anjou pear\n
36 asian pear\n
7645 langsat\n
I have tried split(), as well as replacing the white space between with several difference things to no avail. I asked a question earlier which yielded a lot of progress but still didn't quite work. The two methods that were suggested were:
d = dict()
f=open('file.txt', 'r')
for line in f:
number, name = line.split(None,1)
d[number] = name
this almost worked but still left me with the \n so if I call d['14'] i get 'apple\n'. The other method was:
import re
f=open('file.txt', 'r')
fr=f.read()
r=re.findall("(\w+)\s+(.+)", fr)
this seemed to have gotten rid of the \n at the end of every name but leaves me with the problem of having a tuple with each number-name combo being a single entry so if i were to say r[1] i would get ('14', 'apple'). I really don't want to delete each new line command by hand on all ~8400 entries...
Any recommendations on how to get the corresponding name given a number from a file like this?
In your first method change the line ttn[number] = name to ttn[number] = name[:-1]. This simply strips off the last character, and should remove your \n.
names = {}
with open("id_file.txt") as inf:
header = next(inf, '') # skip header row
for line in inf:
id, name = line.split(None, 1)
names[int(id)] = name.strip()
names[27] # => 'anjou pear'
Use this to modify your first approach:
raw_dict = dict()
cleaned_dict = dict()
Assuming you've imported file to dictionary:
raw_dict = {14:"apple\n",27:"anjou pear\n",36 :"asian pear\n" ,7645:"langsat\n"}
for keys in raw_dict:
cleaned_dict[keys] = raw_dict[keys][:len(raw_dict[keys])-1]
So now, cleaned_dict is equal to:
{27: 'anjou pear', 36: 'asian pear', 7645: 'langsat', 14: 'apple'}
*Edited to add first sentence.

Categories

Resources