Parsing with multiple identifiers - python

I was trying to implement this block of code from Generator not working to split string by particular identifier . Python 2 but I found two bugs in it that I can’t seem to fix.
Input:
#m120204
CTCT
+
~##!
#this_one_has_an_at_sign
CTCTCT
+
#jfik9
#thisoneisempty
+
#empty line after + and then empty line to end file (2 empty lines)
The two bugs are:
(i) when there is a # that starts the line of code after the ‘+’ line such as the 2nd entry (#this_one_has_an_at_sign)
(ii) when there line following the #identification_line or the line following the ‘+’ lines are empty like in 3rd entry (#thisoneisempty)
I would like the output to be the same as the post that i referenced:
yield (name, body, extra)
in the case of #this_one_has_an_at_sign
name= this_one_has_an_at_sign
body= CTCTCT
quality= #jfik9
in the case of #thisoneisempty
name= thisoneisempty
body= ''
quality= ''
I tried using flags but i can’t seem to fix this issue. I know how to do it without using a generator but i’m going to be using big files so i don’t want to go down that path. My current code is:
def organize(input_file):
name = None
body = ''
extra = ''
for line in input_file:
line = line.strip()
if line.startswith('#'):
if name:
body, extra = body.split('+',1)
yield name, body, extra
body = ''
name = line
else:
body = body + line
body, extra = body.split('+',1)
yield name, body, extra
for line in organize(file_path):
print line

Related

Python reading URLs from file until last line

I have script which basically checks domain from the text file and finds its email. I want to add multiple domain names(line by line) then script should take each domain run the function and goes to second line after finishing. I tried to google for specific solution but not sure how do i find appropriate answer.
f = open("demo.txt", "r")
url = f.readline()
extractUrl(url)
def extractUrl(url):
try:
print("Searching emails... please wait")
count = 0
listUrl = []
req = urllib.request.Request(
url,
data=None,
headers={
'User-Agent': ua.random
})
try:
conn = urllib.request.urlopen(req, timeout=10)
status = conn.getcode()
contentType = conn.info().get_content_type()
html = conn.read().decode('utf-8')
emails = re.findall(
r '[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}', html)
for email in emails:
if (email not in listUrl):
count += 1
print(str(count) + " - " + email)
listUrl.append(email)
print(str(count) + " emails were found")
Python files are iterable, so it's basically a simple as:
for line in f:
extractUrl(line)
But you may want to do it right (ensure you close the file whatever happens, ignore possible empty lines etc):
# use `with open(...)` to ensure the file will be correctly closed
with open("demo.txt", "r") as f:
# use `enumerate` to get line numbers too
#- we might need them for information
for lineno, line in enumerate(f, 1):
# make sure the line is clean (no leading / trailing whitespaces)
# and not empty:
line = line.strip()
# skip empty lines
if not line:
continue
# ok, this one _should_ match - but something could go wrong
try:
extractUrl(line)
except Exception as e:
# mentioning the line number in error report might help debugging
print("oops, failed to get urls for line {} ('{}') : {}".format(lineno, line, e))

Why doesn't my defined delete function working proberly in python? all results are deleted

I'm making a program that stores data in a text file, I can search for data line by line, and I made a (delete function) that is quoted below, making a variable 'a' adding to it the (non deleted lines), and ask before deletion for results and if not confirmed it would be added also to 'a', then rewrite the (file) with'a' omitting the deleted lines.
THE PROBLEM IS:
all results are deleted not only the confirmed one desbite that:
#deleting line
confirm = input('confirm to delete [y]/[n]>>')
if confirm != 'y':
a += line
so, why did this problem happen and how to fix it?
Next is the whole code of delete function:
searching = input('enter any information about query: ')
searching = searching.lower() # converting words in lower case
f = open(file, 'r')
lines = f.readlines()
f.close()
print('Word | Definition | Remarks')
a = '' # we will store our new edited text here
for line in lines:
line_lower_case = line.lower() # changing line in lower case temporary
# because contact != COntact and will not appear in searcch
if searching in line_lower_case:
print('Query found')
print()
print('>>',line, end = '') # printing words in the same case as been added
# end = '', to prevent printing new line avoiding extra empty line
#deleting line
confirm = input('confirm to delete [y]/[n]>>')
if confirm != 'y':
a += line
#elif confirm =='y':
# pass # it will just do nothing, and will not add line to 'a'
continue # to search for more queries with the same searching entry
print()
a += line #we add each line to the 'a' variable
f = open(file,'w')
f.write(a) #we save our new edited text to the file
f.close()
I changed the indentations of the program and that was the issue as I agreed with #TheLazyScripter and that should work now if I understood your problem correctly, I did a bunch of tests and they did work. I noticed that you didn't define what input file will be and I add that line of code at line 3 which will through an error if the file not defined.
searching = input('enter any information about query: ')
searching = searching.lower() # converting words in lower case
file = "test.txt" #your file
f = open(file, 'r')
lines = f.readlines()
f.close()
print('Word | Definition | Remarks')
a = '' # we will store our new edited text here
for line in lines:
line_lower_case = line.lower() # changing line in lower case temporary
# because contact != COntact and will not appear in searcch
if searching in line_lower_case:
print('Query found')
print()
print('>>',line, end = '') # printing words in the same case as been added
# end = '', to prevent printing new line avoiding extra empty line
#deleting line
confirm = input('confirm to delete [y]/[n]>>')
if confirm != 'y':
a += line
#elif confirm =='y':
# pass # it will just do nothing, and will not add line to 'a'
continue # to search for more queries with the same searching entry
print()
a += line #we add each line to the 'a' variable
f = open(file,'w')
f.write(a) #we save our new edited text to the file
f.close()

Parsing Generator Python 2

Input:
#example1
abcd
efg
hijklmnopq
#example2
123456789
Script:
def parser_function(f):
name = ''
body = ''
for line in f:
if len(line) >= 1:
if line[0] == '#':
name = line
continue
body = body + line
yield name,''.join(body)
for line in parser_function(data_file):
print line
Output
('#example1', 'abcd')
('#example1', 'abcdefg')
('#example1', 'abcdefghijklmnopq')
('#example2', 'abcdefghijklmnopq123456789')
Desired Output:
('#example1', 'abcdefghijklmnopq')
('#example2', '123456789')
My problem, my generator is yielding every line but i'm not sure where to reset the line. i'm having trouble getting the desired output and i've tried a few different ways. any help would be greatly appreciated. saw some other generators that had "if name:" but they were fairly complicated. I got it to work using those codes but i'm trying to make my code as small as possible
You need to change where you yield:
def parser_function(f):
name = None
body = ''
for line in f:
if line and line[0] == '#':
if name:
yield name, body
name = line
else:
body += line
if name:
yield name, body
This yields once before every #... and once at the end.
P.S. I've renamed str to body to avoid shadowing a built-in.

Find author of python file from docstring

I'm trying to write a program which has a function that finds and prints the author of a file by looking for the Author string in the docstring. I've managed to get the code below to print the author of a file that has the author string followed by the authors name and also the author string not followed by a name. The thing I'm having problems with is trying to print Unknown when the author string does not exist at all i.e. no part of the docstring contains Author.
N.B. lines is just a list constructed by using readlines() on a file.
def author_name(lines):
'''Finds the authors name within the docstring'''
for line in lines:
if line.startswith("Author"):
line = line.strip('\n')
line = line.strip('\'')
author_line = line.split(': ')
if len(author_line[1]) >=4:
print("{0:21}{1}".format("Author", author_line[1]))
else:
print("{0:21}{1}".format("Author", "Unknown"))
If you are writing a function, then return a value. Do not use print (that is for debugging only). Once you use return, you can return early if you do find the author:
def author_name(lines):
'''Finds the authors name within the docstring'''
for line in lines:
name = 'Unknown'
if line.startswith("Author"):
line = line.strip('\n')
line = line.strip('\'')
author_line = line.split(': ')
if len(author_line[1]) >=4:
name = author_line[1]
return "{0:21}{1}".format("Author", name) # ends the function, we found an author
return "{0:21}{1}".format("Author", name)
print(author_name(some_docstring.splitlines()))
The last return statement only executes if there were no lines starting with Author, because if there was, the function would have returned early.
Alternatively, because we default name to Unknown, you can use break as well to end the loop early and leave returning to that last line:
def author_name(lines):
'''Finds the authors name within the docstring'''
for line in lines:
name = 'Unknown'
if line.startswith("Author"):
line = line.strip('\n')
line = line.strip('\'')
author_line = line.split(': ')
if len(author_line[1]) >=4:
name = author_line[1]
break # ends the `for` loop, we found an author.
return "{0:21}{1}".format("Author", name)

Cannot add new items into python dictionary

Hi I'm new to python. I am trying to add different key value pairs to a dictionary depending on different if statements like the following:
def getContent(file)
for line in file:
content = {}
if line.startswith(titlestart):
line = line.replace(titlestart, "")
line = line.replace("]]></title>", "")
content["title"] = line
elif line.startswith(linkstart):
line = line.replace(linkstart, "")
line = line.replace("]]>", "")
content["link"] = line
elif line.startswith(pubstart):
line = line.replace(pubstart, "")
line = line.replace("</pubdate>", "")
content["pubdate"] = line
return content
print getContent(list)
However, this always returns the empty dictionary {}.
I thought it was variable scope issue at first but that doesn't seem to be it. I feel like this is a very simple question but I'm not sure what to google to find the answer.
Any help would be appreciated.
You reinitialize content for every line, move the initialization outside of the loop:
def getContent(file)
content = {}
for line in file:
etc.

Categories

Resources