How can I view an HTML that I have written in Python? - python

I am currently learning HTML in my Python class. My assignment is to ask the user for their name and a short bio of themselves and print it on a web-page.
I believe my code is correct, I'm not too confused on that, but after I've created everything how do I see if it actually worked? When I try to open the file location with google chrome nothing happens.
def main():
f = open('Userbio', 'w')
f.write("<html>" + "\n")
f.write("<heads>" + "\n")
name = input("What is your name?")
bio = input ("Please write a sentence about yourself!")
f.write("<naming>" + "\n")
f.write(name)
f.write("/naming>" + "\n")
fout.write("</heads>" + "\n")
fout.write("</body>" + "\n")
fout.write("</html>" + "\n")
fout.close()
f.close()
main()
Basically after this program is written and the user inputs their information I'm trying to figure out how to open the web page.
This method works when I use a regular writer like notepad, I just save it and open it with chrome and I can see my webpage. But not Python?

Change f = open('Userbio', 'w') to f = open('Userbio.html', 'w').
You need to rename <heads> to <head> and add the <body> tag.
You will also need to move the <naming> tag out of the <head> and into the <body> tag in order to see any text inside of the naming tag.
As python does not have a built in web browser, you will be able to open the .html file with a browser like Chrome.
If you want to be able to parse and manipulate the html before writing to file, there are several libraries, like BeautifulSoup, that can do that for you.

Related

I'm trying to write/paste data I have scraped into a specific excel spreadsheet cell location, but I'm not sure how to go about doing it

I have written some code that allows me to me to scrape data from a website using selenium and then display the specific data in CMD. I have tried a few methods to write this data or append it to a spreadsheet but none have worked. I keep receiving invalid syntax errors.
I have attempted three methods, this being my third:
with open("results.xlsx", "a") as f:
for i in range(values):
f.write(values "\n")
I really thought this block would work and don't know why invalid syntax is being displayed.
import csv
import os
from selenium import webdriver - part of code
ticker = input("Enter your ticker: ")
url = "http://financials.morningstar.com/cash-flow/cf.html?t=" + ticker.upper()
print(url)
browser = webdriver.Firefox()
browser.get(url)
values_element = browser.find_elements_by_xpath("//*[#id='data_i97']")
values = [x.text for x in values_element]
print('Cash Flows:')
print (values[0])
with open("results.xlsx", "a") as f:
for i in range(values):
f.write(values "\n")
browser.close()
I was expecting this block of code
with open("results.xlsx", "a") as f:
for i in range(values):
f.write(values "\n")
to write to my excel file. I didn't care where it wrote to on the excel spreadsheet, I just want it to write anywhere as proof of concept, but it doesn't. Instead, when I run my py file in CMD I receive
line 21
f.write(values "\n")
^
SyntaxError: invalid syntax
I'm not sure where to go from here, I am very new to coding, this is my first attempt. Any well written detailed sources on writing scraped data to excel or an explanation on why my code isn't working would be greatly appreciated.
It should be.
for i in range(len(values)):
f.write(values[i] + "\n")

Read lines from file and store as variable to use in another function before repeating

I'm farily new to python and I'm currently stuck when trying to improve my script. I have a script that performs a lot of operations using selenium to automate a manual task. The scripts opens two pages, searches for an email, fetches data from that page and sends it to another tab. I need help to feed the script a textfile containing a list of email addresses, one line at a time and using each line to search the webpage. What I need is the following:
Open file "test.txt"
Read first line in text file and store this value for use in another function.
perform function which uses line from text file as its input value.
Add "Completed" behind the first line in the text file before moving to the next
Move to and read next line i text file, store as variable and repeat from step 3.
I'm not sure how I can do this.
Here is a snippet of my code at the time:
def fetchEmail():
fileName = input("Filename: ")
fileNameExt = fileName + ".txt" #to make sure a .txt extension is used
line = f.readline()
for line in f:
print(line) # <-- How can I store the value here for use later?
break
def performSearch():
emailSearch = driver.find_element_by_id('quicksearchinput')
emailSearch.send_keys(fetchEmail, Keys.RETURN) <--- This is where I want to be able to paste current line for everytime function is called.
return main
I would appreciate any help how I can solve this.
It's a bit tricky to diagnose your particular issue, since you don't actually provide real code. However, probably one of the following will help you:
Return the list of all lines from fetchEmail, then search for all of them in send_keys:
def fetchEmail():
fileName = input("Filename: ")
fileNameExt = fileName + ".txt"
with open(fileNameExt) as f:
return f.read().splitlines()
def performSearch():
emailSearch = driver.find_element_by_id('quicksearchinput')
emailSearch.send_keys(fetchEmail(), Keys.RETURN)
# ...
Yield them one at a time, and look for them individually:
def fetchEmail():
fileName = input("Filename: ")
fileNameExt = fileName + ".txt"
with open(fileNameExt) as f:
for line in f:
yield line.strip()
def performSearch():
emailSearch = driver.find_element_by_id('quicksearchinput')
for email in fetchEmail():
emailSearch.send_keys(email, Keys.RETURN)
# ...
I don't recommend using globals, there should be a better way to share information between functions (such as having both of these in a class instance, or having one function call the other like I show above), but here would be an example of how you could save the value when the first function gets called, and retrieve the results in the second function at an arbitrary later time:
emails = []
def fetchEmail():
global emails
fileName = input("Filename: ")
fileNameExt = fileName + ".txt"
with open(fileNameExt) as f:
emails = f.read().splitlines()
def performSearch():
emailSearch = driver.find_element_by_id('quicksearchinput')
emailSearch.send_keys(emails, Keys.RETURN)
# ...

My variable does not include all copied text using sublime.get_clipboard() sublime text plugin

I'm trying to create a sublime text 3 plugin that copies code from files like .html, .css, .js & .php and then pastes it in a txt document.
My problem is that sometimes when I copy code from a document with a lot of code and paste it into a text document, some of the code disappears.
import sublime
import sublime_plugin
def save_code(file):
sublime.active_window().active_view().set_status("","Your file is
saved inside _templates " + file + "!")
text_file = open("path_to_file/_templates"+file, "w")
my_copy = str(sublime.get_clipboard())
text_file.write("%s" % my_copy)
text_file.close()
class MycopypasteCommand(sublime_plugin.TextCommand):
def run(self, edit):
self.view.window().show_input_panel("Add
filename:","",save_code,None,None)
Finally I found a solution. My problem was that when I copied text that contained "Åäö", text_file.write("%s" % my_copy) couldn't write to file. My solution was simpel. Just need 2 add encoding="utf-8" at the end of open("path_to_file, "w", open("path_to_file/_templates"+file, "w", encoding="utf-8")

How to extract text from a directory of PDF files efficiently with OCR?

I have a large directory with PDF files (images), how can I extract efficiently the text from all the files inside the directory?. So far I tried to:
import multiprocessing
import textract
def extract_txt(file_path):
text = textract.process(file_path, method='tesseract')
p = multiprocessing.Pool(2)
file_path = ['/Users/user/Desktop/sample.pdf']
list(p.map(extract_txt, file_path))
However, it is not working... it takes a lot of time (I have some documents that have 600 pages). Additionally: a) I do not know how to handle efficiently the directory transformation part. b) I would like to add a page separator, let's say: <start/age = 1> ... page content ... <end/page = 1>, but I have no idea of how to do this.
Thus, how can I apply the extract_txt function to all the elements of a directory that end with .pdf and return the same files in another directory but in a .txt format, and add a page separator with OCR text extraction?.
Also, I was curios about using google docs to make this task, is it possible to programmatically use google docs to solve the aforementioned text extracting problem?.
UPDATE
Regarding the "adding a page separator" issue (<start/age = 1> ... page content ... <end/page = 1>) after reading Roland Smith's answer I tried to:
from PyPDF2 import PdfFileWriter, PdfFileReader
import textract
def extract_text(pdf_file):
inputpdf = PdfFileReader(open(pdf_file, "rb"))
for i in range(inputpdf.numPages):
w = PdfFileWriter()
w.addPage(inputpdf.getPage(i))
outfname = 'page{:03d}.pdf'.format(i)
with open(outfname, 'wb') as outfile: # I presume you need `wb`.
w.write(outfile)
print('\n<begin page pos =' , i, '>\n')
text = textract.process(str(outfname), method='tesseract')
os.remove(outfname) # clean up.
print(str(text, 'utf8'))
print('\n<end page pos =' , i, '>\n')
extract_text('/Users/user/Downloads/ImageOnly.pdf')
However, I still have issues with the print() part, since instead of printing, it would be more useful to save into a file all the output. Thus, I tried to redirect the output to a a file:
sys.stdout=open("test.txt","w")
print('\n<begin page pos =' , i, '>\n')
sys.stdout.close()
text = textract.process(str(outfname), method='tesseract')
os.remove(outfname) # clean up.
sys.stdout=open("test.txt","w")
print(str(text, 'utf8'))
sys.stdout.close()
sys.stdout=open("test.txt","w")
print('\n<end page pos =' , i, '>\n')
sys.stdout.close()
Any idea of how to make the page extraction/separator trick and saving everything into a file?...
In your code, you are extracting the text, but you don't do anything with it.
Try something like this:
def extract_txt(file_path):
text = textract.process(file_path, method='tesseract')
outfn = file_path[:-4] + '.txt' # assuming filenames end with '.pdf'
with open(outfn, 'wb') as output_file:
output_file.write(text)
return file_path
This writes the text to file that has the same name but a .txt extension.
It also returns the path of the original file to let the parent know that this file is done.
So I would change the mapping code to:
p = multiprocessing.Pool()
file_path = ['/Users/user/Desktop/sample.pdf']
for fn in p.imap_unordered(extract_txt, file_path):
print('completed file:', fn)
You don't need to give an argument when creating a Pool. By default it will create as many workers as there are cpu-cores.
Using imap_unordered creates an iterator that starts yielding values as soon as they are available.
Because the worker function returned the filename, you can print it to let the user know that this file is done.
Edit 1:
The additional question is if it is possible to mark page boundaries. I think it is.
A method that would surely work is to split the PDF file into pages before the OCR. You could use e.g. pdfinfo from the poppler-utils package to find out the number of pages in a document. And then you could use e.g. pdfseparate from the same poppler-utils package to convert that one pdf file of N pages into N pdf files of one page. You could then OCR the single page PDF files separately. That would give you the text on each page separately.
Alternatively you could OCR the whole document and then search for page breaks. This will only work if the document has a constant or predictable header or footer on every page. It is probably not as reliable as the abovementioned method.
Edit 2:
If you need a file, write a file:
from PyPDF2 import PdfFileWriter, PdfFileReader
import textract
def extract_text(pdf_file):
inputpdf = PdfFileReader(open(pdf_file, "rb"))
outfname = pdf_file[:-4] + '.txt' # Assuming PDF file name ends with ".pdf"
with open(outfname, 'w') as textfile:
for i in range(inputpdf.numPages):
w = PdfFileWriter()
w.addPage(inputpdf.getPage(i))
outfname = 'page{:03d}.pdf'.format(i)
with open(outfname, 'wb') as outfile: # I presume you need `wb`.
w.write(outfile)
print('page', i)
text = textract.process(outfname, method='tesseract')
# Add header and footer.
text = '\n<begin page pos = {}>\n'.format(i) + text + '\n<end page pos = {}>\n'.format(i)
# Write the OCR-ed text to the output file.
textfile.write(text)
os.remove(outfname) # clean up.
print(text)

Python not printing beautifulsoup data to .txt file (or I can't find it)

I'm trying to put all the anchor text on a page in a txt file
print(anchor.get('href'))
with open('file.txt', 'a') as fd:
fd.write(anchor.get('href') + '\n')
and the script executes with no errors but I cannot find file.txt anywhere on my computer. Am I missing out on something really obvious?
With file open mode a (append) the file will be created if it doesn't already exist, otherwise writes will be appended to the end of the file. As written in the question, the file will be created in the current directory of the running process... look there.
from bs4 import BeautifulSoup
import requests
response = requests.get('http://httpbin.org/')
soup = BeautifulSoup(response.text)
with open('file.txt', 'a') as fd:
links = (link.get('href') for link in soup.find_all('a'))
fd.write('\n'.join(links) + '\n')
Try to set a full path to filename, so you can find in this path.
Example:
print(anchor.get('href'))
with open('/home/wbk/file.txt', 'a') as fd:
fd.write(anchor.get('href') + '\n')
And yes, you can use 'a' to create a new file, although this is not a good practice. The 'a' is to append data to the end of an existing file as describes doc
The following code works for me:
with open("example.txt","a") as a:
a.write("hello")
are you sure that the file isn't in your computer? In which ways you have demonstrated it? In my case, i work with eclipse IDE, so I have to refresh the Eclipse file explorer to see it..
Try this:
with open("prova.txt","a") as a:
a.write("ciao")
try: open("prova.txt","r")
except: print "file doesn't exist"

Categories

Resources