Extracting with Python [closed] - python

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I am currently working on a python program that extracts information from a stock website
http://markets.usatoday.com/custom/usatoday-com/html-mktscreener.asp
I need to extract all the columns Symbol - Volume. Before this program I had to create a bash scrip that downloaded the pages every minute for 1 hour to get 60 pages. Which I have done. But I do not understand how to extract the information so I can inject that information into MySQL db.
import libxml2
import sys
import os
import commands
import re
import sys
import MySQLdb
from xml.dom.minidom import parse, parseString
# for converting dict to xml
from cStringIO import StringIO
from xml.parsers import expat
def get_elms_for_atr_val(tag,atr,val):
lst=[]
elms = dom.getElementsByTagName(tag)
# ............
return lst
# get all text recursively to the bottom
def get_text(e):
lst=[]
# ............
return lst
def extract_values(dm):
lst = []
l = get_elms_for_atr_val('table','class','most_actives')
# ............
# get_text(e)
# ............
return lst
I'm very new to python and that's the best provided. There are 60 HTML pages downloaded and all I need to do is just extract the information from 1 page I believe Or at least if I can start on the 1 page I can figure out a loop for the others, and extract that information to be used in MYsql
Any help to get me started is appreciated!

Use a robust HTML parser instead of xml module, as the latter will reject malformed documents, like the URL you pointed appears to be. Here's a quick solution:
from lxml.html import parse
import sys
def process(htmlpage):
tree = parse(htmlpage).getroot()
# Helper function
xpath_to_column = lambda expr: [el.text for el in tree.xpath(expr)]
symbol = xpath_to_column('//*[#id="idcquoteholder"]/table/tr/td[1]/a')
price = xpath_to_column('//*[#id="idcquoteholder"]/table/tr/td[3]')
volume = xpath_to_column('//*[#id="idcquoteholder"]/table/tr/td[6]')
return zip(symbol, price, volume)
def main():
for filename in sys.argv[1:]:
with open(filename, 'r') as page:
print process(page)
if __name__ == '__main__':
main()
You will have to elaborate on this example a bit, as some elements (like the "Symbol") are further contained in span or a nodes, but the spirit is: use XPath to query and extract column contents. Add columns as needed.
Hint: use Chrome Inspector or Firebug to get the right XPath.
EDIT: pass all the filenames on the command line to this script. If you need to process separately each file, then remove the for loop in main().

Related

Implementing a class error and returning 0 text [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I hope this finds you well. I am struggling a bit and I have two questions. First, I am trying to implement a class and it returns a code similar to <main. object at 0x02C08790> when I am trying to implement the class. I have referenced other comments and do not quite understand. My second question is when I run the code below it states that there is no items in the pdf I saved earlier. I think I am passing the document incorrectly but I am unsure. I have tested each code separately and the both work independently but not together. Any help is greatly appreciated.
import os
from pdfminer3.layout import LAParams, LTTextBox
from pdfminer3.pdfpage import PDFPage
from pdfminer3.pdfinterp import PDFResourceManager
from pdfminer3.pdfinterp import PDFPageInterpreter
from pdfminer3.converter import PDFPageAggregator
from pdfminer3.converter import TextConverter
import io
import PyPDF2
from PyPDF2 import PdfFileMerger, PdfFileReader
import pandas as pd
class Transform:
# method for extracting data and merging it into one pdf
def __init__(self):
try:
source_dir = os.getcwd()
merger = PdfFileMerger()
for item in os.listdir(source_dir):
if item.endswith("pdf"):
merger.append(item)
except Exception:
print("unable to collect")
finally:
merger.write("test.pdf")
merger.close()
#running that method extract
def extract(self):
resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle, laparams=LAParams())
page_interpreter = PDFPageInterpreter(resource_manager, converter)
with open('test.pdf', 'rb') as fh:
for page in PDFPage.get_pages(fh,
caching=True,
check_extractable=True):
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
# close open handles
converter.close()
fake_file_handle.close()
print(Transform)
I expect your code should merge all pdf in current dir into test.pdf plus print text of this merged pdf. It needs just two corrections, first replace
print(Transform)
with
print(Transform().extract())
Transform by itself is a class, you need to create (instantiate) object out of it, using Transform(). Then you may call some methods on it like .extract(), this runs method-function defined in that class. You may read about classes and objects here.
Second, add
return text
as last line in def extract(self) function body. This return is necessary so that extract returns text it has extracted from pdf, otherwise it does some work, but doesn't return any result in the original code.
You may Run full corrected code here.

Parsing the Skills Section in a Resume in Python

I am trying to parse the skills section of a resume in python. I found a library by Mr. Omkar Pathak called pyresparser and I was able to extract a PDF resume's contents into a resume.txt file.
However, I was wondering how I can go about only extracting the skills section from the resume into a list and then writing that list possibly into a query.txt file.
I'm reading the contents of the resume.txt into a list and then comparing that to a list called skills which stores the extracted contents from a skill.cv file. Currently, the skills list is empty and I was wondering how I can go about storing the skills into that list? Is this the correct approach? Any help is greatly appreciated, thank you!
import string
import csv
import re
import sys
import importlib
import os
import spacy
from pyresparser import ResumeParser
import pandas as pd
import nltk
from spacy.matcher import matcher
import multiprocessing as mp
def main():
data = ResumeParser("C:/Users/infinitel88p/Downloads/resume.pdf").get_extracted_data()
print(data)
# Added encoding utf-8 to prevent unicode error
with open("C:/Users/infinitel88p/Downloads/resume.txt", "w", encoding='utf-8') as rf:
rf.truncate()
rf.write(str(data))
print("Resume results are getting printed into resume.txt.")
# Extracting skills
resume_list = []
skill_list = []
data = pd.read_csv("skills.csv")
skills = list(data.columns.values)
resume_file = os.path.dirname(__file__) + "/resume.txt"
with open(resume_file, 'r', encoding='utf-8') as f:
for line in f:
resume_list.append(line.strip())
for token in resume_list:
if token.lower() in skills:
skill_list.append(token)
print(skill_list)
if __name__ == "__main__":
main()
An easy way ( but not an efficient way ) to do:
Have a set of all possible relevant skills in a text file. For the words in skills sections of the resume or for all the words in resume, take each words and check whether it matches with any of the word from the text file. If a word matched, then that skill is present in resume. This way, you could identify a set of skills present in the resume.
For further addition or better identification, you can use naive-bayes classification or uni-gram probabilities to extract more relevant skills.

How to hover over the word in word cloud [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have created a word cloud from the text file using python by using the module called (pytagcloud),but I have created it as .jpeg file. But i want to make the words in the cloud to be active when I hover the cursor over the words in cloud.
When I click the particular word in word cloud, the corresponding sentence in the passage should be highlighted.How to do that ? please help me. I have project work in this topic.
I created a gui in which i have options to import the text file and read a text file. after reading text file, I need to produce the word cloud from the passage in .txt file.
def wordcloud(self):
from pytagcloud import create_tag_image, create_html_data, make_tags, LAYOUT_HORIZONTAL, LAYOUTS, LAYOUT_MIX, LAYOUT_VERTICAL, LAYOUT_MOST_HORIZONTAL, LAYOUT_MOST_VERTICAL
from pytagcloud.lang.counter import get_tag_counts
from pytagcloud.colors import COLOR_SCHEMES
import webbrowser
#import Tkinter
#from tkFileDialog import askopenfilename
#filename=askopenfilename()
#with open(filename,'r') as f:
# text=f.read()
#def create_tag_cloud(text):
words = nltk.word_tokenize(self._contents)
doc = " ".join(d for d in words[:70])
tags = make_tags(get_tag_counts(doc), maxsize=100)
create_tag_image(tags, 'sid.jpeg',size=(1600, 1200),fontname='Philosopher',layout=LAYOUT_MIX,rectangular=True)
webbrowser.open('sid.jpeg')
Without seeing your code there is nothing to correct.
However the best approach to this is to output your tag cloud in HTML and CSS so you end something like their demo.
After you have your HTML code one approach to take is to use Javascript to react on a word clicked and highlight every occurrence of that word in your body.
However there are many other approaches that may be better suited but without any context it's impossible to comment I'm afraid. Regardless of this don't render your tag cloud as a jpeg. This is static and will not have the capability to be interactive.
Edit1: Code provided
Have a look at the test_create_html_data(self): function in the PyTagCloud tests available on github to get an idea of how to output HTML and CSS.
Just a quick note on your code, Python will be importing all those packages upon every run of your wordcloud() method. Pull them out to something like this (I started the adaption for you):
from pytagcloud import (create_tag_image, create_html_data,
make_tags, LAYOUT_MIX)
from pytagcloud.lang.counter import get_tag_counts
from pytagcloud.colors import COLOR_SCHEMES
import webbrowser
# ...the rest of your code...
def wordcloud(self):
words = nltk.word_tokenize(self._contents)
doc = " ".join(d for d in words[:70])
tags = make_tags(get_tag_counts(doc), maxsize=100)
data = create_html_data(tags, (1600,1200), layout=LAYOUT_MIX, fontname='Philosopher', rectangular=True)

Multithreaded screen scraping help needed

I'm relatively new to python, and I'm working through a screen- scraping application that gathers data from multiple financial sites. I have four procedures for now. Two run in just a couple minutes, and the other two... hours each. These two look up information on particular stock symbols that I have in a csv file. There are 4,000+ symbols that I'm using. I know enough to know that the vast majority of the time spent is in IO over the wire. It's essential that I get these down to 1/2 hour each (or, better. Is that too ambitious?) for this to be of any practical use to me. I'm using python 3 and BeautifulSoup.
I have the general structure of what I'm doing below. I've abbreviated conceptually non essential sections. I'm reading many threads on multiple calls/ threads at once to speed things up, and it seems like there are a lot of options. Can anyone point me in the right direction that I should pursue, based on the structure of what I have so far? It'd be a huge help. I'm sure it's obvious, but this procedure gets called along with the other data download procs in a main driver module. Thanks in advance...
from bs4 import BeautifulSoup
import misc modules
class StockOption:
def __init__(self, DateDownloaded, OptionData):
self.DateDownloaded = DateDownloaded
self.OptionData = OptionData
def ForCsv(self):
return [self.DateDownloaded, self.Optiondata]
def extract_options(TableRowsFromBeautifulSoup):
optionsList = []
for opt in range(0, len(TableRowsFromBeautifulSoup))
optionsList.append(StockOption(data parsed from TableRows arg))
return optionsList
def run_proc():
symbolList = read in csv file of tickers
for symb in symbolList:
webStr = #write the connection string
try:
with urllib.request.urlopen(webStr) as url: page = url.read()
soup = BeautifulSoup(page)
if soup.text.find('There are no All Markets results for') == -1:
tbls = soup.findAll('table')
if len(tbls[9]) > 1:
expStrings = soup.findAll('td', text=True, attrs={'align': 'right'})[0].contents[0].split()
expDate = datetime.date(int(expStrings[6]), int(currMonth), int(expStrings[5].replace(',', '')))
calls = extract_options(tbls[9], symb, 'Call', expDate)
puts = extract_options(tbls[13], symb, 'Put', expDate)
optionsRows = optionsRows + calls
optionsRows = optionsRows + puts
except urllib.error.HTTPError as err:
if err.code == 404:
pass
else:
raise
opts = [0] * (len(optionsRows))
for option in range(0, len(optionsRows)):
opts[option] = optionsRows[option].ForCsv()
#Write to the csv file.
with open('C:/OptionsChains.csv', 'a', newline='') as fp:
a = csv.writer(fp, delimiter=',')
a.writerows(opts)
if __name__ == '__main__':
run_proc()
There are some mistakes in the abbreviated code you have given, so it is a little hard to understand the code. If you could show more code and check it, it will be easier to understand your problem.
From the code and problem description, I have some advice to share with you:
In run_proc() function, it read webpage for every symbol. If the urls are the same or some urls are repeated, how about read webpages for just one time and write them to memory or hardware, then analyze page contents for every symbol? It will save
BeautifulSoup is easy to write code, but a little slow in performance. If lxml can do your work, it will save a lot time on analyzing webpage contents.
Hope it will help.
I was pointed in the right direction from the following post (thanks to the authors btw):
How to scrape more efficiently with Urllib2?

Script to pull html and completely de-relativise it. (single file offline)

I am trying to learn python and also create a web utility. One task I am trying to accomplish is creating a single html file which can be run locally but link to everything it needs to look like the original web page. (if you are going to ask why i want this, its because it may act of a part of a utility i am creating, or if not, just for education) So i have two questions, a theoretical one and a practical one:
1) Is this, for visual (as opposed to functional) purposes, possible? Can a html page work offline while linking to everything it needs online? or if their something fundamental about having the html file itself execute on the web server which does not allow this to be possible? How far can I go with it?
2) I have started a python script which de-relativises (made that one up) linked elements on a html page, but I am a noob so most likely I missed some elements or attributes which would also link to outside resources. I have noticed after trying a few pages that the one in the code below does not work properly, their appears to be a .js file which is not linking correctly. (the first of many problems to come) Assuming the answer to my first question was at least a partial yes, can anyone help me fix the code for this website?
Thank you.
Update, I missed the script tag on this, but even after I added it it still does not work correctly.
import lxml
import sys
from lxml import etree
from StringIO import StringIO
from lxml.html import fromstring, tostring
import urllib2
from urlparse import urljoin
site = "www.script-tutorials.com/advance-php-login-system-tutorial/"
output_filename = "output.html"
def download(site):
response = urllib2.urlopen("http://"+site)
html_input = response.read()
return html_input
def derealitivise(site, html_input):
active_html = lxml.html.fromstring(html_input)
for element in tags_to_derealitivise:
for tag in active_html.xpath(str(element+"[#"+"src"+"]")):
tag.attrib["src"] = urljoin("http://"+site, tag.attrib.get("src"))
for tag in active_html.xpath(str(element+"[#"+"href"+"]")):
tag.attrib["href"] = urljoin("http://"+site, tag.attrib.get("href"))
return lxml.html.tostring(active_html)
active_html = ""
tags_to_derealitivise = ("//img", "//a", "//link", "//embed", "//audio", "//video", "//script")
print "downloading..."
active_html = download(site)
active_html = derealitivise(site, active_html)
print "writing file..."
output_file = open (output_filename, "w")
output_file.write(active_html)
output_file.close()
Furthermore, I could make the code more through by checking all of the elements...
It would look kind of like this, but I do not know the exact way to iterate through all of the elements. This is a seperate problem, and I will most likely figure it out by the time anyone responds...:
def derealitivise(site, html_input):
active_html = lxml.html.fromstring(html_input)
for element in active_html.xpath:
for tag in active_html.xpath(str(element+"[#"+"src"+"]")):
tag.attrib["src"] = urljoin("http://"+site, tag.attrib.get("src"))
for tag in active_html.xpath(str(element+"[#"+"href"+"]")):
tag.attrib["href"] = urljoin("http://"+site, tag.attrib.get("href"))
return lxml.html.tostring(active_html)
update
Thanks to Burhan Khalid's solution, which seemed too simple to be viable at first glance, I got it working. The code is so simple most of you will most likely not require it, but I will post it anyway incase it helps:
import lxml
import sys
from lxml import etree
from StringIO import StringIO
from lxml.html import fromstring, tostring
import urllib2
from urlparse import urljoin
site = "www.script-tutorials.com/advance-php-login-system-tutorial/"
output_filename = "output.html"
def download(site):
response = urllib2.urlopen(site)
html_input = response.read()
return html_input
def derealitivise(site, html_input):
active_html = html_input.replace('<head>', '<head> <base href='+site+'>')
return active_html
active_html = ""
print "downloading..."
active_html = download(site)
active_html = derealitivise(site, active_html)
print "writing file..."
output_file = open (output_filename, "w")
output_file.write(active_html)
output_file.close()
Despite all of this, and its great simplicity, the .js object running on the website I have listed in the script still will not load correctly. Does anyone know if this is possible to fix?
while i am trying to make only the html file offline, while using the
linked resources over the web.
This is a two step process:
Copy the HTML file and save it to your local directory.
Add a BASE tag in the HEAD section, and point the href attribute of it to the absolute URL.
Since you want to learn how to do it yourself, I will leave it at that.
#Burhan has an easy answer using <base href="..."> tag in the <head>, and it works as you have found out. I ran the script you posted, and the page downloaded fine. As you noticed, some of the JavaScript now fails. This can be for multiple reasons.
If you are opening the HTML file as a local file:/// URL, the page may not work. Many browsers heavily sandbox local HTML files, not allowing them to perform network requests or examine local files.
The page may perform XmlHTTPRequests or other network operations to the remote site, which will be denied for cross domain scripting reasons. Looking in the JS console, I see the following errors for the script you posted:
XMLHttpRequest cannot load http://www.script-tutorials.com/menus.php?give=menu. Origin http://localhost:8000 is not allowed by Access-Control-Allow-Origin.
Unfortunately, if you do not have control of www.script-tutorials.com, there is no easy way around this.

Categories

Resources