Word count statistics on a web page - python

I am looking for a way to extract basic stats (total count, density, count in links, hrefs) for words on an arbitrary website, ideally a Python based solution.
While it is easy to parse a specific website using, say BautifulSoup and determine where the bulk of the content is, it requires you to define the location of the content in the DOM tree ahead of processing. This is easy for, say, hrefs or any arbitraty tag but gets more complicated when determining where the rest of the data (not enclosed in well defined markers) is.
If I understand correctly, robots used by the likes of Google (GoogleBot?) are able to extract data from any website to determine the keyword density. My scenario is similar, obtain the info related to the words that define what the website is about (i.e. after removing js, links and fillers).
My question is, are there any libraries or web APIs that would allow me to get statistics of meaningful words from any given page?

There is no APIs but there could be few libraries that you can use it as a tool.
you should count the meaningful words and record them by the time.
you can also Start from something like this:
string Link= "http://www.website.com/news/Default.asp";
string itemToSearch= "Word";
int count = new Regex(itemToSearch).Matches(Link).Count;
MessageBox.Show(count.ToString());

There are multiple libraries that deal with more advanced processing of web articles, this question should be a duplicate of this one.

Related

How do I scrape content from a dynamically generated page using selenium and python?

I have tried many attempts and all fail to record the data I need in a reliable and complete manner. I understand the extreme basics of python and selenium for automating simple tasks but in this case the content is dynamically generated and I am unable to find the correct way to access and subsequently record all the data I need.
The URL I am looking to scrape content from is structured similar to the following:
https://dutchie.com/embedded-menu/revolutionary-clinics-somerville/menu
In particular I am trying grab all info using something like -
browser.find_elements_by_xpath('//*[#id="products-container"]
Is this the right approach? How do I access specific sub elements of this element (and all elements of the same path)
I have read that I might need beautifulsoup4, but I am unsure the best way to approach this.
Would the best approach be to use xpaths? If so is there a way to iterate through all elements and record all the data within or do I have to specify each and every data point that I am after?
Any assistance to point me in the right direction would be extremely helpful as I am still learning and have hit a roadblock in my progress.
My end goal is a list of all product names, prices and any other data points that I deem relevant based on the specific exercise at hand. If I could find the correct way to access the data points I could then store them and compare/report on them as needed.
Thank you!
I think you are looking for something like
browser.find_elements_by_css_selector('[class*="product-information__Title"]')
This should find all elements with a class beginning with that string.

Is there a way to extract information from contracts using ML with including contract files and targeted strings as inputs and outputs?

I'm doing a work-related project in which I should study whether we could extract certain fields of information (e.g. contract parties, start and end dates) from contracts automatically.
I am quite new to working with text data and am wondering if those pieces of information could be extracted with ML by having the whole contract as input and the information as output without tagging or annotating the whole text?
I understand that the extraction should be ran separately for each targeted field.
Thanks!
First question - how are the contracts stored? Are they PDFs or text-based?
If they're PDFs, there are a handful of packages that can extract text from a PDF (e.g. pdftotext).
Second question - is the data you're looking for in the same place in every document?
If so, you can extract the information you're looking for (like start and end dates) from a known location in the contract. If not, you'll have to do something more sophisticated. For example you may need to do a text search for "start date", if the same terminology is used in every contract. If different terminology is used from contract to contract, you may need to work to extract meaning from the text, which can be done using some sophisticated natural language processing (NLP).
Without more knowledge of your problem or a concrete example, it's hard to say what your best option may be.

Using Beautifulsoup and regex to traverse javascript in page

I'm fetching webpages with a bunch of javascript on it, and I'm interested in parsing through the javascript portion of the pages for certain relevant info. Right now I have the following code in Python/BeautifulSoup/regex:
scriptResults = soup('script',{'type' : 'text/javascript'})
which yields an array of scripts, of which I can use a for loop to search for text I'd like:
for script in scriptResults:
for block in script:
if *patterniwant* in block:
**extract pattern from line using regex**
(Text in asterisks is pseudocode, of course.)
I was wondering if there was a better way for me to just use regex to find the pattern in the soup itself, searching only through the scripts themselves? My implementation works, but it just seems really clunky so I wanted something more elegant and/or efficient and/or Pythonic.
Thanks in advance!
I lot of website have client side data in JSON format. I that case I would suggest to extract JSON part from JavaScirpt code and parse it using Python's json modules (e.g. json.json.loads ). As a result you will get standard dictionary object.
Another option is to check with your browser what sort of AJAX requests application makes. Quite often it also returns structured data in JSON.
I would also check if page has any structured data already available (e.g. OpenGraph, microformats, RDFa, RSS feeds). A lot of web sites include this to improve pages SEO and make it better integrating with social network sharing.

Python libraries that can tokenize wikipedia pages

I'd like to tokenise out wikipedia pages of interest with a python library or libraries. I'm most interested in tables and listings. I want to be able to then import this data into Postgres or Neo4j.
For example, here are three data sets that I'd be interested in:
How many points each country awarded one another in the Eurovision Song contest of 2008:
http://en.wikipedia.org/wiki/Eurovision_Song_Contest_2008#Final
List of currencies and the countries in which they circulate (a many-to-many relationship):
http://en.wikipedia.org/wiki/List_of_circulating_currencies
Lists of solar plants around the world: http://en.wikipedia.org/wiki/List_of_solar_thermal_power_stations
The source of each of these is written with wikipedia's brand of markup which is used to render them out. There are many wikipedia-specific tags and syntax used in the raw data form. The HTML might almost be the easier solution as I can just use BeautifulSoup.
Anyone know of a better way of tokenizeing? I feel that I'd reinvent the wheel if I took the final HTML and parsing it with BeautifulSoup. Also, if I could find a way to output these pages in XML, the table data might not be tokenized enough and it would require further processing.
Since Wikipedia is built on MediWiki, there is an api you can exploit. There is also Special:Export that you can use.
Once you have the raw data, then you can run it through mwlib to parse it.
This goes more to semantic web direction, but DBPedia allows querying parts (community conversion effort) of wikipedia data with SPARQL. This makes it theoretically straightforward to extract the needed data, however dealing with RDF triples might be cumbersome.
Furthermore, I don't know if DBPedia yet contains any data that is of interest for you.

What method should I employ to extract keywords from a URL?

I am working on extraction of keywords. The system takes a URL as input and the output is supposed to be keywords describing the contents of the URL. We are considering only textual parts now. I would like to know what methods I can employ for extracting keywords from URLs and how they compare with each other. Suggestions and redirections are welcome.
i think you can use this method
read the site with urllib ( http://docs.python.org/library/urllib2.html?highlight=urllib2#module-urllib2 ) and then remove tags and create plane text of site
then check which word are used more. then create top tens ( or count )

Categories

Resources