Python: find all urls which contain string

Python: find all urls which contain string - python

I am trying to find all pages that contain a certain string in the name, on a certain domain. For example:
www.example.com/section/subsection/406751371-some-string
www.example.com/section/subsection/235824297-some-string
www.example.com/section/subsection/146783214-some-string
What would be the best way to do it?
The numbers before "-some-string" can be any 9-digit number. I can write a script that loops through all possible 9-digit numbers and tries to access the resulting url, but I keep thinking that there should be a more efficient way to do this, especially since I know that overall there are only about 1000 possible pages that end with that string.

I understood your situation, the numeric value before -some-string is a kind of object id for that web site (for example, this question has a id 39594926, and the url is stackoverflow.com/questions/39594926/python-find-all-urls-which-contain-string)
I don't think there is a way to find all valid numbers, unless you have a listing (or parent) page from that website lists all these numbers. Take Stackoverflow as an example again, in the question list page, you will see all these question ids.
If you can provide me the website, I could have a look try to find the 'pattern' of these numbers. For some simple website, that number is just a increment to identify objects (could be user, question or anything else).

If these articles are all linked to on one page you could parse the html of this index page since all links will be contained in the href tags.

Related

How do I scrape content from a dynamically generated page using selenium and python?

I have tried many attempts and all fail to record the data I need in a reliable and complete manner. I understand the extreme basics of python and selenium for automating simple tasks but in this case the content is dynamically generated and I am unable to find the correct way to access and subsequently record all the data I need.
The URL I am looking to scrape content from is structured similar to the following:
https://dutchie.com/embedded-menu/revolutionary-clinics-somerville/menu
In particular I am trying grab all info using something like -
browser.find_elements_by_xpath('//*[#id="products-container"]
Is this the right approach? How do I access specific sub elements of this element (and all elements of the same path)
I have read that I might need beautifulsoup4, but I am unsure the best way to approach this.
Would the best approach be to use xpaths? If so is there a way to iterate through all elements and record all the data within or do I have to specify each and every data point that I am after?
Any assistance to point me in the right direction would be extremely helpful as I am still learning and have hit a roadblock in my progress.
My end goal is a list of all product names, prices and any other data points that I deem relevant based on the specific exercise at hand. If I could find the correct way to access the data points I could then store them and compare/report on them as needed.
Thank you!

I think you are looking for something like
browser.find_elements_by_css_selector('[class*="product-information__Title"]')
This should find all elements with a class beginning with that string.

How to solve "different pages get the same result" when writing crawler？

I use spider to crawl "https://nj.zu.ke.com/zufang" .I found that the URL composition is roughly: https://{0}.zu.ke.com/zufang/pg{1}.
And each listing has a unique listing number, which constitutes a link to a specific page for access.
So I set up two MySQL columns: 1. Auto-increment ID as the primary key, 2. Listing number is unique and not null.
However, during the crawling process, by changing the number of pages of "pg{1}", the repetition ratio of the listing number obtained is extremely large, 30 items per page, roughly 100 pages, and the final result is only more than 300 items.(at first, I think I didn’t write code right.Later, I checked the number of loops with a single thread to see if there were any problems with the return. I found that many IDs returned on different pages were duplicated with print)
Later, I thought it was a problem with the recommendation system. Then I logged in and wrote the cookie. The result was roughly the same.
And if you refresh directly after visiting the page, the result will be different .
How to solve this problem, thanks.

How to scrape a string representation of a nested list?

I am trying to record the DataCamp courses I have done by using a web scraper. First kudos to this guy, who has built something along my needs.
However, recently DataCamp has made changes to their website and now the comprehensive course data is not in JSON anymore, but seems to be stored as a string representation of a nested list.
If you take a look at the source of one of the chapter pages, the first element in the body is:
<body><script>window.PRELOADED_STATE = "["~#iM",["preFetchedData",["^0",["course",["^0",["status","SUCCESS","data",["^ ","id",58,"title","Introduction to R ...
So the original scraper was able to rely on JSON and extracting the information via the dict keys. There is an idea field, so probably I should be able to extract the data once I have a list of lists of the underlying data.
I tried extracting the string representation via ast.literal_eval, but that did not work. Any idea how I could make this list usable?

How to iterate over everything in a python-docx document?

I am using python-docx to convert a Word docx to a custom HTML equivalent. The document that I need to convert has images and tables, but I haven't been able to figure out how to access the images and the tables within a given run. Here is what I am thinking...
for para in doc.paragraphs:
for run in para.runs:
# How to tell if this run has images or tables?
...but I don't see anything on the Run that has info on the InlineShape or Table. Do I have to fall back to the XML directly or is there a better, cleaner way to iterate over everything in the document?
Thanks!

There are actually two problems to solve for what you're trying to do. The first is iterating over all the block-level elements in the document, in document order. The second is iterating over all the inline elements within each block element, in the order they appear.
python-docx doesn't yet have the features you would need to do this directly. However, for the first problem there is some example code here that will likely work for you:
https://github.com/python-openxml/python-docx/issues/40
There is no exact counterpart I know of to deal with inline items, but I expect you could get pretty far with paragraph.runs. All inline content will be within a paragraph. If you got most of the way there and were just hung up on getting pictures or something you could go down the the lxml level and decode some of the XML to get what you needed. If you get that far along and are still keen, if you post a feature request on the GitHub issues list for something like "feature: Paragraph.iter_inline_items()" I can probably provide you with some similar code to get you what you need.
This requirement comes up from time to time so we'll definitely want to add it at some point.
Note that block-level items (paragraphs and tables primarily) can appear recursively, and a general solution will need to account for that. In particular, a paragraph can (and in fact at least one always must) appear in a table cell. A table can also appear in a table cell. So theoretically it can get pretty deep. A recursive function/method is the right approach for getting to all of those.

Assuming doc is of type Document, then what you want to do is have 3 separate iterations:
One for the paragraphs, as you have in your code
One for the tables, via doc.tables
One for the shapes, via doc.inline_shapes
The reason your code wasn't working was that paragraphs don't have references to the tables and or shapes within the document, as that is stored within the Document object.
Here is the documentation for more info: python-docx

Word count statistics on a web page

I am looking for a way to extract basic stats (total count, density, count in links, hrefs) for words on an arbitrary website, ideally a Python based solution.
While it is easy to parse a specific website using, say BautifulSoup and determine where the bulk of the content is, it requires you to define the location of the content in the DOM tree ahead of processing. This is easy for, say, hrefs or any arbitraty tag but gets more complicated when determining where the rest of the data (not enclosed in well defined markers) is.
If I understand correctly, robots used by the likes of Google (GoogleBot?) are able to extract data from any website to determine the keyword density. My scenario is similar, obtain the info related to the words that define what the website is about (i.e. after removing js, links and fillers).
My question is, are there any libraries or web APIs that would allow me to get statistics of meaningful words from any given page?

There is no APIs but there could be few libraries that you can use it as a tool.
you should count the meaningful words and record them by the time.
you can also Start from something like this:
string Link= "http://www.website.com/news/Default.asp";
string itemToSearch= "Word";
int count = new Regex(itemToSearch).Matches(Link).Count;
MessageBox.Show(count.ToString());

There are multiple libraries that deal with more advanced processing of web articles, this question should be a duplicate of this one.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.