scrapy, unable to locate a text phase in website

scrapy, unable to locate a text phase in website - python

I am trying to extract the text
60 Days from website A
https://www.vitalsource.com/products/abnormal-psychology-susan-nolen-hoeksema-v9781259765667
Lifetime Access from website B
https://www.vitalsource.com/products/teaming-with-nutrients-jeff-lowenfels-v9781604695175
I tried to use abs xpath, both return nothing.
for A
//div[2]/div[1]/label[1]
for B
//div[1]/span[1]/label[1]
nor css path
.u-weight--bold.type--magic9.u-inline
I believe the texts I want to extract are not generated by javascript. so I don't know anything else I can do to solve this problem.
Please assist!
Thank you in advance.

The information you need is rendered by Javascript, but it's also available in JSON format inside the page. All you need to do is to select the element that contains the data, parse the data using JSON lib and access the desired field.
import json
import pprint
data = response.xpath(
'//div[#data-react-class="vs.CurrentRegionOnlyWarningModal"]'
'/#data-react-props')
.extract_first()
json_data = json.loads(data)
pprint.pprint(json_data)
{'selectedVariant': None,
'variants': [{'asset_id': 88677112,
'created_at': '2016-10-07T14:17:10.000Z',
'deleted_at': None,
'distributable': True,
'downloadable_duration': 'perpetual',
'full_base_currency': 'USD',
'full_base_price': '107.5',
'full_currency': 'USD',
'full_price': '107.5',
'full_price_converted': False,
'id': 476831514,
'import_id': 'a3b99a3de0df7d0442253798cba8b8ea',
'in_store': True,
'item_type': 'Single',
....
'online_duration': '60 days',
So, you can access it normally:
for x in json_data['variants']:
print(x['online_duration'])
It's important to note that this site has some variants for each product, and there are more fields with this same string. You have to understand how this site organize the products to get the right data, but this approach should be enough to access all information you need.

It is generated by javascript, unfortunately. So you would need to use something like selenium for this most likely.

Related

unable to fetch the list values from the website

i fetch all the detail from the desire website but unable to get the some specific information please guide me for that.
targeted domain: https://shop.adidas.ae/en/messi-16-3-indoor-boots/BA9855.html
my code isresponse.xpath('//ul[#class="product-size"]//li/text()').extract()
need to fetch data!!!
Thanks!

Often ecommerce websites have data in json format in page source and then have javscript unpack it on users end.
In this case you can open up the page source with javascript disabled and search for keywords (like specific size).
I found in this case it can be found with regular expressions:
import re
import json
data = re.findall('window.assets.sizesMap = (\{.+?\});', response.body_as_unicode())
json.loads(data[0])
Out:
{'16': {'uk': '0k', 'us': '0.5'},
'17': {'uk': '1k', 'us': '1'},
'18': {'uk': '2k', 'us': '2.5'},
...}
Edit: More accurately you probably want to get different part of the json but nevertheless the answer is more or less the same:
data = re.findall('window.assets.sizes = (\{(?:.|\n)+?\});', response.body_as_unicode())
json.loads(data[0].replace("'", '"')) # replace single quotes to doubles

The data you want to fetch is loaded from a javascript. It is said explicitly in the tag class="js-size-value ".
If you want to get it, you will need to use a rendering service. I suggest you use Splash, it is simple to install and simple to use. You will need docker to install splash.

FOR loop should yield multiple results, but only yields one

I'm trying to pull very specific elements from a dictionary of RSS data that was fetched using the feedparser library, then place that data into a new dictionary so it can be called on later using Flask. The reason I'm doing this is because the original dictionary contains tons of metadata I don't need.
I have broken down the process into simple steps but keep getting hung up on creating the new dictionary! As it is below, it does create a dictionary object, but it's not comprehensive-- it only contains a single article's title, URL and description-- the rest is absent.
I've tried switching to other RSS feeds and had the same result, so it would appear the problem is either the way I'm trying to do it, or there's something wrong with the structure of the list generated by feedparser.
Here's my code:
from html.parser import HTMLParser
import feedparser
def get_feed():
url = "http://thefreethoughtproject.com/feed/"
front_page = feedparser.parse(url)
return front_page
feed = get_feed()
# make a dictionary to update with the vital information
posts = {}
for i in range(0, len(feed['entries'])):
posts.update({
'title': feed['entries'][i].title,
'description': feed['entries'][i].summary,
'url': feed['entries'][i].link,
})
print(posts)
Ultimately, I'd like to have a dictionary like the following, except that it keeps going with more articles:
[{'Title': 'Trump Does Another Ridiculous Thing',
'Description': 'Witnesses looked on in awe as the Donald did this thing',
'Link': 'SomeNewsWebsite.com/Story12345'},
{...},
{...}]
Something tells me it's a simple mistake-- perhaps the syntax is off, or I'm forgetting a small yet important detail.

The code example you provided does an update to the same dict over and over again. So, you only get one dict at the end of the loop. What your example data shows, is that you actually want a list of dictionaries:
# make a list to update with the vital information
posts = []
for entry in feed['entries']:
posts.append({
'title': entry.title,
'description': entry.summary,
'url': entry.link,
})
print(posts)

Seems that the problem is that you are using a dict instead of a list. Then you are updating the same keys of the dict, so each iteration you are overriding the last content added.
I think that the following code will solve your problem:
from html.parser import HTMLParser
import feedparser
def get_feed():
url = "http://thefreethoughtproject.com/feed/"
front_page = feedparser.parse(url)
return front_page
feed = get_feed()
# make a dictionary to update with the vital information
posts = [] # It should be a list
for i in range(0, len(feed['entries'])):
posts.append({
'title': feed['entries'][i].title,
'description': feed['entries'][i].summary,
'url': feed['entries'][i].link,
})
print(posts)
So as you can see the code above are defining the posts variable as a list. Then in the loop we are adding dicts to this list, so it will give you the data structure that you want.
I hope to help you with this solution.

Django: how to annotate based on only domain name and entire url with query params?

I am using django and I am trying to come up with a query that will allow me to do the following,
I have a column in the database called url. The url column has values that are very long. Basically the domain name followed by a long list of query parameters.
Eg:
https://www.somesite.com/something-interesting-digital-cos-or-make-bleh/?utm_source=something&utm_medium=email&utm_campaign=biswanyam%20report%20-%20digital%20cos%20or%20analog%20prey&ut
http://www.anothersite.com/holly-moly/?utm_source=something&utm_medium=email&tm_campaign=biswanyam%20report%20-%20digital%20cos%20or%20analog%20prey&ut
https://www.onemoresite.com/trinkle-star/?utm_source=something&utm_medium=email&utm_campaign=biswanyam%20report%20-%20digital%20cos%20or%20analog%20prey&ut
https://www.somesite.com/nothing-interesting-bleh/?utm_source=something&utm_medium=email&utm_campaign=biswanyam%20report%20-%20digital%20cos%20or%20analog%20prey&ut
I want a django query that can basically give me an annotated count of urls with the same domain name regardless of the query parameters in the URL.
So essentially this is what I am looking for,
{
'url': 'https://www.somesite.com/something-interesting-digital-cos-or-make-bleh', 'count': 127,
'url': 'http://www.anothersite.com/holly-moly', 'count': 87,
'url': 'https://www.onemoresite.com/trinkle-star', 'count': 94,
'url': 'https://www.somesite.com/nothing-interesting-bleh', 'count':72
}
I tried this query,
Somemodel.objects.filter(url__iregex='http.*\/\?').values('url').annotate(hcount=Count('url'))
This doesn't work as expected. It does an entire URL match along with the query parameters instead of matching only the domain name. Can someone please tell me how I might accomplish this or at least point me in the right direction. Thanks

This might not be possible because you cannot group by some partial information on a certain field. If you really want to achieve this, you might want to consider changing your schema. You should store url and parameters separately, as 2 model fields. Then you would have a method or if you want to make it look like an attribute, use #property decorator, to combine them and return the whole url. It wouldn't be too hard to split them in a migration/script to fit the new schema.

How to combine logical operators with search filters in python LDAP?

I am a newbie at python and I am working on a personal project of mine. The project includes grabbing data through LDAP, which returns JSON data.
Sample of data:
cn=abcd
[[('uid=abcd,OU=active,OU=employees,OU=people,O=xxxx.com',
{'status': ['Active'],
'co': ['India'],
'cn': ['abcd'],
'msDS-UserAccountDisabled': ['FALSE'],
'departmentNumber': ['122839'],
'objectClass': ['top', 'person', 'organizationalPerson', 'user', 'inetOrgPerson', `'ciscoperson'], 'userPrincipalName': ['surahuja'], 'publishpager': ['n'],`
Let's say that the content of the data is something like
'directreportees' : ['2345','1234','6789']
Right now,the search filter is something like
for item in directreportees:
search_filter = "(employeenumber=" + item +")"
I need to put the search filter in a form where I can specify that the no of direct reportees > 0. Is it possible through search filters ? or do I have no option but to grab data and perform test on it ?
Secondly, I need to search the department too. For eg, I need to check if the JSON value such as A.B.NOS. C contains a specific sequence such as NOS. Can I provide this check in the search filter too ?

Python currency codes into a list

Does anyone have a nifty way to get all the three letter alphabetic currency codes (an example of the ones I mean is at http://www.iso.org/iso/support/faqs/faqs_widely_used_standards/widely_used_standards_other/currency_codes/currency_codes_list-1.htm) into a list in Python 2.5? Note I don't want to do a screen scraping version as the code has to work offline - the website is just an example of the codes.
It looks like there should be a way using the locale library but it's not clear to me from reading the documentation and there must be a better way than copy pasting those into a file!
To clear the question up more, in C# for the same problem, the following code did it very neatly using the internal locale libraries:
CultureInfo.GetCultures(CultureTypes.SpecificCultures)
.Select(c => new RegionInfo(c.LCID).CurrencySymbol)
.Distinct()
I was hoping there might be an equivalent in python. And thanks to everyone who has provided an answer so far.

Not very elegant or nifty, but you can generate the list once and save to use it later:
import urllib, re
url = "http://www.iso.org/iso/support/faqs/faqs_widely_used_standards/widely_used_standards_other/currency_codes/currency_codes_list-1.htm"
print re.findall(r'\<td valign\="top"\>\s+([A-WYZ][A-Z]{2})\s+\</td\>', urllib.urlopen(url).read())
output:
['AFN', 'EUR', 'ALL', 'DZD', 'USD', 'EUR', 'AOA', 'ARS', 'AMD', 'AWG', 'AUD',
...
'UZS', 'VUV', 'EUR', 'VEF', 'VND', 'USD', 'USD', 'MAD', 'YER', 'ZMK', 'ZWL', 'SDR']
Note that you'll need to prune everything after X.. as they are apparently reserved names, which means that you'll get one rogue entry (SDR, the last element) which you can just delete by yourself.

You can get currency codes (and other) data from geonames. Here's some code that downloads the data (save the file locally to achieve the same result offline) and populates a list:
import urllib2
data = urllib2.urlopen('http://download.geonames.org/export/dump/countryInfo.txt')
ccodes = []
for line in data.read().split('\n'):
if not line.startswith('#'):
line = line.split('\t')
try:
if line[10]:
ccodes.append(line[10])
except IndexError:
pass
ccodes = list(set(ccodes))
ccodes.sort()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

scrapy, unable to locate a text phase in website - python

It is generated by javascript, unfortunately. So you would need to use something like selenium for this most likely.

Related

unable to fetch the list values from the website

FOR loop should yield multiple results, but only yields one

Django: how to annotate based on only domain name and entire url with query params?

How to combine logical operators with search filters in python LDAP?

Python currency codes into a list

Categories

Resources