Get data from href but it is not usable - python

Recently i started working on a project. In which i am sending data from a html page using <a> tag. At backend i am using python webapp2 feamework. When i get the data it show perfectly. But when i compare it with some string for further usage it does not work.
I know when we get data it is in unicode. But i converted it to utf-8 and it is still not working.
Here is the code in html. Suppose i sent "item 2" as itemname
Click me
The code which i am using to fetch data is
def get(self,nam,des):
nam = self.request.get('itemname')
itemDesc= self.request.get('itemdescription')
name = nam.encode('utf-8')
if name == "item 2":
self.response.write("Equal")
I also try it without encoding but still it not works. It Show the value of item name perfectly. But it is not comparing them. Please help where i am doing mistake.

Looks like 2 issues, there is a space before the expected value is set, and the string is urlquoted.
To fix the space:
Click me
And then to work with the encoding, add import urllib and change the line
name = nam.encode('utf-8')
to
name = urllib.unquote(nam.encode('utf-8'))

Related

Scraping data from a http & javaScript site

I currently want to scrape some data from an amazon page and I'm kind of stuck.
For example, lets take this page.
https://www.amazon.com/NIKE-Hyperfre3sh-Athletic-Sneakers-Shoes/dp/B01KWIUHAM/ref=sr_1_1_sspa?ie=UTF8&qid=1546731934&sr=8-1-spons&keywords=nike+shoes&psc=1
I wanted to scrape every variant of shoe size and color. That data can be found opening the source code and searching for 'variationValues'.
There we can see sort of a dictionary containing all the sizes and colors and, below that, in 'asinToDimentionIndexMap', every product code with numbers indicating the variant from the variationValues 'dictionary'.
For example, in asinToDimentionIndexMap we can see
"B01KWIUH5M":[0,0]
Which means that the product code B01KWIUH5M is associated with the size '8M US' (position 0 in variationValues size_name section) and the color 'Teal' (same idea as before)
I want to scrape both the variationValues and the asinToDimentionIndexMap, so i can associate the IndexMap numbers to the variationValues one.
Another person in the site (thanks for the help btw) suggested doing it this way.
script = response.xpath('//script/text()').extract_frist()
import re
# capture everything between {}
data = re.findall(script, '(\{.+?\}_')
import json
d = json.loads(data[0])
d['products'][0]
I can sort of understand the first part. We get everything that's a 'script' as a string and then get everything between {}. The issue is what happens after that. My knowledge of json is not that great and reading some stuff about it didn't help that much.
Is it there a way to get, from that data, 2 dictionaries or lists with the variationValues and asinToDimentionIndexMap? (maybe using some regular expressions in the middle to get some data out of a big string). Or explain a little bit what happens with the json part.
Thanks for the help!
EDIT: Added photo of variationValues and asinToDimensionIndexMap
I think you are close Manuel!
The following code will turn your scraped source into easy-to-select boxes:
import json
d = json.loads(data[0])
JSON is a universal format for storing object information. In other words, it's designed to interpret string data into object data, regardless of the platform you are working with.
https://www.w3schools.com/js/js_json_intro.asp
I'm assuming where you may be finding things a challenge is if there are any errors when accessing a particular "box" inside you json object.
Your code format looks correct, but your access within "each box" may look different.
Eg. If your 'asinToDimentionIndexMap' object is nested within a smaller box in the larger 'products' object, then you might access it like this (after running the code above):
d['products'][0]['asinToDimentionIndexMap']
I've hacked and slash a little bit so you can better understand the structure of your particular json file. Take a look at the link below. On the right-hand side, you will see "which boxes are within one another" - which is precisely what you need to know for accessing what you need.
JSON Object Viewer
For example, the following would yield "companyCompliancePolicies_feature_div":
import json
d = json.loads(data[0])
d['updateDivLists']['full'][0]['divToUpdate']
The person helping you before outlined a general case for you, but you'll need to go in an look at structure this way to truly find what you're looking for.
variationValues = re.findall(r'variationValues\" : ({.*?})', ' '.join(script))[0]
asinVariationValues = re.findall(r'asinVariationValues\" : ({.*?}})', ' '.join(script))[0]
dimensionValuesData = re.findall(r'dimensionValuesData\" : (\[.*\])', ' '.join(script))[0]
asinToDimensionIndexMap = re.findall(r'asinToDimensionIndexMap\" : ({.*})', ' '.join(script))[0]
dimensionValuesDisplayData = re.findall(r'dimensionValuesDisplayData\" : ({.*})', ' '.join(script))[0]
Now you can easily convert them to json as use them combine as you wish.

How to find the correct website link depending on a composed string with python

I have a list of first name and last that is supposed to be used to compose website links. But sometimes some users don't always follow the naming rule and finally, their website name doesn't match correctly with the expected one.
Here is an example: lest's say the name is John and last name is Paul. In this case, the website URL should be johnpaul.com. But sometimes, use put johnpaul.com or pauljohn.com, or john-paul.com.
I would like to automatize some processes on these websites. The vast majority of them are correct, but some not. When it is not correct, I just google the expected URL and it is generally the first or second result I get on google.
I was asking myself if it is possible to make a Google request and check the 2 or 3 first links with python to get the actual URL. Any idea on how to make something like this?
my code now looks like this:
for value in arr:
try:
print requests.get(url).status_code, url
except Exception as e:
print url, " is not available"
I'd go with endswith()
string = "bla.com"
strfilter = ('.com', '.de') # Tuple
if string.endswith(strfilter):
raise "400 Bad Request"
this way you filter out the .com .net etc errors.

python lxml xpath AttributeError (NoneType) with correct xpath and usually working

I am trying to migrate a forum to phpbb3 with python/xpath. Although I am pretty new to python and xpath, it is going well. However, I need help with an error.
(The source file has been downloaded and processed with tagsoup.)
Firefox/Firebug show xpath: /html/body/table[5]/tbody/tr[position()>1]/td/a[3]/b
(in my script without tbody)
Here is an abbreviated version of my code:
forumfile="morethread-alte-korken-fruchtweinkeller-89069-6046822-0.html"
XPOSTS = "/html/body/table[5]/tr[position()>1]"
t = etree.parse(forumfile)
allposts = t.xpath(XPOSTS)
XUSER = "td[1]/a[3]/b"
XREG = "td/span"
XTIME = "td[2]/table/tr/td[1]/span"
XTEXT = "td[2]/p"
XSIG = "td[2]/i"
XAVAT = "td/img[last()]"
XPOSTITEL = "/html/body/table[3]/tr/td/table/tr/td/div/h3"
XSUBF = "/html/body/table[3]/tr/td/table/tr/td/div/strong[position()=1]"
for p in allposts:
unreg=0
username = None
username = p.find(XUSER).text #this is where it goes haywire
When the loop hits user "tompson" / position()=11 at the end of the file, I get
AttributeError: 'NoneType' object has no attribute 'text'
I've tried a lot of try except else finallys, but they weren't helpful.
I am getting much more information later in the script such as date of post, date of user registry, the url and attributes of the avatar, the content of the post...
The script works for hundreds of other files/sites of this forum.
This is no encode/decode problem. And it is not "limited" to the XUSER part. I tried to "hardcode" the username, then the date of registry will fail. If I skip those, the text of the post (code see below) will fail...
#text of getpost
text = etree.tostring(p.find(XTEXT),pretty_print=True)
Now, this whole error would make sense if my xpath would be wrong. However, all the other files and the first numbers of users in this file work. it is only this "one" at position()=11
Is position() uncapable of going >10 ? I don't think so?
Am I missing something?
Question answered!
I have found the answer...
I must have been very tired when I tried to fix it and came here to ask for help. I did not see something quite obvious...
The way I posted my problem, it was not visible either.
the HTML I downloaded and processed with tagsoup had an additional tag at position 11... this was not visible on the website and screwed with my xpath
(It probably is crappy html generated by the forum in combination with tagsoups attempt to make it parseable)
out of >20000 files less than 20 are afflicted, this one here just happened to be the first...
additionally sometimes the information is in table[4], other times in table[5]. I did account for this and wrote a function that will determine the correct table. Although I tested the function a LOT and thought it working correctly (hence did not inlcude it above), it did not.
So I made a better xpath:
'/html/body/table[tr/td[#width="20%"]]/tr[position()>1]'
and, although this is not related, I ran into another problem with unxpected encoding in the html file (not utf-8) which was fixed by adding:
parser = etree.XMLParser(encoding='ISO-8859-15')
t = etree.parse(forumfile, parser)
I am now confident that after adjusting for strange additional and multiple , and tags my code will work on all files...
Still I will be looking into lxml.html, as I mentioned in the comment, I have never used it before, but if it is more robust and may allow for using the files without tagsoup, it might be a better fit and save me extensive try/except statements and loops to fix the few files screwing with my current script...

Strange urllib2.urlopen() error with variable vs string

I am having some strange behavior while using urllib2 to open a URL and download a video.
I am trying to open a video resource and here is an example link:
https://zencoder-temp-storage-us-east-1.s3.amazonaws.com/o/20130723/b3ed92cc582885e27cb5c8d8b51b9956/b740dc57c2a44ea2dc2d940d93d772e2.mp4?AWSAccessKeyId=AKIAI456JQ76GBU7FECA&Signature=S3lvi9n9kHbarCw%2FUKOknfpkkkY%3D&Expires=1374639361
I have the following code:
mp4_url = ''
#response_body is a json response that I get the mp4_url from
if response_body['outputs'][0]['label'] == 'mp4':
mp4_url = response_body['outputs'][0]['url']
if mp4_url:
logging.info('this is the mp4_url')
logging.info(mp4_url)
#if I add the line directly below this then it works just fine
mp4_url = 'https://zencoder-temp-storage-us-east-1.s3.amazonaws.com/o/20130723/b3ed92cc582885e27cb5c8d8b51b9956/b740dc57c2a44ea2dc2d940d93d772e2.mp4?AWSAccessKeyId=AKIAI456JQ76GBU7FECA&Signature=S3lvi9n9kHbarCw%2FUKOknfpkkkY%3D&Expires=1374639361'
mp4_video = urllib2.urlopen(mp4_url)
logging.info('succesfully opened the url')
The code works when I add the designated line but it gives me a HTTP Error 403: Forbidden message when I don't which makes me think it is messing up the mp4_url somehow. But the confusing part is that when I check the logging line for mp4_url it is exactly what I hardcoded in there. What could the difference be? Are there some characters in there that may be disrupting it? I have tried converting it to a string by doing:
mp4_video = urllib2.urlopen(str(mp4_url))
But that didn't do anything. Any ideas?
UPDATE:
With the suggestion to use print repr(mp4_url) it is giving me:
u'https://zencoder-temp-storage-us-east-1.s3.amazonaws.com/o/20130723/b3ed92cc582885e27cb5c8d8b51b9956/b740dc57c2a44ea2dc2d940d93d772e2.mp4?AWSAccessKeyId=AKIAI456JQ76GBU7FECA&Signature=S3lvi9n9kHbarCw%2FUKOknfpkkkY%3D&Expires=1374639361'
And I suppose the difference is what is causing the error but what would be the best way to parse this?
UPDATE II:
It ended up that I did need to cast it to a string but also the source that I was getting the link (an encoded video) needed nearly a 60 second delay before it could serve that URL so that is why it kept working when I hardcoded it because it had that delay. Thanks for the help!
It would be better to simply dump the response obtained. This way you would be able to check what response_body['outputs'][0]['label'] evaluates to. In you case, you are initializing mp4_url to ''. This is not the same as None and hence the condition if mp4_url: will always be true.
You may want to check that the initial if statement where you check that response_body['outputs'][0]['label'] is correct.

exporting wikipedia with Python

I am trying to export a category from Turkish wikipedia page by following http://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export . Here is the code I am using;
# -*- coding: utf-8 -*-
import requests
from BeautifulSoup import BeautifulStoneSoup
from sys import version
link = "http://tr.wikipedia.org/w/index.php?title=%C3%96zel:D%C4%B1%C5%9FaAktar&action=submit"
def get(pages=[], category = False, curonly=True):
params = {}
if pages:
params["pages"] = "\n".join(pages)
if category:
params["addcat"] = 1
params["category"] = category
if curonly:
params["curonly"] = 1
headers = {"User-Agent":"Wiki Downloader -- Python %s, contact: Yaşar Arabacı: yasar11732#gmail.com" % version}
r = requests.post(link, headers=headers, data=params)
return r.text
print get(category="Matematik")
Since I am trying to get data from Turkish wikipedia, I have used its url. Other things should be self explanatory. I am getting the form page that you can use to export data instead of the actual xml. Can anyone see what am I doing wrong here? I have also tried making a get request.
There is no parameter named category, the category name should be in the catname parameter.
But Special:Export was not build for bots, it was build for humans. So, if you use catname correctly, it will return the form again, this time with pages from the category filled in. Then you are supposed to click "Submit" again, which will return the XML you want.
I think doing this in code would be too complicated. It would be easier if you used the API instead. There are some Python libraries that can help you with that: Pywikipediabot or wikitools.
Sorry my original answer was horribly flawed. I misunderstood the original intent.
I did some more experimenting because I was curious. It seems that the code you have above is not necessarily incorrect, it is, in fact, that the Special Export documentation is misleading. The documentation states that using catname and addcat will add the categories to the output, but instead it only lists the pages and categories within the specified catname inside an html form. It seems that wikipedia actually requires that the pages that you wish download be specified explicitly. Granted, there documentation doesn't necessarily appear to be very thorough on that matter. I would suggest that you parse the page for the pages within the category and then explicitly download those pages with your script. I do see an an issue with this approach in terms of efficiency. Due to the nature of Wikipedia's data, you'll get a lot of pages which are simply category pages of other pages.
As an aside, it could possibly be faster to use the actual corpus of data from Wikipedia which is available for download.
Good luck!

Categories

Resources