associative list python - python

i am parsing some html form with Beautiful soup. Basically i´ve around 60 input fields mostly radio buttons and checkboxes. So far this works with the following code:
from BeautifulSoup import BeautifulSoup
x = open('myfile.html','r').read()
out = open('outfile.csv','w')
soup = BeautifulSoup(x)
values = soup.findAll('input',checked="checked")
# echoes some output like ('name',1) and ('value',4)
for cell in values:
# the following line is my problem!
statement = cell.attrs[0][1] + ';' + cell.attrs[1][1] + ';\r'
out.write(statement)
out.close()
x.close()
As indicating in the code my problem ist where the attributes are selected, because the HTML template is ugly, mixing up the sequence of arguments that belong to a input field. I am interested in name="somenumber" value="someothernumber" . Unfortunately my attrs[1] approach does not work, since name and value do not occur in the same sequence in my html.
Is there any way to access the resulting BeautifulSoup list associatively?
Thx in advance for any suggestions!

My suggestion is to make values a dict. If soup.findAll returns a list of tuples as you seem to imply, then it's as simple as:
values = dict(soup.findAll('input',checked="checked"))
After that you can simply refer to the values by their attribute name, like what Peter said.
Of course, if soup.findAll doesn't return a list of tuples as you've implied, or if your problem is that the tuples themselves are being returned in some weird way (such that instead of ('name', 1) it would be (1, 'name')), then it could be a bit more complicated.
On the other hand, if soup.findAll returns one of a certain set of data types (dict or list of dicts, namedtuple or list of namedtuples), then you'll actually be better off because you won't have to do any conversion in the first place.
...Yeah, after checking the BeautifulSoup documentation, it seems that findAll returns an object that can be treated like a list of dicts, so you can just do as Peter says.
http://www.crummy.com/software/BeautifulSoup/documentation.html#The%20attributes%20of%20Tags
Oh yeah, if you want to enumerate through the attributes, just do something like this:
for cell in values:
for attribute in cell:
out.write(attribute + ';' + str(cell[attribute]) + ';\r')

I'm fairly sure you can use the attribute name like a key for a hash:
print cell['name']

Related

How do I get a normal list with strings instead of generator objects when I perform a googlesearch

Hi I am trying to get the first url of a google search based on queries in a list. For the sake of simplicity I am going to use the same code as a similar question 2 years prior.
from googlesearch import search
list_of_queries = ["Geeksforgeeks", "stackoverflow", "GitHub"]
results = []
for query in list_of_queries:
results.append(search(query, tld="co.in", num=1, stop=1, pause=2))
print (results)
Now this returns a list of generator objects. A solution was found to print out the list of results by adding
for result in results:
print (list(results))
However I want the results list to be in the form of a list of strings in order to web scrape the urls for data. One solution I found was to add
results_str = []
for result in results:
results_str.append(list(result))
When I print results_str I get this as an output:
[['https://www.geeksforgeeks.org/'], ['https://stackoverflow.com/'], ['https://github.com/']]
As one can see I cannot even use results_str directly as a list of urls to webscrape because of the additional brackets around each url. I thought I could work around it by removing the brackets by following this answer and thus adding
results_str_new = [s.replace('[' and ']', '') for s in results_str]
But this simply results in an AttributeError
AttributeError: 'list' object has no attribute 'replace'
Either way even if I did get it to work it all seems unnecessarily unnecessary to do all this work just to convert a list of generator objects to strings to use as urls to webscrape so I was wondering if there were any alternatives. I know that one of my options is to use selenium but I don't really want to do that because I don't want the hassle of an instance of Chrome opening whenever I run my script.
Thanks in advance
You are getting back a list of lists of string. To change that, you can use a list comprehension like this
results_str = [url for result in results for url in result]
or you can change from append to extend if you don't want to go with a list comprehension. Extend just extends the list where es append inserts the lists into the list.
results_str = []
for result in results:
results_str.extend(result)
Looks like you may be using a different version of googlesearch. I'm using googlesearch-python 1.1.0 so the call parameters are different. However, this should help:
from googlesearch import search
list_of_queries = ["Geeksforgeeks", "stackoverflow", "GitHub"]
results = []
for query in list_of_queries:
results.extend([r for r in search(query, 1, 'en')])
print(results)
Output:
['https://www.youtube.com/c/GeeksforGeeksVideos/videos', 'https://stackoverflow.com/', 'https://stackoverflow.blog/', 'https://github.com/']
Which, as you can see, is a simple list of strings (URLs in this case)

Scraping data from a http & javaScript site

I currently want to scrape some data from an amazon page and I'm kind of stuck.
For example, lets take this page.
https://www.amazon.com/NIKE-Hyperfre3sh-Athletic-Sneakers-Shoes/dp/B01KWIUHAM/ref=sr_1_1_sspa?ie=UTF8&qid=1546731934&sr=8-1-spons&keywords=nike+shoes&psc=1
I wanted to scrape every variant of shoe size and color. That data can be found opening the source code and searching for 'variationValues'.
There we can see sort of a dictionary containing all the sizes and colors and, below that, in 'asinToDimentionIndexMap', every product code with numbers indicating the variant from the variationValues 'dictionary'.
For example, in asinToDimentionIndexMap we can see
"B01KWIUH5M":[0,0]
Which means that the product code B01KWIUH5M is associated with the size '8M US' (position 0 in variationValues size_name section) and the color 'Teal' (same idea as before)
I want to scrape both the variationValues and the asinToDimentionIndexMap, so i can associate the IndexMap numbers to the variationValues one.
Another person in the site (thanks for the help btw) suggested doing it this way.
script = response.xpath('//script/text()').extract_frist()
import re
# capture everything between {}
data = re.findall(script, '(\{.+?\}_')
import json
d = json.loads(data[0])
d['products'][0]
I can sort of understand the first part. We get everything that's a 'script' as a string and then get everything between {}. The issue is what happens after that. My knowledge of json is not that great and reading some stuff about it didn't help that much.
Is it there a way to get, from that data, 2 dictionaries or lists with the variationValues and asinToDimentionIndexMap? (maybe using some regular expressions in the middle to get some data out of a big string). Or explain a little bit what happens with the json part.
Thanks for the help!
EDIT: Added photo of variationValues and asinToDimensionIndexMap
I think you are close Manuel!
The following code will turn your scraped source into easy-to-select boxes:
import json
d = json.loads(data[0])
JSON is a universal format for storing object information. In other words, it's designed to interpret string data into object data, regardless of the platform you are working with.
https://www.w3schools.com/js/js_json_intro.asp
I'm assuming where you may be finding things a challenge is if there are any errors when accessing a particular "box" inside you json object.
Your code format looks correct, but your access within "each box" may look different.
Eg. If your 'asinToDimentionIndexMap' object is nested within a smaller box in the larger 'products' object, then you might access it like this (after running the code above):
d['products'][0]['asinToDimentionIndexMap']
I've hacked and slash a little bit so you can better understand the structure of your particular json file. Take a look at the link below. On the right-hand side, you will see "which boxes are within one another" - which is precisely what you need to know for accessing what you need.
JSON Object Viewer
For example, the following would yield "companyCompliancePolicies_feature_div":
import json
d = json.loads(data[0])
d['updateDivLists']['full'][0]['divToUpdate']
The person helping you before outlined a general case for you, but you'll need to go in an look at structure this way to truly find what you're looking for.
variationValues = re.findall(r'variationValues\" : ({.*?})', ' '.join(script))[0]
asinVariationValues = re.findall(r'asinVariationValues\" : ({.*?}})', ' '.join(script))[0]
dimensionValuesData = re.findall(r'dimensionValuesData\" : (\[.*\])', ' '.join(script))[0]
asinToDimensionIndexMap = re.findall(r'asinToDimensionIndexMap\" : ({.*})', ' '.join(script))[0]
dimensionValuesDisplayData = re.findall(r'dimensionValuesDisplayData\" : ({.*})', ' '.join(script))[0]
Now you can easily convert them to json as use them combine as you wish.

Pull url from string

I have the string below I'm trying to pull the url out of out with python django. Thoughts on how I can get to it? I've tried treating it like a list but didn't have any luck.
[(u'https://api.twilio.com/2010-04-01/Accounts/ACae738c5e6aaf12ffa887440a3143e55b/Messages/MM673cd77ab21b37ae435c1d1d5e767366/Media/ME33be4a0ae88358aaef2aa0ea25f31339', u'image/jpeg')]
It looks like your value is a list with one tuple with two items. So get the first of each using the 0th index:
lt = [(u'https://api.twilio.com/2010-04-01/Accounts/ACae738c5e6aaf12ffa887440a3143e55b/Messages/MM673cd77ab21b37ae435c1d1d5e767366/Media/ME33be4a0ae88358aaef2aa0ea25f31339', u'image/jpeg')]
url = lt[0][0]
print(url)
https://api.twilio.com/2010-04-01/Accounts/ACae738c5e6aaf12ffa887440a3143e55b/Messages/MM673cd77ab21b37ae435c1d1d5e767366/Media/ME33be4a0ae88358aaef2aa0ea25f31339
If your value is actually a string CONTAINING the list, you can get a list by using ast:
import ast
lt = ast.literal_eval(lt)
... then use the above code to access the inner contents of the list.

How to retrieve key value pairs from URL using python

I am working on a python 2.7 script such that given an URL with certain number of key-value pairs (not fixed numer of them), retrieves this values on a json structure.
This is what I have done so far:
from furl import furl
url = "https://search/address?size=10&city=Madrid&offer_type=1&query=Gran%20v"
f = furl(url)
fields = ['size', 'city', 'offer_type', 'query']
l = []
l.append(f.args['size'])
l.append(f.args['city'])
l.append(f.args['offer_type'])
l.append(f.args['query'])
body = {
fields[0]: f.args[fields[0]],
fields[1]: f.args[fields[1]],
fields[2]: f.args[fields[2]],
fields[3]: f.args[fields[3]]
}
This code works, but just for the case in which I know that there will be 4 key-value pairs, and the names of those pairs. I do not know how to face the problem if, for example, my url is shorter or larger.
Using this command length = len(f.args) I can obtain the number of pairs, but no idea of how to extract the key names from the f.args object
Thank you very much,
Álvaro
I'm slightly confused... f.args is already a dictionary-like object of the type you want. If you want to explicitly convert it to a dictionary, you can use:
body = dict(f.args)
But even this seems unnecessary. If you want a new copy of the object so that you can change it without affecting the original instance, you can call the .copy() method.
is this what you're looking for?
from furl import furl
url = "https://search/address?size=10&city=Madrid&offer_type=1&query=Gran%20v"
f = furl(url)
print zip(f.args.keys(),f.args.values())
Output:
[('size', '10'), ('city', 'Madrid'), ('offer_type', '1'), ('query', 'Gran v')]
The furl library is not especially well documented, but digging through the source code shows that f.args is a property that redirects eventually to an orderedmultidict.omdict object. This supports all the standard dictionary methods, in addition to lots more interesting stuff (also not well documented).
You can therefore just use f.args wherever you need body. If you need a copy for some reason, do f.args.copy(), or possibly dict(f.args).

Returning XPATH response as a python dictionary

Scrapy noob here. I am extracting an href 'rel'attribute which looks like the following:
rel=""prodimage":"image_link","intermediatezoomimage":"image_link","fullimage":"image_link""
This can be seen as a dict like structure within the attribute.
My main goal is to obtain the image url against 'fullimage'. Hence, I want to store the response as a python dictionary.
However, Xpath returns a unicode "list" ( Not just a string but a list!) with one item ( the whole rel contents as one item)
res = response.xpath('//*[#id="detail_product"]/div[1]/div[2]/ul/li[1]/a/#rel').extract()
print res
[u'"prodimage":"image_link", "intermediatezoomimage":"image_link", "fullimage":"image_link"']
type(res)
type 'list'
How do I convert the content of 'res' into something like a python dictionary ( with separated out items as list items, not just one whole item) so that I can grab individual components from the structure within 'rel'.
I hope I am clear. Thank you!
SOLVED
The XPATH response above is basically a list with ONE item in unicode.
Convert the respective items into strings ( using x.encode('ascii') )
and then form a string representation of a dict. In my case I had to append and prepend the string (the rel contents) with curly braces. Thats all!
Then convert that string representation of a dict into an actual dict using the method mentioned in the link below.
Convert a String representation of a Dictionary to a dictionary?

Categories

Resources