Python ValueError: too many values to unpack for crawler

Python ValueError: too many values to unpack for crawler - python

I am trying to run a scraper I found online but receive a ValueError: too many values to unpack on this line of code
k, v = piece.split("=")
This line is part of this function
def format_url(url):
# make sure URLs aren't relative, and strip unnecssary query args
u = urlparse(url)
scheme = u.scheme or "https"
host = u.netloc or "www.amazon.com"
path = u.path
if not u.query:
query = ""
else:
query = "?"
for piece in u.query.split("&"):
k, v = piece.split("=")
if k in settings.allowed_params:
query += "{k}={v}&".format(**locals())
query = query[:-1]
return "{scheme}://{host}{path}{query}".format(**locals())
If you have any input it would be appreciated, thank you.

Instead of parsing the urls yourself, you can use urlparse.parse_qs function:
>>> from urlparse import urlparse, parse_qs
>>> URL = 'https://someurl.com/with/query_string?i=main&mode=front&sid=12ab&enc=+Hello'
>>> parsed_url = urlparse(URL)
>>> parse_qs(parsed_url.query)
{'i': ['main'], 'enc': [' Hello '], 'mode': ['front'], 'sid': ['12ab']}
(source)

This is due to the fact that one of the pieces contains two or more '=' characters. In that case you thus return a list of three or more elements. And you cannot assign it to the two values.
You can solve that problem, by splitting at most one '=' by adding an additional parameter to the .split(..) call:
k, v = piece.split("=",1)
But now we still do not have guarantees that there is an '=' in the piece string anyway.
We can however use the urllib.parse module in python-3.x (urlparse in python-2.x):
from urllib.parse import urlparse, parse_qsl
purl = urlparse(url)
quer = parse_qsl(purl.query)
for k,v in quer:
# ...
pass
Now we have decoded the query string as a list of key-value tuples we can process separately. I would advice to build up a URL with the urllib as well.

You haven't shown any basic debugging: what is piece at the problem point? If it has more than a single = in the string, the split operation will return more than 2 values -- hence your error message.
If you want to split on only the first =, then use index to get the location, and grab the slices you need:
pos = piece.index('=')
k = piece[:pos]
v = piece[pos+1:]

Related

Get list from string with exec in python

I have:
"[15765,22832,15289,15016,15017]"
I want:
[15765,22832,15289,15016,15017]
What should I do to convert this string to list?
P.S. Post was edited without my permission and it lost important part. The type of line that looks like list is 'bytes'. This is not string.
P.S. №2. My initial code was:
import urllib.request, re
f = urllib.request.urlopen("http://www.finam.ru/cache/icharts/icharts.js")
lines = f.readlines()
for line in lines:
m = re.match('var\s+(\w+)\s*=\s*\[\\s*(.+)\s*\]\;', line.decode('windows-1251'))
if m is not None:
varname = m.group(1)
if varname == "aEmitentIds":
aEmitentIds = line #its type is 'bytes', not 'string'
I need to get list from line
line from web page looks like
[15765, 22832, 15289, 15016, 15017]

Assuming s is your string, you can just use split and then cast each number to integer:
s = [int(number) for number in s[1:-1].split(',')]
For detailed information about split function:
Python3 split documentation

What you have is a stringified list. You could use a json parser to parse that information into the corresponding list
import json
test_str = "[15765,22832,15289,15016,15017]"
l = json.loads(test_str) # List that you need.
Or another way to do this would be to use ast
import ast
test_str = "[15765,22832,15289,15016,15017]"
data = ast.literal_eval(test_str)
The result is
[15765, 22832, 15289, 15016, 15017]
To understand why using eval() is bad practice you could refer to this answer

You can also use regex to pull out numeric values from the string as follows:
import re
lst = "[15765,22832,15289,15016,15017]"
lst = [int(number) for number in re.findall('\d+',lst)]
Output of the above code is,
[15765, 22832, 15289, 15016, 15017]

How to change a value in a parameter in python from parse_qsl?

I need to add +four to the parameter q and unparse the query string after making the modification. The problem I'm having is parse_qsl gives tuples in a list so I can't modify the tuple. I can't use parse_qs because I have multiple parameters with the same name. How do I modify q parameter and unparse the query in this scenario?
from urllib import parse
url = 'https://www.test.com/search?q=one+two+three&array[]=apple&array[]=oranges'
parts = parse.urlparse(url)
querys = parse.parse_qsl(parts.query)
# >>> querys
# [('q', 'one two three'), ('array[]', 'apple'), ('array[]', 'oranges')]

I'm not sure I understood your question correctly, is this what you want?
from urllib import parse
url = 'https://www.test.com/search?q=one+two+three&array[]=apple&array[]=oranges'
parts = parse.urlparse(url)
querys = [list(q) for q in parse.parse_qsl(parts.query)]
for q in querys:
if q[0] == 'q':
q[1] = q[1] + ' four'
print([tuple(q) for q in querys])
#[('q', 'one two three four'), ('array[]', 'apple'), ('array[]', 'oranges')]

Updating Query In URL With Urlib In Python

I have a url that is being parsed out of an XML file.
product_url = urlparse(item.find('product_url').text)
When I use urlib to break the url up I get this,
ParseResult(scheme='http', netloc='example.com', path='/dynamic', params='', query='t=MD5-YOUR-OAUTH-TOKEN&p=11111111', fragment='')
I need to update the
MD5-YOUR-OAUTH-TOKEN part of the query with a MD5 Hashed Oauth Key.
Which I have in this tokenHashed = encryptMd5Hash(token)
My goal is to after it is parsed and the hash has been inserted to the string in place of the MD5-YOUR-OAUTH-TOKEN, to have the whole url in a string I can use somewhere else. Originally I was trying to use regex to do this but found urlib. I cannot find where it says to do something like this?
Am I right to be using urlib for this? How do I achieve my goal of updating the url with the hashed token and having the whole url stored in a string?
So the string should look like this,
newString = 'http://example.com/dynamic?t='+tokenHashed+'&p=11112311312'

You'll first want to use the parse_qs function to parse the query string into a dictionary:
>>> import urlparse
>>> import urllib
>>> url = 'http://example.com/dynamic?t=MD5-YOUR-OAUTH-TOKEN&p=11111111'
>>> parsed = urlparse.urlparse(url)
>>> parsed
ParseResult(scheme='http', netloc='example.com', path='/dynamic', params='', query='t=MD5-YOUR-OAUTH-TOKEN&p=11111111', fragment='')
>>> qs = urlparse.parse_qs(parsed.query)
>>> qs
{'p': ['11111111'], 't': ['MD5-YOUR-OAUTH-TOKEN']}
>>>
Now you can modify the dictionary as desired:
>>> qs['t'] = ['tokenHashed']
Note here that because the parse_qs returned lists for each query
parameter, we need replace them with lists because we'll be calling
urlencode next with doseq=1 to handle those lists.
Next, rebuild the query string:
>>> newqs = urllib.urlencode(qs, doseq=1)
>>> newqs
'p=11111111&t=tokenHashed'
And then reassemble the URL:
>>> newurl = urlparse.urlunparse(
... [newqs if i == 4 else x for i,x in enumerate(parsed)])
>>> newurl
'http://example.com/dynamic?p=11111111&t=tokenHashed'
That list comprehension there is just using all the values from
parsed except for item 4, which we are replacing with our new query
string.

Python: parsing sections of a log file

I have a section of a log file that looks like this:
"/log?action=End&env=123&id=8000&cat=baseball"
"/log?action=start&get=3210&rsa=456&key=golf"
I want to parse out each section so the results would look like this:
('/log?action=', 'End', 'env=123', 'id=8000', 'cat=baseball')
('/log?action=', 'start', 'get=3210', 'rsa=456', 'key=golf')
I've looked into regex and matching, but a lot of my logs have different sequences which leads me to believe that it is not possible. Any suggestions?

This is clearly a fragment of a URL, so the best way to parse it is to use URL parsing tools. The stdlib comes with urlparse, which does exactly what you want.
For example:
>>> import urlparse
>>> s = "/log?action=End&env=123&id=8000&cat=baseball"
>>> bits = urlparse.urlparse(s)
>>> variables = urlparse.parse_qs(bits.query)
>>> variables
{'action': ['End'], 'cat': ['baseball'], 'env': ['123'], 'id': ['8000']}
If you really want to get the format you asked for, you can use parse_qsl instead, and then join the key-value pairs back together. I'm not sure why you want the /log to be included in the first query variable, or the first query variable's value to be separate from its variable, but even that is doable if you insist:
>>> variables = urlparse.parse_qsl(s)
>>> result = (variables[0][0] + '=', variables[0][1]) + tuple(
'='.join(kv) for kv in variables[1:])
>>> result
('/log?action=', 'End', 'env=123', 'id=8000', 'cat=baseball')
If you're using Python 3.x, just change the urlparse to urllib.parse, and the rest is exactly the same.

You can split a couple times:
s = '/log?action=End&env=123&id=8000&cat=baseball'
L = s.split("&")
L[0:1]=L[0].split("=")
Output:
['/log?action', 'End', 'env=123', 'id=8000', 'cat=baseball']

It's a bit hard to say without knowing what the domain of possible inputs is, but here's a guess at what will work for you:
log = "/log?action=End&env=123&id=8000&cat=baseball\n/log?action=start&get=3210&rsa=456&key=golf"
logLines = [line.split("&") for line in log.split('\n')]
logLines = [tuple(line[0].split("=")+line[1:]) for line in logLines]
print logLines
OUTPUT:
[('/log?action', 'End', 'env=123', 'id=8000', 'cat=baseball'),
('/log?action', 'start', 'get=3210', 'rsa=456', 'key=golf')]
This assumes that you don't really need the "=" at the end of the first string.

How to urlencode a querystring in Python?

I am trying to urlencode this string before I submit.
queryString = 'eventName=' + evt.fields["eventName"] + '&' + 'eventDescription=' + evt.fields["eventDescription"];

Python 2
What you're looking for is urllib.quote_plus:
safe_string = urllib.quote_plus('string_of_characters_like_these:$##=?%^Q^$')
#Value: 'string_of_characters_like_these%3A%24%23%40%3D%3F%25%5EQ%5E%24'
Python 3
In Python 3, the urllib package has been broken into smaller components. You'll use urllib.parse.quote_plus (note the parse child module)
import urllib.parse
safe_string = urllib.parse.quote_plus(...)

You need to pass your parameters into urlencode() as either a mapping (dict), or a sequence of 2-tuples, like:
>>> import urllib
>>> f = { 'eventName' : 'myEvent', 'eventDescription' : 'cool event'}
>>> urllib.urlencode(f)
'eventName=myEvent&eventDescription=cool+event'
Python 3 or above
Use urllib.parse.urlencode:
>>> urllib.parse.urlencode(f)
eventName=myEvent&eventDescription=cool+event
Note that this does not do url encoding in the commonly used sense (look at the output). For that use urllib.parse.quote_plus.

Try requests instead of urllib and you don't need to bother with urlencode!
import requests
requests.get('http://youraddress.com', params=evt.fields)
EDIT:
If you need ordered name-value pairs or multiple values for a name then set params like so:
params=[('name1','value11'), ('name1','value12'), ('name2','value21'), ...]
instead of using a dictionary.

Context
Python (version 2.7.2 )
Problem
You want to generate a urlencoded query string.
You have a dictionary or object containing the name-value pairs.
You want to be able to control the output ordering of the name-value pairs.
Solution
urllib.urlencode
urllib.quote_plus
Pitfalls
dictionary output arbitrary ordering of name-value pairs
(see also: Why is python ordering my dictionary like so?)
(see also: Why is the order in dictionaries and sets arbitrary?)
handling cases when you DO NOT care about the ordering of the name-value pairs
handling cases when you DO care about the ordering of the name-value pairs
handling cases where a single name needs to appear more than once in the set of all name-value pairs
Example
The following is a complete solution, including how to deal with some pitfalls.
### ********************
## init python (version 2.7.2 )
import urllib
### ********************
## first setup a dictionary of name-value pairs
dict_name_value_pairs = {
"bravo" : "True != False",
"alpha" : "http://www.example.com",
"charlie" : "hello world",
"delta" : "1234567 !##$%^&*",
"echo" : "user#example.com",
}
### ********************
## setup an exact ordering for the name-value pairs
ary_ordered_names = []
ary_ordered_names.append('alpha')
ary_ordered_names.append('bravo')
ary_ordered_names.append('charlie')
ary_ordered_names.append('delta')
ary_ordered_names.append('echo')
### ********************
## show the output results
if('NO we DO NOT care about the ordering of name-value pairs'):
queryString = urllib.urlencode(dict_name_value_pairs)
print queryString
"""
echo=user%40example.com&bravo=True+%21%3D+False&delta=1234567+%21%40%23%24%25%5E%26%2A&charlie=hello+world&alpha=http%3A%2F%2Fwww.example.com
"""
if('YES we DO care about the ordering of name-value pairs'):
queryString = "&".join( [ item+'='+urllib.quote_plus(dict_name_value_pairs[item]) for item in ary_ordered_names ] )
print queryString
"""
alpha=http%3A%2F%2Fwww.example.com&bravo=True+%21%3D+False&charlie=hello+world&delta=1234567+%21%40%23%24%25%5E%26%2A&echo=user%40example.com
"""

Python 3:
urllib.parse.quote_plus(string, safe='', encoding=None, errors=None)

Try this:
urllib.pathname2url(stringToURLEncode)
urlencode won't work because it only works on dictionaries. quote_plus didn't produce the correct output.

Note that the urllib.urlencode does not always do the trick. The problem is that some services care about the order of arguments, which gets lost when you create the dictionary. For such cases, urllib.quote_plus is better, as Ricky suggested.

In Python 3, this worked with me
import urllib
urllib.parse.quote(query)

for future references (ex: for python3)
>>> import urllib.request as req
>>> query = 'eventName=theEvent&eventDescription=testDesc'
>>> req.pathname2url(query)
>>> 'eventName%3DtheEvent%26eventDescription%3DtestDesc'

If the urllib.parse.urlencode( ) is giving you errors , then Try the urllib3 module .
The syntax is as follows :
import urllib3
urllib3.request.urlencode({"user" : "john" })

For use in scripts/programs which need to support both python 2 and 3, the six module provides quote and urlencode functions:
>>> from six.moves.urllib.parse import urlencode, quote
>>> data = {'some': 'query', 'for': 'encoding'}
>>> urlencode(data)
'some=query&for=encoding'
>>> url = '/some/url/with spaces and %;!<>&'
>>> quote(url)
'/some/url/with%20spaces%20and%20%25%3B%21%3C%3E%26'

import urllib.parse
query = 'Hellö Wörld#Python'
urllib.parse.quote(query) // returns Hell%C3%B6%20W%C3%B6rld%40Python

Another thing that might not have been mentioned already is that urllib.urlencode() will encode empty values in the dictionary as the string None instead of having that parameter as absent. I don't know if this is typically desired or not, but does not fit my use case, hence I have to use quote_plus.

For Python 3 urllib3 works properly, you can use as follow as per its official docs :
import urllib3
http = urllib3.PoolManager()
response = http.request(
'GET',
'https://api.prylabs.net/eth/v1alpha1/beacon/attestations',
fields={ # here fields are the query params
'epoch': 1234,
'pageSize': pageSize
}
)
response = attestations.data.decode('UTF-8')

If you don't want to use urllib.
https://github.com/wayne931121/Python_URL_Decode
#保留字元的百分號編碼
URL_RFC_3986 = {
"!": "%21", "#": "%23", "$": "%24", "&": "%26", "'": "%27", "(": "%28", ")": "%29", "*": "%2A", "+": "%2B",
",": "%2C", "/": "%2F", ":": "%3A", ";": "%3B", "=": "%3D", "?": "%3F", "#": "%40", "[": "%5B", "]": "%5D",
}
def url_encoder(b):
# https://zh.wikipedia.org/wiki/%E7%99%BE%E5%88%86%E5%8F%B7%E7%BC%96%E7%A0%81
if type(b)==bytes:
b = b.decode(encoding="utf-8") #byte can't insert many utf8 charaters
result = bytearray() #bytearray: rw, bytes: read-only
for i in b:
if i in URL_RFC_3986:
for j in URL_RFC_3986[i]:
result.append(ord(j))
continue
i = bytes(i, encoding="utf-8")
if len(i)==1:
result.append(ord(i))
else:
for c in i:
c = hex(c)[2:].upper()
result.append(ord("%"))
result.append(ord(c[0:1]))
result.append(ord(c[1:2]))
result = result.decode(encoding="ascii")
return result
#print(url_encoder("我好棒==%%0.0:)")) ==> '%E6%88%91%E5%A5%BD%E6%A3%92%3D%3D%%0.0%3A%29'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python ValueError: too many values to unpack for crawler - python

Related

Get list from string with exec in python

How to change a value in a parameter in python from parse_qsl?

Updating Query In URL With Urlib In Python

Python: parsing sections of a log file

How to urlencode a querystring in Python?

Categories

Resources