I have some urls
http://go.mail.ru/search?fr=vbm9&fr2=query&q=%D0%BF%D1%80%D0%BE%D0%B3%D1%83%D0%BB%D0%BA%D0%B0+%D0%B0%D0%BA%D1%82%D0%B5%D1%80%D1%8B&us=10&usln=1
https://www.google.ru/search?q=NaoOmiKi&oq=NaoOmiKi&aqs=chrome..69i57j69i61&sourceid=chrome&es_sm=0&ie=UTF-8
https://yandex.ru/search/?text=%D0%BE%D1%82%D0%BA%D1%83%D0%B4%D0%B0%20%D0%B2%D0%B5%D0%B7%D1%83%D1%82%20%D0%BE%D0%B4%D0%B5%D0%B6%D0%B4%D1%83%20%D0%B2%20%D1%81%D0%B5%D0%BA%D0%BE%D0%BD%D0%B4%20%D1%85%D0%B5%D0%BD%D0%B4&clid=2073067
When I run this url in browser I get, that it's search of:
прогулка актеры
NaoOmiKi
откуда везут одежду в секонд хенд
I want to write code to get this values. I try
get = urlparse(url)
print urllib.unquote(get[4])
But it doesn't work correctly for all url. What I should use?
urlparse parses a URL into 6 components: scheme, netloc, path, params, query, fragment. You correctly use index 4 to get the path.
The path however, is a &-separated string of key=value pairs with the values urlencoded. You try to unquote the entire string, while you are only interested in the value of the text or q key.
You can use urlparse.parse_qs to parse the querystring and look for q or text keys in the returned dict.
Related
I have a request payload where i need to use a variable to get details from different products
payload = f"{\"filter\":\"1202-1795\",\"bpId\":\"\",\"hashedAgentId\":\"\",\"defaultCurrencyISO\":\"pt-PT\",\"regionId\":2001,\"tenantId\":1,\"homeRegionCurrencyUID\":48}"
i need to change the \"filter\":\"1202-1795\" to \"filter\":\"{variable}\" to populate the requests and get the info, but i'm struggling hard with the backslash in the f string
I tried to change for double to single quotes both inside the string and the opening and closing quotes, tried double {} and nothing works
this is my variable to populate the request
variable = ['1214-2291','1202-1823','1202-1795','1202-1742','1202-1719','1214-2000','1202-1198','1202-1090']
Create a dict, loop over items in the list, set filter key, make the request.
import requests
payload = {"bpId":"",
"hashedAgentId":"",
"defaultCurrencyISO":"pt-PT",
"regionId":2001,
"tenantId":1,
"homeRegionCurrencyUID":48}
items = ['1214-2291','1202-1823','1202-1795','1202-1742','1202-1719','1214-2000','1202-1198','1202-1090']
for item in items:
payload['filter'] = item
response = requests.get(url, json=payload)
You can check the difference between data and json parameters in python requests package
I created a list of urls based on a pattern using string format.
Each url looks something like this:
https://www.myurl.com/somestr-0/#X
Where "X" goes from "A" to "Z" (code bellow).
Now I want to iterate through this list and get each url with requests except the "0" in each url should actually be any number that could be one or two digits.
I used the re module to replace the "0" in my pattern but I don't know how to use the output with requests.
import string
alphabet = [x for x in string.ascii_uppercase]
urls = [f'https://www.myurl.com/somestr-x/#{letter}'for letter in alphabet]
for url in urls :
url = re.sub('x',r'\\d{1,2}',url)
I want to be able to use every url with "any number" instead of the "0" without having to specify what number that would be exactly.
ETA : the "any number" can only be 1 or 2 digits and I want to avoid spamming the website with too many requests by "trying" every possible combination.
You can use randrange from random.
for url in urls :
url = re.sub('x', random.randrange(1,9) ,url)
response = requests.get(url)
...
You could use requests. Supposing you only need a get, you could fetch an url with something like:
import requests
response = requests.get(url)
You only need to loop through all the urls you have and process the responses. More info at https://pypi.org/project/requests/
The line
url = re.sub('x',r'\\d{1,3}',url)
Is problematic - you need to replace with an actual string, not a regular expression.
Try
import random
...the rest of your code
url = re.sub('x',str(random.randint(100)),url)
The code I am working on is retrieving a list from an HTML page with 2 fields, URL, and title...
The URL anyway starts with /URL.... And I need to append the "http://website.com" to every returned vauled from a re.findall.
The code so far is this:
bsoup=bs(html)
tag=soup.find('div',{'class':'item'})
reg=re.compile('<a href="(.+?)" rel=".+?" title="(.+?)"')
links=re.findall(reg,str(tag))
*(append "http://website.com" to the href"(.+?)" field)*
return links
Try:
for link in tag.find_all('a'):
link['href'] = 'http://website.com' + link['href']
Then use one of these output methods:
return str(soup) gets you the document after the changes are applied.
return tag.find_all('a') gets you all the link elements.
return [str(i) for i in tag.find_all('a')] gets you all the link elements converted to strings.
Now, don't try to parse HTML with regex while you have a XML parser already working.
I would like to get the id from the url ,I have stored a set of urls in a list,i would like to get a certail part of the url ,ie is the id part ,for thoose url that dont have an id part should print as none.The code so far i have tried
text=[u'/fhgj/b?ie=UTF8&node=2858778011',u'/gp/v/w/', u'/gp/v/l', u'/gp/fhghhgl?ie=UTF8&docId=1001423601']
text=text.rsplit(sep='&', maxsplit=-1)
print text
the output is
[u'2858778011',u'/gp/v/w/', u'/gp/v/l', u'1001423601']
i expect to get something like this
[u'2858778011',u'None', u'None', u'1001423601']
Use urlparse, or if you really want to use string libs then
prefix, sep, text = text.partition("&")
(or just text = text.partition("&")[2]).
I have tried this before. I'm completely at a loss for ideas.
On this page this dialog box to qet quotes.
http://www.schwab.com/public/schwab/non_navigable/marketing/email/get_quote.html?
I used SPY, XLV, IBM, MSFT
The output is the above with a table.
If you have an account the quote are real time --- via cookie.
How do I get the table into python using 2.6. The data as list or dictionary
Use something like Beautiful Soup to parse the HTML response from the web site and load it into a dictionary. use the symbol as the key and a tuple of whatever data you're interested in as the value. Iterate over all the symbols returned and add one entry per symbol.
You can see examples of how to do this in Toby Segaran's "Programming Collective Intelligence". The samples are all in Python.
First problem: the data is actually in an iframe in a frame; you need to be looking at https://www.schwab.wallst.com/public/research/stocks/summary.asp?user_id=schwabpublic&symbol=APC (where you substitute the appropriate symbol on the end of the URL).
Second problem: extracting the data from the page. I personally like lxml and xpath, but there are many packages which will do the job. I would probably expect some code like
import urllib2
import lxml.html
import re
re_dollars = '\$?\s*(\d+\.\d{2})'
def urlExtractData(url, defs):
"""
Get html from url, parse according to defs, return as dictionary
defs is a list of tuples ("name", "xpath", "regex", fn )
name becomes the key in the returned dictionary
xpath is used to extract a string from the page
regex further processes the string (skipped if None)
fn casts the string to the desired type (skipped if None)
"""
page = urllib2.urlopen(url) # can modify this to include your cookies
tree = lxml.html.parse(page)
res = {}
for name,path,reg,fn in defs:
txt = tree.xpath(path)[0]
if reg != None:
match = re.search(reg,txt)
txt = match.group(1)
if fn != None:
txt = fn(txt)
res[name] = txt
return res
def getStockData(code):
url = 'https://www.schwab.wallst.com/public/research/stocks/summary.asp?user_id=schwabpublic&symbol=' + code
defs = [
("stock_name", '//span[#class="header1"]/text()', None, str),
("stock_symbol", '//span[#class="header2"]/text()', None, str),
("last_price", '//span[#class="neu"]/text()', re_dollars, float)
# etc
]
return urlExtractData(url, defs)
When called as
print repr(getStockData('MSFT'))
it returns
{'stock_name': 'Microsoft Corp', 'last_price': 25.690000000000001, 'stock_symbol': 'MSFT:NASDAQ'}
Third problem: the markup on this page is presentational, not structural - which says to me that code based on it will likely be fragile, ie any change to the structure of the page (or variation between pages) will require reworking your xpaths.
Hope that helps!
Have you thought of using yahoo's quotes api?
see: http://developer.yahoo.com/yql/console/?q=show%20tables&env=store://datatables.org/alltableswithkeys#h=select%20*%20from%20yahoo.finance.quotes%20where%20symbol%20%3D%20%22YHOO%22
You will be able to dynamically generate a request to the website such as:
http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20yahoo.finance.quotes%20where%20symbol%20%3D%20%22YHOO%22&diagnostics=true&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys
And just poll it with standard a http GET request. The response is in XML format.
matplotlib has a module that gets historical quotes from Yahoo:
>>> from matplotlib.finance import quotes_historical_yahoo
>>> from datetime import date
>>> from pprint import pprint
>>> pprint(quotes_historical_yahoo('IBM', date(2010, 11, 12), date(2010, 11, 18)))
[(734088.0,
144.59,
143.74000000000001,
145.77000000000001,
143.55000000000001,
4731500.0),
(734091.0,
143.88999999999999,
143.63999999999999,
144.75,
143.27000000000001,
3827700.0),
(734092.0,
142.93000000000001,
142.24000000000001,
143.38,
141.18000000000001,
6342100.0),
(734093.0,
142.49000000000001,
141.94999999999999,
142.49000000000001,
141.38999999999999,
4785900.0)]