Get the id from url - python

I would like to get the id from the url ,I have stored a set of urls in a list,i would like to get a certail part of the url ,ie is the id part ,for thoose url that dont have an id part should print as none.The code so far i have tried
text=[u'/fhgj/b?ie=UTF8&node=2858778011',u'/gp/v/w/', u'/gp/v/l', u'/gp/fhghhgl?ie=UTF8&docId=1001423601']
text=text.rsplit(sep='&', maxsplit=-1)
print text
the output is
[u'2858778011',u'/gp/v/w/', u'/gp/v/l', u'1001423601']
i expect to get something like this
[u'2858778011',u'None', u'None', u'1001423601']

Use urlparse, or if you really want to use string libs then
prefix, sep, text = text.partition("&")
(or just text = text.partition("&")[2]).

Related

Python: generating and getting urls with a non specified variable

I created a list of urls based on a pattern using string format.
Each url looks something like this:
https://www.myurl.com/somestr-0/#X
Where "X" goes from "A" to "Z" (code bellow).
Now I want to iterate through this list and get each url with requests except the "0" in each url should actually be any number that could be one or two digits.
I used the re module to replace the "0" in my pattern but I don't know how to use the output with requests.
import string
alphabet = [x for x in string.ascii_uppercase]
urls = [f'https://www.myurl.com/somestr-x/#{letter}'for letter in alphabet]
for url in urls :
url = re.sub('x',r'\\d{1,2}',url)
I want to be able to use every url with "any number" instead of the "0" without having to specify what number that would be exactly.
ETA : the "any number" can only be 1 or 2 digits and I want to avoid spamming the website with too many requests by "trying" every possible combination.
You can use randrange from random.
for url in urls :
url = re.sub('x', random.randrange(1,9) ,url)
response = requests.get(url)
...
You could use requests. Supposing you only need a get, you could fetch an url with something like:
import requests
response = requests.get(url)
You only need to loop through all the urls you have and process the responses. More info at https://pypi.org/project/requests/
The line
url = re.sub('x',r'\\d{1,3}',url)
Is problematic - you need to replace with an actual string, not a regular expression.
Try
import random
...the rest of your code
url = re.sub('x',str(random.randint(100)),url)

Compare string result from path & requests

I am scraping the HTML code from the URL defined, mainly focussing on the tag, to extract the results of it. Then, compare if string "example" exists in the script, if yes, print something and flag =1.
I am not able to compare the results extracted from the HTML.fromstring
Able to scrape the HTML content and view the full successfully, wanted to proceed further but not able to (compare strings)
import requests
from lxml import html
page = requests.get("http://econpy.pythonanywhere.com/ex/001.html")
tree = html.fromstring(page.text) #was page.content
# To get all the content in <script> of the webpage
scripts = tree.xpath('//script/text()')
# To get line of script that contains the string "location" (text)
keyword = tree.xpath('//script/text()[contains(., "location")]')
# To get the element ID of the script that contains the string "location"
keywordElement = tree.xpath('//script[contains(., "location")]')
print('\n<SCRIPT> is :\n', scripts)
# To print the Element ID
print('\n\KEYWORD script is discovered # ',keywordElement)
# To print the line of script that contain "location" in text form
print('Supporting lines... \n\n',keyword)
# ******************************************************
# code below is where the string comparison comes in
# to compare the "keyword" and display output to user
# ******************************************************
string = "location"
if string in keyword:
print('\nDANGER: Keyword detected in URL entered')
Flag = "Detected" # For DB usage
else:
print('\nSAFE: Keyword does not exist in URL entered')
Flag = "Safe" # For DB usage
# END OF PROGRAM
Actual result: able to retrieve all the necessary information including its element and content
Expected result: To print the DANGER / SAFE word to user and define the variable "Flag" which will then stored into database.
keyword is a list.
You need to index the list to get the string after which you will be able to search for the specific string
"location" in keyword[0] #gives True

How to get text from url

I have some urls
http://go.mail.ru/search?fr=vbm9&fr2=query&q=%D0%BF%D1%80%D0%BE%D0%B3%D1%83%D0%BB%D0%BA%D0%B0+%D0%B0%D0%BA%D1%82%D0%B5%D1%80%D1%8B&us=10&usln=1
https://www.google.ru/search?q=NaoOmiKi&oq=NaoOmiKi&aqs=chrome..69i57j69i61&sourceid=chrome&es_sm=0&ie=UTF-8
https://yandex.ru/search/?text=%D0%BE%D1%82%D0%BA%D1%83%D0%B4%D0%B0%20%D0%B2%D0%B5%D0%B7%D1%83%D1%82%20%D0%BE%D0%B4%D0%B5%D0%B6%D0%B4%D1%83%20%D0%B2%20%D1%81%D0%B5%D0%BA%D0%BE%D0%BD%D0%B4%20%D1%85%D0%B5%D0%BD%D0%B4&clid=2073067
When I run this url in browser I get, that it's search of:
прогулка актеры
NaoOmiKi
откуда везут одежду в секонд хенд
I want to write code to get this values. I try
get = urlparse(url)
print urllib.unquote(get[4])
But it doesn't work correctly for all url. What I should use?
urlparse parses a URL into 6 components: scheme, netloc, path, params, query, fragment. You correctly use index 4 to get the path.
The path however, is a &-separated string of key=value pairs with the values urlencoded. You try to unquote the entire string, while you are only interested in the value of the text or q key.
You can use urlparse.parse_qs to parse the querystring and look for q or text keys in the returned dict.

Python find tag in XML using wildcard

I have this line in my python script:
url = tree.find("//video/products/product/read_only_info/read_only_value[#key='storeURL-GB']")
but sometimes the storeURL-GB key changes the last two country code letters, so I am trying to use something like this, but it doesn't work:
url = tree.find("//video/products/product/read_only_info/read_only_value[#key='storeURL-\.*']")
Any suggestions please?
You should probably try .xpath() and starts-with():
urls = tree.xpath("//video/products/product/read_only_info/read_only_value[starts-with(#key, 'storeURL-')]")
if urls:
url = urls[0]

Simple POST using urllib with Python 3.3

I'm participating in a hash-breaking contest and I'm trying to automate posting strings to an html form and getting the hash score back. So far I've managed to get SOMETHING posted to the url, but its not the exact string I'm expecting, and thus the value returned for the hash is way off from the one obtained by just typing in the string manually.
import urllib.parse, urllib.request
url = "http://almamater.xkcd.com/?edu=una.edu"
data = "test".encode("ascii")
header = {"Content-Type":"application/octet-stream"}
req = urllib.request.Request(url, data, header)
f = urllib.request.urlopen(req)
print(f.read())
#parse f to pull out hash
I obtain the following hash from the site:
0fff9563bb3279289227ac77d319b6fff8d7e9f09da1247b72a0a265cd6d2a62645ad547ed8193db48cff847c06494a03f55666d3b47eb4c20456c9373c86297d630d5578ebd34cb40991578f9f52b18003efa35d3da6553ff35db91b81ab890bec1b189b7f52cb2a783ebb7d823d725b0b4a71f6824e88f68f982eefc6d19c6
This differs considerably from what I expected, which is what you get if you type in "test" (no quotes) into the form:
e21091dbb0d61bc93db4d1f278a04fe1a51165fb7262c7da31f886ae09ff3e04c41483c500db2792c59742958d8f7f39fe4f4f2cdc7940b7b25e3289b89d344e06f76305b9de525933b5df5dae2a37388f82cf76374fe363587acfb49b9d2c8fc131ef4a32c762be083b07330989b298d60e312f56a6b8a4c0f53c9b59864fb7
Obviously the code isn't doing what I'm expecting it to do. Any tips?
When you submit your form data, it also includes the field name, so when you submit "test" the data submitted actually looks like "hashable=test". Try changing your data like this:
data = "hashable=test".encode("ascii")
or alternatively:
data = urllib.parse.urlencode({'hashable': 'test'})

Categories

Resources