How to get domain name for any site? - python

I am getting output "com" instead of "google.com"
I have tried changing sites name but every site is showing "com"
from tld import get_tld
def get_domain_name(url):
domain_name= get_tld(url)
return domain_name
print(get_domain_name("https://www.google.com"))
I expect output to be "google.com" I have tried this code on Piaza workspace terminal

What you're asking for is the first level domain, so you'd use the function get_fld
from tld import get_fld
print(get_fld("https://www.google.com")
>>> google.com
See the documentation at readthedocs

Related

Get protocol and domain (WITHOUT subdomain) from a URL

This is an extension of Get protocol + host name from URL, with the added requirement that I want only the domain name, not the subdomain.
So, for example,
Input: classes.usc.edu/xxx/yy/zz
Output: usc.edu
Input: mail.google.com
Output: google.com
Input: google.co.uk
Output: google.co.uk
For more context, I accept one or more seed URLs from a user and then run a scrapy crawler on the links. I need the domain name (without the subdomain) to set the allowed_urls attribute.
I've also taken a look at Python urlparse -- extract domain name without subdomain but the answers there seem outdated.
My current code uses urlparse but this also gets the subdomain which I don't want...
from urllib.parse import urlparse
uri = urlparse('https://classes.usc.edu/term-20191/classes/csci/')
f'{uri.scheme}://{uri.netloc}/'
# 'https://classes.usc.edu/'
Is there a (hopefully stdlib) way of getting (only) the domain in python-3.x?
I am using tldextract When I doing the domain parse.
In your case you only need combine the domain + suffix
import tldextract
tldextract.extract('mail.google.com')
Out[756]: ExtractResult(subdomain='mail', domain='google', suffix='com')
tldextract.extract('classes.usc.edu/xxx/yy/zz')
Out[757]: ExtractResult(subdomain='classes', domain='usc', suffix='edu')
tldextract.extract('google.co.uk')
Out[758]: ExtractResult(subdomain='', domain='google', suffix='co.uk')

Authentication issue using requests on aspx site

I am trying to use requests (python) to grab some pages from a website that requires me to be logged in.
I did inspect the login page to check out the username and password headers. But I found the names for those fields are not the standard 'username', 'password' used by most sites as you can see from the below screenshots
password field
I used them that way in my python script but each time I get a 'wrong syntax' error. Even sublimetext displayed a part of the name in orange as you can see from the pix below
From this I know there must be some problem with the name. But try to escape the $ signs did not help.
Even the login.aspx header disappears before google chrome could register it on the network.
The site is www dot bncnetwork dot net
I'd be happy if someone could help me figure out what to do about this.
Here is the code`import requests
import requests
def get_project_page(seed_page):
username = "*******************"
password = "*******************"
bnc_login = dict(ctl00$MainContent$txtEmailID=username, ctl00$MainContent$txtPassword=password)
sess_req = requests.Session()
sess_req.get(seed_page)
sess_req.post(seed_page, data=bnc_login, headers={"Referer":"http://www.bncnetwork.net/MyBNC.aspx"})
page = sess_req.get(seed_page)
return page.text`
You need to use strings for the keys, the $ will cause a syntax error if you don't:
data = {"ctl00$MainContent$txtPassword":password, "ctl00$MainContent$txtEmailID":email}
There are evenvalidation fileds etc.. to be filled in also, follow the logic from this answer to fill them out, all the fields can be seen in chrome tools:

Problems crawling wordreference

I am trying to crawl wordreference, but I am not succeding.
The first problem I have encountered is, that a big part is loaded via JavaScript, but that shouldn't be much problem because I can see what I need in the source code.
So, for example, I want to extract for a given word, the first two meanings, so in this url: http://www.wordreference.com/es/translation.asp?tranword=crane I need to extract grulla and grĂșa.
This is my code:
import lxml.html as lh
import urllib2
url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
doc = lh.parse((urllib2.urlopen(url)))
trans = doc.xpath('//td[#class="ToWrd"]/text()')
for i in trans:
print i
The result is that I get an empty list.
I have tried to crawl it with scrapy too, no success. I am not sure what is going on, the only way I have been able to crawl it is using curl, but that is sloopy, I want to do it in an elegant way, with Python.
Thank you very much
It looks like you need a User-Agent header to be sent, see Changing user agent on urllib2.urlopen.
Also, just switching to requests would do the trick (it automatically sends the python-requests/version User Agent by default):
import lxml.html as lh
import requests
url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
response = requests.get("http://www.wordreference.com/es/translation.asp?tranword=crane")
doc = lh.fromstring(response.content)
trans = doc.xpath('//td[#class="ToWrd"]/text()')
for i in trans:
print(i)
Prints:
grulla
grĂșa
plataforma
...
grulla blanca
grulla trompetera

How to get current domain name with Python/GAE?

Current url is
http://myapp.appspot.com/something/<user-id>
or
http://127.0.0.1:8080/something/<user-id>
How in my python code I can get http://myapp.appspot.com/ or http://127.0.0.1:8080/?
This is need for dynamic links generation, for ex., to http://myapp.appspot.com/somethingelse.
self.request.path returns the whole path.
self.request.host_url
I think you want app_identity.get_default_version_hostname().
If an app is served from a custom domain, it may be necessary to
retrieve the entire hostname component. You can do this using the
app_identity.get_default_version_hostname() method.
This code:
logging.info(app_identity.get_default_version_hostname())
prints localhost:8080 on the development server.
If self.request.path returns the whole path, can't you just do:
import urlparse
def get_domain(url):
return urlparse.urlparse(url).netloc
>>> get_domain("http://myapp.appspot.com/something/")
'myapp.appspot.com'

mechanize can't login python

I'm making auto-login script by use mechanize python.
Before I was used mechanize with no problem, but www.gmarket.co.kr in this site I couldn't make it .
whenever i try to login always login page was returned even with correct gmarket id , pass, i can't login and I saw some suspicious message
"<script language=javascript>top.location.reload();</script>"
I think this related with my problem, but don't know exactly how to handle .
Here is sample id and pass for login test
id: tgi177 pass: tk1047
if anyone can help me much appreciate thanks in advance
CODE:
# -*- coding: cp949 -*-
from lxml.html import parse, fromstring
import sys,os
import mechanize, urllib
import cookielib
import re
from BeautifulSoup import BeautifulSoup,BeautifulStoneSoup,Tag
try:
params = urllib.urlencode({'command':'login',
'url':'http%3A%2F%2Fwww.gmarket.co.kr%2F',
'member_type':'mem',
'member_yn':'Y',
'login_id':'tgi177',
'image1.x':'31',
'image1.y':'26',
'passwd':'tk1047',
'buyer_nm':'',
'buyer_tel_no1':'',
'buyer_tel_no2':'',
'buyer_tel_no3':''
})
rq = mechanize.Request("http://www.gmarket.co.kr/challenge/login.asp")
rs = mechanize.urlopen(rq)
data = rs.read()
logged_in = r'input_login_check_value' in data
if logged_in:
print ' login success !'
rq = mechanize.Request("http://www.gmarket.co.kr")
rs = mechanize.urlopen(rq)
data = rs.read()
print data
else:
print 'login failed!'
pass
quit()
except:
pass
mechanize doesn't have the ability to interact with JavaScript. Probably spidermonkey module will help you (I have no experience with it, but description is quite promising). Also you could handle such reload (e.g.Browser.reload() for this particular case) manually if it's the only site you have this problem.
Update:
Quick look through your page shows that you have submit to other URL (with https: scheme). Look through checkValid() JavaScript function. Posting to it gives other result. Note, that this looks like homework you should do yourself before asking.

Categories

Resources