I want to extract the chat-id from telegram web version z. I have done it on the telegram web version k but, it is not present in the z version. I looked every where but could not find any element containing chat-id. I know I can get the chat-id from url after opening the chat, but I can not open chat due to some reason.
The following is the basic code to open the telegram.
from seleniumwire import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.firefox import GeckoDriverManager
import sys
URL = 'https://web.telegram.org/'
firefox_options = webdriver.FirefoxOptions()
path_to_firefox_profile = "output_files\\firefox\\ghr2wgpa.default-release"
profile = webdriver.FirefoxProfile(path_to_firefox_profile)
profile.set_preference("dom.webdriver.enabled", False)
profile.set_preference('useAutomationExtension', False)
firefox_options.set_preference("general.useragent.override", 'user-agent=Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0')
profile.set_preference("security.mixed_content.block_active_content", False)
profile.update_preferences()
firefox_options.add_argument("--width=1400")
firefox_options.add_argument("--height=1000")
driver_installation = GeckoDriverManager().install()
service = Service(driver_installation)
if sys.platform == 'win32':
from subprocess import CREATE_NO_WINDOW
service.creationflags = CREATE_NO_WINDOW
driver = webdriver.Firefox(options=firefox_options, firefox_profile=profile, capabilities = caps,
service=service)
driver.get(URL)
driver.close()
driver.quit()
I only know is that the chat-id is being passed to some function in common.js when we click on the chat to open it.
The chat-id is available as the following attribute.
data-peer-id="777000"
Thanks in advance.
If the goal - to open a chat first, so I think you may link not to id first, but to the name, I see in the tree of this version:
<div class="ListItem Chat chat-item-clickable group selected no-selection has-ripple"....
<div class="ListItem-button"
<div class="info"
<div class="info-row"
<h3 class="fullName" > Unique name is here>
So, as an option, you find unique chat via its name, open it, and grab the id from url, coz the chat will be already open.
Related
I was trying to scrape some image data from some stores. For example, I was looking at some images from Nordstrom (tested with 'https://www.nordstrom.com/browse/men/clothing/sweaters').
I had initially used requests.get() to get the code, but I noticed that I was getting some javascript -- and upon further researc I found that this occured because it was dynamically loaded in the html using javascript.
To remedy this issue, following this post (Python requests.get(url) returning javascript code instead of the page html), I tried to use selenium to get the html code. However, I still ran into issues trying to access all the html: it was still returning alot of javascript. Finally, I added in some time delay as I thought maybe it needed some time to load in all of the html, but this still failed. Is there a way to get all the html using selenium? I have attached the current code below:
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
def create_browser(webdriver_path):
#create a selenium object that mimics the browser
browser_options = Options()
#headless tag created an invisible browser
browser_options.add_argument("--headless")
browser_options.add_argument('--no-sandbox')
browser = webdriver.Chrome(webdriver_path, chrome_options=browser_options)
print("Done Creating Browser")
return browser
url = 'https://www.nordstrom.com/browse/men/clothing/sweaters'
browser = create_browser('path/to/chromedriver_win32/chromedriver.exe')
browser.implicitly_wait(10)
browser.get(url)
time.sleep(10)
html_source = browser.page_source
print(html_source)
Is there something that I am not doing properly to load in all of the html code?
browser.page_source always returns initial HTML source but not current DOM state. Try
time.sleep(10)
html_source = browser.find_element_by_tag_name('html').get_attribute('outerHTML')
I would recommend reading "Test-Driven Development with Python", you'll get an answer for your question and so many more. You can read it for free here: https://www.obeythetestinggoat.com/ (and then you can also buy it ;-) )
Regarding your question, you have to wait that the element you're looking for is effectively loaded. You may use time.sleep but you'll get strange behavior depending on the speed of your internet connection and browser.
A better solution is explained here in depth: https://www.obeythetestinggoat.com/book/chapter_organising_test_files.html#self.wait-for
You can use the proposed solution:
def wait_for(self, fn):
start_time = time.time()
while True:
try:
return fn()
except (AssertionError, WebDriverException) as e:
if time.time() - start_time > MAX_WAIT:
raise e
time.sleep(0.5)
fn is just a function finding the element in the page.
Just add a user agent. Chrome's headless user agent says headless that is the problem.
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
def create_browser(webdriver_path):
#create a selenium object that mimics the browser
browser_options = Options()
browser_options.add_argument('--headless')
browser_options.add_argument('--user-agent="Mozilla/5.0 (Windows NT 4.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36"')
browser = webdriver.Chrome(webdriver_path, options=browser_options)
print("Done Creating Browser")
return browser
url = 'https://www.nordstrom.com/browse/men/clothing/sweaters'
browser = create_browser('C:/bin/chromedriver.exe')
browser.implicitly_wait(10)
browser.get(url)
divs = browser.find_elements_by_tag_name('a')
for div in divs:
print(div.text)
Output:-
Displays all links on the page..
Patagonia Better Sweater® Quarter Zip Pullover
(56)
Nordstrom Men's Shop Regular Fit Cashmere Quarter Zip Pullover (Regular & Tall)
(73)
Nordstrom Cashmere Crewneck Sweater
(51)
Cutter & Buck Lakemont Half Zip Sweater
(22)
Nordstrom Washable Merino Quarter Zip Sweater
(2)
ALLSAINTS Mode Slim Fit Merino Wool Sweater
Process finished with exit code -1
I'm scraping apps names from Google Play Store and for each URL as input I get only 60apps (because the website rendered 60apps if the user doesn't scroll down). How is it working and how can I scrape all the apps from a page using BeautifulSoup and/or Selenium ?
Thank you
Here is my code :
urls = []
urls.extend(["https://play.google.com/store/apps/category/NEWS_AND_MAGAZINES/collection/topselling_paid"])
for i in urls:
response = get(i)
html_soup = BeautifulSoup(response.text, 'html.parser')
app_container = html_soup.find_all('div', class_="card no-rationale square-cover apps small")
file = open("./InputFiles/applications.txt","w+")
for i in range(0, len(app_container)):
#print(app_container[i].div['data-docid'])
file.write(app_container[i].div['data-docid'] + "\n")
file.close()
num_lines = sum(1 for line in open('./InputFiles/applications.txt'))
print("Applications : " + str(num_lines) )
In this case You need to use Selenium . I try it for you an get the all apps . I will try to explain hope will understand.
Using Selenium is more powerful than other Python function .I used ChromeDriver so If you don't install yet You can install it in
http://chromedriver.chromium.org/
from time import sleep
from selenium import webdriver
options = webdriver.ChromeOptions()
driver=webdriver.Chrome(chrome_options=options,
executable_path=r'This part is your Driver path')
driver.get('https://play.google.com/store/apps/category/NEWS_AND_MAGAZINES/collection/topselling_paid')
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") ## Scroll to bottom of page with using driver
sleep(5) ## Give a delay for allow to page scroll . If we dont program will already take 60 element without letting scroll
x = driver.find_elements_by_css_selector("div[class='card-content id-track-click id-track-impression']") ## Declare which class
for a in x:
print a.text
driver.close()
OUTPUT :
1. Pocket Casts
Podcast Media LLC
₺24,99
2. Broadcastify Police Scanner Pro
RadioReference.com LLC
₺18,99
3. Relay for reddit (Pro)
DBrady
₺8,00
4. Sync for reddit (Pro)
Red Apps LTD
₺15,00
5. reddit is fun golden platinum (unofficial)
TalkLittle
₺9,99
... **UP TO 75**
Note :
Dont mind the money. Its my countr currency so It will change in yours.
UPDATE ACCORDİNG TO YOUR COMMENT:
The same data-docid is also in span tag.You can get it with using get_attribute . Just add below codes into your project.
y = driver.find_elements_by_css_selector("span[class=preview-overlay-container]")
for b in y :
print b.get_attribute('data-docid')
OUTPUT
au.com.shiftyjelly.pocketcasts
com.radioreference.broadcastifyPro
reddit.news
com.laurencedawson.reddit_sync.pro
com.andrewshu.android.redditdonation
com.finazzi.distquakenoads
com.twitpane.premium
org.fivefilters.kindleit
.... UP TO 75
Google Play has recently changed the user interface and structure of links and the display of information. I recently wrote a Scrape Google Play Search Apps in Python blog where I described the whole process in detail with more data.
In order to access all apps, you need to scroll to the bottom of the page.
After that, you can start extracting the app names and then writing them into a file. Extraction selectors have also changed e.g. new selectors.
Code and full example in online IDE:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time
urls = []
urls.extend(["https://play.google.com/store/apps?device=phone&hl=en_GB&gl=US"])
service = Service(ChromeDriverManager().install())
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument("--lang=en")
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
driver = webdriver.Chrome(service=service, options=options)
for url in urls:
driver.get(url)
# scrolling page
while True:
try:
driver.execute_script("document.querySelector('.snByac').click();")
time.sleep(2)
break
except:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)
soup = BeautifulSoup(driver.page_source, "lxml")
driver.quit()
with open("applications.txt", "w+") as file:
for result in soup.select(".Epkrse"):
file.write(result.text + "\n")
num_lines = sum(1 for line in open("applications.txt"))
print("Applications : " + str(num_lines))
Output:
Applications : 329
Also, you can use Google Play Apps Store API from SerpApi. It will bypass blocks from search engines and you don't have to create the parser from scratch and maintain it.
Code example:
from serpapi import GoogleSearch
import os
params = {
# https://docs.python.org/3/library/os.html#os.getenv
'api_key': os.getenv('API_KEY'), # your serpapi api
'engine': 'google_play', # SerpApi search engine
'store': 'apps' # Google Play Games
}
data = []
while True:
search = GoogleSearch(params) # where data extraction happens on the SerpApi backend
result_dict = search.get_dict() # JSON -> Python dict
if result_dict.get('organic_results'):
for result in result_dict.get('organic_results'):
for item in result['items']:
data.append(item['title'])
next_page_token = result_dict['serpapi_pagination']['next_page_token']
params['next_page_token'] = next_page_token
else:
break
with open('applications.txt', 'w+') as file:
for app in data:
file.write(app + "\n")
num_lines = sum(1 for line in open('applications.txt'))
print('Applications : ' + str(num_lines))
The output will be the same.
I'm using Selenium library on Python to scrape a website written on js. My strategy is moving through the website using selenium and, at the right time, scraping with BeautifulSoup. This works just fine on simple tests, except when, as shown in the following picture,
I need to click on the "<" button.
The "class" of the button changes at hover, so I'm using ActionChains to move to the element and click on it (I'm also using sleep to give enough time for the browser to load the page). Python is not throwing any exception, but the click is not working (i.e. the calendar is not moving backwards).
Below I provide the mentioned website and the code I wrote with an example. Do you have any idea why this is happening and/or how can I overcome this issue? Thank you very very much.
Website = https://burocomercial.profeco.gob.mx/index.jsp
Code:
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
import time
driver = webdriver.Chrome(path_to_webdriver)
driver.get('https://burocomercial.profeco.gob.mx/index.jsp') #access website
# Search bar and search button
search_bar = driver.find_elements_by_xpath('//*[#id="txtbuscar"]')
search_button = driver.find_element_by_xpath('//*[#id="contenido"]/div[2]/div[2]/div[2]/div/div[2]/div/button')
# Perform search
search_bar[0].send_keys("inmobiliaria")
search_button.click()
# Select result
time.sleep(2)
xpath='//*[#id="resultados"]/div[4]/table/tbody/tr[1]/td[5]/button'
driver.find_elements_by_xpath(xpath)[0].click()
# Open calendar
time.sleep(5)
driver.find_element_by_xpath('//*[#id="calI"]').click() #opens calendar
time.sleep(2)
# Hover-and-click on "<" (Here's the problem!!!)
cal_button=driver.find_element_by_xpath('//div[#id="ui-datepicker-div"]/div/a')
time.sleep(4)
ActionChains(driver).move_to_element(cal_button).perform() #hover
prev_button = driver.find_element_by_class_name('ui-datepicker-prev') #catch element whose class was changed by the hover
ActionChains(driver).click(prev_button).perform() #click
time.sleep(1)
print('clicked on it a second ago. No exception was raised, but the click was not performed')
time.sleep(1)
This is a different approach using requests. I think that Selenium should be the last option to use when doing webscraping. Usually, It is possible to retrieve the data from a webpage emulating the requests made by the web application
import requests
from bs4 import BeautifulSoup as BS
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36'}
## Starts session
s = requests.Session()
s.headers = headers
url_base = 'https://burocomercial.profeco.gob.mx/'
ind = 'index.jsp'
resp0 = s.get(url_base+ind) ## First request, to get the 'name' parameter that is dynamic
soup0 = BS(resp0.text, 'lxml')
param_name = soup0.select_one('input[id="txtbuscar"]')['name']
action = 'BusGeneral' ### The action when submit the form
keyword = 'inmobiliaria' # Word to search
data_buscar = {param_name:keyword,'yy':'2017'} ### Data submitted
resp1 = s.post(url_base+action,data=data_buscar) ## second request: make the search
resp2 = s.get(url_base+ind) # Third request: retrieve the results
print(resp2.text)
queja = 'Detalle_Queja.jsp' ## Action when Quejas selected
data_queja = {'Lookup':'2','Val':'1','Bus':'2','FI':'28-Nov-2016','FF':'28-Feb-2017','UA':'0'} # Data for queja form
## Lookup is the number of the row in the table, FI is the initial date and FF, the final date, UA is Unidad Administrativa
## You can change these parameters to obtain different queries.
resp3 = s.post(url_base+queja,data=data_queja) # retrieve Quejas results
print(resp3.text)
With this I got:
'\r\n\r\n\r\n\r\n\r\n\r\n1|<h2>ABITARE PROMOTORA E INMOBILIARIA, SA DE CV</h2>|0|0|0|0.00|0.00|0|0.00|0.00|0.00|0.00|0 % |0 % ||2'
Which contains the data that is used in the webpage.
Maybe this answer is not exactly what you are looking for, but I think it could be easier for you to use requests.
You don't need to hover the <, just click it.
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(path_to_webdriver)
driver.get('https://burocomercial.profeco.gob.mx/index.jsp') #access website
# set up wait
wait = WebDriverWait(driver, 10)
# Perform search
driver.find_element_by_id('txtbuscar').send_keys("inmobiliaria")
driver.find_element_by_css_selector('button[alt="buscar"]').click()
# Select result
xpath='//*[#id="resultados"]/div[4]/table/tbody/tr[1]/td[5]/button'
wait.until(EC.element_to_be_clickable((By.XPATH, xpath))).click()
# Open calendar
wait.until(EC.element_to_be_clickable((By.ID, 'calI'))).click() #opens calendar
wait.until(EC.element_to_be_visible((By.ID, 'ui-datepicker-div'))
# Click on "<"
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, 'a[title="Ant"]'))).click()
A few things
If your XPath consists only of an ID, just use .find_element_by_id(). It's faster and easier to read.
If you are only using the first element in a collection, e.g. search_bar, just use .find_element_* instead of .find_elements_* and search_bar[0].
Don't use sleeps. Sleeps are a bad practice and result in unreliable tests. Instead use expected conditions, e.g. to wait for an element to be clickable.
I have the following code in order to help me auto login the portal but i fond that i able to print the content but the web portal does not pop up :
import pandas as pd
import html5lib
import time
import requests
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
import webbrowser
with requests.Session() as c:
proxies = {"http://proxy-udd.com"}
url = 'https://ji.devtools.com/login'
USERNAME = 'shiji'
PASSWORD = 'Tan#9'
c.get(url,verify= False)
csrftoken = ''
login_data = dict(proxies,atl_token = csrftoken, os_username=USERNAME, os_password=PASSWORD, next='/')
c.post(url, data=login_data, headers={"referer" : "https://ji.devtools.com/login"})
page = c.get('https://ji.devtools.com/')
print (page.content)
It is expected that there is no pop up. You are sending a HTTP request to the portal. The portal did return you the right content as String/Text. However, your python is not a browser. It is not able to process String/Text as browser did. Thus no pop up. If you want to see real pop up with Python, Try selenium. It will try to simulate what browser behaves and you will see a page.
This is neither written by me nor affiliate to me.
Please use this link to follow the full details.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
browser=webdriver.Firefox()
browser.get("http://172.16.16.16/24online/servlet/E24onlineHTTPClient")
username=browser.find_element_by_name("username")
password=browser.find_element_by_name("password")
login=browser.find_eleme
file=open('userid.txt','w')
for i in range(100,200):
username.send_keys("160905"+i)
password.send_keys("123456")
if "You have succesfully logged in" in page.source:
file.write("160905"+i)
file.close()
browser.close()
How can I reach the following webpage using Python Requests?
https://www.fidelity.com.hk/investor/en/fund-prices-performance/fund-price-details/factsheet-historical-nav-dividends.page?FundId=10306
The page is forwarded until I click the 2 "Accept" buttons.
This is what I do:
import requests
s = requests.Session()
r = s.post("https://www.fidelity.com.hk/investor/en/important-notice.page?submit=true&componentID=1298599783876")
r = s.get("https://www.fidelity.com.hk/investor/en/fund-prices-performance/fund-price-details/factsheet-historical-nav-dividends.page?&FundId=10306")
How do I handle the first "Accept" button, I have checked there is a cookie called "Accepted", am I correct?:
<a id="terms_use_accept" class="btn btn-default standard-btn smallBtn" title="Accept" href="javascript:void(0);">Accept</a>
First of all, requests is not a browser and there is no JavaScript engine built-in.
But, you can mimic the unrelying logic by inspecting what is going on in the browser when you click "Accept". This is there Browser Developer Tools are handy.
If you click "Accept" in the first Accept/Decline "popup" - there is an "accepted=true" cookie being set. As for the second "Accept", here is how the button link looks in the source code:
<a href="javascript:agree()">
<img src="/static/images/investor/en/buttons/accept_Btn.jpg" alt="Accept" title="Accept">
</a>
If you click the button agree() function is being called. And here is what it does:
function agree() {
$("form[name='agreeFrom']").submit();
}
In other words, agreeFrom form is being submitted. This form is hidden, but you can find it in the source code:
<form name="agreeFrom" action="/investor/en/important-notice.page?submit=true&componentID=1298599783876" method="post">
<input value="Agree" name="iwPreActions" type="hidden">
<input name="TargetPageName" type="hidden" value="en/fund-prices-performance/fund-price-details/factsheet-historical-nav-dividends">
<input type="hidden" name="FundId" value="10306">
</form>
We can submit this form with requests. But, there is an easier option. If you click "Accept" and inspect what cookies are set, you'll notice that besides "accepted" there are 4 new cookies set:
"irdFundId" with a "FundId" value from the "FundId" form input or a value from the requested URL (see "?FundId=10306")
"isAgreed=yes"
"isExpand=true"
"lastAgreedTime" with a timestamp
Let's use this information to build a solution using requests+BeautifulSoup (for HTML parsing part):
import time
from bs4 import BeautifulSoup
import requests
from requests.cookies import cookiejar_from_dict
fund_id = '10306'
last_agreed_time = str(int(time.time() * 1000))
url = 'https://www.fidelity.com.hk/investor/en/fund-prices-performance/fund-price-details/factsheet-historical-nav-dividends.page'
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30'}
session.cookies = cookiejar_from_dict({
'accepted': 'true',
'irdFundId': fund_id,
'isAgreed': 'yes',
'isExpand': 'true',
'lastAgreedTime': last_agreed_time
})
response = session.get(url, params={'FundId': fund_id})
soup = BeautifulSoup(response.content)
print soup.title
It prints:
Fidelity Funds - America Fund A-USD| Fidelity
which means we are seeing the desired page.
You can also approach it with a browser automation tool called selenium:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox() # could also be headless: webdriver.PhantomJS()
driver.get('https://www.fidelity.com.hk/investor/en/fund-prices-performance/fund-price-details/factsheet-historical-nav-dividends.page?FundId=10306')
# switch to the popup
frame = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "iframe.cboxIframe")))
driver.switch_to.frame(frame)
# click accept
accept = driver.find_element_by_link_text('Accept')
accept.click()
# switch back to the main window
driver.switch_to.default_content()
# click accept
accept = driver.find_element_by_xpath('//a[img[#title="Accept"]]')
accept.click()
# wait for the page title to load
WebDriverWait(driver, 10).until(EC.title_is("Fidelity Funds - America Fund A-USD| Fidelity"))
# TODO: extract the data from the page
You can't handle JavaScript using requests nor the urllib modules. But based on my knowledge (which is not much) I'll tell you how I would solve this problem.
This site is using a specific cookie to know if you have already accepted their policy. If not, the server redirects you to the page shown in the image above. Look for that cookie using some Add-On and set it manually so the website shows you the content you're looking for.
Another way is to use Qt's built-in web browser (which uses WebKit) that lets you execute JavaScript code. Simply use evaluateJavaScript("agree();") and there you go.
Hope it helps.