I am making a program that fetches app from APKmirror
So, I fetch the page using urllib3 and scrap it using Beautiful Soup
for a in ((BeautifulSoup((urlopen(Request(url="https://www.apkmirror.com/apk/twitter-inc/", headers={'User-Agent': 'Mozilla/5.0'})).read()), 'lxml')).find_all(["a"], class_="fontBlack", text=re.compile("^.*.release*"))):
twver = ((a.string).split(' ')[1]).replace(".", "-")
break
twurl = "".join(["https://www.apkmirror.com/apk/twitter-inc/twitter/twitter-", twver, "-release/"])
twpage1= "".join(["https://apkmirror.com", ((((BeautifulSoup((urlopen(Request(url=twurl, headers={'User-Agent': 'Mozilla/5.0'})).read()), 'lxml')).find(["span"], text="APK")).parent).find(["a"], class_="accent_color")['href'])])
twpage2= "".join(["https://apkmirror.com", ((BeautifulSoup((urlopen(Request(url=twpage1, headers={'User-Agent': 'Mozilla/5.0'})).read()), 'lxml')).find(["a"], { 'class' : re.compile("accent_bg btn btn-flat downloadButton")})['href'])])
twdllink = "".join(["https://apkmirror.com", (((BeautifulSoup((urlopen(Request(url=twpage2, headers={'User-Agent': 'Mozilla/5.0'})).read()), 'lxml')).find(rel="nofollow"))['href'])])
So, I request you to please tell how to use a single conenction to apkmirror server and use it to fetch different url everytime.
You can see the url changes everytime.
Or suggest me other ways to make it fast..
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:104.0) Gecko/20100101 Firefox/104.0'
}
def get_soup(content):
return BeautifulSoup(content, 'lxml')
def main(url):
with requests.Session() as req:
req.headers.update(headers)
r = req.get(url)
soup = get_soup(r.content)
links = (urljoin(url, x['href'])
for x in soup.select('a.fontBlack[href*=release]'))
for link in links:
# From here you can continue
# r = req.get(link) as you are using the same session currently
print(link)
if __name__ == "__main__":
main('https://www.apkmirror.com/apk/twitter-inc/')
Related
I'm building a Twitter bot using Tweepy and BeautifulSoup4. I'd like to save in a list the results of a request but my script isn't working anymore (but it was working days ago). I've been looking at it and I don't understand. Here is my function:
import requests
import tweepy
from bs4 import BeautifulSoup
import urllib
import os
from tweepy import StreamListener
from TwitterEngine import TwitterEngine
from ConfigEngine import TwitterAPIConfig
import urllib.request
import emoji
import random
# desktop user-agent
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
# mobile user-agent
MOBILE_USER_AGENT = "Mozilla/5.0 (Linux; Android 7.0; SM-G930V Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.125 Mobile Safari/537.36"
# Récupération des liens
def parseLinks(url):
headers = {"user-agent": USER_AGENT}
resp = requests.get(url, headers=headers)
if resp.status_code == 200:
soup = BeautifulSoup(resp.content, "html.parser")
results = []
for g in soup.find_all('div', class_='r'):
anchors = g.find_all('a')
if anchors:
link = anchors[0]['href']
results.append(link)
return results
The "url" parameter is 100% correct in the rest of the code. As an output, I get a "None". To be more precise, the execution stops right after line "results = []" (so it doesn't enter into the for).
Any idea?
Thank you so much in advance!
It seems that Google changed the HTML markup on the page. Try to change the search from class="r" to class="rc":
import requests
from bs4 import BeautifulSoup
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
def parseLinks(url):
headers = {"user-agent": USER_AGENT}
resp = requests.get(url, headers=headers)
if resp.status_code == 200:
soup = BeautifulSoup(resp.content, "html.parser")
results = []
for g in soup.find_all('div', class_='rc'): # <-- change 'r' to 'rc'
anchors = g.find_all('a')
if anchors:
link = anchors[0]['href']
results.append(link)
return results
url = 'https://www.google.com/search?q=tree'
print(parseLinks(url))
Prints:
['https://en.wikipedia.org/wiki/Tree', 'https://simple.wikipedia.org/wiki/Tree', 'https://www.britannica.com/plant/tree', 'https://www.treepeople.org/tree-benefits', 'https://books.google.sk/books?id=yNGrqIaaYvgC&pg=PA20&lpg=PA20&dq=tree&source=bl&ots=_TP8PqSDlT&sig=ACfU3U16j9xRJgr31RraX0HlQZ0ryv9rcA&hl=sk&sa=X&ved=2ahUKEwjOq8fXyKjsAhXhAWMBHToMDw4Q6AEwG3oECAcQAg', 'https://teamtrees.org/', 'https://www.woodlandtrust.org.uk/trees-woods-and-wildlife/british-trees/a-z-of-british-trees/', 'https://artsandculture.google.com/entity/tree/m07j7r?categoryId=other']
I'm trying to get the 'src' from 500 profile pictures on Transfermarkt, the pictures on each players profile that is, not the small picture from the list. I've managed to store each players URL to a list. Now when I'm trying to iterate through it, the code just runs and runs, then stops after 20 minutes something, without any error or output from my print command. As I said, I want the source (src) for each players picture on their respective profile.
I'm not really sure what's wrong with the code, since I don't get any error messages. I've built it with help from different posts here on stackoverflow.
from bs4 import BeautifulSoup
import requests
import pandas as pd
playerID = []
playerImgSrc = []
result = []
for page in range(1, 21):
r = requests.get("https://www.transfermarkt.com/spieler-statistik/wertvollstespieler/marktwertetop?land_id=0&ausrichtung=alle&spielerposition_id=alle&altersklasse=alle&jahrgang=0&kontinent_id=0&plus=1",
params= {"page": page},
headers= {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0"}
)
soup = BeautifulSoup(r.content, "html.parser")
links = soup.select('a.spielprofil_tooltip')
for i in range(len(links)):
playerID.append(links[i].get('id'))
playerProfile = ["https://www.transfermarkt.com/josh-maja/profil/spieler/" + x for x in playerID]
for p in playerProfile:
html = requests.get(p).text
soup = BeautifulSoup(html, "html.parser")
link = soup.select('div.dataBild')
for i in range(len(link)):
playerImgSrc.append(link[i].get('src'))
print(playerImgSrc)
Basically, the website navigation is using AJAX technology, Which is really quick enough, the same as you browsing a folder in your local machine.
Therefore, the data displayed within the UI(User Interface) is actually coming from a background of XHR request to specific directory within the host which is marktwertetop where it's using AJAX.
I've been able to locate the XHR request been made to it, Then I called it directly with the required parameters while looping over the pages.
I figured out the difference between small and large photo is actually one different location of direction which is small and header, So I've replaced it within in the url itself.
Also i considered been under antibiotic protection (😋) meant under requests.Session() to maintain the Session during my loop and downloading the pics, which means to prevent TCP layer security from blocking/refusing/dropping my packet/request while Scraping/Downloading.
Imagine, that you already open a browser, where you navigate between the same website pages, there's a cookies session created which established as long as you connected to the site, and if idle it's refresh itself.
But the way you were doing it, is just you are open a browser, then close it, then open it again and close it, AND SO ON ! where the server side count it as DDOS attack ?! or flood behavior. and that's a very basics of firewall action.
import requests
from bs4 import BeautifulSoup
site = "https://www.transfermarkt.com/spieler-statistik/wertvollstespieler/marktwertetop?ajax=yw1&page={}"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}
def main(url):
with requests.Session() as req:
allin = []
for item in range(1, 21):
print(f"Collecting Links From Page# {item}")
r = req.get(url.format(item), headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
img = [item.get("src") for item in soup.findAll(
"img", class_="bilderrahmen-fixed")]
convert = [item.replace("small", "header") for item in img]
allin.extend(convert)
return allin
def download():
urls = main(site)
with requests.Session() as req:
for url in urls:
r = req.get(url, headers=headers)
name = url[52:]
name = name.split('?')[0]
print(f"Saving {name}")
with open(f"{name}", 'wb') as f:
f.write(r.content)
download()
UPDATE per user comment:
import requests
from bs4 import BeautifulSoup
import csv
site = "https://www.transfermarkt.com/spieler-statistik/wertvollstespieler/marktwertetop?ajax=yw1&page={}"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}
def main(url):
with requests.Session() as req:
allin = []
names = []
for item in range(1, 21):
print(f"Collecting Links From Page# {item}")
r = req.get(url.format(item), headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
img = [item.get("src") for item in soup.findAll(
"img", class_="bilderrahmen-fixed")]
convert = [item.replace("small", "header") for item in img]
name = [name.text for name in soup.findAll(
"a", class_="spielprofil_tooltip")][:-5]
allin.extend(convert)
names.extend(name)
with open("data.csv", 'w', newline="", encoding="UTF-8") as f:
writer = csv.writer(f)
writer.writerow(["Name", "IMG"])
data = zip(names, allin)
writer.writerows(data)
main(site)
Output: view online
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.amazon.com/s?k=iphone+5s&ref=nb_sb_noss')
c = r.content
soup = BeautifulSoup(c, 'html.parser')
all = soup.find_all("span", {"class": "a-size-medium a-color-base a-text-normal"})
print(all)
so this is my simple script of python trying to scrape a page in amazon but not all the html is returned in the "soup" variable therefor i get nothing when trying to find a specific series of tags and extract them.
Try the below code, it should do the trick for you.
You actually missed to add headers in your code
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
url = 'https://www.amazon.com/s?k=iphone+5s&ref=nb_sb_noss'
response = requests.get(url, headers=headers)
print(response.text)
soup = BeautifulSoup(response.content, features="lxml")
my_all = soup.find_all("span", {"class": "a-size-medium a-color-base a-text-normal"})
print(my_all)
I am trying to get a value from a class. From time to time, find returns the value I need, but another time it no longer works.
Code:
import requests
from bs4 import BeautifulSoup
url = 'https://beru.ru/catalog/molotyi-kofe/76321/list'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
item_count = (soup.find('div', class_='_2StYqKhlBr')).text.split()[4]
print(item_count)
The reason why that you get the values sometimes and sometimes not. That's because the website is protected by CAPTCHA
So when the request is blocked by CAPTCHA
It's became like the following:
https://beru.ru/showcaptcha?retpath=https://beru.ru/catalog/molotyi-kofe/76321/list?ncrnd=4561_aa1b86c2ca77ae2b0831c4d95b9d85a4&t=0/1575204790/b39289ef083d539e2a4630548592a778&s=7e77bfda14c97f6fad34a8a654d9cd16
You can verify by parse the response content:
import requests
from bs4 import BeautifulSoup
r = requests.get(
'https://beru.ru/catalog/molotyi-kofe/76321/list')
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll('div', attrs={'class': '_2StYqKhlBr _1wAXjGKtqe'}):
print(item)
for item in soup.findAll('div', attrs={'class': 'captcha__image'}):
for captcha in item.findAll('img'):
print(captcha.get('src'))
And you will get the CAPTCHA image link:
https://beru.ru/captchaimg?aHR0cHM6Ly9leHQuY2FwdGNoYS55YW5kZXgubmV0L2ltYWdlP2tleT0wMEFMQldoTnlaVGh3T21WRmN4NWFJRUdYeWp2TVZrUCZzZXJ2aWNlPW1hcmtldGJsdWU,_0/1575206667/b49556a86deeece9765a88f635c7bef2_df12d7a36f0e2d36bd9c9d94d8d9e3d7
I made something which gets the time from https://time.is/ and shows the time. I used BeautifulSoup and urllib.request.
But I want to trim the output. I'm getting this as output and I want to remove the code part.
<div id="twd">07:29:26</div>
Program File:
import urllib.request
from bs4 import BeautifulSoup
url = 'https://time.is/'
hdr = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)' }
req = urllib.request.Request(url, headers=hdr)
res = urllib.request.urlopen(req)
soup = BeautifulSoup(res, 'html.parser')
string = soup.find(id='twd')
print(string)
How can I get just the text?
You can get the text from the dom element with .text like:
string.text
Test Code:
import urllib.request
from bs4 import BeautifulSoup
url = 'https://time.is/'
hdr = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'}
req = urllib.request.Request(url, headers=hdr)
res = urllib.request.urlopen(req)
soup = BeautifulSoup(res, 'html.parser')
string = soup.find(id='twd')
print(string.text)
Results:
07:06:11PM