I can't scrape 'https://www.upwork.com/'
I try this code:
import requests
url = "https://www.upwork.com/"
header={'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15"}
requests=requests.get(url,headers=header)
soup=BeautifulSoup(requests.content,'html.parser')
and also used :
import requests
headers = {
'authority': 'www.upwork.com',
'accept':
...
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36',
}
response = requests.get('https://www.upwork.com/', headers=headers)
the response is always = 403
The Upwork platform is heavily protected against bots and scraper. So there is no surprise in the fact that you basic scraper gets detected and blocked immediately. In general, when you want to scrape big websites (Google, Amazon, Upwork, Freelancer etc.), it is recommended to either build a complex scraper, or to use a third party service. Let me continue the explanation for each case:
1. Build a complex scraper:
By complex scraper, I mean a web scraper that can go undetected. As an engineer at WebScrapingAPI and a researcher working exactly on this matter (namely web fingerprinting techniques), I can tell you that this task is a lot, even for an experienced programmer.
However, a good place to start for you would be to drop requests and use Selenium instead. Here is an example that successfully access:
from selenium import webdriver
# Initiate the webdriver
driver = webdriver.Chrome()
# Navigate to website
driver.get('https://www.upwork.com/')
# Get raw HTML and quit driver
html = driver.page_source
# Print the HTML
print(html)
## You can then use this HTML content with BeautifulSoup for example
## in order to extract the desired elements from your page
## Add your code bellow:
# Close the webdriver
driver.quit()
2. Use a third party provider:
There are quite a few web scraping providers. However, since I know the product we've built at WebScrapingAPI, I will recommend you use ours. I've even designed and tested a script that fetches data from Upwork here:
import json
import requests,json
from bs4 import BeautifulSoup
API_KEY = '<YOUR_API_KEY>'
SCRAPER_URL = 'https://api.webscrapingapi.com/v1'
TARGET_URL = 'https://www.upwork.com/freelance-jobs/'
CATEGORY = 'python'
PARAMS = {
"api_key":API_KEY,
"url": TARGET_URL + CATEGORY,
"render_js":1,
"proxy_type":"residential",
"extract_rules":'{"jobs":{"selector":"div.job-tile-wrapper","output":"html"}}',
}
response = requests.get(SCRAPER_URL, params=PARAMS)
json = json.loads(response.text)
for job in json['jobs']:
soup = BeautifulSoup(job, 'html.parser')
try:
job_title = soup.find('a', attrs={'data-qa':'job-title'}).text.strip()
job_about = soup.find('p', attrs={'data-qa':'job-description'}).text.strip()
job_price = soup.select_one('div.row>p.col-6.col-sm-3.col-md-3.mb-0.pb-15.pb-md-20>strong').text.strip()
job_level = soup.select_one('div.row>p.col-6.col-sm-4.mb-0.pb-15.pb-md-20>strong').text.strip()
print('Title: ' + job_title)
print('Price: ' + job_price)
print('Level: ' + job_level)
print('About: ' + job_about)
print('\n')
except:
pass
Result:
Title: Web Scraping Using Python
Price: $50
Level: Intermediate
About: Deliverables are 2 Python Scripts:
1. I am looking for a Python Script that will allow me to export a JSON response and put into csv.…
Title: Think or Swim Trading Automation (TD Ameritrade) Bot creation
Price: $550
Level: Entry
About: I have a back tested program I wish to automated into a trading bot. See attached. This is a very simple program in theory.
Title: Resy Reservation Bot / Snipe
Price: $150
Level: Intermediate
About: I would like to create a Bot that allows me to book a restaurant reservation the moment it is posted on the website.
https://resy.com/…
Title: Emails are not being received in my email account
Price: $5
Level: Entry
About: Emails are not being received in my email account. I am using the Hestial control panel
Related
I'm trying to get the page source of an imgur website using requests, but the results I'm getting are different from the source. I understand that these pages are rendered using JS, but that is not what I am searching for.
It seems I'm getting redirected because they detect I'm using an automated browser, but I'd prefer not to use selenium here. For example, see the following code to scrape the page source of two imgur ID's (one valid ID and one invalid ID) with different page sources.
import requests
from bs4 import BeautifulSoup
url1 = "https://i.imgur.com/ssXK5" #valid ID
url2 = "https://i.imgur.com/ssXK4" #invalid ID
def get_source(url):
headers = {
"User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Mobile Safari/537.36"}
page = requests.get(url, headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')
return soup
page1 = get_source(url1)
page2 = get_source(url2)
print(page1==page2)
#True
The scraped page sources are identical, so I presume it's an anti-scraping thing. I know there is an imgur API, but I'd like to know how to circumvent such a redirection, if possible. Is there any way to get the actual source code using the requests module?
Thanks.
I have wrote a code for web scraping google news page. It worked fine till today, when it stopped.
It does not give me any error, but It does not scrape anything.
For this code I have watched tutorial from 2018 on youtube and I have used the same url and same 'div's.
When I go to 'inspect' on browser, it still has class="st" and class="slp"
I mean, that means that it worked one year ago till, and it worked yesterday, but It stopped working today
Do you know what can be the problem?
This is the code that worked yesterday:
from textblob import TextBlob
from bs4 import BeautifulSoup
import requests
from datetime import date, timedelta, datetime
term = 'coca cola'
url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(term)
response = requests.get(url)
print(response)
soup = BeautifulSoup(response.text, 'html.parser')
snippet_text = soup.find_all('div', class_='st')
print(len(snippet_text))
news_date = soup.find_all('div', class_='slp')
print(len(news_date))
for paragraph_text, post_date in zip(snippet_text, news_date):
paragraph_text = TextBlob(paragraph_text.get_text())
print(paragraph_text)
todays_date = date.today()
time_ago = TextBlob(post_date.get_text()).split('- ')[1]
print(time_ago)
Does google changes HTML code or url?
Please add user-agent while scraping google.
from bs4 import BeautifulSoup
import requests
from datetime import date, timedelta, datetime
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
term = 'coca cola'
url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(term)
response = requests.get(url,headers=headers)
print(response)
soup = BeautifulSoup(response.text, 'html.parser')
snippet_text = soup.find_all('div', class_='st')
print(len(snippet_text))
news_date = soup.find_all('div', class_='slp')
print(len(news_date))
If you get SSL error maximum reach then add verify=False
response = requests.get(url,headers=headers,verify=False)
As KunduK said, Google is blocking your request because the default user-agent from the requests library is python-requests. You can fake user browser visit by adding headers to your request. List of user-agents among other websites.
Also, you can set timeout to your request (info) to stop waiting for a response after a given number of seconds. Otherwise, the script can hang indefinitely.
You can apply the same logic to Yahoo, Bing, Baidu, Yandex, and other search engines.
Code and full example:
from bs4 import BeautifulSoup
import requests
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get('https://www.google.com/search?hl=en-US&q=coca cola&tbm=nws', headers=headers).text
soup = BeautifulSoup(response, 'lxml')
for headings in soup.findAll('div', class_='dbsr'):
title = headings.find('div', class_='JheGif nDgy9d').text
link = headings.a['href']
print(f'Title: {title}')
print(f'Link: {link}')
print()
Part of output:
Title: Fact check: Georgia is not removing Coca-Cola products from state-owned
buildings
Link: https://www.usatoday.com/story/news/factcheck/2021/04/09/fact-check-georgia-not-removing-coke-products-state-buildings/7129548002/
Title: The 'race for talent' is pushing companies like Delta and Coca-Cola to
speak out against voting laws
Link: https://www.businessinsider.com/georgia-voting-law-merits-response-delta-coca-cola-workers-2021-4
Title: Why Coke's Earnings Could Contain Good News, One Analyst Says
Link: https://www.barrons.com/articles/cokes-stock-is-lagging-why-one-analyst-thinks-next-weeks-earnings-could-include-good-news-51618246989
Alternatively, you can use Google News Result API from SerpApi. Check out Playground to test.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "coca cola",
"tbm": "nws",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for news_result in results["news_results"]:
print(f"Title: {news_result['title']}\nLink: {news_result['link']}\n")
Part of output:
Title: Why Coke's Earnings Could Contain Good News, One Analyst Says
Link: https://www.barrons.com/articles/cokes-stock-is-lagging-why-one-analyst-thinks-next-weeks-earnings-could-include-good-news-51618246989
Title: The 'race for talent' is pushing companies like Delta and Coca-Cola to speak out against voting laws
Link: https://www.businessinsider.com/georgia-voting-law-merits-response-delta-coca-cola-workers-2021-4
Title: 2 Reasons You Shouldn't Buy Coca-Cola Now
Link: https://seekingalpha.com/article/4418712-2-reasons-you-shouldnt-buy-coca-cola-now
Title: Worrying Signs For Coca-Cola
Link: https://seekingalpha.com/article/4418630-worrying-signs-for-coca-cola
Disclaimer, I work for SerpApi.
I took the code below from the answer How to use BeautifulSoup to parse google search results in Python
It used to work on my Ubuntu 16.04 and I have both Python 2 and 3.
The code is below:
import urllib
from bs4 import BeautifulSoup
import requests
import webbrowser
text = 'My query goes here'
text = urllib.parse.quote_plus(text)
url = 'https://google.com/search?q=' + text
response = requests.get(url)
#with open('output.html', 'wb') as f:
# f.write(response.content)
#webbrowser.open('output.html')
soup = BeautifulSoup(response.text, 'lxml')
for g in soup.find_all(class_='g'):
print(g.text)
print('-----')
It executes but prints nothing. The problem is really suspicious to me. Any help would be appreciated.
The problem is that Google is serving different HTML when you don't specify User-Agent in headers. To specify custom header, add dict with User-Agent to headers= parameter in requests:
import urllib
from bs4 import BeautifulSoup
import requests
import webbrowser
text = 'My query goes here'
text = urllib.parse.quote_plus(text)
url = 'https://google.com/search?q=' + text
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
for g in soup.find_all(class_='g'):
print(g.text)
print('-----')
Prints:
How to Write the Perfect Query Letter - Query Letter Examplehttps://www.writersdigest.com/.../how-to-write-the-perfect-qu...PuhverdatudTõlgi see leht21. märts 2016 - A literary agent shares a real-life novel pitch that ultimately led to a book deal—and shows you how to query your own work with success.
-----
Inimesed küsivad ka järgmistHow do you start a query letter?What should be included in a query letter?How do you end a query in an email?How long is a query letter?Tagasiside
-----
...and so on.
Learn more about user-agent and request headers.
Basically, user-agent let identifies the browser, its version number, and its host operating system that representing a person (browser) in a Web context that lets servers and network peers identify if it's a bot or not.
Have a look at SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser. CSS selectors reference.
To make it look better, you can pass URL params as a dict() which is more readable and requests do everything for you automatically (same goes for adding user-agent into headers):
params = {
"q": "My query goes here"
}
requests.get("YOUR_URL", params=params)
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "My query goes here"
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
print(title)
-------
'''
MySQL 8.0 Reference Manual :: 3.2 Entering Queries
Google Sheets Query function: Learn the most powerful ...
Understanding MySQL Queries with Explain - Exoscale
An Introductory SQL Tutorial: How to Write Simple Queries
Writing Subqueries in SQL | Advanced SQL - Mode
Getting IO and time statistics for SQL Server queries
How to store MySQL query results in another Table? - Stack ...
More efficient SQL with query planning and optimization (article)
Here are my Data Files. Here are my Queries. Where ... - CIDR
Slow in the Application, Fast in SSMS? - Erland Sommarskog
'''
Alternatively, you can do the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to extract the data you want from JSON string rather than figuring out how to extract, maintain or bypass blocks from Google.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "My query goes here",
"hl": "en",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['title'])
--------
'''
MySQL 8.0 Reference Manual :: 3.2 Entering Queries
Google Sheets Query function: Learn the most powerful ...
Understanding MySQL Queries with Explain - Exoscale
An Introductory SQL Tutorial: How to Write Simple Queries
Writing Subqueries in SQL | Advanced SQL - Mode
Getting IO and time statistics for SQL Server queries
How to store MySQL query results in another Table? - Stack ...
More efficient SQL with query planning and optimization (article)
Here are my Data Files. Here are my Queries. Where ... - CIDR
Slow in the Application, Fast in SSMS? - Erland Sommarskog
'''
Disclaimer, I work for SerpApi.
I am trying to scrape through google news search results using python's requests to get links to different articles. I get the links by using Beautiful Soup.
The problem I get is that although in browser's source view all links look normal, after the operation they are changed - all of the start with "/url?q=" and after the "core" of the link is finished there goes a string of characters which starts with "&". Also - some characters inside the link are also changed - for example url:
http://www.azonano.com/news.aspx?newsID=35576
changes to:
http://www.azonano.com/news.aspx%newsID%35576
I'm using standard "getting started" code:
import requests, bs4
url_list = list()
url = 'https://www.google.com/search?hl=en&gl=us&tbm=nws&authuser=0&q=graphene&oq=graphene&gs_l=news-cc.3..43j0l9j43i53.2022.4184.0.4322.14.10.3.1.1.1.166.884.5j5.10.0...0.0...1ac.1.-Q2j3YFqIPQ'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for link in soup.select('h3 > a'):
url_list.append(link.get('href'))
# First link on google news page is:
# https://www.theengineer.co.uk/graphene-sensor-could-speed-hepatitis-diagnosis/
print url_list[0] #this line will print url modified by requests.
I know it's possible to get around this problem by using selenium, but I'd like to know where lies a root cause of this problem with requests (or more plausible not with requests but the way I'm using it).
Thanks for any help!
You're comparing what you are seeing with a browser with what requests generates (i.e. there is no user agent header). If you specify this before making the initial request it will reflect what you would see in a web browser. Google serves the requests differently it looks like:
url = 'https://www.google.com/search?hl=en&gl=us&tbm=nws&authuser=0&q=graphene&oq=graphene&gs_l=news-cc.3..43j0l9j43i53.2022.4184.0.4322.14.10.3.1.1.1.166.884.5j5.10.0...0.0...1ac.1.-Q2j3YFqIPQ'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'} # I just used a general Chrome 41 user agent header
res = requests.get(url, headers=headers)
I've looked around for a solution to this problem but for the life of me I cannot figure it out!
This is my first attempt at writing anything in python, and what I want my script to do is load a list of subjects from a text file, generate a Google search URL, and scrape these URLs one by one to output the amount of 'results found:' according to Google, in addition to the links of the top 15 results.
My problem is that when I run my code, all that is printed are empty lists:
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
**END_OBJECT**
..etc.
Here is my code:
from lxml import html
import requests
def iterate():
with open("list.txt", "r") as infile:
for line in infile:
if not line.strip():
break
yield line
output = open ("statistic_out.txt", "w")
for line in iterate():
raw = line
output.write(raw + " services")
request = raw.replace(" ", "%20")
page = requests.get('https://www.google.com.au/search?safe=off&tbs=ctr:countryAU&cr=countryAU&q=' + request + 'services%20-yellowpages%20-abs', verify=False)
path = html.fromstring(page.text)
#This will create a list of buyers:
resultCount = path.xpath('//*[#id="resultStats"]/text()')
#This will create a list of prices
print(resultCount)
print('\n')
resultUrlList1 = path.xpath('//*[#id="rso"]/div[2]/li[1]/div/h3/a/text()')
resultUrlList2 = path.xpath('//*[#id="rso"]/div[2]/li[2]/div/h3/a/text()')
resultUrlList3 = path.xpath('//*[#id="rso"]/div[2]/li[3]/div/h3/a/text()')
resultUrlList4 = path.xpath('//*[#id="rso"]/div[2]/li[4]/div/h3/a/text()')
resultUrlList5 = path.xpath('//*[#id="rso"]/div[2]/li[5]/div/h3/a/text()')
resultUrlList6 = path.xpath('//*[#id="rso"]/div[2]/li[6]/div/h3/a/text()')
resultUrlList7 = path.xpath('//*[#id="rso"]/div[2]/li[7]/div/h3/a/text()')
resultUrlList8 = path.xpath('//*[#id="rso"]/div[2]/li[8]/div/h3/a/text()')
resultUrlList9 = path.xpath('//*[#id="rso"]/div[2]/li[9]/div/h3/a/text()')
resultUrlList10 = path.xpath('//*[#id="rso"]/div[2]/li[10]/div/h3/a/text()')
resultUrlList11 = path.xpath('//*[#id="rso"]/div[2]/li[11]/div/h3/a/text()')
resultUrlList12 = path.xpath('//*[#id="rso"]/div[2]/li[12]/div/h3/a/text()')
resultUrlList13 = path.xpath('//*[#id="rso"]/div[2]/li[13]/div/h3/a/text()')
resultUrlList14 = path.xpath('//*[#id="rso"]/div[2]/li[14]/div/h3/a/text()')
resultUrlList15 = path.xpath('//*[#id="rso"]/div[2]/li[15]/div/h3/a/text()')
print(resultUrlList1)
print('\n')
print(resultUrlList2)
print('\n')
print(resultUrlList3)
print('\n')
print(resultUrlList4)
print('\n')
print(resultUrlList5)
print('\n')
print(resultUrlList6)
print('\n')
print(resultUrlList7)
print('\n')
print(resultUrlList8)
print('\n')
print(resultUrlList9)
print('\n')
print(resultUrlList10)
print('\n')
print(resultUrlList11)
print('\n')
print(resultUrlList12)
print('\n')
print(resultUrlList13)
print('\n')
print(resultUrlList14)
print('\n')
print(resultUrlList15)
print('\n')
print("**END_OBJECT** \n")
The actual HTML structure is that of any Google search:
Any help would be greatly appreciated - as I am completely lost as to why this is occurring.
EDIT:
It appears that my script is hitting Google's anti-bot protections and path.content shows messages along the lines of "This page checks to see if it's really you sending the requests, and not a robot."
I'm unsure if there are easy ways to bypass this though will update if I find any.
The problem is that you don't specify the user-agent. You need to send user-agent which will act as a "real" user visit. When a bot or browser sends a fake user-agent string to announce itself as a different client. Because default requests user-agent is python-requests and Google understands it, blocks a requests and you receive a different HTML with an error that contains different elements which your code can't recognize and find.
You can read more about it in the blog post I wrote about how to reduce the chance of being blocked while web scraping.
Pass user-agent:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('URL', headers=headers)
Also, instead of creating 15 resultUrlList, you can iterate over all of them in a for loop:
# .tF2Cxc CSS selector is a container with title, link and other data
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
params = {
'q': 'how to create minecraft server',
'gl': 'us',
'hl': 'en',
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
print(title, link, sep='\n')
----------
'''
How to Setup a Minecraft: Java Edition Server – Home
https://help.minecraft.net/hc/en-us/articles/360058525452-How-to-Setup-a-Minecraft-Java-Edition-Server
Minecraft Server Download
https://www.minecraft.net/en-us/download/server
Setting Up Your Own Minecraft Server - iD Tech
https://www.idtech.com/blog/creating-minecraft-server
# other results
'''
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't need to figure how to to make things work: how to bypass blocks, how to extract data, how to maintain script over time if something in HTML will be changed, etc. Instead you only need to iterate over structured JSON and get the data you want.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "how to create minecraft server",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['title'])
print(result['link'])
----------
'''
How to Setup a Minecraft: Java Edition Server – Home
https://help.minecraft.net/hc/en-us/articles/360058525452-How-to-Setup-a-Minecraft-Java-Edition-Server
Minecraft Server Download
https://www.minecraft.net/en-us/download/server
Setting Up Your Own Minecraft Server - iD Tech
https://www.idtech.com/blog/creating-minecraft-server
# other results
'''
Disclaimer, I work for SerpApi.