I'm trying to scraping using Yahoo Engine. Using keyword like "python".
I have wrote this little program :
query = "python"
url = {"https://fr.search.yahoo.com/search?p=&fr=yfp-search-sb",
"https://fr.search.yahoo.com/search?p=&fr=yfp-search-sb&b=11&pz=10&pstart=5"}
def checker():
for yahoo in url:
yahooo = yahoo.replace("&fr",query + "&fr")
r = requests.get(yahooo)
soup = bs(r.text, 'html.parser')
links = soup.find_all('a')
for link in soup.find_all('a'):
a = link.get('href')
unquote(a)
print("Urls : " + a)
with open("Yahoo.txt", mode="a",encoding="utf-8") as fullz:
fullz.write(a + "\n")
fullz.close()
lines_seen = set() # holds lines already seen
outfile = open("Yahoonodup.txt", "w", encoding="utf-8")
for line in open("Yahoo.txt", "r", encoding="utf-8"):
if line not in lines_seen: # not a duplicate
outfile.write(line)
lines_seen.add(line)
outfile.close()
checker()
My output file contains some urls like this :
https://r.search.yahoo.com/cbclk2/dWU9MURCNjczQ0UwNThBNDk4MyZ1dD0xNjE2ODAzMTA5MDE4JnVvPTg0OTM3NTA2NTgyMzY5Jmx0PTImcz0xJmVzPVdHbFZxQzRHUFNfemNveGNLaUgxVkpoX3lXV2N2WFhiQkRfZklRLS0-/RV=2/RE=1616831909/RO=10/RU=https%3a%2f%2fwww.bing.com%2faclick%3fld%3de8BWTO-5A13W9y2D2Aw39AjjVUCUyb98EJf6bSa7R7dGxGXelKfNh7KW94OonXABpN7Bo9YkZqB22Evk3cfTIpJi3aGEXXKJMtDqnaNUDUVcsehzFOYyr09GoYqUE-iUywRWeOnV4aeACKf4_YX6dE2BVZAbqkvWj4HQMqeB_Fl1KlwT1v%26u%3daHR0cHMlM2ElMmYlMmZ2ZXJnbGVpY2guZm9jdXMuZGUlMmZ3YXNjaG1hc2NoaW5lJTJmJTNmY2hhbm5lbCUzZGJpbmclMjZkZXZpY2UlM2RjJTI2bmV0d29yayUzZG8lMjZjYW1wYWlnbiUzZDQwNzE4NzU1MCUyNmFkZ3JvdXAlM2QxMzU4OTk2OTA3NDAxNDE4JTI2dGFyZ2V0JTNka3dkLTg0OTM3NjAxMjIzNjUyJTNhbG9jLTcyJTI2YWQlM2Q4NDkzNzUwNjU4MjM2OSUyNmFkLWV4dGVuc2lvbiUzZA%26rlid%3d0fc40f09a4b6109e9c726f57d193ec0e/RK=2/RS=3w4U9AT_OQyaVSF.6KLwzWuo_LU-;_ylc=cnQDMQ--?IG=0ac9439bcf3f4ec087000000005bf464
And I want to change it into the real links :
https://vergleich.focus.de/waschmaschine/?channel=bing&device=c&network=o&campaign=407187550&adgroup=1358996907401418&target=kwd-84937601223652:loc-72&ad=84937506582369&ad-extension=
It's possible ?
As seen here the response will return the url of the site that was responsible of returning the content. Meaning that for your example, you can do something like this.
url = 'https://r.search.yahoo.com/cbclk2/dWU9MURCNjczQ0UwNThBNDk4MyZ1dD0xNjE2ODAzMTA5MDE4JnVvPTg0OTM3NTA2NTgyMzY5Jmx0PTImcz0xJmVzPVdHbFZxQzRHUFNfemNveGNLaUgxVkpoX3lXV2N2WFhiQkRfZklRLS0-/RV=2/RE=1616831909/RO=10/RU=https%3a%2f%2fwww.bing.com%2faclick%3fld%3de8BWTO-5A13W9y2D2Aw39AjjVUCUyb98EJf6bSa7R7dGxGXelKfNh7KW94OonXABpN7Bo9YkZqB22Evk3cfTIpJi3aGEXXKJMtDqnaNUDUVcsehzFOYyr09GoYqUE-iUywRWeOnV4aeACKf4_YX6dE2BVZAbqkvWj4HQMqeB_Fl1KlwT1v%26u%3daHR0cHMlM2ElMmYlMmZ2ZXJnbGVpY2guZm9jdXMuZGUlMmZ3YXNjaG1hc2NoaW5lJTJmJTNmY2hhbm5lbCUzZGJpbmclMjZkZXZpY2UlM2RjJTI2bmV0d29yayUzZG8lMjZjYW1wYWlnbiUzZDQwNzE4NzU1MCUyNmFkZ3JvdXAlM2QxMzU4OTk2OTA3NDAxNDE4JTI2dGFyZ2V0JTNka3dkLTg0OTM3NjAxMjIzNjUyJTNhbG9jLTcyJTI2YWQlM2Q4NDkzNzUwNjU4MjM2OSUyNmFkLWV4dGVuc2lvbiUzZA%26rlid%3d0fc40f09a4b6109e9c726f57d193ec0e/RK=2/RS=3w4U9AT_OQyaVSF.6KLwzWuo_LU-;_ylc=cnQDMQ--?IG=0ac9439bcf3f4ec087000000005bf464'
response = requests.get(url)
print(response.url) ## this will give you 'https://vergleich.focus.de/waschmaschine/?channel=bing&device=c&network=o&campaign=407187550&adgroup=1358996907401418&target=kwd-84937601223652:loc-72&ad=84937506582369&ad-extension='
Related
So I have this code but I am having issues when the data I am scraping has commas. I want it only show on the first column but when there's a comma, the data appears on the 2nd column. Is it possible to scrape and print it on the first column only of csv without using panda? Thanks
i = 1
for url in urls:
print(f'Scraping the URL no {i}')
i += 1
response = requests.get(url)
soup = BeautifulSoup(response.text,'html.parser')
links = []
for text in soup.find('div',class_='entry-content').find_all('div',class_='streak'):
link = text.a['href']
text = text.a.text
links.append(link)
with open("/Users/Rex/Desktop/data.csv", "a") as file_object:
file_object.write(text)
file_object.write("\n")
CSV files have rules for escaping commas within a single column so that they are not mistakenly interpreted as a new column. This escaping can be applied automatically if you use the csv module. You really only need to open the file once, so with a few more tweaks to your code
import csv
with open("/Users/Rex/Desktop/data.csv", "a", newline=None) as file_object:
csv_object = csv.writer(file_object)
i = 1
for url in urls:
print(f'Scraping the URL no {i}')
i += 1
response = requests.get(url)
soup = BeautifulSoup(response.text,'html.parser')
links = []
for text in soup.find('div',class_='entry-content').find_all('div',class_='streak'):
link = text.a['href']
text = text.a.text.strip()
# only record if we have text
if text:
links.append(link)
csv_object.writerow([text])
NOTE: This code is skipping links that do not have text.
I want to do a search using keywords from a file in a loop. using Selenium and BeatifulSoup
read 1st. row, put the value of it (one keyword) into the search query area, and search, when done, use the 2nd row from the file, and so on.
the read file part does print all keywords, one on each row, but I am not sure how to put it into the search query area, one at a time.
def SearchFuncs():
driver.get('https://www.website.com/search/?q=pet%20care') #put the value from one row on search/?q=
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
a = soup.select('div.class_name a')
for a in soup.select('div.class_name a'):
#print(a['title'])
return a
#SearchFuncs()
x = SearchFuncs()
print(x ['title'])
# read file sction:
with open ("kw-to-search.txt", "r") as f:
for line in f:
print(line.strip())
Updated: I also added save the result to file
but I tested the codes without save to file section
this is the code I tried using one of the solution (broderick) provided, thank you broderick, I don't get any output, and neither any error:
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from bs4 import BeautifulSoup
import requests
import time
def SearchFuncs(addr):
driver.get(addr)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
a = soup.select('div.class_name a')
for a in soup.select('div.class_name a'):
#return a
#print(a ['title'])
with open ("kw.txt", "r") as f:
for line in f:
addr_to_search = 'https://www.website.com/search/?q='
# Build search query from lines
pieces = line.split()
query = ''
for i in range(len(pieces) - 1):
query += (pieces[i] + '%20')
query += pieces[-1]
# Debugging print
print(query)
addr_to_search += query
SearchFuncs(addr_to_search)
textList = a['title']
outF = open("keyword_result.txt", 'a')
for line in textList:
# write line to output file
outF.write(line)
#outF.write("\n")
outF.write(textList + '\n')
outF.close()
Updated with another code
This is another variation Arthur Pereira provided, thank you, Arthur Pereira
def SearchFuncs(url):
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
a = soup.select('div.class_name a')
for a in soup.select('div.class_name a'):
return a
#y = SearchFuncs(url)
#print(y ['title'])
#print(a['title'])
textList = a['title']
outF = open("Keyword_results-2.txt", 'a')
for line in textList:
# write line to output file
outF.write(line)
#outF.write("\n")
outF.write(textList + '\n')
outF.close()
with open("kw.txt", "r") as f:
for line in f:
query = line.strip().replace(" ", "%20")
url = "https://www.website.com/search/?q=" + query
SearchFuncs(url)
Error:
Traceback (most recent call last):
File "c:/Users/mycomp/Desktop/Python/test/Test-Search-on-Pin-fromList-1.py", line 45, in <module>
SearchFuncs(url)
File "c:/Users/mycomp/Desktop/Python/test/Test-Search-on-Pin-fromList-1.py", line 31, in SearchFuncs
textList = a['title']
TypeError: list indices must be integers or slices, not str
Iterate over each line in your text and prepare it to search. Then pass this url to your search function as a parameter:
Also I think you misuderstand the concept of return. Here your code is just returning the first a element and nothing should happen after it, leaving the function.
for a in soup.select('div.Eqh.F6l.Jea.k1A.zI7.iyn.Hsu a'):
return a
The error you are getting is beacuse it's not finding anything with your select, so it tries to create a list with a string as index:
textList = a['title']
So, assuming you want to get the text inside each anchor element you have to find the correct div and jup into the a element. Then you can get the title and write to a file.
def SearchFuncs(url):
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
collection = soup.select('div.Collection-Item a')
for item in collection:
title = item['title'].strip()
with open("Keyword_results-2.txt", 'a', encoding="utf-8") as outF:
outF.write(title + '\n') # write line to output file
with open("kw.txt", "r") as f:
for line in f:
query = line.strip().replace(" ", "%20")
url = "https://www.pinterest.com/search/pins/?q=" + query
SearchFuncs(url)
Try
def SearchFuncs(addr):
driver.get(addr)
...
and
with open ("kw-to-search.txt", "r") as f:
for line in f:
addr_to_search = 'https://www.website.com/search/?q='
# Build search query from lines
pieces = line.split()
query = ''
for i in range(len(pieces) - 1):
query += (pieces[i] + '%20')
query += pieces[-1]
# Debugging print
print(query)
addr_to_search += query
SearchFuncs(addr_to_search)
I have a beautiful soup program where I find all the links on a webpage and put it in a queue.txt file. The program then gets each link from the file and find all the links on those links. They then get put into a crawled.txt file for all the crawled links.
I want to make sure I get no duplicates so I want the program to go through the queue.txt and crawled.txt and if the links that have just been found are in those files, then the new found links shouldn't be put in the file
I have tried doing it so that it prints the newly found links into a list and removes duplicates from there and prints the list to a .txt file but it wouldn't be able to tell what is in the queue file it only removes duplicates from the newly found links from the one page.
This is the code:
from bs4 import BeautifulSoup
import requests
import re
from urllib.parse import urlparse
def get_links(base_url, file_name):
page = requests.get(base_url)
soup = BeautifulSoup(page.content, 'html.parser')
single_slash = re.compile(r'^/\w')
double_slash = re.compile(r'^//\w')
parsed_uri = urlparse(base_url)
domain_name = '{uri.scheme}://{uri.netloc}'.format(uri=parsed_uri)
with open(file_name, "a") as f:
for tag in soup.find_all('a'):
link = str(tag.get('href'))
if str(link).startswith("http"):
link = link
print(link)
if double_slash.match(link):
link = 'https:' + link
print(link)
if single_slash.match(link):
link = domain_name + link
print(link)
if str(link).startswith("#"):
continue
if str(link).startswith("j"):
continue
if str(link).startswith('q'):
continue
if str(link).startswith('u'):
continue
if str(link).startswith('N'):
continue
if str(link).startswith('m'):
continue
try:
f.write(link + '\n')
except:
pass
get_links('https://stackabuse.com/reading-and-writing-lists-to-a-file-in-python/', "queue.txt")
with open('queue.txt') as f:
lines = f.read().splitlines()
print(lines)
for link in lines:
if lines[0] == "/":
del lines[0]
print(lines[0])
with open('crawled.txt', 'a') as h:
h.write('%s\n' % lines[0])
h.close()
del lines[0]
if lines[0] == "/":
del lines[0]
with open('queue.txt', 'w') as filehandle:
for listitem in lines:
filehandle.write('%s\n' % listitem)
page_url = lines[0]
get_links(page_url, "queue.txt")
print(lines)
with open('queue.txt') as f:
lines = f.read().splitlines()
In general for Python, when trying to remove duplicates, sets are usually a good bet. For example:
lines = open('queue.txt', 'r').readlines()
queue_set = set(lines)
result = open('queue.txt', 'w')
for line in queue_set:
result.write(line)
Note: This will not preserve the order of the links, but I don't see a reason for that in this case.
Also, this was adapted from this answer.
I am a complete programming beginner, so please forgive me if I am not able to express my problem very well. I am trying to write a script that will look through a series of pages of news and will record the article titles and their links. I have managed to get that done for the first page, the problem is getting the content of the subsequent pages. By searching in stackoverflow, I think I managed to find a solution that will make the script access more than one URL BUT it seems to be overwriting the content extracted from each page it accesses so I always end up with the same number of recorded articles in the file. Something that might help: I know that URLs follow the following model: "/ultimas/?page=1", "/ultimas/?page=2", etc. and it appears to be using AJAX to request new articles
Here is my code:
import csv
import requests
from bs4 import BeautifulSoup as Soup
import urllib
r = base_url = "http://agenciabrasil.ebc.com.br/"
program_url = base_url + "/ultimas/?page="
for page in range(1, 4):
url = "%s%d" % (program_url, page)
soup = Soup(urllib.urlopen(url))
letters = soup.find_all("div", class_="titulo-noticia")
letters[0]
lobbying = {}
for element in letters:
lobbying[element.a.get_text()] = {}
letters[0].a["href"]
prefix = "http://agenciabrasil.ebc.com.br"
for element in letters:
lobbying[element.a.get_text()]["link"] = prefix + element.a["href"]
for item in lobbying.keys():
print item + ": " + "\n\t" + "link: " + lobbying[item]["link"] + "\n\t"
import os, csv
os.chdir("...")
with open("lobbying.csv", "w") as toWrite:
writer = csv.writer(toWrite, delimiter=",")
writer.writerow(["name", "link",])
for a in lobbying.keys():
writer.writerow([a.encode("utf-8"), lobbying[a]["link"]])
import json
with open("lobbying.json", "w") as writeJSON:
json.dump(lobbying, writeJSON)
print "Fim"
Any help on how I might go about adding the content of each page to the final file would be very appreciated. Thank you!
How about this one if serving the same purpose:
import csv, requests
from lxml import html
base_url = "http://agenciabrasil.ebc.com.br"
program_url = base_url + "/ultimas/?page={0}"
outfile = open('scraped_data.csv', 'w', newline='')
writer = csv.writer(outfile)
writer.writerow(["Caption","Link"])
for url in [program_url.format(page) for page in range(1, 4)]:
response = requests.get(url)
tree = html.fromstring(response.text)
for title in tree.xpath("//div[#class='noticia']"):
caption = title.xpath('.//span[#class="field-content"]/a/text()')[0]
policy = title.xpath('.//span[#class="field-content"]/a/#href')[0]
writer.writerow([caption , base_url + policy])
It looks like the code in your for loop (for page in range(1, 4):) isn't been called due to your file not been correctly indented:
If you tidy up your code, it works:
import csv, requests, os, json, urllib
from bs4 import BeautifulSoup as Soup
r = base_url = "http://agenciabrasil.ebc.com.br/"
program_url = base_url + "/ultimas/?page="
for page in range(1, 4):
url = "%s%d" % (program_url, page)
soup = Soup(urllib.urlopen(url))
letters = soup.find_all("div", class_="titulo-noticia")
lobbying = {}
for element in letters:
lobbying[element.a.get_text()] = {}
prefix = "http://agenciabrasil.ebc.com.br"
for element in letters:
lobbying[element.a.get_text()]["link"] = prefix + element.a["href"]
for item in lobbying.keys():
print item + ": " + "\n\t" + "link: " + lobbying[item]["link"] + "\n\t"
#os.chdir("...")
with open("lobbying.csv", "w") as toWrite:
writer = csv.writer(toWrite, delimiter=",")
writer.writerow(["name", "link",])
for a in lobbying.keys():
writer.writerow([a.encode("utf-8"), lobbying[a]["link"]])
with open("lobbying.json", "w") as writeJSON:
json.dump(lobbying, writeJSON)
print "Fim"
Using Python 3.5, what I'm looking to do is to go to the results page of an ebay search by means of generating a link, save the source code as an XML document, and iterate through every individual listing, of which there could be 1000 or more. Next I want to create a dictionary with every word that appears in every listing's title, (title only) and its corresponding frequency of appearance. So for example, if I search 'honda civic', and the thirty of the results are 'honda civic ignition switch', I'd like my results to come out as
results = {'honda':70, 'civic':60, 'igntion':30, 'switch':30, 'jdm':15, 'interior':5}
etc., etc.
Here's a link I use:
http://www.ebay.com/sch/Car-Truck-Parts-/6030/i.html?_from=R40&LH_ItemCondition=4&LH_Complete=1&LH_Sold=1&_mPrRngCbx=1&_udlo=100&_udhi=700&_nkw=honda+%281990%2C+1991%2C+1992%2C+1993%2C+1994%2C+1995%2C+1996%2C+1997%2C+1998%2C+1999%2C+2000%2C+2001%2C+2002%2C+2003%2C+2004%2C+2005%29&_sop=16
The problem I'm having is that I only get the first 50 results, instead of the X,000's of results I potentially will get with different search options. What might be a better method of going about this?
And my code:
import requests
from bs4 import BeautifulSoup
from collections import Counter
r = requests.get(url)
myfile = 'c:/users/' + myquery
fw = open(myfile + '.xml', 'w')
soup = BeautifulSoup(r.content, 'lxml')
for item in soup.find_all('ul',{'class':'ListViewInner'}):
fw.write(str(item))
fw.close()
print('...complete')
fr = open(myfile + '.xml', 'r')
wordfreq = Counter()
for i in fr:
words = i.split()
for i in words:
wordfreq[str(i)] = wordfreq[str(i)] + 1
fw2 = open(myfile + '_2.xml', 'w')
fw2.write(str(wordfreq))
fw2.close()
You are getting the first 50 results because EBay display 50 results for each page. The solution is to parse one page at time. With this search, you can use a different url:
http://www.ebay.com/sch/Car-Truck-Parts-/6030/i.html?_from=R40&LH_ItemCondition=4&LH_Complete=1&LH_Sold=1&_mPrRngCbx=1&_udlo=100&_udhi=700&_sop=16&_nkw=honda+%281990%2C+1991%2C+1992%2C+1993%2C+1994%2C+1995%2C+1996%2C+1997%2C+1998%2C+1999%2C+2000%2C+2001%2C+2002%2C+2003%2C+2004%2C+2005%29&_pgn=1&_skc=50&rt=nc
Notice a parameter _pgn=1 in the url? This is the number of the page currently displayed. If you provide a number that exceeds the number of the pages for the search, a error message will appear in a div with class "sm-md"
So you can do something like:
page = 1
url = """http://www.ebay.com/sch/Car-Truck-Parts-/6030/i.html?_from=R40&LH_ItemCondition=4&LH_Complete=1&LH_Sold=1&_mPrRngCbx=1&_udlo=100&_udhi=700&_sop
=16&_nkw=honda+%281990%2C+1991%2C+1992%2C+1993%2C+1994%2C+1995%2C+1996%2C+
1997%2C+1998%2C+1999%2C+2000%2C+2001%2C+2002%2C+2003%2C+2004%2C+2005%29&
_pgn="""+str(page)+"&_skc=50&rt=nc"
has_page = True
myfile = 'c:/users/' + myquery
fw = open(myfile + '.xml', 'w')
while has_page:
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
error_msg = soup.find_all('p', {'class':"sm-md"})
if len(error_msg) > 0:
has_page = False
continue
for item in soup.find_all('ul',{'class':'ListViewInner'}):
fw.write(str(item))
page+=1
fw.close()
I only tested entering the pages and printing the ul, and it worked nice