I am trying to get all emails from a specific page and separate them into an individual variable or even better a dictionary. This is some code.
import requests
import re
import json
from bs4 import BeautifulSoup
page = "http://www.example.net"
info = requests.get(page)
if info.status_code == 200:
print("Page accessed")
else:
print("Error accessing page")
code = info.content
soup = BeautifulSoup(code, 'lxml')
allEmails = soup.find_all("a", href=re.compile(r"^mailto:"))
print(allEmails)
sep = ","
allEmailsStr = str(allEmails)
print(type(allEmails))
print(type(allEmailsStr))
j = allEmailsStr.split(sep, 1)[0]
print(j)
Excuse the poor variable names because I put this together so it would be fine by itself. The output from the example website would be for example something like
[k, kolyma, location, balkans]
So if I ran the problem it would return only
[k
But if I wanted it to return every email on there individually how would I do that?
To get just the email str you can try:
emails = []
for email_link in allEmails:
emails.append(email_link.get("href").replace('mailto:', ''))
print(emails)
Based on your expected output, you can use the unwrap function of BeautifulSoup
allEmails = soup.find_all("a", href=re.compile(r"^mailto:"))
for Email in allEmails:
print(Email.unwrap()) #This will print the whole element along with tag
# k
Related
Python code:
url = 'https://www.basketball-reference.com/players/'
initial = list(string.ascii_lowercase)
initial_url = [url + i for i in initial]
html_initial = [urllib.request.urlopen(i).read() for i in initial_url]
soup_initial = [BeautifulSoup(i, 'html.parser') for i in html_initial]
tags_initial = [i('a') for i in soup_initial]
print(tags_initial[0][50])
Results example:
Shareef Abdur-Rahim
From the example above, I want to extract the name of the players which is 'Shareef Abdur-Rahim', but I want to do it for all the tags_initial lists,
Does anyone have an idea?
Could you modify your post by adding your code so that we can help you better?
Maybe that could help you :
name = soup.findAll(YOUR_SELECTOR)[0].string
UPDATE
import re
import string
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = 'https://www.basketball-reference.com/players/'
# Alphabet
initial = list(string.ascii_lowercase)
datas = []
# URLS
urls = [url + i for i in initial]
for url in urls:
# Soup Object
soup = BeautifulSoup(urlopen(url), 'html.parser')
# Players link
url_links = soup.findAll("a", href=re.compile("players"))
for link in url_links:
# Player name
datas.append(link.string)
print("datas : ", datas)
Then, "datas" contains all the names of the players, but I advise you to do a little processing afterwards to remove some erroneous information like "..." or perhaps duplicates
There are probably better ways but I'd do it like this:
html = "a href=\"/teams/LAL/2021.html\">Los Angeles Lakers</a"
index = html.find("a href")
index = html.find(">", index) + 1
index_end = html.find("<", index)
print(html[index:index_end])
If you're using a scraper library it probably has a similar function built-in.
in the code below I am trying to extract IP addresses and ports of http://free-proxy-list.net from the table using BeautifulSoup.
But every time I get the whole row which is useless because I can't separate IP addresses from their ports.
How can I get IP and port separated?
Here is my code:
def get_proxy(self):
response = requests.get(self.url)
soup = bs(response.content,'html.parser')
data_list = [tr for tr in soup.select('tr') if tr.td]
for i in data_list:
print(i.text)
In your code,
instead of -
i.text you could use i.getText(' ,') (or with another separator of your choice other than ,).
That will give you comma separated IP and Ports.
Moreover for convenience you could load the proxy list into a dataframe as well.
Make the following changes/additions to your code -
soup = bs(response.content,'html.parser')
data_list = [tr for tr in soup.select('tr') if tr.td]
data_list2 = [tr.getText(' ,') for tr in soup.select('tr') if tr.td]
#for i in data_list:
#print(i.text)
df = pd.DataFrame(data_list2,columns=['proxy_list'])
df_proxyList= df['proxy_list'].str.split(',', expand=True)[0:300]
df_proxyList would look like (with few garbage columns) -
Try this. I had to add the isnumeric() condition to make sure that the code doesn't include the data from another table which is present on the same website.
from bs4 import BeautifulSoup as bs
import requests
from collections import defaultdict
def get_proxy(URL):
response = requests.get(url)
soup = bs(response.content,'html.parser')
mapping = defaultdict()
for tr in soup.select('tr'):
if len(list(tr)) == 8:
ip_val = str(list(tr)[0].text)
port_val = str(list(tr)[1].text)
if port_val.isnumeric():
mapping[ip_val] = port_val
for items in mapping.keys():
print("IP:",items)
print("PORT:",mapping[items])
if __name__ == '__main__':
url = "http://free-proxy-list.net"
get_proxy(url)
I've created a script in python to parse the website address of different agencies from it's landing page and the location address from it's inner page. What I can't understand is how can i return a string and a list at the same time in order for them to be reused in another function. To be clearer: I wish to return the website address and list of links from collect_links() function and reuse them in get_info() function. My current approach throws an error - ValueError: not enough values to unpack (expected 2, got 1).
Tis is my attempt so far:
import re
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
def collect_links(link):
res = requests.get(link)
soup = BeautifulSoup(res.text, "lxml")
website = [soup.select_one("p.company-profile-website > a").get("href")]
items = [urljoin(url,item.get("href")) for item in soup.select("[id^='company-'] .search-companies-result-info h2 > a")]
return website,items
def get_info(website,link):
res = requests.get(link)
soup = BeautifulSoup(res.text, "lxml")
address = soup.select_one("p.footer-right").get_text(strip=True)
print(website,address)
if __name__ == '__main__':
url = "https://www.cv-library.co.uk/companies/agencies/A"
for item,link in collect_links(url):
get_info(item,link)
How can I return a string and a list from one function to another?
PS I would like to be stick to the design I've already tried.
Your websites is a list with a single element string, not a string as you've enclosed it in [] literal. You need to drop [] to make it a string as no point making that a list.
After doing that, you can get the return value, and iterate over the links like:
if __name__ == '__main__':
url = "https://www.cv-library.co.uk/companies/agencies/A"
website, links = collect_links(url)
for link in links:
get_info(website, link)
Main error in the code is in this link.
website = [soup.select_one("p.company-profile-website > a").get("href")]
This only returns one value:
http://www.autoskills-uk.com
Your function should be:
def collect_links(link):
res = requests.get(link)
soup = BeautifulSoup(res.text, "lxml")
websites = [x.get("href") for x in soup.select("p.company-profile-website > a")] #<============== Changed
items = [urljoin(url,item.get("href")) for item in soup.select("[id^='company-'] .search-companies-result-info h2 > a")]
return zip(websites, items)
Return as zip of websites and items.
Now you can list unpack item and link in the for loop:
if __name__ == '__main__':
url = "https://www.cv-library.co.uk/companies/agencies/A"
for item,link in collect_links(url):
get_info(item,link)
You are returning two lists, one with one element and another, with many elements as a tuple, and try to iterator over this tuple, unpacking each list into two elements item and link.
I don't see, what you really want to do, but you should separate for-loop and return values:
website, links = collect_links(url)
for link in links:
get_info(website[0], link)
How would i modify the parameters of the findAll method to read both li's and id's? li's are elements and id's are attributes correct?
#Author: David Owens
#File name: soupScraper.py
#Description: html scraper that takes surf reports from various websites
import csv
import requests
from bs4 import BeautifulSoup
###################### SURFLINE URL STRINGS AND TAG ###########################
slRootUrl = 'http://www.surfline.com/surf-report/'
slSunsetCliffs = 'sunset-cliffs-southern-california_4254/'
slScrippsUrl = 'scripps-southern-california_4246/'
slBlacksUrl = 'blacks-southern-california_4245/'
slCardiffUrl = 'cardiff-southern-california_4786/'
slTagText = 'observed-wave-range'
slTag = 'id'
#list of surfline URL endings
slUrls = [slSunsetCliffs, slScrippsUrl, slBlacksUrl, slCardiffUrl]
###############################################################################
#################### MAGICSEAWEED URL STRINGS AND TAG #########################
msRootUrl = 'http://magicseaweed.com/'
msSunsetCliffs = 'Sunset-Cliffs-Surf-Report/4211/'
msScrippsUrl = 'Scripps-Pier-La-Jolla-Surf-Report/296/'
msBlacksUrl = 'Torrey-Pines-Blacks-Beach-Surf-Report/295/'
msTagText = 'rating-text text-dark'
msTag = 'li'
#list of magicseaweed URL endings
msUrls = [msSunsetCliffs, msScrippsUrl, msBlacksUrl]
###############################################################################
'''
This method iterates through a list of urls and extracts the surf report from
the webpage dependent upon its tag location
rootUrl: The root url of each surf website
urlList: A list of specific urls to be appended to the root url for each
break
tag: the html tag where the actual report lives on the page
returns: a list of strings of each breaks surf report
'''
def extract_Reports(rootUrl, urlList, tag, tagText):
#empty list to hold reports
reports = []
#loop thru URLs
for url in urlList:
try:
#request page
request = requests.get(rootUrl + url)
#turn into soup
soup = BeautifulSoup(request.content, 'lxml')
#get the tag where report lives
reportTag = soup.findAll(id = tagText)
for report in reportTag:
reports.append(report.string.strip())
#notify if fail
except:
print 'scrape failure'
pass
return reports
#END METHOD
slReports = extract_Reports(slRootUrl, slUrls, slTag, slTagText)
msReports = extract_Reports(msRootUrl, msUrls, msTag, msTagText)
print slReports
print msReports
As of right now, only slReports prints correctly because i have it explicitly set to id = tagText. I am also aware that my tag paramater is not used currently.
So the problem is that you want to search the parse tree for elements that have either a class name of rating-text (it turns out you do not need text-dark to identify the relevant elements in the case of Magicseaweed) or an ID of observed-wave-range, using a single findAll call.
You can use a filter function to achieve this:
def reportTagFilter(tag):
return (tag.has_attr('class') and 'rating-text' in tag['class']) \
or (tag.has_attr('id') and tag['id'] == 'observed-wave-range')
Then change your extract_Reports function to read:
reportTag = soup.findAll(reportTagFilter)[0]
reports.append(reportTag.text.strip())
In order to practice and help my sister get emails from doctors for her baby, I have designed this email harvester. It makes a search, cleans the urls given, adds them to a dictionary and parse them for emails in two different ways.
The code has been taken from different places, so if you correct me, please explain clearly your improvement, as I am working at the limit of my knowledge already.
The question is how to get emails better (and improve code, if possible). I'll post the code and the exact output below:
CODE of my program:
import requests, re, webbrowser, bs4
from selenium import webdriver
from bs4 import BeautifulSoup
import time, random, webbrowser
import urllib.request
def google_this(): #Googles and gets the first few links
search_terms = ['Fiat','Lambrusco']
added_terms = 'email contact? #'
#This searches for certain keywords in Google and parses results with BS
for el in search_terms:
webpage = 'http://google.com/search?q=' + str(el) + str(added_terms)
print('Searching for the terms...', el,added_terms)
headers = {'User-agent':'Mozilla/5.0'}
res = requests.get(webpage, headers=headers)
#res.raise_for_status()
statusCode = res.status_code
if statusCode == 200:
soup = bs4.BeautifulSoup(res.text,'lxml')
serp_res_rawlink = soup.select('.r a')
dicti = [] #This gets the href links
for link in serp_res_rawlink:
url = link.get('href')
if 'pdf' not in url:
dicti.append(url)
dicti_url = [] #This cleans the "url?q=" from link
for el in dicti:
if '/url?q=' in el:
result = (el.strip('/url?q='))
dicti_url.append(result)
#print(dicti_url)
dicti_pretty_links = [] #This cleans the gibberish at end of url
for el in dicti_url[0:4]:
pretty_url = el.partition('&')[0]
dicti_pretty_links.append(pretty_url)
print(dicti_pretty_links)
for el in dicti_pretty_links: #This converts page in BS soup
# browser = webdriver.Firefox()
# browser.get(el)
# print('I have been in the element below and closed the window')
# print(el)
# time.sleep(1)
# browser.close()
webpage = (el)
headers = {'User-agent':'Mozilla/5.0'}
res = requests.get(webpage, headers=headers)
#res.raise_for_status()
statusCode = res.status_code
if statusCode == 200:
soup = bs4.BeautifulSoup(res.text,'lxml')
#This is the first way to search for an email in soup
emailRegex = re.compile(r'([a-zA-Z0-9_.+]+#+[a-zA-Z0-9_.+])', re.VERBOSE)
mo = emailRegex.findall(res.text)
#mo = emailRegex.findall(soup.prettify())
print('THIS BELOW IS REGEX')
print(mo)
#This is the second way to search for an email in soup:
mailtos = soup.select('a[href^=mailto]')
for el in mailtos:
print('THIS BELOW IS MAILTOS')
print(el.text)
time.sleep(random.uniform(0.5,1))
google_this()
And here is the OUTPUT when this very same code above. As you can see, some emails seem to be found, but at cut just after the "#" symbol:
C:\Users\SK\AppData\Local\Programs\Python\Python35-32\python.exe C:/Users/SK/PycharmProjects/untitled/another_temperase.py
Searching for the terms... Fiat email contact? #
['http://www.fcagroup.com/en-US/footer/Pages/contacts.aspx', 'http://www.fiat.co.uk/header-contacts', 'http://www.fiatusa.com/webselfservice/fiat/', 'https://twitter.com/nic_fincher81/status/672505531689394176']
THIS BELOW IS REGEX
['investor.relations#f', 'investor.relations#f', 'sustainability#f', 'sustainability#f', 'mediarelations#f', 'mediarelations#f']
THIS BELOW IS MAILTOS
investor.relations#fcagroup.com
THIS BELOW IS MAILTOS
sustainability#fcagroup.com
THIS BELOW IS MAILTOS
mediarelations#fcagroup.com
THIS BELOW IS REGEX
[]
THIS BELOW IS REGEX
[]
THIS BELOW IS REGEX
['nic_fincher81#y', 'nic_fincher81#y', 'nic_fincher81#y', 'nic_fincher81#y', 'nic_fincher81#y', 'nic_fincher81#y']
Searching for the terms... Lambrusco email contact? #
['http://www.labattagliola.it/%3Flang%3Den']
Process finished with exit code 0
I would recommend a more restrictive version that still catches all of the email:
([a-zA-Z0-9_.+]+#[a-zA-Z0-9_.+]+)
The problem of not catching anything after the first letter after the # is because the regex is missing a +
([a-zA-Z0-9_.+]+#+[a-zA-Z0-9_.+]+)
Originally this part [a-zA-Z0-9_.+] simply said to catch one of any of the following characters a-z, A-Z, 0-9, ., _,+.
I would also be careful about #+ which says to catch 1 or more "#" symbols.
So a potentially valid email could look like this:
..................########################.................