Scrape links into pandas dataframe in python

Scrape links into pandas dataframe in python - python

I am fairly new to this and I am facing issues all around. Any help/guidance is really appreciated!
I have a dataframe in the following structure:
data:
LINK
<link_one>
<link_two>
<link_three>
The dataframe name is data and it has one column called LINK which contains few weblinks.
I am trying to take each link from the column LINK and do some scraping to return text body contents of each link and attached it to a column called CONTENT in the dataframe.
Here is what the outcome I am hoping for:
data:
LINK CONTENT
<link_one> <text_body_one>
<link_two> <text_body_two>
<link_three> <text_body_three>
This is what I have so far:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
data = pd.read_csv("~/Downloads/links.csv")
def body_content(val):
url = val
try:
page = requests.get(url, verify=False).text
except requests.ConnectionError:
pass
soup = BeautifulSoup(page, 'lxml')
p_tags = soup.find_all('p')
p_tags_text = [tag.get_text().strip() for tag in p_tags]
sentence_list = [sentence for sentence in p_tags_text if not '\n' in sentence]
sentence_list = [sentence for sentence in sentence_list if '.' in sentence]
article = ' '.join(sentence_list)
return article
data['CONTENT'] = zip(*data['LINK'].map(body_content))
While the function body_content works but I can not get the contents to attach properly to the dataframe. Getting the following error:
UnboundLocalError: local variable 'page' referenced before assignment
Thank you for your time!

Probably the problem is because in the try/except part, the code goes to the except and thus doesn't create the variable page, you can do as following:
except requests.ConnectionError:
return ''
So if it has a connection error, it will return an empty string.

Related

Issue with scraping in python

I am trying to scrape some precise lines and create table from collected data (url attached), but cannot get more than the entire body text. Thus, I got stuck.
To give some example:
I would like to arrive at the below table, scraping details from the body content.All the details are there, however any help on how to retrieve them in a form given below would be much appreciated.
My code is:
import requests
from bs4 import BeautifulSoup
# providing url
url = 'https://www.polskawliczbach.pl/wies_Baniocha'
# creating request object
req = requests.get(url)
# creating soup object
data = BeautifulSoup(req.text, 'html')
# finding all li tags in ul and printing the text within it
data1 = data.find('body')
for li in data1.find_all("li"):
print(li.text, end=" ")

At first find the ul and then try to find li inside ul. Scrape needed data, save scraped data in variable and make table using pandas. Now we have done all things if you want to save table then save it in csv file otherwise just print it.
Here's the code implementation of all above things:
from bs4 import BeautifulSoup
import requests
import pandas as pd
page = requests.get('https://www.polskawliczbach.pl/wies_Baniocha')
soup = BeautifulSoup(page.content, 'lxml')
lis=soup.find_all("ul",class_="list-group row")[1].find_all("li")[1:-1]
dic={"name":[],"value":[]}
for li in lis:
try:
dic["name"].append(li.find(text=True,recursive=False).strip())
dic["value"].append(li.find("span").text.replace(" ",""))
print(li.find(text=True,recursive=False).strip(),li.find("span").text.replace(" ",""))
except:
pass
df=pd.DataFrame(dic)
print(df)
# If you want to save this as file then uncomment following line:
# df.to_csv("<FILENAME>.csv")
And additionally if you want to scrape all then "categories", I don't understand that language so,I don't know which is useful and which is not but anyway here's the code, you can just change this part of above code:
soup = BeautifulSoup(page.content, 'lxml')
dic={"name":[],"value":[]}
lis=soup.find_all("ul",class_="list-group row")
for li in lis:
a=li.find_all("li")[1:-1]
for b in a:
error=0
try:
print(b.find(text=True,recursive=False).strip(),"\t",b.find("span").text.replace(" ","").replace(",",""))
dic["name"].append(b.find(text=True,recursive=False).strip())
dic["value"].append(b.find("span").text.replace(" ","").replace(",",""))
except Exception as e:
pass
df=pd.DataFrame(dic)

Find main tag by specific class and from it find all li tag
main_data=data.find("ul", class_="list-group").find_all("li")[1:-1]
names=[]
values=[]
main_values=[]
for i in main_data:
values.append(i.find("span").get_text())
names.append(i.find(text=True,recursive=False))
main_values.append(values)
For table representation use pandas module
import pandas as pd
df=pd.DataFrame(columns=names,data=main_values)
df
Output:
Liczba mieszkańców (2011) Kod pocztowy Numer kierunkowy
0 1 935 05-532 (+48) 22

Why is BeautifulSoup(...).find(...) returning None?

I have some problem with code (I use bs4):
elif 'temperature' in query:
speak("where?")
miejsce=takecommand().lower()
search = (f"Temperature in {miejsce}")
url = (f'https://www.google.com/search?q={search}')
r = requests.get(url)
data = BeautifulSoup(r.text , "html.parser")
temp = data.find("div", class_="BNeawe").text
speak(f"In {search} there is {temp}")
and the error is:
temp = data.find("div", class_="BNeawe").text
AttributeError: 'NoneType' object has no attribute 'text'
Could you help me please

data.find("div", class_="BNeawe") didnt return anything, so i believe google changed how it displays weather since you last ran this code successfully.
If you search for yourself 'Weather in {place}' then right click the weather widget and choose Inspect Element (browser dependent), you can look for yourself at where the data is in the page, and see which class the data is under.
It appears it was previously under the BNeawe class.

elif "temperature" in query or "temperatures" in query:
search = "Temperature in New York"
url = f"https://www.google.com/search?q={search}:"
r = requests.get(url)
data = BeautifulSoup(r.text, "html.parser")
temp = data.find("div", class_="BNeawe").text
speak(f"Currently, the temperature in your region is {temp}")
Try this one, you were experiencing your proble in line 5 which is '(r.text, "html.parser")'
try to avoid these comma space mistakes in the code...

Best practice would be to use directly api google / weather - If you wanna scrape,try to avoid selecting your elements by classes, cause they are often that dynamic.
Instead focus on id if possible or use HTML structure:
for p in list(soup.select_one('span:-soup-contains("weather.com")').parents):
if '°' in p.text:
print(p.next.get_text(strip=True))
break
Example
from bs4 import BeautifulSoup
import requests
url = "https://www.google.com/search?q=temperature"
response = requests.get(url, headers = {'User-Agent': 'Mozilla/5.0', 'Accept-Language':'en-US,en;q=0.5'}, cookies={'CONSENT':'YES+'})
soup = BeautifulSoup(response.text)
for p in list(soup.select_one('span:-soup-contains("weather.com")').parents):
if '°' in p.text:
print(p.next.get_text(strip=True))
break

How to extract element from a webpage with special class name?

I have a txt file filed with multiple urls, each url is an article with text and their corresponding SDG (example of one article 1)
The text parts of an article are in balises 'div.text.-normal.content' and then in 'p'
And the SDGs are in 'div.tax-section.text.-normal.small' and then in 'span'
To extract them I use the following lines of code :
data = []
with open('urls_news.txt', 'r') as inf:
for row in inf:
url = row.strip()
response = requests.get(url, headers={'User-agent': 'Mozilla/5.0'})
if response.ok:
try:
soup = BeautifulSoup(response.text,"html.parser")
text = soup.select_one('div.text-normal').get_text(strip=True)
topic = soup.select_one('div.tax-section').get_text(strip=True)
data.append(
{
'text':text,
'topic': topic,
}
)
pd.DataFrame(data).to_excel('text_2.xlsx', index = False, header=True)
except AttributeError:
print (" ")
time.sleep(3)
But I have no result, I've previously used this code to extract same type of information from an other website with clearer class name. I'va also tried to enter "div.text.-normal.content" and "div.tax-section.text.-normal.small" but same result.
I think that the classes i'm calling in this exemple are wrong. I would like to know what i've missed in theses classes names.

To select the text you can go with:
soup.select_one('div.text.-normal.content').get_text(strip=True)
Think there is something wrong with the names of the classes, just chain them with a . for every whitespace between them.
or:
soup.select_one('div.c-single-content').get_text(strip=True)
To get the topics as mentioned you can go with:
'^^'.join([topic.get_text(strip=True) for topic in soup.select_one('div.tax-section.text.-normal.small').select('a')])

How can I address this error: InvalidSchema("No connection adapters were found for {!r}".format(url))?

I get this error:
InvalidSchema("No connection adapters were found for {!r}".format(url))
when I try to run this code:
import pandas as pd
pd.set_option('display.max_colwidth', -1)
url_file = 'https://github.com/MarissaFosse/ryersoncapstone/raw/master/DailyNewsArticles.xlsx'
tstar_articles = pd.read_excel(url_file, "TorontoStar Articles", header=0)
url_to_sents = {}
for url in tstar_articles:
url = tstar_articles['URL']
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(class_='c-article-body__content')
results_text = [tag.get_text().strip() for tag in results]
sentence_list = [sentence for sentence in results_text if not '\n' in sentence]
sentence_list = [sentence for sentence in sentence_list if '.' in sentence]
article = ' '.join(sentence_list)
url_to_sents[url] = article
I'm trying to use requests() to read a URL from an Excel file I've created. I suspect it's due to unseen characters, but don't know how to check for any.

When you iterate over the returned dataframe, it returns only the column names. Thus, your original code first assigned Date to url, then Category, and so on; these strings do not have URLs, thus the error.
By contrast, looking up any column in the dataframe returns a series you can iterate over. Thus, instead of iterating over tstar_articles when you want the URLs, iterate over tstar_articles['URL']:
Thus, instead of:
for url in tstar_articles:
url = tstar_articles['URL']
page = requests.get(url)
...use:
for url in tstar_articles['URL']:
page = requests.get(url)

Parsing HTML using LXML Python

I'm trying to parse Oxford Dictionary in order to obtain the etymology of a given word.
class SkipException (Exception):
def __init__(self, value):
self.value = value
try:
doc = lxml.html.parse(urlopen('https://en.oxforddictionaries.com/definition/%s' % "good"))
except SkipException:
doc = ''
if doc:
table = []
trs = doc.xpath("//div[1]/div[2]/div/div/div/div[1]/section[5]/div/p")
I cannot seem to work out how to obtain the string of text I need. I know I lack some lines of code in the ones I have copied but I don't know how HTML nor LXML fully works. I would much appreciate if someone could provide me with the correct way to solve this.

You don't want to do web scraping, and especially when probably every dictionary has an API interface. In the case of Oxford create an account at https://developer.oxforddictionaries.com/. Get the API credentials from your account and do something like this:
import requests
import json
api_base = 'https://od-api.oxforddictionaries.com:443/api/v1/entries/{}/{}'
language = 'en'
word = 'parachute'
headers = {
'app_id': '',
'app_key': ''
}
url = api_base.format(language, word)
reply = requests.get(url, headers=headers)
if reply.ok:
reply_dict = json.loads(reply.text)
results = reply_dict.get('results')
if results:
headword = results[0]
entries = headword.get('lexicalEntries')[0].get('entries')
if entries:
entry = entries[0]
senses = entry.get('senses')
if senses:
sense = senses[0]
print(sense.get('short_definitions'))

Here's a sample to get you started scraping Oxford dictionary pages:
import lxml.html as lh
from urllib.request import urlopen
url = 'https://en.oxforddictionaries.com/definition/parachute'
html = urlopen(url)
root = lh.parse(html)
body = root.find("body")
elements = body.xpath("//span[#class='ind']")
for element in elements:
print(element.text)
To find the correct search string you need to format the html so you can see the structure. I used the html formatter at https://www.freeformatter.com/html-formatter.html. Looking at the formatted HTML, I could see the definitions were in the span elements with the 'ind' class attribute.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrape links into pandas dataframe in python - python

Probably the problem is because in the try/except part, the code goes to the except and thus doesn't create the variable page, you can do as following: except requests.ConnectionError: return '' So if it has a connection error, it will return an empty string.

Related

Issue with scraping in python

Why is BeautifulSoup(...).find(...) returning None?

How to extract element from a webpage with special class name?

How can I address this error: InvalidSchema("No connection adapters were found for {!r}".format(url))?

Parsing HTML using LXML Python

Categories

Resources