I create a dataframe using beautiful soup scraping the data. However, there have 2 problems.
Why does the for loop run 2 times?
How to remove the brackets on the data frame?
import urllib.request as req
from bs4 import BeautifulSoup
import bs4
import requests
import pandas as pd
url = "https://finance.yahoo.com/quote/BF-B/profile?p=BF-B"
root = requests.get(url)
soup = BeautifulSoup(root.text, 'html.parser')
records = []
for result in soup:
name = soup.find_all('h1', attrs={'D(ib) Fz(18px)'})
website = soup.find_all('a')[44]
sector = soup.find_all('span')[35]
industry = soup.find_all('span')[37]
records.append((name, website, sector, industry))
df = pd.DataFrame(records, columns=['name', 'website', 'sector', 'industry'])
df.head()
And the result like this:
DataFrame Output
To get information about the company, you don't have to loop over the soup, just extract necessary information directly. To get rid of [..] brackets, use .text property:
import requests
from bs4 import BeautifulSoup
url = 'https://finance.yahoo.com/quote/BF-B/profile?p=BF-B'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
all_data.append({
'Name': soup.h1.text,
'Website': soup.select_one('.asset-profile-container a[href^="http"]')['href'],
'Sector': soup.select_one('span:contains("Sector(s)") + span').text,
'Industry': soup.select_one('span:contains("Industry") + span').text
})
df = pd.DataFrame(all_data)
print(df)
Prints:
Name Website Sector Industry
0 Brown-Forman Corporation (BF-B) http://www.brown-forman.com Consumer Defensive Beverages—Wineries & Distilleries
Related
I have been doing my web scraping project with this web link. But the code runs with no errors. But its ot showing any records. Can you pls check the reason for the same?
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://live-cosmos.finq.com/trading-platform/#trading/Shares/Global/USA/All/FACEBOOK"
data = requests.get(url).text
soup = BeautifulSoup(data, 'html5lib')
df = pd.DataFrame(columns=["Instrument", "Sell", "Buy", "Change"])
for row in soup.find_all('tr'):
col = row.find_all("td")
Instrument = col[0].text
Sell = col[1].text
Buy = col[2].text
Change = col[3].text
df = df.append({"Instrument":Instrument,"Sell":Sell,"Buy":Buy,"Change":Change}, ignore_index=True)
print(df)
Thanks
I am trying to get the name of all organizations from https://www.devex.com/organizations/search using beautifulsoup.However, I am getting an error. Can someone please help.
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from time import sleep
from random import randint
headers = {"Accept-Language": "en-US,en;q=0.5"}
titles = []
pages = np.arange(1, 2, 1)
for page in pages:
page = requests.get("https://www.devex.com/organizations/search?page%5Bnumber%5D=" + str(page) + "", headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
movie_div = soup.find_all('div', class_='info-container')
sleep(randint(2,10))
for container in movie_div:
name = container.a.find('h3', class_= 'ng-binding').text
titles.append(name)
movies = pd.DataFrame({
'movie': titles,
})
to see your dataframe
print(movies)
to see the datatypes of your columns
print(movies.dtypes)
to see where you're missing data and how much data is missing
print(movies.isnull().sum())
to move all your scraped data to a CSV file
movies.to_csv('movies.csv')
you may try with something like
name = bs.find("h3", {"class": "ng-binding"})
There is a website which doesn't take queries (hidden), there is an input field with an html id, once u enter value and click submit, you get a single row table.
Is it possible to enter input values in a loop and get the table data by web scraping using python along with beautifulsoup or flask? (Not selenium)
link
Click on Know your class & section
`import requests
import urllib.request
import time
from bs4 import BeautifulSoup
# Set the URL you want to webscrape from
url = 'https://www.pesuacademy.com/Academy'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
#results = soup.find(id = "knowClsSectionModalLoginId")
#R = soup.find(id = 'knowClsSectionModalTableDate')
try:
a = soup.find('input', {'id':'knowClsSectionModalLoginId'}).get('value')
for i in a:
inputv = i.get('value')
print(i, \n)
except:
pass
`
I assume you are referring to "Know your Class & Section". This is a form.
This is an ajax post call with the loginid.
You can give all the ids in list loginids. The script loops through and gets all the data and saves to a csv file.
import requests
from bs4 import BeautifulSoup
import pandas as pd
loginids = ["PES1201900004"]
payload = {
"loginId": ""
}
headers = {
"content-type": "application/x-www-form-urlencoded"
}
url = "https://pesuacademy.com/Academy/getStudentClassInfo"
columns = ['PRN', 'SRN', 'Name', 'Class', 'Section', 'Cycle', 'Department', 'Branch', 'Institute Name']
data = []
for logins in loginids:
payload["loginId"] = logins
res = requests.post(url, data=payload,headers=headers)
soup = BeautifulSoup(res.text, "html.parser")
data.append([i.get_text(strip=True) for i in soup.find("table").find("tbody").find_all("td")])
df = pd.DataFrame(data, columns=columns)
df.to_csv("data.csv", index=False)
print(df)
Output:
PRN SRN Name Class Section Cycle Department Branch Institute Name
0 PES1201900004 NA AKSHAYA RAMESH NA B ARCH
I'm having problems within my code which works perfectly with one page, but when I try to parse all the 28 pages it doesn't parse 27 pages, but parse only the first one.
The main idea is parse the data from the mentioned url which has 28 pages in overall and I made for loop for it in order to make BS parse from all the pages. However, it parses only the first page, but doesn't parse others.
I would like to get your recommendations and ways to make it work.
Code:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
for t in range(28):
url = "https://boss.az/vacancies?action=index&controller=vacancies&only_path=true&page={}&type=vacancies".format(t)
r = requests.get(url)
soup = bs(r.content, 'html.parser')
titles = [i.text for i in soup.select('.results-i-title')]
#print(titles)
companies = [i.text for i in soup.select('.results-i-company')]
#print(companies)
summaries = [i.text for i in soup.select('.results-i-summary')]
df = pd.DataFrame(list(zip(titles, companies, summaries)), columns = ['Title', 'Company', 'Summary'])
df.to_csv(r'Data.csv', sep=',', encoding='utf-8-sig',index = False )
You are overwriting titles, companies and summaries with every iteration of the loop. Simply change titles = ... to titles += ...:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
titles = []
companies = []
summaries = []
for t in range(28):
url = "https://boss.az/vacancies?action=index&controller=vacancies&only_path=true&page={}&type=vacancies".format(t)
r = requests.get(url)
soup = bs(r.content, 'html.parser')
titles += [i.text for i in soup.select('.results-i-title')]
companies += [i.text for i in soup.select('.results-i-company')]
summaries += [i.text for i in soup.select('.results-i-summary')]
df = pd.DataFrame(list(zip(titles, companies, summaries)), columns = ['Title', 'Company', 'Summary'])
df.to_csv(r'Data.csv', sep=',', encoding='utf-8-sig',index = False )
I have to make a code in order to scrape datafrom a website and then analyse them for university.
My problem is that I made this code in order to get some data for all products but when I run it it only shows a single response for each variable.
Can you help me resolve this error ?
from bs4 import BeautifulSoup as soup
import urllib
from urllib.request import urlopen as uReq
import requests
myurl='https://boutique.orange.fr/mobile/choisir-un-mobile'
Uclient=uReq(myurl)
page=Uclient.read()
Uclient.close()
pagesoup=soup(page,'html.parser')
containers=pagesoup.findAll('div',{'class':'box-prod pointer'})
container=containers[0]
produit=container.img['alt']
price=container.findAll('span',{'class':'price'})
price2=container.findAll('div',{'class':'prix-seul'})
avis=container.footer.div.a.img['alt']
file="orange.csv"
f=open(file,'w')
headers='produit,prix avec abonnement, prix seul, avis\n'
f.write(headers)
for container in containers:
produit=container.img['alt']
price=container.findAll('span',{'class':'price'})
price2=container.findAll('div',{'class':'prix-seul'})
avis=container.footer.div.a.img['alt']
You could use different selectors. Separate two prices per product by index. Extract price specific info using join and findall.
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://boutique.orange.fr/mobile/choisir-un-mobile'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
#print(len(soup.select('#resultat .box-prod.pointer')))
p = re.compile('[0-9,€]+')
altText= [item.get('alt').strip() for item in soup.select('#resultat .box-prod.pointer .lazy')]
titles = [item.text.strip().replace('\n', ' ') for item in soup.select('#resultat .box-prod.pointer .titre-produit')]
allPrices = [''.join(p.findall(item.text)) for item in soup.select('#resultat span.price')]
aPartirPrice = allPrices[0::2]
prixSeul = allPrices[1::2]
items = list(zip(titles, altText, aPartirPrice, prixSeul))
df = pd.DataFrame(items,columns=['title', 'altText', 'aPartirPrice', 'prixSeul'])
df.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8',index = False )
Transpose with:
df = df.T