how to put web scraped data into a list - python

this is the code I used to get the data from a website with all the wordle possible words, im trying to put them in a list so I can create a wordle clone but I get a weird output when I do this. please help
import requests
from bs4 import BeautifulSoup
url = "https://raw.githubusercontent.com/tabatkins/wordle-list/main/words"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
word_list = list(soup)

It do not need BeautifulSoup, simply split the text of the response:
import requests
url = "https://raw.githubusercontent.com/tabatkins/wordle-list/main/words"
requests.get(url).text.split()
Or if you like to do it wit BeautifulSoup anyway:
import requests
from bs4 import BeautifulSoup
url = "https://raw.githubusercontent.com/tabatkins/wordle-list/main/words"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
soup.text.split()
Output:
['women',
'nikau',
'swack',
'feens',
'fyles',
'poled',
'clags',
'starn',...]

Related

Scraping data from site

I've tried to scrape some data from a site using BeauitfulSoup, I've scraped some of the data successfully some others like (phone, website) I get errors with those data.
https://yellowpages.com.eg/en/search/spas/3231
this is the link to the site I try to scrape.
from bs4 import BeautifulSoup
import requests
url = 'https://yellowpages.com.eg/en/search/spas/3231'
r = requests.get(url)
soup =BeautifulSoup(r.content, 'lxml')
info = soup.find_all('div', class_='col-xs-12 padding_0')
for item in info:
phone = item.find('span', class_='phone-spans')
print(phone)
Every time I run this code the result is none.
Not sure where that code comes from but I couldn't see anything that looked similar, however this code works:
from bs4 import BeautifulSoup
import requests
url = 'https://yellowpages.com.eg/en/search/spas/3231'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
for item in soup.find_all('div', class_='searchResultsDiv'):
name = item.find('a',class_= 'companyName').text.strip()
phone = item.find('a',class_= 'search-call-mob')['href']
print(name,phone)

How to scrape main headings of a website using python in colab?

Hi I am a beginner and would like to get the list of all datasets from the website 'https://www.kaggle.com/datasets' based on the filters 'csv' and 'only datasets with tasks'.
I applied the filters and inspected the element. My attempt returns an empty list. This is my code
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://www.kaggle.com/datasets?sort=usability&fileType=csv&tasks=true'
html = urlopen(url)
soup = BeautifulSoup(response.text, 'lxml')
titles = soup.find_all('li')
print(titles)
Can anyone help?

Parsing a HTML Table gets empy soup with beautifulsoup and request

I'm trying to get all the table in this url = "https://www.timeshighereducation.com/world-university-rankings/2020/subject-ranking/life-sciences#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats" in a DataFrame (821 rows in total, need all the table). The code I'm using is this:
import requests
from bs4 import BeautifulSoup
import json
url = "https://www.timeshighereducation.com/world-university-rankings/2020/subject-ranking/life-sciences#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
print(soup) # It doesn't print anything
My idea is to get the info in soup and then look for the tag <script> jQuery.extend(Drupal.settings, {"basePath": ... and get inside the followig json link https://www.timeshighereducation.com/sites/default/files/the_data_rankings/life_sciences_rankings_2020_0__a2e62a5137c61efeef38fac9fb83a262.json where is all the data in the table. I already have a function to read this json link, but first need to find the info in soup and then get json link. Need to be in this way because I have to read many tables and get the json link by inspectioning manually is not an option for me.
You want the following regex pattern which finds the desired string after "url"
from bs4 import BeautifulSoup as bs
import requests
import re
with requests.Session() as s:
s.headers = {'User-Agent':'Mozilla/5.0'}
r = s.get('https://www.timeshighereducation.com/world-university-rankings/2020/subject-ranking/life-sciences#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats')
url = re.search('"url":"(.*?)"', r.text).groups(0)[0].replace('\/','/')
data = s.get(url).json()
print(data)

BS4 returns [] instead of the wanted HTML tag

I want to parse the given website and scrape the table. To me the code looks right. New to python and web parsing
import requests
from bs4 import BeautifulSoup
response = requests.get('https://delhifightscorona.in/')
doc = BeautifulSoup(response.text, 'lxml-xml')
cases = doc.find_all('div', {"class": "cell"})
print(cases)
doing this returns
[]
Change your parser and the class and there you have it.
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://delhifightscorona.in/').text, 'html.parser').find('div', {"class": "grid-x grid-padding-x small-up-2"})
print(soup.find("h3").getText())
Output:
423,831
You can choose to print only the cases or the total stats with the date.
import requests
from bs4 import BeautifulSoup
response = requests.get('https://delhifightscorona.in/')
doc = BeautifulSoup(response.text, 'html.parser')
stats = doc.find('div', {"class": "cell medium-5"})
print(stats.text) #Print the whole block with dates and the figures
cases = stats.find('h3')
print(cases.text) #Print the cases only

How to get text following a table/span with BeautifulSoup and Python?

I need to get the text 2,585 shown in the screenshot below. I very new to coding, but this is what i have so far:
import urllib2
from bs4 import BeautifulSoup
url= 'insertURL'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
span = soup.find('span', id='d21475972e793-wk-Fact -8D34B98C76EF518C788A2177E5B18DB0')
print (span.text)
Any info is helpful!! Thanks.
Website HTML
3 things, your using requests not urllib2. Your selecting XML with namespaces so you need to use xml as the parser. The element you want is not span it is ix:nonFraction. Here is a working example using another web-page (you just need to point it at your page and use the commented line).
# Using requests no need for urllib2.
import requests
from bs4 import BeautifulSoup
# Using this page as an example.
url= 'https://www.sec.gov/Archives/edgar/data/27904/000002790417000004/0000027904-17-000004.txt'
r = requests.get(url)
data = r.text
# use xml as the parser.
soup = BeautifulSoup(data, 'xml')
ix = soup.find('ix:nonFraction', id="Fact-7365D69E1478B0A952B8159A2E39B9D8-wk-Fact-7365D69E1478B0A952B8159A2E39B9D8")
# Your original code for your page.
# ix = soup.find('ix:nonFraction', id='d21475972e793-wk-Fact-8D34B98C76EF518C788A2177E5B18DB0')
print (ix.text)

Categories

Resources