How to scrape main headings of a website using python in colab? - python

Hi I am a beginner and would like to get the list of all datasets from the website 'https://www.kaggle.com/datasets' based on the filters 'csv' and 'only datasets with tasks'.
I applied the filters and inspected the element. My attempt returns an empty list. This is my code
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://www.kaggle.com/datasets?sort=usability&fileType=csv&tasks=true'
html = urlopen(url)
soup = BeautifulSoup(response.text, 'lxml')
titles = soup.find_all('li')
print(titles)
Can anyone help?

Related

how to put web scraped data into a list

this is the code I used to get the data from a website with all the wordle possible words, im trying to put them in a list so I can create a wordle clone but I get a weird output when I do this. please help
import requests
from bs4 import BeautifulSoup
url = "https://raw.githubusercontent.com/tabatkins/wordle-list/main/words"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
word_list = list(soup)
It do not need BeautifulSoup, simply split the text of the response:
import requests
url = "https://raw.githubusercontent.com/tabatkins/wordle-list/main/words"
requests.get(url).text.split()
Or if you like to do it wit BeautifulSoup anyway:
import requests
from bs4 import BeautifulSoup
url = "https://raw.githubusercontent.com/tabatkins/wordle-list/main/words"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
soup.text.split()
Output:
['women',
'nikau',
'swack',
'feens',
'fyles',
'poled',
'clags',
'starn',...]

Trying to scrape Aliexpress

So I am trying to scrape the price of a product on Aliexpress. I tried inspecting the element which looks like
<span class="product-price-value" itemprop="price" data-spm-anchor-id="a2g0o.detail.1000016.i3.fe3c2b54yAsLRn">US $14.43</span>
I'm trying to run the following code
'''
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
url = 'https://www.aliexpress.com/item/32981494236.html?spm=a2g0o.productlist.0.0.44ba26f6M32wxY&algo_pvid=520e41c9-ba26-4aa6-b382-4aa63d014b4b&algo_expid=520e41c9-ba26-4aa6-b382-4aa63d014b4b-22&btsid=0bb0623b16170222520893504e9ae8&ws_ab_test=searchweb0_0,searchweb201602_,searchweb201603_'
source = urlopen(url).read()
soup = BeautifulSoup(source, 'lxml')
soup.find('span', class_='product-price-value')
'''
but I keep getting a blank output. I must be doing something wrong but these methods seem to work in the tutorials I've seen.
So, what i got. As i understood right, the page what you gave, was recived by scripts, but in origin, it doesn't contain it, just script tags, so i just used split to get it. Here is my code:
from bs4 import BeautifulSoup
import requests
url = 'https://aliexpress.ru/item/1005002281350811.html?spm=a2g0o.productlist.0.0.42d53b59T5ddTM&algo_pvid=f3c72fef-c5ab-44b6-902c-d7d362bcf5a5&algo_expid=f3c72fef-c5ab-44b6-902c-d7d362bcf5a5-1&btsid=0b8b035c16170960366785062e33c0&ws_ab_test=searchweb0_0,searchweb201602_,searchweb201603_&sku_id=12000019900010138'
data = requests.get(url)
soup = BeautifulSoup(data.content, features="lxml")
res = soup.findAll("script")
total_value = str(res[-3]).split("totalValue:")[1].split("}")[0].replace("\"", "").replace(".", "").strip()
print(total_value)
It works fine, i tried on few pages from Ali.

unable to Webscrape dropdown item [Python][beautifulsoup]

i am new to webscraping, i am scraping a website - https://www.valueresearchonline.com/funds/22/uti-mastershare-fund-regular-plan/
In this,i want to scrape this text - Regular Plan
But the thing is, when i do it using inspect element,
code -
import requests
from bs4 import BeautifulSoup
import csv
import sys
url = 'https://www.valueresearchonline.com/funds/newsnapshot.asp?schemecode=22'
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
regular_direct = soup.find('span',class_="filter-option pull-left").text
print(regular_direct)
i get none in printing, and i don't know why, the code in inspect element and view page source is also different, because in view page source, this span and class is not there.
why i am getting none?? can anyone please tell me and how can i get that text and why inspect element code and view page source code are different?
You need to change the selector because the html source that gets downloaded is different.
import requests
from bs4 import BeautifulSoup
import csv
import sys
url = 'https://www.valueresearchonline.com/funds/newsnapshot.asp?schemecode=22'
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
regular_direct = soup.find("select", {"id":"select-plan"}).find("option",{"selected":"selected"}).get_text(strip=True)
print(regular_direct)
Output:
Regular plan

How do I find the hyperlinks of a webpage when using BeautifulSoup and Python 3?

I am writing a script to only extract the hyperlinks from a webpage. This is what I have so far:
import bs4 as bs
import urllib.request
source = urllib.request.urlopen('http://www.soc.napier.ac.uk/~40009856/CW/').read()
soup = bs.BeautifulSoup(source, 'lxml')
#for paragraph in soup.find_all('p'):
# print(paragraph.string)
for url in soup.find_all('a'):
print(url.get('href'))
I want only hyperlinks to other webpages and not links to PDFs and email addresses as well. As is given in the output
How do I specify to only return hyperlinks?

BeautifulSoup returns empty list

I am new to scraping with python. I am using the BeautifulSoup to extract quotes from a website and here's my code:
#!/usr/bin/python3
from urllib.request import urlopen
from bs4 import BeautifulSoup
r = urlopen("http://quotes.toscrape.com/tag/inspirational/")
bsObj = BeautifulSoup(r, "lxml")
links = bsObj.find_all("div", {"class:" "quote"})
print(links)
It returns:
[]
But when I try this:
for link in links :
print(link)
It returns nothing.
(Note: this happened to me for every website )
Edit: the propose of the code above is just to return a Tag but not the text (the quote)

Categories

Resources