How to make dataset from web scaped variables? - python

I was trying to scrape a real estate website. The problem is that I can't insert my scaped variables into one dataset. Can anyone help me, please? Thank you!
Here is my code:
html_text1=requests.get('https://www.propertyfinder.ae/en/search?c=1&ob=mr&page=1').content
soup1=BeautifulSoup(html_text1,'lxml')
listings=soup1.find_all('a',class_='card card--clickable')
for listing in listings:
price=listing.find('p', class_='card__price').text.split()[0]
price=price.split()[0]
title=listing.find('h2', class_='card__title card__title-link').text
property_type=listing.find('p',class_='card__property-amenity card__property-amenity--property-type').text
bedrooms=listing.find('p', class_='card__property-amenity card__property-amenity--bedrooms').text
bathrooms=listing.find('p', class_='card__property-amenity card__property-amenity--bathrooms').text
location=listing.find('p', class_='card__location').text
dataset=pd.DataFrame({property_type, price, title, bedrooms, bathrooms, location})
print(dataset)
My output looks like this:
enter image description here
However, I want it to look like a DataFrame:
Apartment | 162500 | ...
Townhouse | 162500 | ...
Villa | 7500000 | ...
Villa | 15000000 | ...

The problem with your code is, you are trying to create a dataframe from within the for loop. What you should be doing is creating lists to store these values separately in lists and then creating a df from these lists.
Here's what the code will look like:
price_lst = []
title_lst = []
propertyType_lst = []
bedrooms_lst = []
bathrooms_lst = []
location_lst = []
listings = soup1.find_all('a',class_='card card--clickable')
for listing in listings:
price = listing.find('p', class_='card__price').text.split()[0]
price = price.split()[0]
price_lst.append(price)
title = listing.find('h2', class_='card__title card__title-link').text
title_lst.append(title)
property_type = listing.find('p',class_='card__property-amenity card__property-amenity--property-type').text
propertyType_lst.append(property_type)
bedrooms = listing.find('p', class_='card__property-amenity card__property-amenity--bedrooms').text
bedrooms_lst.append(bedrooms)
bathrooms = listing.find('p', class_='card__property-amenity card__property-amenity--bathrooms').text
bathrooms_lst.append(bathrooms)
location = listing.find('p', class_='card__location').text
location_lst.append(location)
dataset = pd.DataFrame(list(zip(propertyType_lst, price_lst, title_lst, bedrooms_lst, bathrooms_lst, location_lst)),
columns = ['Property Type', 'Price', 'Title', 'Bedrooms', 'Bathrooms', 'Location'])

Would recommend to work with a bit more structur - Use dicts or list of dicts to store the data of your iteration and create a data frame in the end:
data = []
for listing in listings:
price=listing.find('p', class_='card__price').text.split()[0].split()[0]
title=listing.find('h2').text
property_type=listing.find('p',class_='card__property-amenity card__property-amenity--property-type').text
bedrooms=listing.find('p', class_='card__property-amenity card__property-amenity--bedrooms').text
bathrooms=listing.find('p', class_='card__property-amenity card__property-amenity--bathrooms').text
location=listing.find('p', class_='card__location').text
data.append({
'price':price,
'title':title,
'property_type':property_type,
'bedrooms':bedrooms,
'bathrooms':bathrooms,
'location':location
})
Note: Also check the your selections to avoid AttributeErrors
title=t.text if (t:=listing.find('h2')) else None
Example
from bs4 import BeautifulSoup
import requests
import pandas as pd
html_text1=requests.get('https://www.propertyfinder.ae/en/search?c=1&ob=mr&page=1').content
soup1=BeautifulSoup(html_text1,'lxml')
listings=soup1.find_all('a',class_='card card--clickable')
data = []
for listing in listings:
price=listing.find('p', class_='card__price').text.split()[0]
price=price.split()[0]
title=listing.find('h2').text
property_type=listing.find('p',class_='card__property-amenity card__property-amenity--property-type').text
bedrooms=listing.find('p', class_='card__property-amenity card__property-amenity--bedrooms').text
bathrooms=listing.find('p', class_='card__property-amenity card__property-amenity--bathrooms').text
location=listing.find('p', class_='card__location').text
data.append({
'price':price,
'title':title,
'property_type':property_type,
'bedrooms':bedrooms,
'bathrooms':bathrooms,
'location':location
})
dataset=pd.DataFrame(data)
Output
price
title
property_type
bedrooms
bathrooms
location
0
35,000,000
Fully Upgraded
Private Pool
Prime Location
Villa
6
District One Villas, District One, Mohammed Bin Rashid City, Dubai
1
2,600,000
Vacant
Brand New and Ready
Community View
Villa
3
La Quinta, Villanova, Dubai Land, Dubai
2
8,950,000
Exclusive
Newly Renovated
Prime Location
Villa
4
Jumeirah 3 Villas, Jumeirah 3, Jumeirah, Dubai
3
3,500,000
Brand New
Single Row
Vastu Compliant
Villa
3
Azalea, Arabian Ranches 2, Dubai
4
1,455,000
Limited Units
3 Yrs Payment Plan
La Violeta TH
Townhouse
3
La Violeta 1, Villanova, Dubai Land, Dubai

Related

IMDb webscraping for the top 250 movies using Beautifulsoup

I know that there are many similar questions here already, but none of them gives me a satisfying answer for my problem. So here it is:
We need to create a dataframe from the top 250 movies from IMDb for an assignment. So we need to scrape the data first using BeautifulSoup.
These are the attributes that we need to scrape:
IMDb id (0111161)
Movie name (The Shawshank Redemption)
Year (1994)
Director (Frank Darabont)
Stars (Tim Robbins, Morgan Freeman, Bob Gunton)
Rating (9.3)
Number of reviews (2.6M)
Genres (Drama)
Country (USA)
Language (English)
Budget ($25,000,000)
Gross box Office Revenue ($28,884,504)
So far, I have managed to get only a few of them. I received all the separate URLs for all the movies, and now I loop over them. This is how the loop looks so far:
for x in np.arange(0, len(top_250_links)):
url=top_250_links[x]
req = requests.get(url)
page = req.text
soup = bs(page, 'html.parser')
# ID
# Movie Name
Movie_name=(soup.find("div",{'class':"sc-dae4a1bc-0 gwBsXc"}).get_text(strip=True).split(': ')[1])
# Year
year =(soup.find("a",{'class':"ipc-link ipc-link--baseAlt ipc-link--inherit-color sc-8c396aa2-1 WIUyh"}).get_text())
# Length
# Director
director = (soup.find("a",{'class':"ipc-metadata-list-item__list-content-item"}).get_text())
# Stars
stars = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
# Rating
rating = (soup.find("span",{'class':"sc-7ab21ed2-1 jGRxWM"}).get_text())
rating = float(rating)
# Number of Reviews
reviews = (soup.find("span",{'class':"score"}).get_text())
reviews = reviews.split('K')[0]
reviews = float(reviews)*1000
reviews = int(reviews)
# Genres
genres = (soup.find("span",{'class':"ipc-chip__text"}).get_text())
# Language
# Country
# Budget
meta = (soup.find("div" ,{'class':"ipc-metadata-list-item__label ipc-metadata-list-item__label--link"}))
# Gross box Office Revenue
gross = (soup.find("span",{'class':"ipc-metadata-list-item__list-content-item"}).get_text())
# Combine
movie_dict={
'Rank':x+1,
'ID': 0,
'Movie Name' : Movie_name,
'Year' : year,
'Length' : 0,
'Director' : director,
'Stars' : stars,
'Rating' : rating,
'Number of Reviewes' : reviews,
'Genres' : genres,
'Language': 0,
'Country': 0,
'Budget' : 0,
'Gross box Office Revenue' :0}
df = df.append(pd.DataFrame.from_records([movie_dict],columns=movie_dict.keys() ) )
I can't find a way to obtain the missing information. If anybody here has experience with this kind of topic and might be able to share his thoughts, it would help a lot of people. I think the task is not new and has been solved hundreds of times, but IMDb changed the classes and the structure in their HTML.
Thanks in advance.
BeautifulSoup has many functions to search elements. it is good to read all documentation
You can create more complex code using many .find() with .parent, etc.
soup.find(text='Language').parent.parent.find('a').text
For some elements you can also use data-testid="...."
soup.find('li', {'data-testid': 'title-details-languages'}).find('a').text
Minimale working code (for The Shawshank Redemption)
import requests
from bs4 import BeautifulSoup as BS
url = 'https://www.imdb.com/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=1a264172-ae11-42e4-8ef7-7fed1973bb8f&pf_rd_r=A453PT2BTBPG41Y0HKM8&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1'
response = requests.get(url)
soup = BS(response.text, 'html.parser')
print('Language:', soup.find(text='Language').parent.parent.find('a').get_text(strip=True))
print('Country of origin:', soup.find(text='Country of origin').parent.parent.find('a').get_text(strip=True))
for name in ('Language', 'Country of origin'):
value = soup.find(text=name).parent.parent.find('a').get_text(strip=True)
print(name, ':', value)
print('Language:', soup.find('li', {'data-testid':'title-details-languages'}).find('a').get_text(strip=True))
print('Country of origin:', soup.find('li', {'data-testid':'title-details-origin'}).find('a').get_text(strip=True))
for name, testid in ( ('Language', 'title-details-languages'), ('Country of origin', 'title-details-origin')):
value = soup.find('li', {'data-testid':testid}).find('a').get_text(strip=True)
print(name, ':', value)
Result:
Language: English
Country of origin: United States
Language : English
Country of origin : United States
Language: English
Country of origin: United States
Language : English
Country of origin : United States

Python BeautifulSoup Not Getting Correct Value

I am trying to scrape movie data from https://www.imdb.com/search/title/?title_type=feature&genres=comedy&explore=genres but when I try to scrape the movie runtime text I get an error saying get_text is not callable and that is because in some of the movies that I am scraping there is no runtime. How can I make my code skip the movies with no runtime?
source = requests.get('https://www.imdb.com/search/title/?title_type=feature&genres=comedy&explore=genres')
source.raise_for_status()
soup = BeautifulSoup(source.text, 'html.parser')
comedy_movies = soup.find_all('div', class_ = "lister-item mode-advanced")
for movies in comedy_movies:
#movie title
movie_title = movies.find('div', class_ = 'lister-item-content').a.text
#Parental Advisory
advisory = movies.find('span', class_ = 'certificate') #figure out how to single out advisory-
#Movie runtime
runtime = movies.find('span', class_ = 'runtime') #figure out how to single out runtime
#Movie Genre
genre = movies.find('span', class_ = 'genre').get_text()
#Movie Rating
rating = movies.find('span', class_ = 'global-sprite rating-star imdb-rating') #Figure out how to single out ratings
#MetaScore
metascore = movies.find('div', class_ = 'inline-block ratings-metascore') #.span.text same here missing values
#Movie Description
description = movies.find('div', class_ = 'lister-item-content').p.text
print(runtime)
Also when I try to scrape the descriptions. I am not getting the descriptions, I am getting another text with the same and class. How can I fix these? I will appreciate it a lot if someone can help.my code executed with runtime showing the None values
To avoid the error you can simply first check wheteher find returned anything that is not None, like
runtime = movies.find('span', class_ = 'runtime')
if runtime is not None:
runtime = runtime.text
As for ratings, you want the contents of the <strong> tag next to the span you were finding:
rating = movies.find(
'span', class_ = 'global-sprite rating-star imdb-rating'
).find_next('strong').text
and for description, you would need to look for the p tag with class="text-muted" after the div with class="ratings-bar":
rating = movies.find(
'div', class_ = 'ratings-bar'
).find_next('p', class_ = 'text-muted').text
although this will find None [and then raise error] when ratings is missing...
You might have noticed by now that some data (description, rating, metascore and title) would need more than one if...is not None checks to avoid getting errors if anything returns None, so it might be preferable [especially with nested elements] to select_one instead. (If you are unfamiliar with css selectors, check this for reference.)
Then, you would be able to get metascore as simply as:
metascore = movies.select_one('div.inline-block.ratings-metascore span')
if metascore is not None:
metascore = metascore.get_text()
In fact, you could define a dictionary with a selector for each piece of information you need and restructure your for-loop to something like
selectorDict = {
'movie_title': 'div.lister-item-content a',
'advisory': 'span.certificate',
'runtime': 'span.runtime',
'genre': 'span.genre',
'rating': 'span.global-sprite.rating-star.imdb-rating~strong',
'metascore': 'div.inline-block.ratings-metascore span',
'description': 'div.lister-item-content p~p'
#'description': 'div.ratings-bar~p.text-muted'
# ^--misses description when rating is missing
}
movieData = []
for movie in comedy_movies:
mData = {}
for k in selectorDict:
dTag = movie.select_one(selectorDict[k])
if dTag is not None:
mData[k] = dTag.get_text(strip=True)
else: mData[k] = None # OPTIONAL
movieData.append(mData)
with this, you could easily explore the collected data at once; for example, as a pandas dataframe with
# import pandas
pandas.DataFrame(movieData)
[As you might notice in the output below, some cells are blank (because value=None), but no errors would have been raised while the for-loop is running because of it.]
index
movie_title
advisory
runtime
genre
rating
metascore
description
0
Amsterdam
R
134 min
Comedy, Drama, History
6.2
48
In the 1930s, three friends witness a murder, are framed for it, and uncover one of the most outrageous plots in American history.
1
Hocus Pocus 2
PG
103 min
Comedy, Family, Fantasy
6.1
55
Two young women accidentally bring back the Sanderson Sisters to modern day Salem and must figure out how to stop the child-hungry witches from wreaking havoc on the world.
2
Hocus Pocus
PG
96 min
Comedy, Family, Fantasy
6.9
43
A teenage boy named Max and his little sister move to Salem, where he struggles to fit in before awakening a trio of diabolical witches that were executed in the 17th century.
3
The Super Mario Bros. Movie
Animation, Adventure, Comedy
A plumber named Mario travels through an underground labyrinth with his brother, Luigi, trying to save a captured princess. Feature film adaptation of the popular video game.
4
Bullet Train
R
127 min
Action, Comedy, Thriller
7.4
49
Five assassins aboard a swiftly-moving bullet train to find out that their missions have something in common.
5
Spirited
PG-13
127 min
Comedy, Family, Musical
A musical version of Charles Dickens's story of a miserly misanthrope who is taken on a magical journey.
---
---
---
---
---
---
---
---
47
Scooby-Doo
PG
86 min
Adventure, Comedy, Family
5.2
35
After an acrimonious break up, the Mystery Inc. gang are individually brought to an island resort to investigate strange goings on.
48
Casper
PG
100 min
Comedy, Family, Fantasy
6.1
49
An afterlife therapist and his daughter meet a friendly young ghost when they move into a crumbling mansion in order to rid the premises of wicked spirits.
49
Ghostbusters
PG
105 min
Action, Comedy, Fantasy
7.8
71
Three parapsychologists forced out of their university funding set up shop as a unique ghost removal service in New York City, attracting frightened yet skeptical customers.

Python University Names and Abbrevations and Weblink

I want to prepare a dataframe of universities, its abbrevations and website link.
My code:
abb_url = 'https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States'
abb_html = requests.get(abb_url).content
abb_df_list = pd.read_html(abb_html)
Present answer:
ValueError: No tables found
Expected answer:
df =
| | university_full_name | uni_abb | uni_url|
---------------------------------------------------------------------
| 0 | Albert Einstein College of Medicine | AECOM | https://en.wikipedia.org/wiki/Albert_Einstein_College_of_Medicine|
That's one funky page you have there...
First, there are indeed no tables in there. Second, some organizations don't have links, others have redirect links and still others use the same abbreviation for more than one organization.
So you need to bring in the heavy artillery: xpath...
import pandas as pd
import requests
from lxml import html as lh
url = "https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States"
response = requests.get(url)
doc = lh.fromstring(response.text)
rows = []
for uni in doc.xpath('//h2[./span[#class="mw-headline"]]//following-sibling::ul//li'):
info = uni.text.split(' – ')
abb = info[0]
#for those w/ no links
if not uni.xpath('.//a'):
rows.append((abb," ",info[1]))
#now to account for those using the same abbreviation for multiple teams
for a in uni.xpath('.//a'):
dat = a.xpath('./#*')
#for those with redirects
if len(dat)==3:
del dat[1]
link = f"https://en.wikipedia.org{dat[0]}"
rows.append((abb,link,dat[1]))
#and now, at last, to the dataframe
cols = ['abb','url','full name']
df = pd.DataFrame(rows,columns=cols)
df
Output:
abb url full name
0 AECOM https://en.wikipedia.org/wiki/Albert_Einstein_... Albert Einstein College of Medicine
1 AFA https://en.wikipedia.org/wiki/United_States_Ai... United States Air Force Academy
etc.
Note: you can rearrange the order of columns in the dataframe, if you are so inclined.
Select and iterate only the expected <li> and extract its information, but be aware there is a university without an <a> (SUI – State University of Iowa), so this should be handled with if-statement in example:
for e in soup.select('h2 + ul li'):
data.append({
'abb':e.text.split('-')[0],
'full_name':e.text.split('-')[-1],
'url':'https://en.wikipedia.org' + e.a.get('href') if e.a else None
})
Example
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States"
response = requests.get(url)
soup = BeautifulSoup(response.text)
data = []
for e in soup.select('h2 + ul li'):
data.append({
'abb':e.text.split('-')[0],
'full_name':e.text.split('-')[-1],
'url':'https://en.wikipedia.org' + e.a.get('href') if e.a else None
})
pd.DataFrame(data)
Output:
abb
full_name
url
0
AECOM
Albert Einstein College of Medicine
https://en.wikipedia.org/wiki/Albert_Einstein_College_of_Medicine
1
AFA
United States Air Force Academy
https://en.wikipedia.org/wiki/United_States_Air_Force_Academy
2
Annapolis
U.S. Naval Academy
https://en.wikipedia.org/wiki/United_States_Naval_Academy
3
A&M
Texas A&M University, but also others; see A&M
https://en.wikipedia.org/wiki/Texas_A%26M_University
4
A&M-CC or A&M-Corpus Christi
Corpus Christi
https://en.wikipedia.org/wiki/Texas_A%26M_University%E2%80%93Corpus_Christi
...
There are no tables on this page, but lists. So the goal will be to go through the <ul> and then <li> tags, skipping the paragraphs you are not interested in (the first and those after the 26th).
You can extract aab_code of the university this way:
uni_abb = li.text.strip().replace(' - ', ' - ').replace(' - ', ' - ').split(' - ')[0]
while to get the url you have to access the 'href' and 'title' parameter inside the <a> tag:
for a in li.find_all('a', href=True):
title = a['title']
url= f"https://en.wikipedia.org/{a['href']}"
Accumulate the extracted information into a list, and finally create the dataframe by assigning appropriate column names.
Here is the complete code, in which I use BeautifulSoup:
import requests
import pandas as pd
from bs4 import BeautifulSoup
abb_url = 'https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States'
abb_html = requests.get(abb_url).content
soup = BeautifulSoup(abb_html)
l = []
for ul in soup.find_all("ul")[1:26]:
for li in ul.find_all("li"):
uni_abb = li.text.strip().replace(' - ', ' – ').replace(' — ', ' – ').split(' – ')[0]
for a in li.find_all('a', href=True):
l.append((a['title'], uni_abb, f"https://en.wikipedia.org/{a['href']}"))
df = pd.DataFrame(l, columns=['university_full_name', 'uni_abb', 'uni_url'])
Result:
university_full_name uni_abb uni_url
0 Albert Einstein College of Medicine AECOM https://en.wikipedia.org//wiki/Albert_Einstein...
1 United States Air Force Academy AFA https://en.wikipedia.org//wiki/United_States_A...

Create a data frame with headings as column names and <li> tag content as rows, then print this data frame into a text file

I'm trying to get the main body data from this website
I want to get a data frame (or any other object which makes life easier) as output with subheadings as column names and body under the subheading as lines under that column.
My code is below:
from bs4 import BeautifulSoup
import requests
import re
url = "https://www.bankersadda.com/17th-september-2021-daily-gk-update/"
page = requests.get(url)
html = page.text
soup = BeautifulSoup(html,'lxml') #"html.parser")
article = soup.find(class_ = "entry-content")
headings = []
lines = []
my_df = pd.DataFrame(index=range(100))
for strong in article.findAll('strong'):
if strong.parent.name =='p':
if strong.find(text=re.compile("News")):
headings.append(strong.text)
#headings
k=0
for ul in article.findAll('ul'):
for li in ul.findAll('li'):
lines.append(li.text)
lines= lines + [""]
my_df[k] = pd.Series(lines)
k=k+1
my_df
I want to use the "headings" list to get the data frame column names.
Clearly I'm not writing the correct logic. I explored nextSibling, descendants and other attributes too, but I can't figure out the correct logic. Can someone please help?
Once you get the headline, use .find_next() to get that news article list. Then add them into a list under the headline as a key in a dictionary. Then simply use pd.concat() with ignore_index=False
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
url = "https://www.bankersadda.com/17th-september-2021-daily-gk-update/"
page = requests.get(url)
html = page.text
soup = BeautifulSoup(html,'lxml') #"html.parser")
article = soup.find(class_ = "entry-content")
headlines = {}
news_headlines = article.find_all('p',text=re.compile("News"))
for news_headline in news_headlines:
end_of_news = False
sub_title = news_headline.find_next('p')
headlines[news_headline.text] = []
#print(news_headline.text)
while end_of_news == False:
headlines[news_headline.text].append(sub_title.text)
articles = sub_title.find_next('ul')
for li in articles.findAll('li'):
headlines[news_headline.text].append(li.text)
#print(li.text)
sub_title = articles.find_next('p')
if 'News' in sub_title.text or sub_title.text == '' :
end_of_news = True
df_list = []
for headings, lines in headlines.items():
temp = pd.DataFrame({headings:lines})
df_list.append(temp)
my_df = pd.concat(df_list, ignore_index=False, axis=1)
Output:
print(my_df)
National News ... Obituaries News
0 1. Cabinet approves 100% FDI under automatic r... ... 11. Eminent Kashmiri Writer Aziz Hajini passes...
1 The Union Cabinet, chaired by Prime Minister N... ... Noted writer and former secretary of Jammu and...
2 A total of 9 structural and 5 process reforms ... ... He has over twenty books in Kashmiri to his cr...
3 Change in the definition of AGR: The definitio... ... 12. Former India player and Mohun Bagan great ...
4 Rationalised Spectrum Usage Charges: The month... ... Former India footballer and Mohun Bagan captai...
5 Four-year Moratorium on dues: Moratorium has b... ... Bhabani Roy helped Mohun Bagan win the Rovers ...
6 Foreign Direct Investment (FDI): The governmen... ... 13. 2 times Olympic Gold Medalist Yuriy Sedykh...
7 Auction calendar fixed: Spectrum auctions will... ... Double Olympic hammer throw gold medallist Yur...
8 Important takeaways for all competitive exams: ... He set the world record for the hammer throw w...
9 Minister of Communications: Ashwini Vaishnaw. ... He won his first gold medal at the 1976 Olympi...
[10 rows x 8 columns]

How to use Beautiful Soup find all to scrape only a list that is a part of the body

I am having trouble scraping this wikipedia list with the neighborhoods of Los Angeles using beautiful soup. I am getting all the content of the body and not just the neighborhood list like I would like to. I saw a lot about how to scrape a table but I got stucked in how to apply the table logic in this case.
This is the code I have been using:
import BeautifulSoup
address = 'Los Angeles, United States'
url = "https://en.wikipedia.org/wiki/List_of_districts_and_neighborhoods_of_Los_Angeles"
source = requests.get(url).text
soup = BeautifulSoup(source,'lxml')
neighborhoodList = []
-- append the data into the list
for row in soup.find_all("div", class_="mw-body")[0].findAll("li"):
neighborhoodList.append(row.text.replace(', LA',''))
df_neighborhood = pd.DataFrame({"Neighborhood": neighborhoodList})
If you review the page source the neighborhood entries are within divs that have a class of "div-col" and the link contains an attribute of "title".
Also, the replace on the text during the append doesn't appear to be needed.
The following code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
address = 'Los Angeles, United States'
url = "https://en.wikipedia.org/wiki/List_of_districts_and_neighborhoods_of_Los_Angeles"
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')
neighborhoodList = []
# -- append the data into the list
links = []
for row in soup.find_all("div", class_="div-col"):
for item in row.select("a"):
if item.has_attr('title'):
neighborhoodList.append(item.text)
df_neighborhood = pd.DataFrame({"Neighborhood": neighborhoodList})
print(f'First 10 Rows:')
print(df_neighborhood.head(n=10))
print(f'\nLast 10 Rows:')
print(df_neighborhood.tail(n=10))
Results:
First 10 Rows:
Neighborhood
0 Angelino Heights
1 Arleta
2 Arlington Heights
3 Arts District
4 Atwater Village
5 Baldwin Hills
6 Baldwin Hills/Crenshaw
7 Baldwin Village
8 Baldwin Vista
9 Beachwood Canyon
Last 10 Rows:
Neighborhood
186 Westwood Village
187 Whitley Heights
188 Wholesale District
189 Wilmington
190 Wilshire Center
191 Wilshire Park
192 Windsor Square
193 Winnetka
194 Woodland Hills
195 Yucca Corridor

Categories

Resources