Real Estate Market Scraping using Python and BeautifulSoup - python
I need some concept how to parse a real estate market using Python. I've searched some information about parsing the websites, I even did this in VBA, but I would like to do it in python.
This is the site which will be parsed (it's one offer only now, but it will be working on full range of real estate offers, multiple sites from kontrakt.szczecin.pl):
http://www.kontrakt.szczecin.pl/mieszkanie-sprzedaz-100m2-335000pln-grudziadzka-pomorzany-szczecin-zachodniopomorskie,351149
First of all, program will use 3 pieces of information:
1/ The table where is information (Main parameters):
Numer oferty 351149, Liczba pokoi 3, Cena 335 000 PLN, Cena za m2 3 350 PLN (Number of offer, Room no, Price, Price by square meter etc). However the quantity of information depends on property offer: sometimes is 14, sometimes is 12, sometimes 16 etc.
2/ Description of property in paragraphs (it is another part of program, for now it can be skipped): Sometimes in the table (1/) there is information that there is garage or balcony. But in paragraph there is a sentence that garage is for additional price (which means for me that property doesn't have garage) or balcony is in French type (which is no balcony for me).
I managed that program should find the correct word in paragraph (such as garage) and copy text from paragraph with additional text on the left and right side (for instance: 20 letters in both sides, but what if the word is in the first place?)
3/ Additional Parameters -
Not every offer has it but like this one (http://www.kontrakt.szczecin.pl/mieszkanie-sprzedaz-6664m2-339600pln-potulicka-nowe-miasto-szczecin-zachodniopomorskie,351165) there is information about number of balconies in property. Sometimes there is information about basement too. It should be similar code to the 1/ issue.
So I tried something like this, using some internet sources (it is still incomplete):
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = "http://www.kontrakt.szczecin.pl/mieszkanie-sprzedaz-6664m2-339600pln-potulicka-nowe-miasto-szczecin-zachodniopomorskie,351165"
#PL: otwiera połączenie z wybraną stroną, pobieranie zawartości strony (urllib)
#EN: Opens a connection and grabs url
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
#html parsing (BeautifulSoup)
page_soup = soup(page_html, "html.parser") #html.parser -> zapisujemy do html, nie np. do xml
#PL: zbiera tabelkę z numerami ofert, kuchnią i innymi danymi o nieruchomości z tabelki
#EN: grabs the data about real estate like kitchen, offer no, etc.
containers = page_soup.findAll("section",{"class":"clearfix"},{"id":"quick-summary"})
# print(len(containers)) - len(containers) sprawdza ile takich obiektów istnieje na stronie
#PL: Co prawda na stronie jest tylko jedna taka tabelka, ale dla dobra nauki zrobię tak jak gdyby tabelek było wiele.
#EN: There is only one table, but for the sake of knowledge I do the container variable
container = containers[0]
find_dt = container.findAll("dt")
find_dd = container.findAll("dd")
print(find_dt[0].text + " " + find_dd[0])
It works, but still is incomplete. I don't continue it right now because there is major flaw. As you see the last print it takes indexes, but not every property will have the same order (because as I mentioned sometimes there is 10 pieces of info, sometimes more, sometimes less). It will be a huge mess in CSV.
My VBA program worked in this way:
Copy table to Excel (Sheet 1)
In the sheet 2 there was parameters that program was looking for (such as Prices)
Mechanism in shortcut: Copy parameter from sheet 2 (Price), go to sheet 1 (where is parsed information), find Price string (paste the information from sheet 2: "Price"), go line below, copy price value, go to sheet 2, find Price, go below, paste the price value. And so on.
Looking for help with concept and coding also.
EDIT:
PART 1 and PART 2 is ready. But I have big issues with PART 3. Here is the code:
from urllib import request as uReq
import requests
#dzięki temu program jest zamykany odrazu, i nie kontynuuje wykonywania reszty kodu. Po imporcie wystarczy exit(0)
from sys import exit
from urllib.request import urlopen as uReq2
from bs4 import BeautifulSoup as soup
import csv
import re
import itertools
filename = 'test.txt'
#licznik, potrzebny do obliczenia ilości numerów ofert w pliku .txt
num_lines = 0
# tworzymy listę danych i listę URLi. Wyniki będą dodawane do list, dlatego potrzeba jest ich utworzenia (jako puste)
list_of_lines = ['351238', '351237', '111111', '351353']
list_of_lines2 = []
list_of_URLs = []
list_of_redictered_URLs = []
KONTRAKT = 'http://www.kontrakt.szczecin.pl'
with open(filename, 'r') as file:
for line in file:
#dodajemy linię (ofertę) do listy
list_of_lines.append(line.strip())
#num_lines jest licznikiem, wskazuje ile wierszy zawiera lista, zmienna jest istotna w zakresię tworzenia pętli z adresami URL
num_lines += 1
#tworzymy URLe z Numerów Ofert zawartych w filename
for i in range(num_lines):
nr_oferty = list_of_lines[i]
my_url = "http://www.kontrakt.szczecin.pl/lista-ofert/?f_listingId=" + nr_oferty + "&f=&submit=Szukaj"
list_of_URLs.append(my_url)
print(list_of_URLs)
#Cześć druga: konwertowanie listy linków na listę linków przekierowanych
#Program wchodzi na stronę, która powinna być przekierowana, jednak ze względu na użyscie Java Scriptu,
#zadanie zostało utrudnione. Dlatego, też celem programu jest symulowanie przeglądarki, pobranie
#zawartości strony, a następnie 'wyłuskanie' odpowiedniego linku do przekierowania
i = 0
for i in range(num_lines):
url_redirect = list_of_URLs[i]
my_url = url_redirect
BROWSER = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(my_url, headers=BROWSER)
script1 = '<script>'
script2 = '</script>'
content_URL = str(response.content)
find_script1 = (content_URL.find(script1))
find_script2 = (content_URL.find(script2))
url_ready = content_URL[find_script1:find_script2]
print(i+1,'z', num_lines, '-', 'oferta nr:', str(my_url[57:57+6]))
list_of_redictered_URLs.append(url_ready)
#usuwanie zbędnych tagów i znaków, w celu uzyskania czystego przekierowanego linku
list_of_redictered_URLs = [w.replace('<script>window.location=\\\'','') for w in list_of_redictered_URLs]
list_of_redictered_URLs = [w.replace('\\\';','') for w in list_of_redictered_URLs]
#print(list_of_redictered_URLs)
#usuwanie pustych wierszy z listy (oferty, które są nieakutalne na liste "wchodzą jako puste" !!! item: jest to zmienna, można zamienić np. na janusz.
filtered_list = list(filter(lambda item: item.strip(), list_of_redictered_URLs))
filtered_list = [KONTRAKT + item for item in filtered_list]
#zmiana na tuple, ze względu iż mutowalność (dodawanie kolejnych linków) nie będzie potrzebne
filtered_list = tuple(filtered_list)
#print(str(filtered_list))
print('Lista linków:\n',filtered_list)
# Kolejną częścią programu jest pobieranie istotnych informacji (parametrów podstawowych)
# ze strony kontrakt.szczecin.pl, a następnie ich zapisanie w pliku csv.
# Nagłówki w csv oraz nazwy parametrów na stronie (muszą być identyczne jak na stronie, aby mogły
# zostać odpowiednio przyporządkowane w .csv)
HEADERS = ['Numer oferty',
'Liczba pokoi',
'Cena',
'Cena za m2',
'Powierzchnia',
'Piętro',
'Liczba pięter',
'Typ kuchni',
'Balkon',
'Czynsz administracyjny',
'Rodzaj ogrzewania',
'Umeblowanie',
'Wyposażona kuchnia',
'Gorąca woda',
'Rodzaj budynku',
'Materiał',
'Rok budowy',
'Stan nieruchomości',
'Rynek',
'Dach:',
'Liczba balkonów:',
'Liczba tarasów:',
'Piwnica:',
'Ogród:',
'Ochrona:',
'Garaż:',
'Winda:',
'Kształt działki:',
'Szerokość działki (mb.):',
'Długość działki (mb.):',
'Droga dojazdowa:',
'Gaz:',
'Prąd:',
'Siła:','piwnica',
'komórk',
'strych',
'gospodarcze',
'postojow',
'parking',
'przynależn',
'garaż',
'ogród',
'ogrod',
'działka',
'ocieplony',
'moderniz',
'restaur',
'odnow',
'ociepl',
'remon',
'elew',
'dozór',
'dozor',
'monitoring',
'monit',
'ochron',
'alarm',
'strzeż',
'portier',
'wspólnot',
'spółdziel',
'kuchni',
'aneks',
'widna',
'ciemna',
'prześwit',
'oficyn',
'linia',
'zabudow',
'opłat',
'bezczynsz',
'poziom',
'wind',
'francuski',
'ul.',
'w cenie',
'dodatkową']
LINKI = ["Link"]
#HEADERS2 = ['Liczba balkonów:',
# 'Liczba tarasów:',
# 'Piwnica:',
# 'Ogród:',
# 'Ochrona:',
# 'Garaż:',
# 'Winda:']
HEADERS3 = ['piwnica',
'komórk',
'strych',
'gospodarcze',
'postojow',
'parking',
'przynależn',
'garaż',
'ogród',
'ogrod',
'działka',
'ocieplony',
'moderniz',
'restaur',
'odnow',
'ociepl',
'remon',
'elew',
'dozór',
'dozor',
'monitoring',
'monit',
'ochron',
'alarm',
'strzeż',
'portier',
'wspólnot',
'spółdziel',
'kuchni',
'aneks',
'widna',
'ciemna',
'prześwit',
'oficyn',
'linia',
'zabudow',
'opłat',
'bezczynsz',
'poziom',
'wind',
'francuski',
'ul.',
'w cenie',
'dodatkową',]
csv_name = 'data.csv'
print('Dane zostaną zapisane do pliku:',csv_name + '.csv')
print('\n>>>>Program rozpoczyna pobieranie danych')
#Pobieranie linków
i = 0
#Tworzy plik csv o nazwie csv
#writerow może mieć tylko jeden argument, dlatego jest nim suma poszczególnych list. Lista
#linki ma jędną pozycję, ponieważ można sumować dane jednego typu. Nie można sumować listy ze stringami.
with open(csv_name + '.csv', 'w', newline='') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=',', quotechar='"')
HEADERS_ALL = HEADERS+HEADERS3+LINKI
csvwriter.writerow(HEADERS_ALL)
for i in range(len(filtered_list)):
my_url = filtered_list[i]
with uReq2(my_url) as uClient:
page_soup = soup(uClient.read(), 'lxml')
print('\t\t-----------',i+1,'-----------\n',my_url)
#<dt> - nazwa parametru np. Kuchnia
#<dd> - wartość parametru np. widna
row = ['-'] * len(HEADERS) + ['-'] * len(HEADERS3) + ['-'] * len(LINKI)
# Parametry podstawowe (kontrakt.szczecin.pl)
for dt, dd in zip(page_soup.select('section#quick-summary dt'), page_soup.select('section#quick-summary dd')):
if dt.text.strip() not in HEADERS:
print("\n 1(dt,dd):UWAGA!, kolumna [{}] nie istnieje w nagłówkach! (stała: HEADERS)\n".format(dt.text.strip()))
continue
row[HEADERS.index(dt.text.strip())] = dd.text.strip()
# Parametry dodatkowe
for span, li in zip(page_soup.select('section#property-features span'), page_soup.select('section#property-features li')):
if span.text.strip() not in HEADERS:
print("\n 2:UWAGA(span,li), kolumna [{}] nie istnieje w nagłówkach (stała HEADERS)!\n".format(span.text.strip()))
continue
row[HEADERS.index(span.text.strip())] = li.text.strip()
#csvwriter.writerow(row)
print(row)
#No to zaczynamy zabawę...................................
# zmienna j odnosi się do indeksu HEADERS3, jest to j nie i, ponieważ i jest w dalszym użyciu
# w pętli powyżej
for p in page_soup.select('section#description'):
p = str(p)
p = p.lower()
for j in range(len(HEADERS3)):
#print('j:',j)
# find_p znajduje wszystkie słowa kluczowe z HEADERS3 w paragrafie na stronie kontraktu.
find_p = re.findall(HEADERS3[j],p)
# listy, które wyświetlają pozycję startową poszczególnych słów muszą zaczynać się od '-' lub 0?,
# ponieważ, gdy dane słowo nie zostanie odnalezione to listy będą puste w pierwszej iteracji pętli
# co w konsekewncji doprowadzi do błędu out of range
m_start = []
m_end = []
lista_j = []
for m in re.finditer(HEADERS3[j], p):
#print((m.start(),m.end()), m.group())
m_start.append(m.start())
m_end.append(m.end())
#print(h)
for k in range(len(m_start)):
#właściwe teraz nie wiem po co to jest..
try:
x = m_start[k]
y = m_end[k]
except IndexError:
x = m_start[0]
y = m_end[0]
#print('xy:',x,y)
#print(find_p)
#print(HEADERS3[j])
z = (HEADERS3[j]+':',p[-60+x:y+60]+' ++-NNN-++')
lista_j.append(z)
print (lista_j)
print(str(lista_j))
row[HEADERS.index(span.text.strip())] = str(lista_j)
csvwriter.writerow(row)
#print(row)
This code snippet will parse the quick summary table of the property url and saves it in csv file:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import csv
# my_url = 'http://www.kontrakt.szczecin.pl/mieszkanie-sprzedaz-6664m2-339600pln-potulicka-nowe-miasto-szczecin-zachodniopomorskie,351165'
my_url = 'http://www.kontrakt.szczecin.pl/mieszkanie-sprzedaz-100m2-335000pln-grudziadzka-pomorzany-szczecin-zachodniopomorskie,351149'
with uReq(my_url) as uClient:
page_soup = soup(uClient.read(), 'lxml')
with open('data.csv', 'w', newline='') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=',', quotechar='"')
for dt, dd in zip(page_soup.select('section#quick-summary dt'), page_soup.select('section#quick-summary dd')):
csvwriter.writerow([dt.text.strip(), dd.text.strip()])
The result is in data.csv, screenshot from my LibreOffice:
For having the table transposed, you can use this code:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import csv
# my_url = 'http://www.kontrakt.szczecin.pl/mieszkanie-sprzedaz-6664m2-339600pln-potulicka-nowe-miasto-szczecin-zachodniopomorskie,351165'
my_url = 'http://www.kontrakt.szczecin.pl/mieszkanie-sprzedaz-100m2-335000pln-grudziadzka-pomorzany-szczecin-zachodniopomorskie,351149'
with uReq(my_url) as uClient:
page_soup = soup(uClient.read(), 'lxml')
headers = ['Numer oferty',
'Liczba pokoi',
'Cena',
'Cena za m2',
'Powierzchnia',
'Piętro',
'Liczba pięter',
'Typ kuchni',
'Balkon',
'Czynsz administracyjny',
'Rodzaj ogrzewania',
'Gorąca woda',
'Rodzaj budynku',
'Materiał',
'Rok budowy',
'Stan nieruchomości',
'Rynek',
'Dach:',
'Liczba balkonów:',
'Piwnica:',
'Kształt działki:',
'Szerokość działki (mb.):',
'Długość działki (mb.):',
'Droga dojazdowa:',
'Gaz:',
'Prąd:',
'Siła:']
with open('data.csv', 'w', newline='') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=',', quotechar='"')
csvwriter.writerow(headers)
row = ['-'] * len(headers)
for dt, dd in zip(page_soup.select('section#quick-summary dt'), page_soup.select('section#quick-summary dd')):
if dt.text.strip() not in headers:
print("Warning, column [{}] doesn't exist in headers!".format(dt.text.strip()))
continue
row[headers.index(dt.text.strip())] = dd.text.strip()
csvwriter.writerow(row)
The result will be in csv file like this (the values not present will be substituted with '-'):
Related
Extract a content from <script> scrapign with BS4
I'm trying to extract the information from a "script" tag, the code is as follows response = requests.get("https://www.zalando.es/jordan-air-jordan-mid-zapatillas-altas-blackdark-beetrootwhitehyper-royal-joc11a024-g11.html?hl=1610800800024", headers=headers) soup = BeautifulSoup(response.content, 'html.parser') marca = soup.find("h3", {"class":"OEhtt9 ka2E9k uMhVZi uc9Eq5 pVrzNP _5Yd-hZ"}).text nombre = soup.find("h1", {"class":"OEhtt9 ka2E9k uMhVZi z-oVg8 pVrzNP w5w9i_ _1PY7tW _9YcI4f"}).text color = soup.find("span", {"class":"u-6V88 ka2E9k uMhVZi dgII7d z-oVg8 pVrzNP"}).text precio = soup.find("span", {"class":"uqkIZw ka2E9k uMhVZi FxZV-M z-oVg8 pVrzNP"}).text talla = soup.find("span", {"class":"u-6V88 ka2E9k uMhVZi FxZV-M z-oVg8 pVrzNP"}).text imagen = soup.find("img", {"class": "_6uf91T z-oVg8 u-6V88 ka2E9k uMhVZi FxZV-M _2Pvyxl JT3_zV EKabf7 mo6ZnF _1RurXL mo6ZnF PZ5eVw"})['src'] sku355 = api + str(soup.find_all('script')[15]).split('sku":"')[3][:-137] sku36 = api + str(soup.find_all('script')[15]).split('sku":"')[4][:-139] sku365 = api + str(soup.find_all('script')[15]).split('sku":"')[5][:-139] sku375 = api + str(soup.find_all('script')[15]).split('sku":"')[6][:-137] sku38 = api + str(soup.find_all('script')[15]).split('sku":"')[7][:-139] sku385 = api + str(soup.find_all('script')[15]).split('sku":"')[8][:-137] sku39 = api + str(soup.find_all('script')[15]).split('sku":"')[9][:-137] sku40 = api + str(soup.find_all('script')[15]).split('sku":"')[10][:-139] sku405 = api + str(soup.find_all('script')[15]).split('sku":"')[11][:-137] sku41 = api + str(soup.find_all('script')[15]).split('sku":"')[12][:-137] sku42 = api + str(soup.find_all('script')[15]).split('sku":"')[13][:-139] sku425 = api + str(soup.find_all('script')[15]).split('sku":"')[14][:-137] sku43 = api + str(soup.find_all('script')[15]).split('sku":"')[15][:-125] print (sku3555) print (sku36) print (sku365) print (sku375) print (sku38) print (sku385) print (sku39) print (sku40) print (sku405) print (sku41) print (sku42) print (sku425) print (sku43) Everything works perfect with these shoes, but when I switch for example to this link it gives me something else, what I would like to take out is the SKU of each size, regardless of the link that puts https://www.zalando.es/nike-sportswear-air-force-1-gtx-unisex-zapatillas-anthraciteblackbarely-grey-ni115o01u-q11.html
Could not reproduce your example, would be cool to improve your question. Just in case If you just wanna grab the sizes, try the following: import requests, json from bs4 import BeautifulSoup headers = {"user-agent": "Mozilla/5.0"} response = requests.get("https://www.zalando.es/jordan-air-jordan-mid-zapatillas-altas-blackdark-beetrootwhitehyper-royal-joc11a024-g11.html?hl=1610800800024", headers=headers) soup = BeautifulSoup(response.content, 'lxml') json_object = json.loads(soup.select_one('script#z-vegas-pdp-props').contents[0].split('CDATA')[1].split(']>')[0]) for item in json_object[0]['model']['articleInfo']['units']: print('sku:{0} - size:{1}'.format(item['id'],item['size']['local'])) Output sku:JOC11A024-G110005000 - size:35.5 sku:JOC11A024-G110055000 - size:36 sku:JOC11A024-G110006000 - size:36.5 sku:JOC11A024-G110065000 - size:37.5 sku:JOC11A024-G110007000 - size:38 sku:JOC11A024-G110075000 - size:38.5 sku:JOC11A024-G110008000 - size:39 sku:JOC11A024-G110085000 - size:40 sku:JOC11A024-G110009000 - size:40.5 sku:JOC11A024-G110095000 - size:41 sku:JOC11A024-G110010000 - size:42 sku:JOC11A024-G110105000 - size:42.5 sku:JOC11A024-G110011000 - size:43
How can I clean up the response for this script to make it more readable?
How can I make the output to for this script into neater format like csv? When I save the response to text it is formatted badly. I tried using writer.writerow but I could not get this method to account for variables. import requests from bs4 import BeautifulSoup url = "https://www.rockauto.com/en/catalog/ford,2015,f-150,3.5l+v6+turbocharged,3308773,brake+&+wheel+hub,brake+pad,1684" response = requests.get(url) data = response.text soup = BeautifulSoup(data, 'html.parser') meta_tag = soup.find('meta', attrs={'name': 'keywords'}) category = meta_tag['content'] linecodes = [] partnos = [] descriptions = [] infos = [] for tbody in soup.select('tbody[id^="listingcontainer"]'): tmp = tbody.find('span', class_='listing-final-manufacturer') linecodes.append(tmp.text if tmp else '-') tmp = tbody.find('span', class_='listing-final-partnumber as-link-if-js buyers-guide-color') partnos.append(tmp.text if tmp else '-') tmp = tbody.find('span', class_='span-link-underline-remover') descriptions.append(tmp.text if tmp else '-') tmp = tbody.find('div', class_='listing-text-row') infos.append(tmp.text if tmp else '-') for row in zip(linecodes,partnos,infos,descriptions): result = category + ' | {:<20} | {:<20} | {:<80} | {:<80}'.format(*row) with open('complete.txt', 'a+') as f: f.write(result + '/n') print(result)
You could put it into a pandas dataframe Remove the last for-loop from the original code. # imports import requests from bs4 import BeautifulSoup import pandas as pd # set pandas display options to display more rows and columns pd.set_option('display.max_columns', 700) pd.set_option('display.max_rows', 400) pd.set_option('display.min_rows', 10) # your code url = "https://www.rockauto.com/en/catalog/ford,2015,f-150,3.5l+v6+turbocharged,3308773,brake+&+wheel+hub,brake+pad,1684" response = requests.get(url) data = response.text soup = BeautifulSoup(data, 'html.parser') meta_tag = soup.find('meta', attrs={'name': 'keywords'}) category = meta_tag['content'] linecodes = [] partnos = [] descriptions = [] infos = [] for tbody in soup.select('tbody[id^="listingcontainer"]'): tmp = tbody.find('span', class_='listing-final-manufacturer') linecodes.append(tmp.text if tmp else '-') tmp = tbody.find('span', class_='listing-final-partnumber as-link-if-js buyers-guide-color') partnos.append(tmp.text if tmp else '-') tmp = tbody.find('span', class_='span-link-underline-remover') descriptions.append(tmp.text if tmp else '-') tmp = tbody.find('div', class_='listing-text-row') infos.append(tmp.text if tmp else '-') added code for dataframe # create dataframe df = pd.DataFrame(zip(linecodes,partnos,infos,descriptions), columns=['codes', 'parts', 'info', 'desc']) # add the category column df['category'] = category # break the category column into multiple columns if desired # skip the last 2 columns, because they are empty df[['cat_desc', 'brand', 'model', 'engine', 'cat_part']] = df.category.str.split(',', expand=True).iloc[:, :-2] # drop the unneeded category column df.drop(columns='category', inplace=True) # save to csv df.to_csv('complete.txt', index=False) # display(df) codes parts info desc cat_desc brand model engine cat_part 0 CENTRIC 30016020 Rear; w/ Manual parking brake Semi-Metallic; w/Shims and Hardware 2015 FORD F-150 Brake Pad FORD F-150 3.5L V6 Turbocharged Brake Pad 1 CENTRIC 30116020 Rear; w/ Manual parking brake Ceramic; w/Shims and Hardware 2015 FORD F-150 Brake Pad FORD F-150 3.5L V6 Turbocharged Brake Pad 2 DYNAMIC FRICTION 1551160200 Rear; Manual Parking Brake 5000 Advanced; Ceramic 2015 FORD F-150 Brake Pad FORD F-150 3.5L V6 Turbocharged Brake Pad
re-iterate over and over rather than once in soup
I keep re-iterating over this code. I'm keen to scrape all past results data from this site yet i keep looping over one by one? for example race_number printed goes 1, 1,2, 1,2,3 etc etc End goal is to full all list with data and panda it out to look at results and trends. import requests import csv import os import numpy import pandas from bs4 import BeautifulSoup as bs with requests.Session() as s: webpage_response = s.get('http://www.harness.org.au/racing/fields/race-fields/?mc=SW010420') soup = bs(webpage_response.content, "html.parser") #soup1 = soup.select('.content') results = soup.find_all('div', {'class':'forPrint'}) race_number = [] race_name = [] race_title = [] race_distance = [] place = [] horse_name = [] Prizemoney = [] Row = [] horse_number = [] Trainer = [] Driver = [] Margin = [] Starting_odds = [] Stewards_comments = [] Scratching = [] Track_Rating = [] Gross_Time = [] Mile_Rate = [] Lead_Time = [] First_Quarter = [] Second_Quarter = [] Third_Quarter = [] Fourth_Quarter = [] for race in results: race_number1 = race.find(class_='raceNumber').get_text() race_number.append(race_number1) race_name1 = race.find(class_='raceTitle').get_text() race_name.append(race_name1) race_title1 = race.find(class_='raceInformation').get_text(strip=True) race_title.append(race_title1) race_distance1 = race.find(class_='distance').get_text() race_distance.append(race_distance1) Need help fixing iteration over and over, and what is the next best move to look at table data rather than headers above? Cheers
Is this the output you are expecting: import requests import csv import os import numpy import pandas as pd import html from bs4 import BeautifulSoup as bs with requests.Session() as s: webpage_response = s.get('http://www.harness.org.au/racing/fields/race-fields/?mc=SW010420') soup = bs(webpage_response.content, "html.parser") #soup1 = soup.select('.content') data = {} data["raceNumber"] = [ i['rowspan'] for i in soup.find_all("td", {"class": "raceNumber", "rowspan": True})] data["raceTitle"] = [ i.get_text(strip=True) for i in soup.find_all("td", {"class": "raceTitle"})] data["raceInformation"] = [ i.get_text(strip=True) for i in soup.find_all("td", {"class": "raceInformation"})] data["distance"] = [ i.get_text(strip=True) for i in soup.find_all("td", {"class": "distance"})] print(data) data_frame = pd.DataFrame(data) print(data_frame) ## Output ## raceNumber raceTitle raceInformation distance ##0 3 PREMIX KING PACE $4,500\n\t\t\t\t\t4YO and older.\n\t\t\t\t\tNR... 1785M ##1 3 GATEWAY SECURITY PACE $7,000\n\t\t\t\t\t4YO and older.\n\t\t\t\t\tNR... 2180M ##2 3 PERRY'S FOOTWEAR TROT $7,000\n\t\t\t\t\t\n\t\t\t\t\tNR 46 to 55.\n\t... 2180M ##3 3 DELAHUNTY PLUMBING 3YO TROT $7,000\n\t\t\t\t\t3YO.\n\t\t\t\t\tNR 46 to 52.... 2180M ##4 3 RAYNER'S FRUIT & VEGETABLES 3YO PACE $7,000\n\t\t\t\t\t3YO.\n\t\t\t\t\tNR 48 to 56.... 2180M ##5 3 KAYE MATTHEWS TRIBUTE $9,000\n\t\t\t\t\t4YO and older.\n\t\t\t\t\tNR... 2180M ##6 3 TALQUIST TREES PACE $7,000\n\t\t\t\t\t\n\t\t\t\t\tNR 62 to 73.\n\t... 2180M ##7 3 WEEKLY ADVERTISER 3WM PACE $7,000\n\t\t\t\t\t\n\t\t\t\t\tNR 56 to 61.\n\t... 1785M
Webscraping table on different pages
how i can webscrape with python the same table that extend on different pages? I'm able to do it but it stops at the first page. Here's an example: https://www.borsaitaliana.it/borsa/azioni/ftse-mib/lista.html?lang=en&page=1 This is my code: from bs4 import BeautifulSoup as soup from urllib.request import urlopen as ureq my_link = "https://www.borsaitaliana.it/borsa/azioni/ftse-mib/lista.html?lang=en" webpage = ureq(my_link).read() htmlpage = soup(webpage , 'html.parser') containers = htmlpage.findAll("td", {"class":"u-hidden -xs"}) filename = "Dati odierni listino FTSEMIB.csv" f = open(filename, 'w') headers = "Stock, price, %, time, opening\n" f.write(headers) for i in range(1, len(containers), 6): stock = containers[i-1].text.strip() price = containers[i].text.strip() percentage = containers[i+1].text.strip() time = containers[i+2].text.strip() opening = containers[i+3].text.strip() f.write(stock + "," + price + "," + percentage + "," + time + "," + opening + "\n") f.close() (There's no way to show all the data in one page) EDIT: I also solved doing this: from bs4 import BeautifulSoup as soup from urllib.request import urlopen as ureq my_link = "https://www.borsaitaliana.it/borsa/azioni/ftse-mib/lista.html?lang=en" my_link2 = "https://www.borsaitaliana.it/borsa/azioni/ftse-mib/lista.html?lang=en&page=2" webpage = ureq(my_link).read() webpage2 = ureq(my_link2).read() htmlpage = soup(webpage , 'html.parser') htmlpage2 = soup(webpage2, 'html.parser') containers = htmlpage.findAll("td", {"class":"u-hidden -xs"}) + htmlpage2.findAll("td", {"class":"u-hidden -xs"}) filename = "Dati odierni listino FTSEMIB.csv" f = open(filename, 'w') headers = "Stock, price, %, time, opening\n" f.write(headers) for i in range(1, len(containers), 6): stock = containers[i-1].text.strip() price = containers[i].text.strip() percentage = containers[i+1].text.strip() time = containers[i+2].text.strip() opening = containers[i+3].text.strip() f.write(stock + "," + price + "," + percentage + "," + time + "," + opening + "\n") f.close() But if the table would be 20 pages long i can't imagine doing in this way, that's why i'm looking for something 'smarter'.
One possibility is to find link to next page, a[title="Next"] in this case. If the link doesn't exist, you are on last page: import requests from bs4 import BeautifulSoup url = 'https://www.borsaitaliana.it/borsa/azioni/ftse-mib/lista.html?lang=en&page=1' soup = BeautifulSoup(requests.get(url).text, 'lxml') from textwrap import shorten page = 1 while True: print() print('Page no. {}'.format(page)) print('-' * 80) for tr in soup.select('tr'): for td in tr.select('td')[1:]: txt = td.get_text(strip=True, separator=' ') print('{: >25}'.format(shorten(txt, 25)), end='') print() m = soup.select_one('a[title="Next"][href]') if m: url = 'https://www.borsaitaliana.it' + m['href'] soup = BeautifulSoup(requests.get(url).text, 'lxml') page += 1 else: break Prints: Page no. 1 -------------------------------------------------------------------------------- A2a 1.5675 +1.33 17:35:32 1.555 Close Amplifon 22.30 +1.27 17:35:39 22.00 Close Atlantia 22.92 +0.26 17:41:55 22.94 Close Azimut Holding 15.595 +1.93 17:35:48 15.285 Close Banco Bpm 1.685 +4.04 17:35:58 1.63 Close Bper Banca 3.078 +2.19 17:35:03 3.022 Close Buzzi Unicem 18.41 +0.60 17:35:13 18.445 Close Campari 7.84 +0.71 17:35:03 7.85 Close Cnh Industrial 7.956 +1.69 17:35:29 7.80 Close Diasorin 106.00 +1.83 17:35:53 104.10 Close Enel 6.285 +4.59 17:35:58 6.064 Close Eni 13.04 -0.47 17:39:49 12.972 Close Exor 57.16 -1.21 17:35:00 58.02 Close Ferrari 140.05 -0.11 17:37:09 141.20 Close Fiat Chrysler Automobiles 11.054 -2.71 17:37:07 11.232 Close Finecobank 8.656 +1.67 17:35:49 8.67 Close Generali 15.98 +0.38 17:40:02 15.93 Close Hera 3.466 +2.79 17:35:06 3.396 Close Intesa Sanpaolo 1.882 +1.97 17:41:21 1.856 Close Italgas 5.674 +0.32 17:35:41 5.70 Close Page no. 2 -------------------------------------------------------------------------------- Juventus Football Club 1.46 +2.21 17:35:42 1.43 Close Leonardo 10.095 +2.91 17:35:59 9.81 Close Mediobanca 8.508 +2.14 17:35:33 8.332 Close Moncler 33.86 -0.85 17:35:25 33.86 Close Nexi 9.80 +0.00 17:35:04 9.79 Close Pirelli & C 4.516 -1.07 17:35:24 4.50 Close Poste Italiane 9.234 +0.98 17:35:24 9.18 Close Prysmian 17.725 +0.25 17:35:59 17.70 Close Recordati 38.80 +1.57 17:35:02 38.74 Close Saipem 4.022 +2.55 17:35:04 3.932 Close Salvatore Ferragamo 17.145 -1.89 17:35:19 17.425 Close Snam 4.487 +2.30 17:35:53 4.391 Close Stmicroelectronics 15.805 +2.10 17:35:48 15.62 Close Telecom Italia 0.4451 +0.75 17:35:31 0.4438 Close Tenaris 9.484 +0.51 17:35:49 9.40 Close Terna - Rete [...] 5.432 +2.22 17:35:55 5.362 Close Ubi Banca 2.217 +5.62 17:38:45 2.105 Close Unicredit 9.531 +3.71 17:39:39 9.27 Close Unipol 4.313 +1.67 17:35:41 4.277 Close Unipolsai 2.208 -0.32 17:35:03 2.221 Close
After you have gone over each <tr> tag on the page, you need to go to the next page, using the href. Looks like it's "/borsa/azioni/ftse-mib/lista.html?lang=en&page=2" in which case you can just iterate over the page=to change to the next page. If you post some code, we can help you a bit more :)
What can I do to scrape 10000 pages without appearing captchas?
Hi there i've been trying to collect all the information in 10,000 pages of this page for a school project, I thought everything was fine until on page 4 I got a mistake. I check the page manually and I find that the page now asks me for a captcha. What can I do to avoid it? Maybe set a timer between the searchs? Here it is my code. import bs4, requests, csv g_page = requests.get("http://www.usbizs.com/NY/New_York.html") m_page = bs4.BeautifulSoup(g_page.text, "lxml") get_Pnum = m_page.select('div[class="pageNav"]') MAX_PAGE = int(get_Pnum[0].text[9:16]) print("Recolectando información de la página 1 de {}.".format(MAX_PAGE)) contador = 0 information_list = [] for k in range(1, MAX_PAGE): c_items = m_page.select('div[itemtype="http://schema.org/Corporation"] a') c_links = [] i = 0 for link in c_items: c_links.append(link.get("href")) i+=1 for j in range(len(c_links)): temp = [] s_page = requests.get(c_links[j]) i_page = bs4.BeautifulSoup(s_page.text, "lxml") print("Ingresando a: {}".format(c_links[j])) info_t = i_page.select('div[class="infolist"]') info_1 = info_t[0].text info_2 = info_t[1].text temp = [info_1,info_2] information_list.append(temp) contador+=1 with open ("list_information.cv", "w") as file: writer=csv.writer(file) for row in information_list: writer.writerow(row) print("Información de {} clientes recolectada y guardada correctamente.".format(j+1)) g_page = requests.get("http://www.usbizs.com/NY/New_York-{}.html".format(k+1)) m_page = bs4.BeautifulSoup(g_page.text, "lxml") print("Recolectando información de la página {} de {}.".format(k+1,MAX_PAGE)) print("Programa finalizado. Información recolectada de {} clientes.".format(contador))