Webscraping table on different pages

Webscraping table on different pages - python

how i can webscrape with python the same table that extend on different pages? I'm able to do it but it stops at the first page.
Here's an example: https://www.borsaitaliana.it/borsa/azioni/ftse-mib/lista.html?lang=en&page=1
This is my code:
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as ureq
my_link = "https://www.borsaitaliana.it/borsa/azioni/ftse-mib/lista.html?lang=en"
webpage = ureq(my_link).read()
htmlpage = soup(webpage , 'html.parser')
containers = htmlpage.findAll("td", {"class":"u-hidden -xs"})
filename = "Dati odierni listino FTSEMIB.csv"
f = open(filename, 'w')
headers = "Stock, price, %, time, opening\n"
f.write(headers)
for i in range(1, len(containers), 6):
stock = containers[i-1].text.strip()
price = containers[i].text.strip()
percentage = containers[i+1].text.strip()
time = containers[i+2].text.strip()
opening = containers[i+3].text.strip()
f.write(stock + "," + price + "," + percentage + "," + time + "," + opening + "\n")
f.close()
(There's no way to show all the data in one page)
EDIT:
I also solved doing this:
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as ureq
my_link = "https://www.borsaitaliana.it/borsa/azioni/ftse-mib/lista.html?lang=en"
my_link2 = "https://www.borsaitaliana.it/borsa/azioni/ftse-mib/lista.html?lang=en&page=2"
webpage = ureq(my_link).read()
webpage2 = ureq(my_link2).read()
htmlpage = soup(webpage , 'html.parser')
htmlpage2 = soup(webpage2, 'html.parser')
containers = htmlpage.findAll("td", {"class":"u-hidden -xs"}) + htmlpage2.findAll("td", {"class":"u-hidden -xs"})
filename = "Dati odierni listino FTSEMIB.csv"
f = open(filename, 'w')
headers = "Stock, price, %, time, opening\n"
f.write(headers)
for i in range(1, len(containers), 6):
stock = containers[i-1].text.strip()
price = containers[i].text.strip()
percentage = containers[i+1].text.strip()
time = containers[i+2].text.strip()
opening = containers[i+3].text.strip()
f.write(stock + "," + price + "," + percentage + "," + time + "," + opening + "\n")
f.close()
But if the table would be 20 pages long i can't imagine doing in this way, that's why i'm looking for something 'smarter'.

One possibility is to find link to next page, a[title="Next"] in this case. If the link doesn't exist, you are on last page:
import requests
from bs4 import BeautifulSoup
url = 'https://www.borsaitaliana.it/borsa/azioni/ftse-mib/lista.html?lang=en&page=1'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
from textwrap import shorten
page = 1
while True:
print()
print('Page no. {}'.format(page))
print('-' * 80)
for tr in soup.select('tr'):
for td in tr.select('td')[1:]:
txt = td.get_text(strip=True, separator=' ')
print('{: >25}'.format(shorten(txt, 25)), end='')
print()
m = soup.select_one('a[title="Next"][href]')
if m:
url = 'https://www.borsaitaliana.it' + m['href']
soup = BeautifulSoup(requests.get(url).text, 'lxml')
page += 1
else:
break
Prints:
Page no. 1
--------------------------------------------------------------------------------
A2a 1.5675 +1.33 17:35:32 1.555 Close
Amplifon 22.30 +1.27 17:35:39 22.00 Close
Atlantia 22.92 +0.26 17:41:55 22.94 Close
Azimut Holding 15.595 +1.93 17:35:48 15.285 Close
Banco Bpm 1.685 +4.04 17:35:58 1.63 Close
Bper Banca 3.078 +2.19 17:35:03 3.022 Close
Buzzi Unicem 18.41 +0.60 17:35:13 18.445 Close
Campari 7.84 +0.71 17:35:03 7.85 Close
Cnh Industrial 7.956 +1.69 17:35:29 7.80 Close
Diasorin 106.00 +1.83 17:35:53 104.10 Close
Enel 6.285 +4.59 17:35:58 6.064 Close
Eni 13.04 -0.47 17:39:49 12.972 Close
Exor 57.16 -1.21 17:35:00 58.02 Close
Ferrari 140.05 -0.11 17:37:09 141.20 Close
Fiat Chrysler Automobiles 11.054 -2.71 17:37:07 11.232 Close
Finecobank 8.656 +1.67 17:35:49 8.67 Close
Generali 15.98 +0.38 17:40:02 15.93 Close
Hera 3.466 +2.79 17:35:06 3.396 Close
Intesa Sanpaolo 1.882 +1.97 17:41:21 1.856 Close
Italgas 5.674 +0.32 17:35:41 5.70 Close
Page no. 2
--------------------------------------------------------------------------------
Juventus Football Club 1.46 +2.21 17:35:42 1.43 Close
Leonardo 10.095 +2.91 17:35:59 9.81 Close
Mediobanca 8.508 +2.14 17:35:33 8.332 Close
Moncler 33.86 -0.85 17:35:25 33.86 Close
Nexi 9.80 +0.00 17:35:04 9.79 Close
Pirelli & C 4.516 -1.07 17:35:24 4.50 Close
Poste Italiane 9.234 +0.98 17:35:24 9.18 Close
Prysmian 17.725 +0.25 17:35:59 17.70 Close
Recordati 38.80 +1.57 17:35:02 38.74 Close
Saipem 4.022 +2.55 17:35:04 3.932 Close
Salvatore Ferragamo 17.145 -1.89 17:35:19 17.425 Close
Snam 4.487 +2.30 17:35:53 4.391 Close
Stmicroelectronics 15.805 +2.10 17:35:48 15.62 Close
Telecom Italia 0.4451 +0.75 17:35:31 0.4438 Close
Tenaris 9.484 +0.51 17:35:49 9.40 Close
Terna - Rete [...] 5.432 +2.22 17:35:55 5.362 Close
Ubi Banca 2.217 +5.62 17:38:45 2.105 Close
Unicredit 9.531 +3.71 17:39:39 9.27 Close
Unipol 4.313 +1.67 17:35:41 4.277 Close
Unipolsai 2.208 -0.32 17:35:03 2.221 Close

After you have gone over each <tr> tag on the page, you need to go to the next page, using the href. Looks like it's "/borsa/azioni/ftse-mib/lista.html?lang=en&page=2" in which case you can just iterate over the page=to change to the next page.
If you post some code, we can help you a bit more :)

Related

How to scrape a table but 'not a table' from a page, using python?

Humble greetings and welcome to anyone willing to spend time here. I shall introduce myself as a very green student of data science and also python. This thread is meant to gain insight from rather more fortunate minds capable of deeper understanding within the realm of python.
As we can see, the value for each row itself could be found easily on the page inspection. But it seems that they all are using the same class name. As for now, i'm afraid i couldnt even find the right keyword to search for any working method in google.
These are the codes that i've tried. They dont work and embaressing, but i have to show it anyway. Ive tried fiddling by adding .content, .text, find, find_all, but i understand that my failure lies at even deeper fundamental core.
from bs4 import BeautifulSoup
import requests
from csv import writer
import pandas as pd
url= 'https://m4.mobilelegends.com/stats'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
lists = soup.find('div', class_="m4-team-stats-scroll")
with open('m4stats_team.csv', 'w', encoding='utf8', newline='') as f:
thewriter = writer(f)
header = ['Team', 'Win Rate', 'Average KDA', 'Average Kills', 'average Deaths', 'Average Assists', 'Average Game Time', 'Average Lord Kills', 'Average Tortoise Kills', 'Average Towers Destroy', 'First Blood Rate', 'Hero Pool']
thewriter.writerow(header)
for list in lists:
team = list.find_all('p', class_="h3 pl-5 whitespace-nowrap hidden xl:block")
awr = list.find_all('p', class_="h4")
akda = list.find('p', class_="h4").text
akill = list.find('p', class_="h4").text
adeath = list.find('p', class_="h4").text
aassist = list.find('p', class_="h4").text
atime = list.find('p', class_="h4").text
aalord = list.find('p', class_="h4").text
atortoise = list.find('p', class_="h4").text
atower = list.find('p', class_="h4").text
firstblood = list.find('p', class_="h4").text
hrpool = list.find('p', class_="h4").text
info = [team, awr, akda, akill, adeath, aassist, atime, aalord, atortoise, atower, firstblood, hrpool]
thewriter.writerow(info)
pd.read_csv('m4stats_team.csv').head()
What am i expecting:
Any kind of insight. Whether if it's clue, keyword, code snippet, i do appreciate and mostfully thankful for any kind of guidance. I am not asking for somehow getting the complete scrapped CSV, as i couldve done it manually. At these point i want to be able to do basic webscraping myself.

You can iterate over rows in the table and its items.
from bs4 import BeautifulSoup
import requests
page = requests.get('https://m4.mobilelegends.com/stats')
page.raise_for_status()
page = BeautifulSoup(page.content)
table = page.find("div", class_="m4-team-stats-scroll")
with open("table.csv", "w") as file:
for row in table.find_all("div", class_="m4-team-stats"):
items = row.find_all("div", class_="col-span-1")
# write into file in csv format, use map to extract text from items
file.write(",".join(map(lambda item: item.text, items)) + "\n")
Display output:
import pandas as pd
df = pd.read_csv("table.csv")
print(df)
# Outputs:
"""
Team ↓Win Rate ... ↓First Blood Rate ↓Hero pool
0 echo 72.0% ... 48.0% 37
1 rrq 60.9% ... 60.9% 37
2 tv 60.0% ... 60.0% 29
3 fcon 55.0% ... 85.0% 32
4 inc 53.3% ... 26.7% 31
5 onic 52.9% ... 47.1% 39
6 blck 52.2% ... 47.8% 31
7 rrq-br 46.2% ... 30.8% 32
8 thq 45.5% ... 63.6% 27
9 s11 42.9% ... 28.6% 26
10 tdk 37.5% ... 62.5% 24
11 ot 28.6% ... 28.6% 21
12 mvg 20.0% ... 20.0% 15
13 rsg-sg 20.0% ... 60.0% 17
14 burn 0.0% ... 20.0% 21
15 mdh 0.0% ... 40.0% 18
[16 rows x 12 columns]
"""

how to use regular expression in python to get memory details in product name

I am trying to use regular expression to parse memory of the product.
product name : 1) Lg K42 Blu Tim Smartphone 64 gb
2) Xiaomi Smartphone 0.128 gb ram 4 gb. tim quadband - Redmi Note 9 128gb Grigio Tim.
How to get the 64 gb using regex in python. gb may be small or caps and memory value may be contains 2 or 3 numbers
import xlwt
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
import re
import time
from bs4 import BeautifulSoup
from datetime import date
class cometmobiles:
def __init__(self):
self.url='https://www.comet.it/smartphone-e-telefonia/smartphone-e-cellulari/smartphone'
self.country='IT'
self.currency='euro'
self.VAT='Included'
def comet(self):
#try:
wb = xlwt.Workbook()
ws = wb.add_sheet('Sheet1',cell_overwrite_ok=True)
ws.write(0,0,"Product_Url")
ws.write(0,1,"Product_Manufacturer")
ws.write(0,2,"Product_Model")
ws.write(0,3,"Product_memory")
ws.write(0,4,"Product_Price")
ws.write(0,5,"Currency")
ws.write(0,6,"VAT")
ws.write(0,7,"Shipping")
ws.write(0,8,"Country")
ws.write(0,9,"Date")
wb.save(r"C:\Users\Karthick R\Desktop\VS code\comet.xls")
driver=webdriver.Chrome()
today = date.today()
driver.maximize_window()
driver.implicitly_wait(30)
driver.get(self.url)
wait = WebDriverWait(driver, 20)
cookies = driver.find_element_by_xpath('//i[#class="btn-close-popup fas fa-times"]')
cookies.click()
time.sleep(5)
print("clicked")
x = 0
titles = []
models = []
memorys = []
prices = []
links =[]
product_links = []
while True:
containers = driver.find_elements_by_css_selector('div[class="col-12 col-sm-6 col-md-4"]')
i = 1
for container in containers:
url = container.find_element_by_css_selector('div[class="sotto-cat__products__item"]')
urls = url.find_element_by_tag_name('a').get_attribute('href')
i = i + 1
product_links.append(urls)
print(product_links)
x+=1
time.sleep(5)
driver.get(self.url+"?p="+str(x))
print("next page")
if driver.current_url == self.url:
break
for links in product_links:
driver.get(links)
time.sleep(10)
#product links
print(driver.current_url)
source = driver.page_source
soup = BeautifulSoup(source,'html.parser')
#title
title = soup.find('h1',{'class':'scheda-prodotto__info__col-sx__title'}).text
y = re.search('([^\s]+)',title)
print(title)
titles.append(y.group(1))
#models
model = re.sub(y.group(1),"",title).strip()
print(model)
models.append(model)
#memory
memory = re.search('^[0-9]{2,3}+[A-Za-z]',model).strip()
print(memory)
memorys.append(memory)
#price
price = soup.find('span',{'class':'caption__price'}).text
print(price)
i=0
while i<len(titles):
ws.write(i+1,0,str(links[i]))
ws.write(i+1,1,str(titles[i]))
ws.write(i+1,2,str(models[i]))
ws.write(i+1,3,str(memorys[i]))
ws.write(i+1,4,str(prices[i]))
ws.write(i+1,5,str(self.currency))
ws.write(i+1,6,str(self.VAT))
ws.write(i+1,7,str(self.shipping))
ws.write(i+1,8,str(self.country))
ws.write(i+1,9,str(date.today()))
i=i+1
wb.save(r"C:\Users\Karthick R\Desktop\VS code\comet.xls")
#except:
#pass
comets=cometmobiles()
comets.comet()

You can try this out.
import re
s = 'Lg K42 Blu Tim Smartphone 64 gb Xiaomi Smartphone 0.128 gb ram 4 gb. tim quadband - Redmi Note 9 128Gb Grigio Tim.'
f = re.findall(r'\b\d{2,3}\s*gb\b',s,re.I)
print(f)
['64 gb', '128 gb', '128Gb']

input_str = "Lg K42 Blu Tim Smartphone 64 gb "
import re
output_value = re.search('[0-9]*.(?=gb)',input_str)
print (output_value[0])
Output : 64
input_str = "Redmi Note 9 128gb Grigio Tim"
Output : 128

Web scraping through multiple pages doesnt save each result -beautifulsoup

My problem is, it loops through pages, but it doesn't write anything into my list.
At the end I print len(title) and it is still 0.
from bs4 import BeautifulSoup
import requests
for page in range(20, 200, 20):
current_page = 'https://auto.bazos.sk/{}/?hledat=kolesa&hlokalita=&humkreis=&cen'.format(page)
web_req = requests.get(current_page).text
soup = BeautifulSoup(requests.get(current_page).content, 'html.parser')
title_data = soup.select('.nadpis')
title = []
for each_title in title_data:
title.append(each_title.text)
print(current_page)
print(len(title))

Move title out of the loop and there you have it.
import requests
from bs4 import BeautifulSoup
title = []
for page in range(20, 40, 20):
current_page = 'https://auto.bazos.sk/{}/?hledat=kolesa&hlokalita=&humkreis=&cen'.format(page)
soup = BeautifulSoup(requests.get(current_page).content, 'html.parser')
title_data = soup.select('.nadpis')
for each_title in title_data:
title.append(each_title.text)
print(current_page)
print(title)
Output:
['ELEKTRONY SKODA OCTAVIA SCOUT DISKY “PROTEUS” R17', 'Fiat Sedici 1.6, 4x4, r.v 04/2009, 79 kw, slovenské ŠPZ', 'Bmw e46 328ci', '255/50 R19', 'Honda Jazz 1.3', 'Predám 4 ks kolesá', 'Audi A5 3.2 FSI quattro tiptronic S LINE R20 TOP STAV', 'Peugeot 407 combi 1,6 hdi', 'Škoda Superb 2.0TDI 4x4 od 260€ mesačne, bez akontácia', 'Predam elektrony Audi 5x112 R17 a letne pneu', 'ROZPREDÁM MAZDA 3 2.0i 110kW NA NÁHRADNÉ DIELY', 'Predám Astra j Turbo Noblesse bronz', 'ŠKODA KAROQ 1.6 TDI - full výbava', 'VW CHICAGO 5x112 + letné pneu 215/40 R18', 'Fiat 500 SPORT 1.3 multijet 70kw', 'Volvo FL280 - TROJSTRANNÝ SKLÁPAČ + HYDRAULICKÁ RUKA', 'ŠKODA SUPERB COMBI 2.0 TDI 190K 4X4 L&K DSG', 'FORD FOCUS 2.0 TDCI TITANIUM', 'FORD EDGE 2.0 TDCi - 154 kW VIGNALE : 27.000 km', 'R18 5x112 originalne Vw Seat Audi Skoda']

Beautiful Soup Craigslist Scraping Pricing is the same

I am trying to scrape Craigslist using BeautifulSoup4. All data shows properly EXCEPT price. I can't seem to find the right tagging to loop through pricing instead of showing the same price for each post.
import requests
from bs4 import BeautifulSoup
source = requests.get('https://washingtondc.craigslist.org/search/nva/sss?query=5%20hp%20boat%20motor&sort=rel').text
soup = BeautifulSoup(source, 'lxml')
for summary in soup.find_all('p', class_='result-info'):
pricing = soup.find('span', class_='result-price')
price = pricing
title = summary.a.text
url = summary.a['href']
print(title + '\n' + price.text + '\n' + url + '\n')
Left: HTML code from Craigslist, commented out is irrelevant (in my opinion) code. I want pricing to not loop the same number. Right: Sublime SS of code.
Snippet of code running through terminal. Pricing is the same for each post.
Thank you

Your script is almost correct. You need to change the soup object for the price to summary
import requests
from bs4 import BeautifulSoup
source = requests.get('https://washingtondc.craigslist.org/search/nva/sss?query=5%20hp%20boat%20motor&sort=rel').text
soup = BeautifulSoup(source, 'lxml')
for summary in soup.find_all('p', class_='result-info'):
price = summary.find('span', class_='result-price')
title = summary.a.text
url = summary.a['href']
print(title + '\n' + price.text + '\n' + url + '\n')
Output:
Boat Water Tender - 10 Tri-Hull with Electric Trolling Motor
$629
https://washingtondc.craigslist.org/nva/boa/d/haymarket-boat-water-tender-10-tri-hull/7160572264.html
1987 Boston Whaler Montauk 17
$25450
https://washingtondc.craigslist.org/nva/boa/d/alexandria-1987-boston-whaler-montauk-17/7163033134.html
1971 Westerly Warwick Sailboat
$3900
https://washingtondc.craigslist.org/mld/boa/d/upper-marlboro-1971-westerly-warwick/7170495800.html
Buy or Rent. DC Party Pontoon for Dock Parties or Cruises
$15000
https://washingtondc.craigslist.org/doc/boa/d/washington-buy-or-rent-dc-party-pontoon/7157810378.html
West Marine Zodiac Inflatable Boat SB285 With 5HP Gamefisher (Merc)
$850
https://annapolis.craigslist.org/boa/d/annapolis-west-marine-zodiac-inflatable/7166031908.html
2012 AB aluminum/hypalon inflatable dinghy/2012 Yamaha 6hp four stroke
$3400
https://annapolis.craigslist.org/bpo/d/annapolis-2012-ab-aluminum-hypalon/7157768911.html
RHODES-18’ CENTERBOARD DAYSAILER
$6500
https://annapolis.craigslist.org/boa/d/ocean-view-rhodes-18-centerboard/7148322078.html
Mercury Outboard 7.5 HP
$250
https://baltimore.craigslist.org/bpo/d/middle-river-mercury-outboard-75-hp/7167399866.html
8 hp yamaha 2 stroke
$0
https://baltimore.craigslist.org/bpo/d/8-hp-yamaha-2-stroke/7154103281.html
TRADE 38' BENETEAU IDYLLE 1150
$35000
https://baltimore.craigslist.org/boa/d/middle-river-trade-38-beneteau-idylle/7163761741.html
5-hp Top Tank Mercury
$0
https://baltimore.craigslist.org/bpo/d/5-hp-top-tank-mercury/7154102434.html
5-hp Top Tank Mercury
$0
https://baltimore.craigslist.org/bpo/d/5-hp-top-tank-mercury/7154102744.html
Wanted ur unwanted outboards
$0
https://baltimore.craigslist.org/bpo/d/randallstown-wanted-ur-unwanted/7141349142.html
Grumman Sport Boat
$2250
https://baltimore.craigslist.org/boa/d/baldwin-grumman-sport-boat/7157186381.html
1996 Carver 355 Aft Cabin Motor Yacht
$47000
https://baltimore.craigslist.org/boa/d/middle-river-1996-carver-355-aft-cabin/7156830617.html
Lower unit, long shaft
$50
https://baltimore.craigslist.org/bpo/d/catonsville-lower-unit-long-shaft/7155566763.html
Lower unit, long shaft
$50
https://baltimore.craigslist.org/bpo/d/catonsville-lower-unit-long-shaft/7155565771.html
Lower unit, long shaft
$50
https://baltimore.craigslist.org/bpo/d/catonsville-lower-unit-long-shaft/7155566035.html
Lower unit, long shaft
$50
https://baltimore.craigslist.org/bpo/d/catonsville-lower-unit-long-shaft/7155565301.html
Cape Dory 25 Sailboat for sale or trade
$6500
https://baltimore.craigslist.org/boa/d/reedville-cape-dory-25-sailboat-for/7149227778.html
West Marine HP-V 350
$1200
https://baltimore.craigslist.org/boa/d/pasadena-west-marine-hp-350/7147285666.html

Real Estate Market Scraping using Python and BeautifulSoup

I need some concept how to parse a real estate market using Python. I've searched some information about parsing the websites, I even did this in VBA, but I would like to do it in python.
This is the site which will be parsed (it's one offer only now, but it will be working on full range of real estate offers, multiple sites from kontrakt.szczecin.pl):
http://www.kontrakt.szczecin.pl/mieszkanie-sprzedaz-100m2-335000pln-grudziadzka-pomorzany-szczecin-zachodniopomorskie,351149
First of all, program will use 3 pieces of information:
1/ The table where is information (Main parameters):
Numer oferty 351149, Liczba pokoi 3, Cena 335 000 PLN, Cena za m2 3 350 PLN (Number of offer, Room no, Price, Price by square meter etc). However the quantity of information depends on property offer: sometimes is 14, sometimes is 12, sometimes 16 etc.
2/ Description of property in paragraphs (it is another part of program, for now it can be skipped): Sometimes in the table (1/) there is information that there is garage or balcony. But in paragraph there is a sentence that garage is for additional price (which means for me that property doesn't have garage) or balcony is in French type (which is no balcony for me).
I managed that program should find the correct word in paragraph (such as garage) and copy text from paragraph with additional text on the left and right side (for instance: 20 letters in both sides, but what if the word is in the first place?)
3/ Additional Parameters -
Not every offer has it but like this one (http://www.kontrakt.szczecin.pl/mieszkanie-sprzedaz-6664m2-339600pln-potulicka-nowe-miasto-szczecin-zachodniopomorskie,351165) there is information about number of balconies in property. Sometimes there is information about basement too. It should be similar code to the 1/ issue.
So I tried something like this, using some internet sources (it is still incomplete):
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = "http://www.kontrakt.szczecin.pl/mieszkanie-sprzedaz-6664m2-339600pln-potulicka-nowe-miasto-szczecin-zachodniopomorskie,351165"
#PL: otwiera połączenie z wybraną stroną, pobieranie zawartości strony (urllib)
#EN: Opens a connection and grabs url
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
#html parsing (BeautifulSoup)
page_soup = soup(page_html, "html.parser") #html.parser -> zapisujemy do html, nie np. do xml
#PL: zbiera tabelkę z numerami ofert, kuchnią i innymi danymi o nieruchomości z tabelki
#EN: grabs the data about real estate like kitchen, offer no, etc.
containers = page_soup.findAll("section",{"class":"clearfix"},{"id":"quick-summary"})
# print(len(containers)) - len(containers) sprawdza ile takich obiektów istnieje na stronie
#PL: Co prawda na stronie jest tylko jedna taka tabelka, ale dla dobra nauki zrobię tak jak gdyby tabelek było wiele.
#EN: There is only one table, but for the sake of knowledge I do the container variable
container = containers[0]
find_dt = container.findAll("dt")
find_dd = container.findAll("dd")
print(find_dt[0].text + " " + find_dd[0])
It works, but still is incomplete. I don't continue it right now because there is major flaw. As you see the last print it takes indexes, but not every property will have the same order (because as I mentioned sometimes there is 10 pieces of info, sometimes more, sometimes less). It will be a huge mess in CSV.
My VBA program worked in this way:
Copy table to Excel (Sheet 1)
In the sheet 2 there was parameters that program was looking for (such as Prices)
Mechanism in shortcut: Copy parameter from sheet 2 (Price), go to sheet 1 (where is parsed information), find Price string (paste the information from sheet 2: "Price"), go line below, copy price value, go to sheet 2, find Price, go below, paste the price value. And so on.
Looking for help with concept and coding also.
EDIT:
PART 1 and PART 2 is ready. But I have big issues with PART 3. Here is the code:
from urllib import request as uReq
import requests
#dzięki temu program jest zamykany odrazu, i nie kontynuuje wykonywania reszty kodu. Po imporcie wystarczy exit(0)
from sys import exit
from urllib.request import urlopen as uReq2
from bs4 import BeautifulSoup as soup
import csv
import re
import itertools
filename = 'test.txt'
#licznik, potrzebny do obliczenia ilości numerów ofert w pliku .txt
num_lines = 0
# tworzymy listę danych i listę URLi. Wyniki będą dodawane do list, dlatego potrzeba jest ich utworzenia (jako puste)
list_of_lines = ['351238', '351237', '111111', '351353']
list_of_lines2 = []
list_of_URLs = []
list_of_redictered_URLs = []
KONTRAKT = 'http://www.kontrakt.szczecin.pl'
with open(filename, 'r') as file:
for line in file:
#dodajemy linię (ofertę) do listy
list_of_lines.append(line.strip())
#num_lines jest licznikiem, wskazuje ile wierszy zawiera lista, zmienna jest istotna w zakresię tworzenia pętli z adresami URL
num_lines += 1
#tworzymy URLe z Numerów Ofert zawartych w filename
for i in range(num_lines):
nr_oferty = list_of_lines[i]
my_url = "http://www.kontrakt.szczecin.pl/lista-ofert/?f_listingId=" + nr_oferty + "&f=&submit=Szukaj"
list_of_URLs.append(my_url)
print(list_of_URLs)
#Cześć druga: konwertowanie listy linków na listę linków przekierowanych
#Program wchodzi na stronę, która powinna być przekierowana, jednak ze względu na użyscie Java Scriptu,
#zadanie zostało utrudnione. Dlatego, też celem programu jest symulowanie przeglądarki, pobranie
#zawartości strony, a następnie 'wyłuskanie' odpowiedniego linku do przekierowania
i = 0
for i in range(num_lines):
url_redirect = list_of_URLs[i]
my_url = url_redirect
BROWSER = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(my_url, headers=BROWSER)
script1 = '<script>'
script2 = '</script>'
content_URL = str(response.content)
find_script1 = (content_URL.find(script1))
find_script2 = (content_URL.find(script2))
url_ready = content_URL[find_script1:find_script2]
print(i+1,'z', num_lines, '-', 'oferta nr:', str(my_url[57:57+6]))
list_of_redictered_URLs.append(url_ready)
#usuwanie zbędnych tagów i znaków, w celu uzyskania czystego przekierowanego linku
list_of_redictered_URLs = [w.replace('<script>window.location=\\\'','') for w in list_of_redictered_URLs]
list_of_redictered_URLs = [w.replace('\\\';','') for w in list_of_redictered_URLs]
#print(list_of_redictered_URLs)
#usuwanie pustych wierszy z listy (oferty, które są nieakutalne na liste "wchodzą jako puste" !!! item: jest to zmienna, można zamienić np. na janusz.
filtered_list = list(filter(lambda item: item.strip(), list_of_redictered_URLs))
filtered_list = [KONTRAKT + item for item in filtered_list]
#zmiana na tuple, ze względu iż mutowalność (dodawanie kolejnych linków) nie będzie potrzebne
filtered_list = tuple(filtered_list)
#print(str(filtered_list))
print('Lista linków:\n',filtered_list)
# Kolejną częścią programu jest pobieranie istotnych informacji (parametrów podstawowych)
# ze strony kontrakt.szczecin.pl, a następnie ich zapisanie w pliku csv.
# Nagłówki w csv oraz nazwy parametrów na stronie (muszą być identyczne jak na stronie, aby mogły
# zostać odpowiednio przyporządkowane w .csv)
HEADERS = ['Numer oferty',
'Liczba pokoi',
'Cena',
'Cena za m2',
'Powierzchnia',
'Piętro',
'Liczba pięter',
'Typ kuchni',
'Balkon',
'Czynsz administracyjny',
'Rodzaj ogrzewania',
'Umeblowanie',
'Wyposażona kuchnia',
'Gorąca woda',
'Rodzaj budynku',
'Materiał',
'Rok budowy',
'Stan nieruchomości',
'Rynek',
'Dach:',
'Liczba balkonów:',
'Liczba tarasów:',
'Piwnica:',
'Ogród:',
'Ochrona:',
'Garaż:',
'Winda:',
'Kształt działki:',
'Szerokość działki (mb.):',
'Długość działki (mb.):',
'Droga dojazdowa:',
'Gaz:',
'Prąd:',
'Siła:','piwnica',
'komórk',
'strych',
'gospodarcze',
'postojow',
'parking',
'przynależn',
'garaż',
'ogród',
'ogrod',
'działka',
'ocieplony',
'moderniz',
'restaur',
'odnow',
'ociepl',
'remon',
'elew',
'dozór',
'dozor',
'monitoring',
'monit',
'ochron',
'alarm',
'strzeż',
'portier',
'wspólnot',
'spółdziel',
'kuchni',
'aneks',
'widna',
'ciemna',
'prześwit',
'oficyn',
'linia',
'zabudow',
'opłat',
'bezczynsz',
'poziom',
'wind',
'francuski',
'ul.',
'w cenie',
'dodatkową']
LINKI = ["Link"]
#HEADERS2 = ['Liczba balkonów:',
# 'Liczba tarasów:',
# 'Piwnica:',
# 'Ogród:',
# 'Ochrona:',
# 'Garaż:',
# 'Winda:']
HEADERS3 = ['piwnica',
'komórk',
'strych',
'gospodarcze',
'postojow',
'parking',
'przynależn',
'garaż',
'ogród',
'ogrod',
'działka',
'ocieplony',
'moderniz',
'restaur',
'odnow',
'ociepl',
'remon',
'elew',
'dozór',
'dozor',
'monitoring',
'monit',
'ochron',
'alarm',
'strzeż',
'portier',
'wspólnot',
'spółdziel',
'kuchni',
'aneks',
'widna',
'ciemna',
'prześwit',
'oficyn',
'linia',
'zabudow',
'opłat',
'bezczynsz',
'poziom',
'wind',
'francuski',
'ul.',
'w cenie',
'dodatkową',]
csv_name = 'data.csv'
print('Dane zostaną zapisane do pliku:',csv_name + '.csv')
print('\n>>>>Program rozpoczyna pobieranie danych')
#Pobieranie linków
i = 0
#Tworzy plik csv o nazwie csv
#writerow może mieć tylko jeden argument, dlatego jest nim suma poszczególnych list. Lista
#linki ma jędną pozycję, ponieważ można sumować dane jednego typu. Nie można sumować listy ze stringami.
with open(csv_name + '.csv', 'w', newline='') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=',', quotechar='"')
HEADERS_ALL = HEADERS+HEADERS3+LINKI
csvwriter.writerow(HEADERS_ALL)
for i in range(len(filtered_list)):
my_url = filtered_list[i]
with uReq2(my_url) as uClient:
page_soup = soup(uClient.read(), 'lxml')
print('\t\t-----------',i+1,'-----------\n',my_url)
#<dt> - nazwa parametru np. Kuchnia
#<dd> - wartość parametru np. widna
row = ['-'] * len(HEADERS) + ['-'] * len(HEADERS3) + ['-'] * len(LINKI)
# Parametry podstawowe (kontrakt.szczecin.pl)
for dt, dd in zip(page_soup.select('section#quick-summary dt'), page_soup.select('section#quick-summary dd')):
if dt.text.strip() not in HEADERS:
print("\n 1(dt,dd):UWAGA!, kolumna [{}] nie istnieje w nagłówkach! (stała: HEADERS)\n".format(dt.text.strip()))
continue
row[HEADERS.index(dt.text.strip())] = dd.text.strip()
# Parametry dodatkowe
for span, li in zip(page_soup.select('section#property-features span'), page_soup.select('section#property-features li')):
if span.text.strip() not in HEADERS:
print("\n 2:UWAGA(span,li), kolumna [{}] nie istnieje w nagłówkach (stała HEADERS)!\n".format(span.text.strip()))
continue
row[HEADERS.index(span.text.strip())] = li.text.strip()
#csvwriter.writerow(row)
print(row)
#No to zaczynamy zabawę...................................
# zmienna j odnosi się do indeksu HEADERS3, jest to j nie i, ponieważ i jest w dalszym użyciu
# w pętli powyżej
for p in page_soup.select('section#description'):
p = str(p)
p = p.lower()
for j in range(len(HEADERS3)):
#print('j:',j)
# find_p znajduje wszystkie słowa kluczowe z HEADERS3 w paragrafie na stronie kontraktu.
find_p = re.findall(HEADERS3[j],p)
# listy, które wyświetlają pozycję startową poszczególnych słów muszą zaczynać się od '-' lub 0?,
# ponieważ, gdy dane słowo nie zostanie odnalezione to listy będą puste w pierwszej iteracji pętli
# co w konsekewncji doprowadzi do błędu out of range
m_start = []
m_end = []
lista_j = []
for m in re.finditer(HEADERS3[j], p):
#print((m.start(),m.end()), m.group())
m_start.append(m.start())
m_end.append(m.end())
#print(h)
for k in range(len(m_start)):
#właściwe teraz nie wiem po co to jest..
try:
x = m_start[k]
y = m_end[k]
except IndexError:
x = m_start[0]
y = m_end[0]
#print('xy:',x,y)
#print(find_p)
#print(HEADERS3[j])
z = (HEADERS3[j]+':',p[-60+x:y+60]+' ++-NNN-++')
lista_j.append(z)
print (lista_j)
print(str(lista_j))
row[HEADERS.index(span.text.strip())] = str(lista_j)
csvwriter.writerow(row)
#print(row)

This code snippet will parse the quick summary table of the property url and saves it in csv file:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import csv
# my_url = 'http://www.kontrakt.szczecin.pl/mieszkanie-sprzedaz-6664m2-339600pln-potulicka-nowe-miasto-szczecin-zachodniopomorskie,351165'
my_url = 'http://www.kontrakt.szczecin.pl/mieszkanie-sprzedaz-100m2-335000pln-grudziadzka-pomorzany-szczecin-zachodniopomorskie,351149'
with uReq(my_url) as uClient:
page_soup = soup(uClient.read(), 'lxml')
with open('data.csv', 'w', newline='') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=',', quotechar='"')
for dt, dd in zip(page_soup.select('section#quick-summary dt'), page_soup.select('section#quick-summary dd')):
csvwriter.writerow([dt.text.strip(), dd.text.strip()])
The result is in data.csv, screenshot from my LibreOffice:
For having the table transposed, you can use this code:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import csv
# my_url = 'http://www.kontrakt.szczecin.pl/mieszkanie-sprzedaz-6664m2-339600pln-potulicka-nowe-miasto-szczecin-zachodniopomorskie,351165'
my_url = 'http://www.kontrakt.szczecin.pl/mieszkanie-sprzedaz-100m2-335000pln-grudziadzka-pomorzany-szczecin-zachodniopomorskie,351149'
with uReq(my_url) as uClient:
page_soup = soup(uClient.read(), 'lxml')
headers = ['Numer oferty',
'Liczba pokoi',
'Cena',
'Cena za m2',
'Powierzchnia',
'Piętro',
'Liczba pięter',
'Typ kuchni',
'Balkon',
'Czynsz administracyjny',
'Rodzaj ogrzewania',
'Gorąca woda',
'Rodzaj budynku',
'Materiał',
'Rok budowy',
'Stan nieruchomości',
'Rynek',
'Dach:',
'Liczba balkonów:',
'Piwnica:',
'Kształt działki:',
'Szerokość działki (mb.):',
'Długość działki (mb.):',
'Droga dojazdowa:',
'Gaz:',
'Prąd:',
'Siła:']
with open('data.csv', 'w', newline='') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=',', quotechar='"')
csvwriter.writerow(headers)
row = ['-'] * len(headers)
for dt, dd in zip(page_soup.select('section#quick-summary dt'), page_soup.select('section#quick-summary dd')):
if dt.text.strip() not in headers:
print("Warning, column [{}] doesn't exist in headers!".format(dt.text.strip()))
continue
row[headers.index(dt.text.strip())] = dd.text.strip()
csvwriter.writerow(row)
The result will be in csv file like this (the values not present will be substituted with '-'):

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Webscraping table on different pages - python

After you have gone over each <tr> tag on the page, you need to go to the next page, using the href. Looks like it's "/borsa/azioni/ftse-mib/lista.html?lang=en&page=2" in which case you can just iterate over the page=to change to the next page. If you post some code, we can help you a bit more :)

Related

How to scrape a table but 'not a table' from a page, using python?

how to use regular expression in python to get memory details in product name

Web scraping through multiple pages doesnt save each result -beautifulsoup

Beautiful Soup Craigslist Scraping Pricing is the same

Real Estate Market Scraping using Python and BeautifulSoup

Categories

Resources