I got this piece of code to spit out the unique "area number" in the URL. However, the loop doesn't work. It spits out the same number, please see below:
import urllib3
from bs4 import BeautifulSoup
http = urllib3.PoolManager()
url = open('MS Type 1 URL.txt',encoding='utf-8-sig')
links = []
for link in url:
y = link.strip()
links.append(y)
url.close()
print('Amount of Links: ', len(links))
for x in links:
j = (x.find("=") + 1)
g = (x.find('&housing'))
print(link[j:g])
Results are:
http://millersamuel.com/aggy-data/home/query_report?area=38&housing_type=3&measure=4&query_type=quarterly®ion=1&year_end=2020&year_start=1980
23
http://millersamuel.com/aggy-data/home/query_report?area=23&housing_type=1&measure=4&query_type=annual®ion=1&year_end=2020&year_start=1980
23
As you can see it spits out the area number '23' which is only in one of this URL but not the '38' of the other URL.
There's a typo in your code. You iterate over links list and bind its elements to x variable, but print a slice of link variable, so you get the same string printed on each loop iteration. So you can change print(link[j:g]) to print(x[j:g]), but it's better to call your variables with more descriptive names, so here's the fixed version of your loop:
for link in links:
j = link.find('=') + 1
g = link.find('&housing')
print(link[j:g])
And I also want to show you a proper way to extract area value from URLs:
from urllib.parse import urlparse, parse_qs
url = 'http://millersamuel.com/aggy-data/home/query_report?area=38&housing_type=3&measure=4&query_type=quarterly®ion=1&year_end=2020&year_start=1980'
area = parse_qs(urlparse(url).query)['area'][0]
So instead of using str.find method, you can write this:
for url in urls:
parsed_qs = parse_qs(urlparse(url).query)
if 'area' in parsed_qs:
area = parsed_qs['area'][0]
print(area)
Used functions:
urllib.urlparse
urllib.parse_qs
You need to change:
print(link[j:g]) to print(x[j:g])
Related
I want to write a code to scrape multiple webpages.
However, the problem is that there are two numbers variations in the webpage.
000/BBSDD0002/93976?page=1&
000/BBSDD0002/93975?page=1&
000/BBSDD0002/93970?page=1&
000/BBSDD0002/93964?page=1&
000/BBSDD0002/93950?page=1&
000/BBSDD0002/93946?page=1&
000/BBSDD0002/93945?page=1&
000/BBSDD0002/93930?page=2&
000/BBSDD0002/93925?page=2&
.
.
.
.
000/BBSDD0002/39045?page=536&
As we see here, both pagenumber and document number are varying at the same time.
import requests
import re
from bs4 import BeautifulSoup
from itertools import product
page = range(1, 6)
document = range(39045, 93976)
for i, j in product(page, document):
print("Page Number:", i)
url = "https://000.com/BBSDD0002/{}?page={}&".format(i,j)
res = requests.get(url, headers=headers)
res.raise_for_status()
soup = BeautifulSoup(res.text,"lxml")
list1=soup.find_all("td", attrs = {"class":"sbj"})
for li in list1:
print(li.get_text())
I wrote it this so far, but it only loops the page numbers, so it does not give me anything.
Are there any ways to create a look for both page numbers and document numbers?
Not sure what your goal is with this, but you could do it that way:
page = range(1, 6)
entry_id = 39045
for p in page:
for i in range(0,10):
print(f'https://000.com/BBSDD0002/{entry_id}?page={p}')
entry_id = entry_id+1
What leads to:
https://000.com/BBSDD0002/39045?page=1
https://000.com/BBSDD0002/39046?page=1
https://000.com/BBSDD0002/39047?page=1
https://000.com/BBSDD0002/39048?page=1
https://000.com/BBSDD0002/39049?page=1
https://000.com/BBSDD0002/39050?page=1
https://000.com/BBSDD0002/39051?page=1
https://000.com/BBSDD0002/39052?page=1
https://000.com/BBSDD0002/39053?page=1
https://000.com/BBSDD0002/39054?page=1
https://000.com/BBSDD0002/39055?page=2
https://000.com/BBSDD0002/39056?page=2
https://000.com/BBSDD0002/39057?page=2
https://000.com/BBSDD0002/39058?page=2
https://000.com/BBSDD0002/39059?page=2
https://000.com/BBSDD0002/39060?page=2
https://000.com/BBSDD0002/39061?page=2
https://000.com/BBSDD0002/39062?page=2
https://000.com/BBSDD0002/39063?page=2
https://000.com/BBSDD0002/39064?page=2
https://000.com/BBSDD0002/39065?page=3
https://000.com/BBSDD0002/39066?page=3
https://000.com/BBSDD0002/39067?page=3
...
If you try to scrape the comments - Why not iterating the pages and collect their urls. This will also prevent you from creating invalid urls for removed comments in example.
The project that I am doing requires us to input a url and then follow the link in a particular position a number of number of times then return the last page visited. I have found the solution with a while loop, but now I am trying to do it with recursion.
Example: http://py4e-data.dr-chuck.net/known_by_Fikret.html
Find the link at position 3 (the first name is 1). Follow that link. Repeat this process 4 times. The answer is the last name that you retrieve.
Sequence of names: Fikret Montgomery Mhairade Butchi Anayah
Last name in sequence: Anayah
My code is this so far:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
import re
cnt = input("Enter count:")
post = input("Enter position:")
url = "http://py4e-data.dr-chuck.net/known_by_Fikret.html"
count = int(cnt)
position = int(post)
def GetLinks(initalPage, position, count):
html = urlopen(initalPage).read()
soup = BeautifulSoup(html, "html.parser")
temp = ""
links = list()
tags = soup('a')
for tag in tags:
x = tag.get('href', None)
links.append(x)
print(links[position - 1])
if count > 1:
GetLinks(links[position-1], position, count - 1)
return links[position - 1]
y = GetLinks(url, position, count)
print("****", y)
I see two problems with my code.
I am creating a list that is expanding with every recursion, which makes it very hard to locate the proper value.
Second, I am obviously returning the wrong item.
I don't know exactly how to fix this.
I am trying to remove the quotes from my re.findall output using Python 3. I tried suggestions from various forums but it didn't work as expected finally thought of asking out here myself.
My code:
import requests
from bs4 import BeautifulSoup
import re
import time
price = [];
while True:
url = "https://api.binance.com/api/v3/ticker/price?symbol=ETHUSDT"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
data = soup.prettify()
for p in data:
match = re.findall('\d*\.?\d+',data)
print("ETH/USDT",match)
price.append(match)
break
Output of match gives:
['143.19000000']. I would like it to be like: [143.1900000] but I cannot figure out how to do this.
Another problem I am encountering is that the list price appends every object like a single list. So the output of price would be for example [[a], [b], [c]]. I would like it to be like [a, b, c] I am having a bit of trouble to solve these two problems.
Thanks :)
Parse the response from requests.get() as JSON, rather than using BeautifulSoup:
import requests
url = "https://api.binance.com/api/v3/ticker/price?symbol=ETHUSDT"
response = requests.get(url)
response.raise_for_status()
data = response.json()
print(data["price"])
To get floats instead of strings:
float_match = [float(el) for el in match]
To get a list instead of a list of lists:
for el in float_match:
price.append(el)
I looked at site html source, and found what i need for namePlayer, it was 4 column and 'a' tag. And i tried to find it at answers.append with 'namePlayer': cols[3].a.text
But when i complile it, i get IndexError. Then i try to change index to 2,3,4,5 but nothing.
Issue: why i get IndexError: list index out of range, when all is ok(i think :D)
source:
#!/usr/bin/env python3
import re
import urllib.request
from bs4 import BeautifulSoup
class AppURLopener(urllib.request.FancyURLopener):
version = "Mozilla/5.0"
def get_html(url):
opener = AppURLopener()
response = opener.open(url)
return response.read()
def parse(html):
soup = BeautifulSoup(html)
table = soup.find(id='answers')
answers = []
for row in table.find_all('div')[16:]:
cols = row.find_all('div')
answers.append({
'namePlayer': cols[3].a.text
})
for answer in answers:
print(answers)
def main():
parse(get_html('http://jaze.ru/forum/topic?id=50&page=1'))
if __name__ == '__main__':
main()
You are overwriting cols during your loop. The last length of cols is zero hence your error.
for row in table.find_all('div')[16:]:
cols = row.find_all('div')
print(len(cols))
Run the above and you will see cols ends up at length 0.
This might also occur elsewhere in loop so you should test the length and also decide if your logic needs updating. Also, you need to account for whether there is a child a tag.
So, you might, for example, do the following (bs4 4.7.1+ required):
answers = []
for row in table.find_all('div')[16:]:
cols = row.find_all('div:has(>a)')
if len(cols) >= 3:
answers.append({
'namePlayer': cols[3].a.text
})
Note that answers has been properly indented so you are working with each cols value. This may not fit your exact use case as I am unsure what your desired result is. If you state the desired output I will update accordingly.
EDIT:
playerNames
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://jaze.ru/forum/topic?id=50&page=1')
soup = bs(r.content, 'lxml')
answer_blocks = soup.select('[id^=answer_]')
names = [i.text.strip() for i in soup.select('[id^=answer_] .left-side a')]
unique_names = {i.text.strip() for i in soup.select('[id^=answer_] .left-side a')}
You can preserve order and de-duplicated with OrderedDict (this by #Michael - other solutions in that Q&A)
from bs4 import BeautifulSoup as bs
import requests
from collections import OrderedDict
r = requests.get('https://jaze.ru/forum/topic?id=50&page=1')
soup = bs(r.content, 'lxml')
answer_blocks = soup.select('[id^=answer_]')
names = [i.text.strip() for i in soup.select('[id^=answer_] .left-side a')]
unique_names = OrderedDict.fromkeys(names).keys()
It does sound like you are providing an index for which a list element does not exist. Remember index starts at 0. example: 0,1,2,3. So if I ask for element 10 I would get an Index error.
why you use for loop for finding all div tag :
for row in table.find_all('div')[16:]:
cols = row.find_all('div')
by using this you got all the tag you want
cols = table.find_all('div')[16:]
so just change your code with this code and you got your answer.
I have been developing a python web-crawler to collect the used car stock data from this website. (http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page=20)
First of all, I would like to collect only "BMW" from the list. So, I used "search" function in regular expression like the code below. But, it keeps returning "None".
Is there anything wrong in my code?
Please give me some advice.
Thanks.
from bs4 import BeautifulSoup
import urllib.request
import re
CAR_PAGE_TEMPLATE = "http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page="
def fetch_post_list():
for i in range(20,21):
URL = CAR_PAGE_TEMPLATE + str(i)
res = urllib.request.urlopen(URL)
html = res.read()
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table', class_='cyber')
print ("Page#", i)
# 50 lists per each page
lists=table.find_all('tr', itemtype="http://schema.org/Article")
count=0
r=re.compile("[BMW]")
for lst in lists:
if lst.find_all('td')[3].find('em').text:
lst_price=lst.find_all('td')[3].find('em').text
lst_title=lst.find_all('td')[1].find('a').text
lst_link = lst.find_all('td')[1].find('a')['href']
lst_photo_url=''
if lst.find_all('td')[0].find('img'):
lst_photo_url = lst.find_all('td')[0].find('img')['src']
count+=1
else: continue
print('#',count, lst_title, r.search("lst_title"))
return lst_link
fetch_post_list()
r.search("lst_title")
This is searching inside the string literal "lst_title", not the variable named lst_title, that's why it never matches.
r=re.compile("[BMW]")
The square brackets indicate that you're looking for one of those characters. So, for example, any string containing M will match. You just want "BMW". In fact you don't even need regular expressions, you can just test:
"BMW" in lst_title