How to revert a string into a command? - python

I actually need more than one item from the page but they are all under the same headers and I really don't want to repeat the same soup_wash.find("td", headers="tf89c8e5b-5207-48e7-a536-1f50ee7f5088c{}").text.strip() line every time, so I am trying to set text as the directory to save time.
import requests
from bs4 import BeautifulSoup
def html(url):
return BeautifulSoup(requests.get(url).text, "lxml")
soup_wash = html("https://www.washtenaw.org/3108/Cases")
text = 'soup_wash.find("td", headers="tf89c8e5b-5207-48e7-a536-1f50ee7f5088c{}").text.strip()'
item1 = text.format("2")
item2 = text.format("6")
print(item1, item2) # Supposed to print -> 1561, 107 but it actually prints str(text) formatted.
I need bs4 to process the string of item1 and item2 but I'm not sure how to do so.

I personally wouldn't use the value tf89c8e5b-5207-48e7-a536-1f50ee7f5088c{} to get Total Cases and Total Deaths values, because it looks it will change any time.
Instead, grab first table and use standard python indexing to get the columns. For example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.washtenaw.org/3108/Cases'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
print('{:<15}{}'.format('Total Cases', 'Total Deaths'))
for tr in soup.select('table')[0].select('tr:has(td)'):
tds = [td.get_text() for td in tr.select('td')]
print('{:<15}{}'.format(tds[1], tds[5]))
Prints:
Total Cases Total Deaths
1561 107
338 3
1899 110

Related

Python scraping with beautifulsoup cannot scrape properly some lines of data

I am exploring web scraping in python. I have the following snippet but the problem with this code is that some lines of data being extracted is not correct. What could be the problem of this snippet?
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
url = 'https://bscscan.com/txsinternal?ps=100&zero=false&valid=all'
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req, timeout=10).read()
soup = BeautifulSoup(webpage, 'html.parser')
rows = soup.findAll('table')[0].findAll('tr')
for row in rows[1:]:
ttype = (row.find_all('td')[3].text[0:])
amt = (row.find_all('td')[7].text[0:])
transamt = str(amt)
print()
print ("this is bnbval: ", transamt)
print ("transactiontype: ", ttype)
Sample output:
trans amt: Binance: WBNB Token #- wrong data being extracted
transtype: 0x2de500a9a2d01c1d0a0b84341340f92ac0e2e33b9079ef04d2a5be88a4a633d4 #- wrong data being extracted
trans amt: 1 BNB
transtype: call
trans amt: 1 BNB
transtype: call
this is bnbval: Binance: WBNB Token #- wrong data being extracted
transactiontype: 0x1cc224ba17182f8a4a1309cb2aa8fe4d19de51c650c6718e4febe07a51387dce #- wrong data being extracted
trans amt: 1 BNB
transtype: call
There is nothing wrong with your code. But there is a problem with the data on the page.
Some rows are 7 column rows - one that you're expecting, and some rows are 9 column rows. Those that are 9 column rows give you wrong data.
You can just go to the page and inspect elements to see the issue.
I can suggest that you use the last element [-1] instead of [7]. But you need to have some kind of if check for 3rd column

How can I find next sibling of 'a' tag which is inside into a table th tag?

I am scraping companies data from Wikipedia infobox table where I need to scrape some values that are inside td - like Type, Traded as, Services etc.
my code is
response = requests.get(url,headers=headers)
html_soup = BeautifulSoup(response.text, 'lxml')
table_container = html_soup.find('table', class_='infobox')
hq_name=table_container.find("th", text=['Headquarters']).find_next_sibling("td")
It gives the headquarter value and works perfectly
But when I am going to fetch 'Trade as' or any hyperlink th element the above code is not working, it returns none.
So how to get the next sibling of Trade as or Type?
From your comment:
https://en.wikipedia.org/wiki/IBM This is the URL, and the expected
output will be Trade as- NYSE: IBM DJIA Component S&P 100 Component
S&P 500 Component
Use the a tags to separate and select the required row from the table by nth-of-type. You can then join the first two items in the output list if required
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://en.wikipedia.org/wiki/IBM')
soup = bs(r.content, 'lxml')
items = [item.text.replace('\xa0',' ') for item in soup.select('.vcard tr:nth-of-type(4) a')]
print(items)
To have as shown (if indeed first and second are joined?):
final = items[2:]
final.insert(0, '-'.join([items[0] , items[1]]))
final

Python - Combine two, single column lists into one dual column list and print

I'm just beginning to dabble with Python, and as many have done I am starting with a web-scraping example to try the language.
I have seen many examples of using zip and map to combine lists, but I am having issues attempting to have that list print.
Again, I am new so please be gentle.
The code gathers everything from 2 certain tag types (the date and title of a post) and returns them as 2 lists. For this I am using BeautifulSoup and requests.
The site I am practicing on for this test is the blog for a small game called 'Staxel'
I can get my code to print a full list of one tag using [soup.find] and [print] in a for loop, but when I attempt to add a 2nd list to print I am simply getting a termination with no error.
Any tips on how to correctly print the 2 lists?
I am looking for output like
Entry 2019-01-06 New Years
Entry 2018-11-30 Staxel Changelog for 1.3.52
# import libraries
import requests
import ssl
from bs4 import BeautifulSoup
# set the URL string
quote_page = 'https://blog.playstaxel.com'
# query the website and return the html to give us a 'page' variable
page = requests.get(quote_page)
# parse the html using beautiful soup and store in a variable ... 'soup'
soup = BeautifulSoup(page.content, 'lxml')
# Remove the 'div' of name and get it's value
title_box = soup.find_all('h1',attrs={'class':'entry-title'})
date_box = soup.find_all('span',attrs={'class':'entry-date published'})
titles = [title.text.strip() for title in title_box]
dates = [date.text.strip()for date in date_box]
date_list = zip(dates, titles)
for heading in date_list:
print ("Entry {}")
The problem is your query for dates is returning an empty list, so the zipped result will also be empty. To extract the date from that page, you want to look for tags of type time, not span, with class entry-date published:
like this:
date_box = soup.find_all("time", attrs={"class": "entry-date published"})
So with the following code:
import requests
from bs4 import BeautifulSoup
quote_page = "https://blog.playstaxel.com"
page = requests.get(quote_page)
soup = BeautifulSoup(page.content, "lxml")
title_box = soup.find_all("h1", attrs={"class": "entry-title"})
date_box = soup.find_all("time", attrs={"class": "entry-date published"})
titles = [title.text.strip() for title in title_box]
dates = [date.text.strip() for date in date_box]
for date, title in zip(dates, titles):
print(f"{date}: {title}")
The result becomes:
2019-01-10: Magic update – feature preview
2019-01-06: New Years
2018-11-30: Staxel Changelog for 1.3.52
2018-11-13: Staxel Changelog for 1.3.49
2018-10-21: Staxel Changelog for 1.3.48
2018-10-12: Halloween Update & GOG

Python scraping from website. selecting TR elements based on multiple class attrs

I am scraping from the following page: https://kenpom.com/index.php?y=2018
The page shows a list of every Divison 1 college basketball team. Each row is for one team. I want to assign every team-row to a variable called "teams". The problem is that after each 40 teams there are two header rows that I do not want to include. These rows are unique as they have a class of "thead1" and "thead2". The rows that I want to grab have a class of None or "bold-bottom". So essentially i need to iterate through every tr element in that table and grab any that has a class of None or "bold-bottom". My attempt below does not work. It returns a count of 35 when it should be 353
import requests
from bs4 import BeautifulSoup
url ='https://kenpom.com/index.php?y=2018'
r = requests.get(url).text
soup = BeautifulSoup(r, 'lxml')
table = soup.find('table',{'id':'ratings-table'}).tbody
teams = table.findAll('tr',attrs = {'class':(None or 'bold-bottom')})
print(len(teams))

BeautifulSoup web scraping all 'li' text to dataframe

I am trying to use BeautifulSoup to scrape a list list of properties from a real estate web site and pass them into a data table. I am using python 3.
The following code I have works to print the required data. But I need a way to output the data into table. Between each li tag are 3 items, a property number (1 - 50), tenant name and square footage. ideally the output would be structured in a data frame with column headers number, tenant, square footage.
from bs4 import BeautifulSoup
import requests
import pandas as pd
page = requests.get("http://properties.kimcorealty.com/properties/0014/")
soup = BeautifulSoup(page.content, 'html.parser')
start = soup.find('div', {'id' : 'units_box_1'})
for litag in start.find_all('li'):
print(litag.text)
start = soup.find('div', {'id' : 'units_box_2'})
for litag in start.find_all('li'):
print(litag.text)
start = soup.find('div', {'id' : 'units_box_3'})
for litag in start.find_all('li'):
print(litag.text)
You can do it like this, getting all the divs in one go, finding enclosing "a" tags for groups of 3 "li" tags containing one set of data.
from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame
page = requests.get("http://properties.kimcorealty.com/properties/0014/")
soup = BeautifulSoup(page.content, 'html.parser')
table = []
# Find all the divs we need in one go.
divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
for div in divs:
# find all the enclosing a tags.
anchors = div.find_all('a')
for anchor in anchors:
# Now we have groups of 3 list items (li) tags
lis = anchor.find_all('li')
# we clean up the text from the group of 3 li tags and add them as a list to our table list.
table.append([unicodedata.normalize("NFKD",lis[0].text).strip(), lis[1].text, lis[2].text.strip()])
# We have all the data so we add it to a DataFrame.
headers = ['Number', 'Tenant', 'Square Footage']
df = DataFrame(table, columns=headers)
print (df)
Outputs:
Number Tenant Square Footage
0 1 Nordstrom Rack 34,032
1 2 Total Wine & More 29,981
2 3 Thomasville Furniture 10,628
...
47 49 Jo-Ann Fabrics 45,940
48 50 Available 32,572

Categories

Resources