I'm trying to learn about scraping and I found an article about making an API by scraping information from a site then using fastapi to serve it. It's a simple project but the way it was written on the blog page makes it really confusing.
I'm thinking they left stuff out but I don't know what that would be. Here's the link to the article: https://www.scien.cx/2022/04/26/creating-a-skyrim-api-with-python-and-webscraping/
Here is the code that I'm trying to run. First I'm in a directory I named skypi. I made a file called sky-scrape.py. Here is the code:
from bs4 import BeautifulSoup
import requests
import json
def getLinkData(link):
return requests.get(link).content
factions = getLinkData("https://elderscrolls.fandom.com/wiki/Factions_(Skyrim)")
data = []
soup = BeautifulSoup(factions, 'html.parser')
table = soup.find_all('table', attrs={'class': 'wikitable'})
for wikiTable in table:
table_body = wikiTable.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
# Get rid of empty values
data.append([ele for ele in cols if ele])
cleanData = list(filter(lambda x: x != [], data))
skyrim_data[html] = cleanData *This doesn't work it throws errors saying skyrim_data not defined. If I just write it as
skyrim_data= cleanData
then here's the biggest issue: I have another file run.py and I want to import the data from sky-scrape.py
here is that file:
from fastapi import FastAPI
from fastapi.responses import HTMLResponse
from sky-scrape import skyrim_data
app = FastAPI()
#app.get("/", response_class=HTMLResponse)
def home():
return("""
<html>
<head>
<title>Skyrim API</title>
</head>
<body>
<h1>API DO SKYRIM</h1>
<h2>Rotas disponÃveis:</h2>
<ul>
<li>/factions</li>
</ul>
</body>
</html>
""")
#app.get("/factions")
def factions():
return skyrim_data["Factions"]
The from sky-scrape import skyrim_data doesn't work so I'm not sure what to do at this point. How do I get this script to work correctly?
Related
I'm a Java and C# developer and learning Python (web scraping, specific) at the moment. When I try to start my script (just double-clicking on it) it won't open. The terminal opens for a few milliseconds and then closes. What mistake did I make?
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
product_all_pages = []
for i in range(1,15):
response = requests.get("https://www.bol.com/nl/s/?page={i}&searchtext=hand+sanitizer&view=list")
content = response.content
parser = BeautifulSoup(content, 'html.parser')
body = parser.body
producten = body.find_all(class_="product-item--row js_item_root")
product_all_pages.extend(producten)
len(product_all_pages)
price = float(product_all_pages[1].meta.get('content'))
productname = product_all_pages[1].find(class_="product-title--inline").a.getText()
print(price)
print(productname)
productlijst = []
for item in product_all_pages:
if item.find(class_="product-prices").getText() == '\nNiet leverbaar\n':
price = None
else:
price = float(item.meta['content'])
product = item.find(class_="product-title--inline").a.getText()
productlijst.append([product, price])
print(productlijst[:3])
df = pd.DataFrame(productlijst, columns=["Product", "price"])
print(df.shape)
df["price"].describe()
Try running your code from command line, then you can see the debugging output. Your code throws an AttributeError because content contains no data. The problem is that the url is not formatted because you didn't initiate f-string formatting. This should work:
response = requests.get(f"https://www.bol.com/nl/s/?page={i}&searchtext=hand+sanitizer&view=list")
I typed some code to get some data from a site (scraping) and I got the result I want (results are just numbers).
the question is, how can I show results as an output on my website? I am using python and HTML in Vs Code. Here is the scraping code:
import requests
from bs4 import BeautifulSoup
getpage= requests.get('https://www.worldometers.info/coronavirus/country/austria/')
getpage_soup= BeautifulSoup(getpage.text, 'html.parser')
Get_Total_Deaths_Recoverd_Cases= getpage_soup.findAll('div', {'class':'maincounter-number'})
for para in Get_Total_Deaths_Recoverd_Cases:
print (para.text)
Get_Active_And_Closed= BeautifulSoup(getpage.text, 'html.parser')
All_Numbers2= Get_Active_And_Closed.findAll('div', {'class':'number-table-main'})
for para2 in All_Numbers2:
print (para2.text)
and these are the results that I want to show on the website:
577,007
9,687
535,798
31,522
545,485
I don't know how to describe this solution but it is a solution nonetheless:
import os
lst = [
577.007, 9.687, 535.798, 31.522, 545.485
]
p = ''
for num in lst:
p += f'<p>{num}</p>\n'
html = f"""
<!doctype>
<html>
<head>
<title>Test</title>
</head>
<body>
{p}
</body>
</html>
"""
with open('web.html', 'w') as file:
file.write(html)
os.startfile('web.html')
You don't have to necessarily use that os.startfile but You get the idea I hope, You now have web.html file and You can display it on Your website or whatever and theoretically You can also get a normal html document and copy it there (set as html variable and put the {p} wherever You need/want) or even go to Your html and put somewhere for example {%%} (which is what django uses but anyways),then read the whole file and replace those {%%} with Your value and write that back.
At the end this is a solution but depending on what You use e.g. a framework of sorts for example django or flask, there are easier ways to do that there
What's a good way for me to output (AMP compliant) HTML using a template and JSON for the data? I have a nice little python script:
import requests
import json
from bs4 import BeautifulSoup
url = requests.get('https://www.perfectimprints.com/custom-promos/20492/Beach-Balls.html')
source = BeautifulSoup(url.text, 'html.parser')
products = source.find_all('div', class_="product_wrapper")
infos = source.find_all('div', class_="categories_wrapper")
def get_category_information(category):
category_name = category.find('h1', class_="category_head_name").text
return {
"category_name": category_name.strip()
}
category_information = [get_category_information(info) for info in infos]
with open("category_info.json", "w") as write_file:
json.dump(category_information, write_file)
def get_product_details(product):
product_name = product.find('div', class_="product_name").a.text
sku = product.find('div', class_="product_sku").text
product_link = product.find('div', class_="product_image_wrapper").find("a")["href"]
src = product.find('div', class_="product_image_wrapper").find('a').find("img")["src"]
return {
"title": product_name,
"link": product_link.strip(),
"sku": sku.strip(),
"src": src.strip()
}
all_products = [get_product_details(product) for product in products]
with open("products.json", "w") as write_file:
json.dump({'items': all_products}, write_file)
print("Success")
Which generates the JSON files I need. However, I now need to use those JSON files and input it into my template (gist) everywhere it says {{ JSON DATA HERE }}.
I'm not even sure where to start. I'm most comfortable with JavaScript so I'd like to use that if possible. I figure something involving Node.js.
Here's how you can render HTML with a template engine by itself and use it to return pure HTML:
from jinja2 import Template
me = Template('<h1>Hello {{x}}<h1>')
html = me.render({'x': 'world'})
return html
html is your rendered HTML string. Alternatively, you can render from a file:
from jinja2 import Template
with open('your_template.html') as file_:
template = Template(file_.read())
html = template.render(your_dict)
return html
Since you're going to generate HTML one way or another, using a template engine will save you much time. You can also do {% for item in list %} and such thing will greatly simplify your task.
I'm new to Python and am trying to do some web scraping. I'm trying to get things like Deck Name, Username, Elixir Cost, and Card from a website about the game Clash Royale. I am taking the data and then sending it into a folder called "Data" in my project directory. The files are being created fine but I am getting empty brackets [] in each .json file. I don't know what I am doing wrong. Any help would be greatly appreciated. Thanks! Code is below:
from bs4 import BeautifulSoup
import requests
import uuid
import json
import os.path
from multiprocessing.dummy import Pool as Threadpool
def getdata(url):
save_path=r'/Users/crazy4byu/PycharmProjects/Final/Data'
clashlist=[]
html = requests.get(url).text
soup = BeautifulSoup(html,'html5lib')
clash = soup.find_all('div',{'class':'row result'})
for clashr in clash:
clashlist.append(
{
'Deck Name':clashr.find('a').text,
'User':clashr.find('td',{'class':'user center'}).text,
'Elixir Cost':clashr.find('span',{'class':'elixir_cost'}).text,
'Card':clashr.find('span',{'class':None}).text
}
)
decks = soup.find_all('div',{'class':' row result'})
for deck in decks:
clashlist.append(
{
'Deck Name':clashr.find('a').text,
'User':clashr.find('td',{'class':'user center'}).text,
'Elixir Cost':clashr.find('span',{'class':'elixir_cost'}).text,
'Card':clashr.find('span',{'class':None}).text
}
)
with open(os.path.join(save_path,'data_'+str(uuid.uuid1())+'.json'),'w') as outfile:
json.dump(clashlist,outfile)
if'__main__' == __name__:
urls=[]
urls.append(r'http://clashroyaledeckbuilder.com/clashroyale/deckViewer/highestRated')
for i in range(20,990,10):
urls.append(r'http://clashroyaledeckbuilder.com/clashroyale/deckViewer/highestRated'+str(i))
pool = Threadpool(25)
pool.map(getdata, urls)
pool.close()
pool.join()
I have found a python html parser that builds a dom like structure
for html sources it seems easy to use and very fast. i'm trying to write a scraper for codepad.org that retrieves the last ten posts from
http://codepad.org/recent
The EHP lib is at https://github.com/iogf/ehp
i have this code below that is working.
import requests
from ehp import Html
def catch_refs(data):
html = Html()
dom = html.feed(data)
return [ind.attr['href']
for ind in dom.find('a')
if 'view' in ind.text()]
def retrieve_source(refs, dir):
"""
Get the source code of the posts then save in a dir.
"""
pass
if __name__ == '__main__':
req = requests.get('http://codepad.org/recent')
refs = catch_refs(req.text)
retrieve_source(refs, '/tmp/')
print refs
it outputs:
[u'http://codepad.org/aQGNiQ6t',
u'http://codepad.org/HMrG1q7t',
u'http://codepad.org/zGBMaKoZ', ...]
as expected but i cant figure out how to download the source code of the files.
Actually your retrieve_source(refs, dir) don't do anything.
So you are not getting any result.
Update according to your comment:
import os
def get_code_snippet(page):
dom = Html().feed(page)
# getting all <div class=='highlight'>
elements = [el for el in dom.find('div')
if el.attr['class'] == 'highlight']
return elements[1].text()
def retrieve_source(refs, dir):
for i, ref in enumerate(refs):
with open(os.path.join(dir, str(i) + '.html'), 'w') as r:
r.write(get_code_snippet(requests.get(ref).content))