may I know if there is a solution to change my code in order to translate the website from Malay to English language but using Bing Translator instead.
import pandas
import urllib.request as ur
from bs4 import BeautifulSoup
from googletrans import Translator
translator = Translator()
url = "http://https://www.bharian.com.my/"
page = ur.urlopen(url)
df = pandas.DataFrame(columns=["Title", "Date", "Url", "Content"])
soup = BeautifulSoup(page, "html.parser")
headlines = soup.find_all("div", {"class": "ms-vb itx"})
intro = soup.find_all("div", {"class": "ms-rtestate-field"})
dates = soup.find_all("td", {"class": "ms-vb2"})
count = len(headlines)
for i in range(0, len(headlines)):
s = str(headlines[i].a.string)
url1 = headlines[i].a.get("href")
page1 = ur.urlopen(url1)
soup1 = BeautifulSoup(page1, "html.parser")
cont = soup1.find_all("div", {"style": "text-align:justify;"})
content = intro[2 * i].p.text
for data in cont:
content += data.text
content = translator.translate(content, src="ms", dest="en").text
s = translator.translate(s, src="ms").text
df = df.append(
{
"Title": s,
"Date": dates[i].string,
"Url": url1,
"Content": content,
},
ignore_index=True,
)
df.to_csv("News.csv")
# f.write(str(len(result))+'\n')
# for res in result:
# f.write(str(res.pre.string))
# f.close()
# while(driver.current_url == url):
# continue
Yes, there is, but you might not like it very much. Right now it looks like you're using the googletrans library from PyPi - https://pypi.org/project/googletrans/. A similar-looking package exists for Bing translate called bing_translator, however it looks as though this package is out-of-date. Microsoft themselves, however, have published code samples on GitHub:
import os, requests, uuid, json
key_var_name = 'TRANSLATOR_TEXT_SUBSCRIPTION_KEY'
if not key_var_name in os.environ:
raise Exception('Please set/export the environment variable: {}'.format(key_var_name))
subscription_key = os.environ[key_var_name]
endpoint_var_name = 'TRANSLATOR_TEXT_ENDPOINT'
if not endpoint_var_name in os.environ:
raise Exception('Please set/export the environment variable: {}'.format(endpoint_var_name))
endpoint = os.environ[endpoint_var_name]
# If you encounter any issues with the base_url or path, make sure
# that you are using the latest endpoint: https://learn.microsoft.com/azure/cognitive-services/translator/reference/v3-0-translate
path = '/translate?api-version=3.0'
params = '&to=de&to=it'
constructed_url = endpoint + path + params
headers = {
'Ocp-Apim-Subscription-Key': subscription_key,
'Content-type': 'application/json',
'X-ClientTraceId': str(uuid.uuid4())
}
# You can pass more than one object in body.
body = [{
'text' : 'Hello World!'
}]
request = requests.post(constructed_url, headers=headers, json=body)
response = request.json()
print(json.dumps(response, sort_keys=True, indent=4, separators=(',', ': ')))
As you've probably noticed, this is a fair bit clunkier than your nice googletrans package. You might want to make your own abstraction layer to make this easier (and maybe publish it on PyPi!).
TRANSLATOR_TEXT_SUBSCRIPTION_KEY and TRANSLATOR_TEXT_ENDPONT should be filled in with your Translation service endpoint and subsciption keys. Whilst Google seem happy enough for you to freely use their API, Microsoft would like you to create an account. Whilst it looks like you can get hold of a free key, depending on what you're using it for Microsoft might expect payment. The links on the github page should take you to the relevant articles for that.
Related
G'day guys, I'm working on a python project that pulls weather data from BOM (https://bom.gov.au).
The script works correctly, however I would like for it to be able to use part of the URL within the post request. i.e., the user navigates to https://example.com/taf/ymml, the script runs and uses YMML within the POST.
the script I am using is below. I would like to swap out 'YSSY' in myobj for something that pulls it from the url that the user navigates to.
import requests
import re
url = 'http://www.bom.gov.au/aviation/php/process.php'
myobj = {'keyword': 'YSSY', 'type': 'search', 'page': 'TAF'}
headers = {'User-Agent': 'Chrome/102.0.0.0'}
x = requests.post(url, data = myobj, headers=headers)
content = x.text
stripped = re.sub('<[^<]+?>', ' ', content)
split_string = stripped.split("METAR", 1)
substring = split_string[0]
print(substring)
Any ideas?
Ok so I've managed to get this working using fastapi. When a user navigates to example.com/taf/ymml, the site will return in plain text the taf for ymml. it can be substituted for any Australian Aerodrome. One thing I haven't figured out is how to remove the the square brackets around the taf, but that is a problem for another time.
from fastapi import FastAPI
import requests
from bs4 import BeautifulSoup
app = FastAPI()
#app.get("/taf/{icao}")
async def read_icao(icao):
url = 'http://www.bom.gov.au/aviation/php/process.php'
myobj = {'keyword': icao, 'type': 'search', 'page': 'TAF'}
headers = {'User-Agent': 'Chrome/102.0.0.0'}
x = requests.post(url, data = myobj, headers=headers)
content = x.text
split_string = content.split("METAR", 1)
substring = split_string[0]
soup = BeautifulSoup(substring, 'html.parser')
for br in soup('br'):
br.replace_with(' ')
#Create TAFs array.
tafs = []
for taf in soup.find_all('p', class_="product"):
full_taf = taf.get_text()
tafs.append(full_taf.rstrip())
return {tuple(tafs)}
I'm using requests and regex to scrape data from an entire website and then save it to a JSON file, hosted on github so I and anyone else can access the data from other devices.
The first thing I tried was just to open every single page on the website and get all the data I want but I found that to be unnecessary so I decided to make two scripts, the first one finds the URL of every page on the site and the second one will be the one called which will then scrape the called URL. What I'm having trouble with right now is getting my data formatted correctly for the JSON file. Currently this is a sample of what the output looks like:
{
"Console":"/neo-geo-aes",
"Call ID":"62815",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/bare-knuckle"
}{
"Console":"/neo-geo-cd",
"Call ID":"62817",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/bare-knuckle-2"
}{
"Console":"/neo-geo-pocket-color",
"Call ID":"62578",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/batman"
}{
"Console":"/playstation",
"Call ID":"62580",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/batman-forever"
}
I've looked into this a lot and can't find a solution, here's the code in question:
import re
import requests
import json
##The base URL
URL = "https://www.pricecharting.com/"
r = requests.get(URL)
htmltext = r.text
##Find all system URLs
dataUrl = re.findall('(?<=<li><a href="\/console).*(?=">)', htmltext)
print(dataUrl)
##For each Item(number of consoles) find games
for i in range(len(dataUrl)):
##make console URL
newUrl = ("https://www.pricecharting.com/console" + dataUrl[i])
req = requests.get(newUrl)
newHtml = req.text
##Get item URLs
urlOne = re.findall('(?<=<a href="\/game).*(?=">)', newHtml)
itemId = re.findall('(?<=tr id="product-).*(?=" data)', newHtml)
##For every item in list(items per console)
out_list = []
for i in range(len(urlOne)):
##Make item URL
itemUrl = ("https://www.pricecharting.com/game" + urlOne[i])
callId = (itemId[i])
##Format for JSON
json_file_content = {}
json_file_content['Console'] = dataUrl[i]
json_file_content['Call ID'] = callId
json_file_content['URL'] = itemUrl
out_list.append(json_file_content)
data_json_filename = 'docs/result.json'
with open(data_json_filename, 'a') as data_json_file:
json.dump(out_list, data_json_file, indent=4)
When i search for books with a single name(e.g bluets) my code works fine, but when I search for books that have two names or spaces (e.g white whale) I got an error(jinja2 synatx) how do I solve this error?
#app.route("/book", methods["GET", "POST"])
def get_books():
api_key =
os.environ.get("API_KEY")
if request.method == "POST":
book = request.form.get("book")
url =f"https://www.googleapis.com/books/v1/volumes?q={book}:keyes&key={api_key}"
response =urllib.request.urlopen(url)
data = response.read()
jsondata = json.loads(data)
return render_template ("book.html", books=jsondata["items"]
I tried to search for similar cases, and just found one solution, but I didn't understand it
Here is my error message
http.client.InvalidURL
http.client.InvalidURL: URL can't contain control characters. '/books/v1/volumes?q=white whale:keyes&key=AIzaSyDtjvhKOniHFwkIcz7-720bgtnubagFxS8' (found at least ' ')
Some chars in url need to be encoded - in your situation you have to use + or %20 instead of space.
This url has %20 instead of space and it works for me. If I use + then it also works
import urllib.request
import json
url = 'https://www.googleapis.com/books/v1/volumes?q=white%20whale:keyes&key=AIzaSyDtjvhKOniHFwkIcz7-720bgtnubagFxS8'
#url = 'https://www.googleapis.com/books/v1/volumes?q=white+whale:keyes&key=AIzaSyDtjvhKOniHFwkIcz7-720bgtnubagFxS8'
response = urllib.request.urlopen(url)
text = response.read()
data = json.loads(text)
print(data)
With requests you don't even have to do it manually because it does it automatically
import requests
url = 'https://www.googleapis.com/books/v1/volumes?q=white whale:keyes&key=AIzaSyDtjvhKOniHFwkIcz7-720bgtnubagFxS8'
r = requests.get(url)
data = r.json()
print(data)
You may use urllib.parse.urlencode() to make sure all chars are correctly encoded.
import urllib.request
import json
payload = {
'q': 'white whale:keyes',
'key': 'AIzaSyDtjvhKOniHFwkIcz7-720bgtnubagFxS8',
}
query = urllib.parse.urlencode(payload)
url = 'https://www.googleapis.com/books/v1/volumes?' + query
response = urllib.request.urlopen(url)
text = response.read()
data = json.loads(text)
print(data)
and the same with requests - it also doesn't need encoding
import requests
payload = {
'q': 'white whale:keyes',
'key': 'AIzaSyDtjvhKOniHFwkIcz7-720bgtnubagFxS8',
}
url = 'https://www.googleapis.com/books/v1/volumes'
r = requests.get(url, params=payload)
data = r.json()
print(data)
I have written a script that should purchase an asset from catalog.
import re
from requests import post, get
cookie = "blablabla"
ID = 1562150
# getting x-csrf-token
token = post("https://auth.roblox.com/v2/logout", cookies={".ROBLOSECURITY": cookie}).headers['X-CSRF-TOKEN']
print(token)
# getting item details
detail_res = get(f"https://www.roblox.com/library/{ID}")
text = detail_res.text
productId = int(get(f"https://api.roblox.com/marketplace/productinfo?assetId={ID}").json()["ProductId"])
expectedPrice = int(re.search("data-expected-price=\"(\d+)\"", text).group(1))
expectedSellerId = int(re.search("data-expected-seller-id=\"(\d+)\"", text).group(1))
headers = {
"x-csrf-token": token,
"content-type": "application/json; charset=UTF-8"
}
data = {
"expectedCurrency": 1,
"expectedPrice": expectedPrice,
"expectedSellerId": expectedSellerId
}
buyres = post(f"https://economy.roblox.com/v1/purchases/products/{productId}", headers=headers,
data=data,
cookies={".ROBLOSECURITY": cookie})
if buyres.status_code == 200:
print("Successfully bought item")
The problem is that it somehow doesn't purchase any item with error 500 (InternalServerError).
Someone told me that if I add json.dumps() to the script it might work.
How to add json.dumps() here (I don't understand it though I read docs) and how to fix this so the script purchases item?
Big thanks to anyone who can help me.
Import the json package.
json.dumps() converts a python dictionary to a json string.
I'm guessing this is what you want.
buyres =
post(f"https://economy.roblox.com/v1/purchases/products/{productId}",
headers=json.dumps(headers),
data=json.dumps(data),
cookies={".ROBLOSECURITY": cookie})
I found the answer finally, I had to do it like this:
dataLoad = json.dumps(data)
buyres = post(f"https://economy.roblox.com/v1/purchases/products/{productId}", headers=headers,
data=dataLoad,
cookies={".ROBLOSECURITY": cookie})
Recently I've been trying to learn how to webscrape in order to download all the images from my school directory. However, within the elements they are not storing the images under the img tag and instead have them ALL under this: background-image: url("/common/pages/GalleryPhoto.aspx?photoId=323070&width=180&height=180");
Anyway to bypass this??
Here is current code that will download images off of a targeted website
import os, requests, bsf n4, webbrowser, random
url = 'https://jhs.lsc.k12.in.us/staff_directory'
res = requests.get(url)
try:
res.raise_for_status()
except Exception as exc:
print('Sorry an error occured:', exc)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
element = soup.select('background-image')
for i in range(len(element)):
url = element[i].get('img')
name = random.randrange(1, 25)
file = open(str(name) + '.jpg', 'wb')
res = requests.get(url)
for chunk in res.iter_content(10000):
file.write(chunk)
file.close()
print('done')
You can use the internal API this site is using to get the data including the image URL. It first gets the list of staff groups using the /settings endpoint then calls the /Search api with all the groupID
The flow is the following :
get the portletInstanceId value from a div tag with attribute data-portlet-instance-id
call the settings api and get the groups ID:
POST https://jhs.lsc.k12.in.us/Common/controls/StaffDirectory/ws/StaffDirectoryWS.asmx/Settings
call the search api with pagination parameter, you can choose how many people you want to request and the number per page :
POST https://jhs.lsc.k12.in.us/Common/controls/StaffDirectory/ws/StaffDirectoryWS.asmx/Search
The following script get the 20 first people and put the result in a pandas DataFrame:
import requests
from bs4 import BeautifulSoup
import pandas as pd
r = requests.get("https://jhs.lsc.k12.in.us/staff_directory")
soup = BeautifulSoup(r.content, "lxml")
portletInstanceId = soup.select('div[data-portlet-instance-id].staffDirectoryComponent')[0]["data-portlet-instance-id"]
r = requests.post("https://jhs.lsc.k12.in.us/Common/controls/StaffDirectory/ws/StaffDirectoryWS.asmx/Settings",
json = { "portletInstanceId": portletInstanceId })
groupIds = [t["groupID"] for t in r.json()["d"]["groups"]]
print(groupIds)
payload = {
"firstRecord": 0,
"groupIds": groupIds,
"lastRecord": 20,
"portletInstanceId": portletInstanceId,
"searchByJobTitle": True,
"searchTerm": "",
"sortOrder": "LastName,FirstName ASC"
}
r = requests.post("https://jhs.lsc.k12.in.us/Common/controls/StaffDirectory/ws/StaffDirectoryWS.asmx/Search",
json = payload)
results = r.json()["d"]["results"]
#add image url based on userID
for t in results:
t["imageURL"] = f'https://jhs.lsc.k12.in.us/{t["imageURL"]}' if t["imageURL"] else ''
df = pd.DataFrame(results)
#whole data
print(df)
#only image url
with pd.option_context('display.max_colwidth', 400):
print(df["imageURL"])
Try this on repl.it
You need to update firstRecord and lastRecord fields accordingly