specify time range using request-html when scraping a dynamic page

specify time range using request-html when scraping a dynamic page - python

I'm recently scraping soccer results for a friend, and it gets me. Below is the link:
https://www.mlssoccer.com/mlsnext/schedule/2021-2022/u16_mls-next-schedule
I'm trying to switch from selenium to request-html. The very reason that I use selenium in the first place is that is the only option I know to click the calendar.
The default date is the current date, but I need all the match history, is it possible to change the default date using request-html, if so, how?
Thanks in advance for your time and effort, any useful suggestion is appreciated.
---------------------------- EDIT------------------------------
After searching for a while, I find the following possible way to do this using request-html,
url= "https://www.modular11.com/public_schedule/league/get_matches"
session = HTMLSession()
response = session.post(url, data={"start_date": "2021-10-30 00:00:00"})
print(response.url)
# print(response.text)
response.html.render(timeout=1200)
print(response.html.text)
Please select the gender, league & age of the matches you are looking for.
$(function () { // Fix event duplicate $('.main_row').unbind('click'); $('.main_row').on('click', function () { $(this) .find('button.icon') .children('span') .toggleClass('glyphicon glyphicon-menu-down') .toggleClass('glyphicon glyphicon-menu-up'); if ($(this).find('button.icon span').hasClass('glyphicon-menu-down')) { $(this) .closest('.container-row') .find('.table-row-heading') .addClass('hidden-mobile'); $(this) .closest('.container-row') .find('.table-content-row') .addClass('hidden-xs hidden-sm'); $(this).css({ 'background-color': 'inherit', 'color': '#0B0B33' }); $(this) .closest('.container-row') .find('.mobile-scrolling') .removeClass('hide-scrolling'); } else if ($(this).find('button.icon span').hasClass('glyphicon-menu-up')) { $(this) .closest('.container-row') .find('.table-row-heading') .removeClass('hidden-mobile'); $(this) .closest('.container-row') .find('.table-content-row') .removeClass('hidden-xs hidden-sm'); $(this).css({ 'background-color': '#2A3851', 'color': '#EEEEEE !important' }); $(this) .closest('.container-row') .find('.mobile-scrolling') .addClass('hide-scrolling'); } }); });
but it seems like the data part has not been send to the server correctly, thus returning:
instead of (https://www.modular11.com/schedule?year=14):

the calendar makes an xhr request (that you can monitor with the network tab of your browser web tools), with an easy to customize query dict. the request returns html data that you'll have to parse with beautiful soup
import requests
query_dict={'open_page': ['0'], 'academy': ['0'], 'league': ['12'], 'gender': ['1'], 'age': ['["14"]'], 'brackets': ['null'], 'groups': ['null'], 'group': ['null'], 'match_number': ['0'], 'status': ['scheduled'], 'match_type': ['2'], 'schedule': ['0'], 'start_date': ['2021-11-03 00:00:00'], 'end_date': ['2021-11-10 23:59:00']}
r = requests.post('https://www.modular11.com/public_schedule/league/get_matches', data = query_dict)
print(r.text)

Related

AWS bulk indexing using gives 'illegal_argument_exception', 'explicit index in bulk is not allowed')

While I am trying to bulk index on AWS Opensearch Service (ElasticSearch V 10.1) using opensearch-py, I am getting below error
RequestError: RequestError(400, 'illegal_argument_exception', 'explicit index in bulk is not allowed')
from opensearchpy.helpers import bulk
bulk(client, format_embeddings_for_es_indexing(embd_data, titles_, _INDEX_))
format_embeddings_for_es_indexing() function yeilds
{
'_index': 'test_v1',
'_id': '208387',
'_source': {
'article_id': '208387',
'title': 'Battery and Performance',
'title_vector': [ 1.77665558e-02, 1.95874255e-02,.....],
......
}
}
I am able to index documents one by one using `open search.index()'
failed = {}
for document in format_embeddings_for_es_indexing(embd_data, titles_, _INDEX_):
res = client.index(
**document,
refresh = True
)
if res['_shards']['failed'] > 0:
failed[document["body"]["article_id"]] = res['_shards']
# document body for open search index
{
'index': 'test_v1',
'id': '208387',
'body': {
'article_id': '208387',
'title': 'Battery and Performance',
'title_vector': [ 1.77665558e-02, 1.95874255e-02,.....],
......
}
}
please help

This may have something to do with what is documented here: https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ac.html#ac-advanced
Please make sure that the value of rest.action.multi.allow_explicit_index in the Advanced cluster settings is true

Scraping data from a site, and it returns nothing?

i'm trying to scrape some data from a site called laced.co.uk, and i'm a tad confused on whats going wrong. i'm new to this, so try and explain it simply (if possible please!). Here is my code ;
from bs4 import BeautifulSoup
import requests
url = "https://www.laced.co.uk/products/nike-dunk-low-retro-black-white?size=7"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
prices = doc.find_all(text=" £195 ")
print(prices)
thank you! (the price at time of posting was 195 (it showed as the size 7 buy now price on the page).

The price is loaded within a <script> tag on the page:
<script>
typeof(dataLayer) != "undefined" && dataLayer.push({
'event': 'eec.productDetailImpression',
'page': {
'ecomm_prodid': 'DD1391-100'
},
'ecommerce': {
'detail': {
'actionField': {'list': 'Product Page'},
'products': [{
'name': 'Nike Dunk Low Retro Black White',
'id': 'DD1391-100',
'price': '195.0',
'brand': 'Nike',
'category': 'Dunk, Dunk Low, Mens Nike Dunks',
'variant': 'White',
'list': 'Product Page',
'dimension1': '195.0',
'dimension2': '7',
'dimension3': '190',
'dimension4': '332'
}]
}
}
});
</script>
You can use a regular expression pattern to search for the price. Note, there's no need for BeautifulSoup:
import re
import requests
url = "https://www.laced.co.uk/products/nike-dunk-low-retro-black-white?size=7"
result = requests.get(url)
price = re.search(r"'price': '(.*?)',", result.text).group(1)
print(f"£ {price}")

How to find non-retweets in a MongoDB collection of tweets?

I have a collection of about 1.4 million tweets in a MongoDB collection. I want to find all that are NOT retweets, and am using Python. The structure of a document is as follows:
{
'_id': ObjectId('59388c046b0c1901172555b9'),
'coordinates': None,
'created_at': datetime.datetime(2016, 8, 18, 17, 17, 12),
'geo': None,
'is_quote': False,
'lang': 'en',
'text': b'Adam Cole Praises Kevin Owens + A Preview For Next Week\xe2\x80\x99s',
'tw_id': 766323071976247296,
'user_id': 2231233110,
'user_lang': 'en',
'user_loc': 'main; #Kan1shk3',
'user_name': 'sheezy0',
'user_timezone': 'Chennai'
}
I can write a query that works to find the particular tweet from above:
twitter_mongo_collection.find_one({
'text': b'Adam Cole Praises Kevin Owens + A Preview For Next Week\xe2\x80\x99s'
})
But when I try to find retweets, my code doesn't work, for example I try to find any tweets that start like this:
'text': b'RT some tweet'
Using this query:
find_one( {'text': {'$regex': "/^RT/" } } )
It doesn't return an error, but it doesn't find anything. I suspect it has something to do with that 'b' at the beginning before the text starts. I know I also need to put '$not:' in there somewhere but am not sure where.
Thanks!

It looks like your regex search is trying to match the string
b'RT'
but you want to match strings like
b'RT some text afterwards'
try using this regex instead
find_one( {'text': {'$regex': "/^RT.*/" } } )

I had to decode the 'text' field that was encoded as binary. Then I was able to use
twitter_mongo_collection.find_one( { {'text': { '$not': re.compile("^RT.*") } } )
to find all the documents that did not start with "RT".

When trying to log in using requests module, I get an error because the User/Password fields are blank

I've been trying to make a scraper to get my grades from my schools website. Unfortunately i cannot log in. When i try to run the program, the return page validates the user/password fields, and since they are blank, it's not letting me proceed.
Also, i am not really sure if I am even coding this correctly.
from twill.commands import *
import requests
payload = {
'ctl00$cphMainContent$lgn$UserName':'user',
'ctl00$cphMainContent$lgn$Password':'pass',
}
cookie = {
'En_oneTime_ga_tracking_v2' : 'true',
'ASP.NET_SessionId' : ''
}
with requests.Session() as s:
p = s.post('schoolUrl', data=payload, cookies=cookie)
print p.text
Updated payload:
payload = {
'ctl00$cphMainContent$lgnEaglesNest$UserName':'user',
'ctl00$cphMainContent$lgnEaglesNest$Password':'pass',
'__LASTFOCUS': '',
'__EVENTTARGET':'',
'__EVENTARGUMENT':'',
'__VIEWSTATE': 'LONG NUMBER',
'__VIEWSTATEGENERATOR': 'C2EE9ABB',
'__EVENTVALIDATION' : 'LONG NUMBER',
'ctl00$cphMainContent$lgnEaglesNest$RememberMe': 'on',
'ctl00$cphMainContent$lgnEaglesNest$LoginButton':'Log+In'
}
How do i know if my POST was successful?
The returned page was saying that Username/Password cannot be blank.
Complete source:
from twill.commands import *
import requests
payload = {
'ctl00$cphMainContent$lgnEaglesNest$UserName':'user',
'ctl00$cphMainContent$lgnEaglesNest$Password':'pass',
'__LASTFOCUS': '',
'__EVENTTARGET':'',
'__EVENTARGUMENT':'',
'__VIEWSTATE': 'LONG NUMBER',
'__VIEWSTATEGENERATOR': 'C2EE9ABB',
'__EVENTVALIDATION' : 'LONG NUMBER',
'ctl00$cphMainContent$lgnEaglesNest$RememberMe': 'on',
'ctl00$cphMainContent$lgnEaglesNest$LoginButton':'Log In'
}
cookie = {
'En_oneTime_ga_tracking_v2' : 'true',
'ASP.NET_SessionId' : ''
}
with requests.Session() as s:
loginUrl = 'http://eaglesnest.pcci.edu/Login.aspx?ReturnUrl=%2f'
gradeUrl = 'http://eaglesnest.pcci.edu/StudentServices/ClassGrades/Default.aspx'
p = s.post( loginUrl, data=payload)
print p.text

Your payload uses the wrong keys, try
ctl00$cphMainContent$lgnEaglesNest$UserName
ctl00$cphMainContent$lgnEaglesNest$Password
You can check the names by watching the network traffic in your browser (e.g. in Firefox: inspect element --> network --> post --> params)
In addition you need to specify which command you want to perform, i.e. which button was pressed.
payload['ctl00$cphMainContent$lgnEaglesNest$LoginButton': 'Log In']

Telegram bot API - Inline bot getting Error 400 while trying to answer inline query

I have a problem coding a bot in Python that works with the new inline mode.
The bot gets the query, and while trying to answer, it receives error 400.
Here is a sample of data sent by the bot at this time:
{
'inline_query_id': '287878416582808857',
'results': [
{
'type': 'article',
'title': 'Convertion',
'parse_mode': 'Markdown',
'id': '287878416582808857/0',
'message_text': 'blah blah'
}
]
}
I use requests library in to make requests, and here is the line that does it in the code:
requests.post(url = "https://api.telegram.org/bot%s%s" % (telegram_bot_token, "/answerInlineQuery"), data = myData)
With myData holding the data described in the sample.
Can you help me solve this, please?

I suspect it is because you haven't JSON-serialized the results parameter.
import json
results = [{'type': 'article',
'title': 'Convertion',
'parse_mode': 'Markdown',
'id': '287878416582808857/0',
'message_text': 'blah blah'}]
my_data = {
'inline_query_id': '287878416582808857',
'results': json.dumps(results),
}
requests.post(url="https://api.telegram.org/bot%s%s" % (telegram_bot_token, "/answerInlineQuery"),
params=my_data)
Note that I use params to supply the data.

I am getting the correct response after doing some POC. I am using java com.github.pengrad.
Below the code.
GetUpdatesResponse updatesResponse = bot.execute(new GetUpdates());
List updates = updatesResponse.updates();
for(Update update:updates){
InlineQuery inlineQuery = update.inlineQuery();
System.out.println(update);
System.out.println(inlineQuery);
System.out.println("----------------");
if(inlineQuery!=null) {
InlineQueryResult r1 = new InlineQueryResultPhoto("AgADBQADrqcxG5q8tQ0EKSz5JaZjzDWgvzIABL0Neit4ar9MsXYBAAEC", "https://api.telegram.org/file/bot230014106:AAGtWr8xUCqUy8HjSgSFrY3aCs4IZs00Omg/photo/file_1.jpg", "https://api.telegram.org/file/bot230014106:AAGtWr8xUCqUy8HjSgSFrY3aCs4IZs00Omg/photo/file_1.jpg");
BaseResponse baseResponse = bot.execute(new AnswerInlineQuery(inlineQuery.id(), r1)
.cacheTime(6000)
.isPersonal(true)
.nextOffset("offset")
.switchPmParameter("pmParam")
.switchPmText("pmText"));
System.out.println(baseResponse.isOk());
System.out.println(baseResponse.toString());
System.out.println(baseResponse.description());
}
}
Below the console output:
Update{update_id=465103212, message=null, edited_message=null, inline_query=InlineQuery{id='995145139265927135', from=User{id=231700283, first_name='Manabendra', last_name='Maji', username='null'}, location=null, query='hi', offset=''}, chosen_inline_result=null, callback_query=null}
InlineQuery{id='995145139265927135', from=User{id=231700283, first_name='Manabendra', last_name='Maji', username='null'}, location=null, query='hi', offset=''}
true
BaseResponse{ok=true, error_code=0, description='null'}
null
And I am getting proper response in my mobile telegram app also.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

specify time range using request-html when scraping a dynamic page - python

Related

AWS bulk indexing using gives 'illegal_argument_exception', 'explicit index in bulk is not allowed')

Scraping data from a site, and it returns nothing?

How to find non-retweets in a MongoDB collection of tweets?

When trying to log in using requests module, I get an error because the User/Password fields are blank

Telegram bot API - Inline bot getting Error 400 while trying to answer inline query

Categories

Resources