Related
I am trying to scrape runner names and number of tips from this page: https://www.horseracing.net/racecards/newmarket/13-05-21
It is only returning the last runner name in the final race. I've been over and over it but can't see what I have done wrong.
Can anyone see the issue?
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
url = "https://www.horseracing.net/racecards/newmarket/13-05-21"
results = requests.get(url)
soup = BeautifulSoup(results.text, "html.parser")
date = []
course = []
time = []
runner = []
tips = []
runner_div = soup.find_all('div', class_='row-cell-right')
for container in runner_div:
runner_name = container.h5.a.text
runner.append(runner_name)
tips_no = container.find('span', class_='tip-text number-tip').text if container.find('span', class_='tip-text number-tip') else ''
tips.append(tips_no)
print(runner_name, tips_no)
Try print(runner, tips) instead of print(runner_name, tips_no):
Output:
print(runner, tips)
# ['Babindi', 'Turandot', 'Sharla', "Serena's Queen", 'Bellazada', 'Baby Alya', 'Adelita', 'Florence Street', 'Allerby', 'Puy Mary', 'Roman Mist', 'Lunar Shadow', 'Breakfastatiffanys', 'General Panic', 'Gidwa', 'Point Lynas', 'Three Dons', 'Wrought Iron', 'Desert Dreamer', 'Adatorio', 'Showmedemoney', 'The Charmer',
# 'Bascinet', 'Dashing Rat', 'Appellation', 'Cambridgeshire', 'Danni California', 'Drifting Sands', 'Lunar Gold', 'Malathaat', 'Miss Calacatta', 'Sunrise Valley', 'Sweet Expectation', 'White Lady', 'Riknnah', 'Aaddeey', 'High Commissioner', 'Kaloor', 'Rodrigo Diaz', 'Mukha Magic', 'Gauntlet', 'Hawridge Flyer', 'Clovis Point', 'Franco Grasso', 'Kemari', 'Magical Land', 'Mobarhin', 'Movin Time', 'Night Of Dreams', 'Punta Arenas', 'Constanta', 'Cosmic George', 'Taravara', 'Basilicata', 'Top Brass', 'Without Revenge', 'Grand Scheme', 'Easy Equation', 'Mr Excellency', 'Colonel Faulkner', 'Urban War', 'Freak Out', 'Alabama Boy', 'Anghaam', 'Arqoob', 'Fiordland', 'Dickens', "Shuv H'Penny King"]
# ['5', '3', '1', '3', '1', '', '1', '', '', '', '1', '', '', '', '', '1', '', '', '12', '1', '', '', '', '', '', '', '5', '', '1', '', '', '7', '', '', '1', '11', '1', '', '', '', '', '2', '', '', '1', '3', '2', '9', '', '', '', '', '5', '1', '4', '', '5', '', '1', '4', '2', '1', '3', '2', '1', '', '', '']
I'm not a coder by trade, rather an infrastructure engineer that's learning to code for my role. I have an output that I am getting and I am struggling to think how I can get this to work.
I've utilized some of my colleagues but the data is outputted in a weird format and I am unsure how to get the outcome I want. I have tried splitting the lines but it will not work perfectly.
The current code is simple. It just pulls the output command from the switch & I then have it split the lines:
output = net_connect.send_command("show switch")
switchlines = output.splitlines()
print(output)
print(switchlines[5])
It will then output the following in this case:
Switch/Stack Mac Address : 188b.45ea.a000 - Local Mac Address
Mac persistency wait time: Indefinite
H/W Current
Switch# Role Mac Address Priority Version State
------------------------------------------------------------
*1 Active 188b.45ea.a000 15 V01 Ready
2 Standby 00ca.e5fc.1780 14 V06 Ready
3 Member 00ca.e5fc.5e80 13 V06 Ready
4 Member 00ca.e588.f480 12 V06 Ready
5 Member 00ca.e588.ee80 11 V06 Ready
*1 Active 188b.45ea.a000 15 V01 Ready
That table comes out as a string & essentially, I need to find a way to split that into usable chunks (I.E a 2D Array) So I can use each field individually.
You already got the lines separated in a list (switchlines), so all you have left to do is iterate over that list and split each one on spaces. Because there are many spaces separating, we also want to strip those elements. So you could do something like:
res = []
for line in switchlines[5:]:
elements = [x.strip() for x in line.split()]
res.append(elements)
And this gives on your example text:
[['*1', 'Active', '188b.45ea.a000', '15', 'V01', 'Ready'],
['2', 'Standby', '00ca.e5fc.1780', '14', 'V06', 'Ready'],
['3', 'Member', '00ca.e5fc.5e80', '13', 'V06', 'Ready'],
['4', 'Member', '00ca.e588.f480', '12', 'V06', 'Ready'],
['5', 'Member', '00ca.e588.ee80', '11', 'V06', 'Ready']]
Another option that can later help you work on the data, is collect it into a dictionary instead of a list:
for line in switchlines[5:]:
switch, role, mac, prio, ver, state, *extras = [x.strip() for x in line.split()]
res.append({'switch': switch, 'role': role, 'mac': mac,
'prio': prio, 'ver': ver, 'state': state, 'extras': extras})
And this gives on your example text:
[{'switch': '*1', 'role': 'Active', 'mac': '188b.45ea.a000', 'prio': '15', 'ver': 'V01', 'state': 'Ready', 'extras': []},
{'switch': '2', 'role': 'Standby', 'mac': '00ca.e5fc.1780', 'prio': '14', 'ver': 'V06', 'state': 'Ready', 'extras': []},
{'switch': '3', 'role': 'Member', 'mac': '00ca.e5fc.5e80', 'prio': '13', 'ver': 'V06', 'state': 'Ready', 'extras': []},
{'switch': '4', 'role': 'Member', 'mac': '00ca.e588.f480', 'prio': '12', 'ver': 'V06', 'state': 'Ready', 'extras': []},
{'switch': '5', 'role': 'Member', 'mac': '00ca.e588.ee80', 'prio': '11', 'ver': 'V06', 'state': 'Ready', 'extras': []}]
I'm unable to parse JSON. My JSON snippet returned from requests.post response :-
{'result': {'parent': '', 'reason': '', 'made_sla': 'true', 'backout_plan': '', 'watch_list': '', 'upon_reject': 'cancel', 'sys_updated_on': '2018-08-22 11:16:09', 'type': 'Comprehensive', 'conflict_status': 'Not Run', 'approval_history': '', 'number': 'CHG0030006', 'test_plan': '', 'cab_delegate': '', 'sys_updated_by': 'admin', 'opened_by': {'link': 'https://dev65345.service-now.com/api/now/table/sys_user/6816f79cc0a8016401c5a33be04be441', 'value': '6816f79cc0a8016401c5a33be04be441'}, 'user_input': '', 'requested_by_date': '', 'sys_created_on': '2018-08-22 11:16:09', 'sys_domain': {'link': 'https://dev65345.service-now.com/api/now/table/sys_user_group/global', 'value': 'global'}, 'state': '-5', 'sys_created_by': 'admin', 'knowledge': 'false', 'order': '', 'phase': 'requested', 'closed_at': '', 'cmdb_ci': '', 'delivery_plan': '', 'impact': '3', 'active': 'true', 'review_comments': '', 'work_notes_list': '', 'business_service': '', 'priority': '4', 'sys_domain_path': '/', 'time_worked': '', 'cab_recommendation': '', 'expected_start': '', 'production_system': 'false', 'opened_at': '2018-08-22 11:16:09', 'review_date': '', 'business_duration': '', 'group_list': '', 'requested_by': {'link': 'https://dev6345.service-now.com/api/now/table/sys_user/user1', 'value': 'user1'}, 'work_end': '', 'change_plan': '', 'phase_state': 'open', 'approval_set': '', 'cab_date': '', 'work_notes': '', 'implementation_plan': '', 'end_date': '', 'short_description': '', 'close_code': '', 'correlation_display': '', 'delivery_task': '', 'work_start': '', 'assignment_group': {'link': 'https://dev65345.service-now.com/api/now/table/sys_user_group/testgroup', 'value': 'testgroup'}, 'additional_assignee_list': '', 'outside_maintenance_schedule': 'false', 'description': '', 'on_hold_reason': '', 'calendar_duration': '', 'std_change_producer_version': '', 'close_notes': '', 'sys_class_name': 'change_request', 'closed_by': '', 'follow_up': '', 'sys_id': '436eda82db4023008e357a61399619ee', 'contact_type': '', 'cab_required': 'false', 'urgency': '3', 'scope': '3', 'company': '', 'justification': '', 'reassignment_count': '0', 'review_status': '', 'activity_due': '', 'assigned_to': '', 'start_date': '', 'comments': '', 'approval': 'requested', 'sla_due': '', 'comments_and_work_notes': '', 'due_date': '', 'sys_mod_count': '0', 'on_hold': 'false', 'sys_tags': '', 'conflict_last_run': '', 'escalation': '0', 'upon_approval': 'proceed', 'correlation_id': '', 'location': '', 'risk': '3', 'category': 'Other', 'risk_impact_analysis': ''}}
I searched on the net. It is showing as as it is single quotes it's not parsing.
So I tried to convert the single quotes into double quotes.
with open ('output.json','r') as handle:
handle=open('output.json')
str="123"
str=handle.stringify() #also with .str()
str = str.replace("\'", "\"")
jsonobj=json.load(json.dumps(handle))
But it shows me No attribute stringify or str as it is an json object and these are string object function. So, can you please help me with what is the correct way of parsing the json object with single quotes in a file.
The code:-
import requests
import json
from pprint import pprint
print("hello world")
url="********"
user="****"
password="*****"
headers={"Content-Type":"application/xml","Accept":"application/json"}
#response=requests.get(url,auth=(user,password),headers=headers)
response = requests.post(url, auth=(user, password), headers=headers ,data="******in xml****")
print(response.status_code)
print(response.json())
jsonobj=json.load(json.dumps(response.json()))
pprint(jsonobj)
What you receive from requests.post is not JSON, it's a dictionary.
One that can be encoded in JSON, via json.dumps(result).
JSON is a text format to represent objects (the "ON" means "object notation"). You can convert a dictionary (or list or scalar) into a JSON-encoded string, or the other way around.
What requests.post does is taking the JSON response and already parsing it (with json.loads), so you don't have to think about JSON at all.
You haven't shown the code where you get the data from the post. However, you are almost certainly doing something like this:
response = requests.post('...')
data = response.json()
Here data is already parsed from JSON to a Python dict; that is what the requests json method does. There is no need to parse it again.
If you need raw JSON rather than Python data, then don't call the json method. Get the data direct from the response:
data = response.content
Now data will be a string containing JSON.
I'm trying to scrape the following website:
http://mlb.mlb.com/stats/sortable_batter_vs_pitcher.jsp#season=2018&batting_team=119&batter=571771&pitching_team=133&pitcher=641941
(this is an example URL with a certain pitcher/batter matchup)
I'm able to enter the player codes and team codes easily with this function:
def matchupURL(season, batter, batterTeam, pitcher, pitcherTeam):
return "http://mlb.mlb.com/stats/sortable_batter_vs_pitcher.jsp#season=" + str(season)+ "&batting_team="+str(teamNumDict[batterTeam])+"&batter="+str(batter)+"&pitching_team="+str(teamNumDict[pitcherTeam])+"&pitcher="+str(pitcher);
which works nicely, and the returned string works when pasted into my browser.
But when i make a request a la
newURL = matchupURL(2018,i.id,x.home_team,j.id,x.away_team)
print(i+ " vs " + j)
newSes = requests.get(newURL);
html = BeautifulSoup(newSes.text, "lxml")
mydivs = html.findAll("td",{"class":"dg-ops"})
#do something with this div
I'm unable to find the div. Infact, the entire format of the HTML returned changes. Further, adding headers didnt help, nor did using urllib instead of requests.
This page is a dynamic, i.e., the content is dynamically generated by javascript and showed in the front. That is the reason you can't detect the div tag.
But in this case you can scrape easier. With inspect tool from your browser you can detect that the data comes from a GET request to an URL. For your example, you only have to provide the players id :
import requests
url = 'http://lookup-service-prod.mlb.com/json/named.stats_batter_vs_pitcher_composed.bam'
params = {"sport_code":"'mlb'","game_type":"'R'","player_id":"571771","pitcher_id":"641941"}
resp = requests.get(url, params=params).json()
print(resp)
That prints:
{'stats_batter_vs_pitcher_composed': {'stats_batter_vs_pitcher_total': {'queryResults': {'created': '2018-04-12T22:21:47', 'totalSize': '1', 'row': {'hr': '1', 'gidp': '0', 'pitcher_first_last_html': 'Emilio Pagán', 'player': 'Hernandez, Enrique', 'np': '4', 'sac': '0', 'pitcher': 'Pagan, Emilio', 'rbi': '1', 'player_first_last_html': 'Enrique Hernández', 'tb': '4', 'bats': 'R', 'xbh': '1', 'bb': '0', 'slg': '4.000', 'avg': '1.000', 'pitcher_id': '641941', 'ops': '5.000', 'hbp': '0', 'pitcher_html': 'Pagán, Emilio', 'g': '', 'd': '0', 'so': '0', 'throws': 'R', 'sf': '0', 'tpa': '1', 'h': '1', 'cs': '0', 'obp': '1.000', 't': '0', 'ao': '0', 'r': '1', 'go_ao': '-.--', 'sb': '0', 'player_html': 'Hernández, Enrique', 'sbpct': '.---', 'player_id': '571771', 'ibb': '0', 'ab': '1', 'go': '0'}}}, 'copyRight': ' Copyright 2018 MLB Advanced Media, L.P. Use of any content on this page acknowledges agreement to the terms posted here http://gdx.mlb.com/components/copyright.txt ', 'stats_batter_vs_pitcher': {'queryResults': {'created': '2018-04-12T22:21:47', 'totalSize': '1', 'row': {'hr': '1', 'gidp': '0', 'pitcher_first_last_html': 'Emilio Pagán', 'player': 'Hernandez, Enrique', 'np': '4', 'sac': '0', 'pitcher': 'Pagan, Emilio', 'rbi': '1', 'opponent': 'Oakland Athletics', 'player_first_last_html': 'Enrique Hernández', 'tb': '4', 'xbh': '1', 'bats': 'R', 'bb': '0', 'avg': '1.000', 'slg': '4.000', 'pitcher_id': '641941', 'ops': '5.000', 'hbp': '0', 'pitcher_html': 'Pagán, Emilio', 'g': '', 'd': '0', 'so': '0', 'throws': 'R', 'sport': 'MLB', 'sf': '0', 'team': 'Los Angeles Dodgers', 'tpa': '1', 'league': 'NL', 'h': '1', 'cs': '0', 'obp': '1.000', 't': '0', 'ao': '0', 'season': '2018', 'r': '1', 'go_ao': '-.--', 'sb': '0', 'opponent_league': 'AL', 'player_html': 'Hernández, Enrique', 'sbpct': '.---', 'player_id': '571771', 'ibb': '0', 'ab': '1', 'opponent_id': '133', 'team_id': '119', 'go': '0', 'opponent_sport': 'MLB'}}}}}
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
In my python code , I get strings from the text file like :
a = "[{'index': '1', 'selected': 'true', 'length': '0', 'completedLength': '0', 'path': '', 'uris': [{'status': 'used', 'uri': 'http://www.single.com'}]}]"
b ="[{'index': '1', 'selected': 'true', 'length': '0', 'completedLength': '0', 'path': '', 'uris': [{'status': 'used', 'uri': 'http://www.mirrors.com'}, {'status': 'used', 'uri': 'http://www.mirrors2.com'}]}]"
c ="[{'index': '1', 'selected': 'true', 'length': '103674793', 'completedLength': '0', 'path': '/home/dr/Maher_Al-Muaiqly_(MP3_Quran)/002.mp3', 'uris': []}, {'index': '2', 'selected': 'true', 'length': '62043128', 'completedLength': '0', 'path': '/home/dr/Maher_Al-Muaiqly_(MP3_Quran)/004.mp3', 'uris': []}, {'index': '3', 'selected': 'true', 'length': '57914945', 'completedLength': '0', 'path': '/home/dr/Maher_Al-Muaiqly_(MP3_Quran)/003.mp3', 'uris': []}]"
I want to get the text of the value uris , the output should looks like :
a = [{'status': 'used', 'uri': 'http://www.single.com'}]
b = [{'status': 'used', 'uri': 'http://www.mirrors.com'}, {'status': 'used', 'uri': 'http://www.mirrors2.com'}]
c = [[],[],[]]
Many hours I spent in failed trials to get this result by using the string functions ,
uris = str.split('}, {')
for uri in uris :
uri = uri.split(',')
# and so on ...
but , it work so bad especially in the second case , I hope that anyone can do it by regex or any other way.
They are all python literals. You can use ast.literal_eval. No need to use regular expression.
>>> a = "[{'index': '1', 'selected': 'true', 'length': '0', 'completedLength': '0', 'path': '', 'uris': [{'status': 'used', 'uri': 'http://www.single.com'}]}]"
>>> b = "[{'index': '1', 'selected': 'true', 'length': '0', 'completedLength': '0', 'path': '', 'uris': [{'status': 'used', 'uri': 'http://www.mirrors.com'}, {'status': 'used', 'uri': 'http://www.mirrors2.com'}]}]"
>>> c = "[{'index': '1', 'selected': 'true', 'length': '103674793', 'completedLength': '0', 'path': '/home/dr/Maher_Al-Muaiqly_(MP3_Quran)/002.mp3', 'uris': []}, {'index': '2', 'selected': 'true', 'length': '62043128', 'completedLength': '0', 'path': '/home/dr/Maher_Al-Muaiqly_(MP3_Quran)/004.mp3', 'uris': []}, {'index': '3', 'selected': 'true', 'length': '57914945', 'completedLength': '0', 'path': '/home/dr/Maher_Al-Muaiqly_(MP3_Quran)/003.mp3', 'uris': []}]"
>>> import ast
>>> [x['uris'] for x in ast.literal_eval(a)]
[[{'status': 'used', 'uri': 'http://www.single.com'}]]
>>> [x['uris'] for x in ast.literal_eval(b)]
[[{'status': 'used', 'uri': 'http://www.mirrors.com'}, {'status': 'used', 'uri': 'http://www.mirrors2.com'}]]
>>> [x['uris'] for x in ast.literal_eval(c)]
[[], [], []]
in javascript you can do this
a = a.replace(/^.*uris[^[]*(\[[^\]]*\]).*$/, '\1');
if php would be this a way
$a = preg_replace('/^.*uris[^[]*(\[[^\]]*\]).*$/', '\1', $a);
edit: well I see, it wouldn't do your complete task for 'c' -.-