I'm trying to scrape links on a website. When I follow the link it can be either a motor advert or an ordinary advert.
The keys that I need to scrape for both types of adverts are the same:
For the Motor adverts - data = dict_keys['header', 'description', 'currency', 'price', 'wanted', 'id', 'photos', 'section', 'age', 'spotlight', 'year', 'state', 'friendlyUrl', 'keyInfo', 'seller', 'displayAttributes', 'countyTown', 'breadcrumbs']
For the Ordinary adverts - data = dict_keys(['header', 'description', 'currency', 'price', 'wanted', 'id', 'photos', 'section', 'age', 'spotlight', 'year', 'state', 'friendlyUrl', 'keyInfo', 'seller', 'displayAttributes', 'countyTown', 'breadcrumbs'])
In the Motor adverts data the 'breadcrumbs' key gives me
[{'name': 'motor',
'displayName': 'Cars & Motor',
'id': 1003,
'title': 'Cars Motorbikes Trucks Caravans and More',
'subdomain': 'www',
'containsSubsections': True,
'xtn2': 101},
{'name': 'cars',
'displayName': 'Cars',
'id': 11,
'title': 'Cars',
'subdomain': 'cars',
'containsSubsections': False,
'xtn2': 142}]
while in the Ordinary adverts 'breadcrumbs' gives me
[{'name': 'all',
'displayName': 'All Sections',
'id': 2066,
'title': 'See Everything For Sale',
'subdomain': 'www',
'containsSubsections': True,
'xtn2': 100},
{'name': 'household',
'displayName': 'House & DIY',
'id': 1001,
'title': 'House & DIY',
'subdomain': 'www',
'containsSubsections': True,
'xtn2': 105},
{'name': 'furniture',
'displayName': 'Furniture & Interiors',
'id': 3,
'title': 'Furniture',
'subdomain': 'www',
'containsSubsections': True,
'xtn2': 105},
{'name': 'kitchenappliances',
'displayName': 'Kitchen Appliances',
'id': 1089,
'title': 'Kitchen Appliances',
'subdomain': 'www',
'containsSubsections': False,
'xtn2': 105}]
I have tried to get the Motor data by calling the 'xtn2' key and value with data['breadcrumbs'][0]['xtn2'] == 101: and giving it a name 'motordata'
if data['breadcrumbs'][0]['xtn2'] == 101:
motordata = data
if motordata:
motors = motordata['breadcrumbs'][0]['name']
views = motordata['views']
title = motordata['header']
Adcounty = motordata['county']
itemId = motordata['id']
sellerId = motordata['seller']['id']
sellerName = motordata['seller']['name']
adCount = motordata['seller']['adCount']
lifetimeAds = motordata['seller']['adCountStats']['lifetimeAdView']['value']
currency = motordata['currency']
price = motordata['price']
adUrl = motordata['friendlyUrl']
adAge = motordata['age']
spotlight = motordata['spotlight']
and the Ordinary data with elif data['breadcrumbs'][0]['xtn2'] == 100: with a name 'Allotherads'
elif data['breadcrumbs'][0]['xtn2'] == 100:
Allotherads = alldata
if Allotherads:
views = Allotherads['views']
title = Allotherads['header']
itemId = Allotherads['id']
Adcounty = Allotherads['county']
# Adtown = alldata['countyTown']
sellerId = Allotherads['seller']['id']
sellerName = Allotherads['seller']['name']
adCount = Allotherads['seller']['adCount']
lifetimeAds = Allotherads['seller']['adCountStats']['lifetimeAdView']['value']
currency = Allotherads['currency']
price = Allotherads['price']
adUrl = Allotherads['friendlyUrl']
adAge = Allotherads['age']
spotlight = Allotherads['spotlight']
topSectionName = Allotherads['xitiAdData']['topSectionName']
xtn2 = Allotherads['breadcrumbs'][2]['xtn2']
subSection = Allotherads['breadcrumbs'][2]['displayName']
but it doesn't work. It just scrapes the Ordinary adverts but not the Motor adverts.
Where am I going wrong?
Can't you just do (if multiple motor dicts are possible):
motordata = [x for x in data.get('breadcrumbs') if x.get('name') == "motor"]
or (if only one motordata is possible:
motordata = next(iter([x for x in data.get('breadcrumbs') if x.get('name') == "motor"]))
next(iter()) works here the same as [0] at the end but is faster
Related
I have a relatively simple nested dictionary as below:
emp_details = {
'Employee': {
'jim': {'ID':'001', 'Sales':'75000', 'Title': 'Lead'},
'eva': {'ID':'002', 'Sales': '50000', 'Title': 'Associate'},
'tony': {'ID':'003', 'Sales': '150000', 'Title': 'Manager'}
}
}
I can get the sales info of 'eva' easily by:
print(emp_details['Employee']['eva']['Sales'])
but I'm having difficulty writing a statement to extract information on all employees whose sales are over 50000.
You can't use one statement because the list initializer expression can't have an if without an else.
Use a for loop:
result = {} # dict expression
result_list = [] # list expression using (key, value)
for key, value in list(emp_details['Employee'].items())): # iterate from all items in dictionary
if int(value['Sales']) > 50000: # your judgement
result[key] = value # add to dict
result_list.append((key, value)) # add to list
print(result)
print(result_list)
# should say:
'''
{'jim': {'ID':'001', 'Sales':'75000', 'Title': 'Lead'}, 'tony': {'ID':'003', 'Sales': '150000', 'Title': 'Manager'}}
[('jim', {'ID':'001', 'Sales':'75000', 'Title': 'Lead'}), ('tony', {'ID':'003', 'Sales': '150000', 'Title': 'Manager'})]
'''
Your Sales is of String type.
Therefore, we can do something like this to get the information of employees whose sales are over 50000 : -
Method1 :
If you just want to get the information : -
emp_details={'Employee':{'jim':{'ID':'001', 'Sales':'75000', 'Title': 'Lead'}, \
'eva':{'ID':'002', 'Sales': '50000', 'Title': 'Associate'}, \
'tony':{'ID':'003', 'Sales': '150000', 'Title': 'Manager'}
}}
for emp in emp_details['Employee']:
if int(emp_details['Employee'][emp]['Sales']) > 50000:
print(emp_details['Employee'][emp])
It print outs to -:
{'ID': '001', 'Sales': '75000', 'Title': 'Lead'}
{'ID': '003', 'Sales': '150000', 'Title': 'Manager'}
Method2 : You can use Dict and List comprehension to get complete information : -
emp_details={'Employee':{'jim':{'ID':'001', 'Sales':'75000', 'Title': 'Lead'}, \
'eva':{'ID':'002', 'Sales': '50000', 'Title': 'Associate'}, \
'tony':{'ID':'003', 'Sales': '150000', 'Title': 'Manager'}
}}
emp_details_dictComp = {k:v for k,v in list(emp_details['Employee'].items()) if int(v['Sales']) > 50000}
print(emp_details_dictComp)
emp_details_listComp = [(k,v) for k,v in list(emp_details['Employee'].items()) if int(v['Sales']) > 50000]
print(emp_details_listComp)
Result : -
{'jim': {'ID': '001', 'Sales': '75000', 'Title': 'Lead'}, 'tony': {'ID': '003', 'Sales': '150000', 'Title': 'Manager'}}
[('jim', {'ID': '001', 'Sales': '75000', 'Title': 'Lead'}), ('tony', {'ID': '003', 'Sales': '150000', 'Title': 'Manager'})]
I have below list which stored in data
{'id': 255719,
'display_name': 'Off Broadway Apartments',
'access_right': {'status': 'OWNED', 'api_enabled': True},
'creation_time': '2021-04-26T15:53:29+00:00',
'status': {'value': 'OFFLINE', 'last_change': '2021-07-10T17:26:50+00:00'},
'licence': {'id': 74213,
'code': '23AF-15A8-0514-2E4B-04DE-5C19-A574-B20B',
'bound_mac_address': '00:11:32:C2:FE:6A',
'activation_time': '2021-04-26T15:53:29+00:00',
'type': 'SUBSCRIPTION'},
'timezone': 'America/Chicago',
'version': {'agent': '3.7.0-b001', 'package': '2.5.1-0022'},
'location': {'latitude': '41.4126', 'longitude': '-99.6345'}}
I would like to convert into data frame.can anyone advise?
I tried below code
df = pd.DataFrame(data)
but it's not coming properly as many nested lists. can anyone advise?
from pandas.io.json import json_normalize
# load json
json = {'id': 255719,
'display_name': 'Off Broadway Apartments',
'access_right': {'status': 'OWNED', 'api_enabled': True},
'creation_time': '2021-04-26T15:53:29+00:00',
'status': {'value': 'OFFLINE', 'last_change': '2021-07-10T17:26:50+00:00'},
'licence': {'id': 74213,
'code': '23AF-15A8-0514-2E4B-04DE-5C19-A574-B20B',
'bound_mac_address': '00:11:32:C2:FE:6A',
'activation_time': '2021-04-26T15:53:29+00:00',
'type': 'SUBSCRIPTION'},
'timezone': 'America/Chicago',
'version': {'agent': '3.7.0-b001', 'package': '2.5.1-0022'},
'location': {'latitude': '41.4126', 'longitude': '-99.6345'}}
Create a fuction to flat nested jsons:
def flatten_json(y):
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
You can now use that function on your original json file:
flat = flatten_json(json)
df = json_normalize(flat)
Results:
id display_name ... location_latitude location_longitude
0 255719 Off Broadway Apartments ... 41.4126 -99.6345
I am writing a parser for ads, I ran into a problem that the dict is not displayed in full, but only the first part.
import requests
import json
from datetime import datetime
url = 'https://api.yo.com/api/v1/products?app_id=web2'
response = requests.get(url).json()
items = response['data']
iter1 = []
for item in items:
iter1.append({
'name': item.get('name', 'NA'),
'owner': item.get('owner', 'NA'),
'date_published': item.get('date_published', 'NA'),
'short_url': item.get('short_url', 'NA')
})
result = {}
for keyvalue in iter1:
result["name"] = iter1[0]["name"]
result["user"] = iter1[1]["owner"]["name"]
result["date_published"] = datetime.utcfromtimestamp(iter1[0]["date_published"]).strftime('%Y-%m-%d %H:%M:%S')
result["short_url"] = iter1[0]["short_url"]
print(keyvalue)
Output example print(keyvalue):
{'name': 'text ads', 'user': 'John. Doe', 'date_published': '2021-08-02 20:37:13', 'short_url': 'https://yo.com/p610009e'}
What is actually contained in iter1:
[{'name': 'IPhone 7 32 gb', 'owner': {'id': '5a552f202d23c1214'}, 'date_published': 1627937371, 'short_url': 'https://yo/p60ff'},
{'name': 'Матрас, подушка', 'owner': {'id': '5dc2590dabc73388f2', 'name': 'Olga', 'type': 'b2b_professional', 'linked_id': '5dc2590dad53388f2', 'is_shop': False, 'is_verified': False, 'image': {'id': '5dc2756ba162b6342', 'num': 1, 'url': 'https://cach6342.jpg'}, 'date_registered': 1573017904, 'settings': {'display_phone_to_anon': True, 'display_phone': True, 'display_chat': True, 'location': {'description': 'Москва'}}, 'display_phone_num': None, 'isOnline': False, 'onlineText': 'Не в сети', 'online_text_detailed': 'Сегодня в 23:30', 'answerTimeText': '', 'is_blocked': False, 'store': None, 'rating_mark': 4.4}, 'date_published': 1627542263, 'short_url': 'https://yo/p60dac06897'},
{'name': 'Букеты', 'owner': {'id': '59d60973a3f3386f3f', 'name': 'Екатерина', 'type': 'person', 'linked_id': '59d609739e9486f3f', 'is_shop': False, 'is_verified': False, 'image': {'id': '59db3e1457556c13', 'num': 1, 'url': 'https://cac/i/oi/d6/59d609b.jpg'}, 'date_registered': 1507199347, 'settings': {'display_phone_to_anon': False, 'display_phone': False, 'display_chat': True}, 'display_phone_num': None, 'isOnline': False, 'onlineText': 'Не в сети', 'online_text_detailed': 'Сегодня в 21:44', 'answerTimeText': '', 'is_blocked': False, 'store': None, 'rating_mark': 0}, 'date_published': 1627472412, 'short_url': 'https://you/p60a0db5de263'}]
How can I display all the lines 'print(keyvalue)' that are processed in result = {}?
I think you've got yourself confused about how iteration works in python. But its interesting how your first loop is fine: for item in items:
You probably meant this for your second loop:
output = []
for result in iter1:
item = {}
item["name"] = result["name"]
item["user"] = result["owner"].get("name")
date = datetime.utcfromtimestamp(result["date_published"]).strftime('%Y-%m-%d %H:%M:%S')
item["date_published"] = date
item["short_url"] = result["short_url"]
output.append(item)
print(output)
You could probably collapse the two loops into one, but best to start with two.
I have the following strings as values for a dictionary key:
["low", "middle", "high", "very high"]
These are the options for the dicionary item key 'priority', a sample dict element is:
{'name': 'service', 'priority': value}
My task is to collect a list of dictionaries with the keys, all differ in the key value 'priority'.
my_list = [{'name': 'service', 'priority': 'low'}, {'name': 'service', 'priority': 'high'}]
In the end a final dictionary item should exist, that has the highest priority value. It should work like the maximum principle. In this case {'name': 'service', 'priority': 'high'} would be the result.
The problem is that the value is a string, not an integer.
Thanks for all ideas to get it work.
Here is the approach with itertools module usage:
# Step 0: prepare data
score = ["low", "middle", "high", "very high"]
my_list = [{'name': 'service', 'priority': 'low', 'label1':'text'}, {'name': 'service', 'priority': 'middle', 'label2':'text'}, {'name': 'service_b', 'priority': 'middle'}, {'name': 'service_b', 'priority': 'very high'}]
my_list # to just show source data in list
Out[1]:
[{'name': 'service', 'priority': 'low', 'label1': 'text'},
{'name': 'service', 'priority': 'middle', 'label2': 'text'},
{'name': 'service_b', 'priority': 'middle'},
{'name': 'service_b', 'priority': 'very high'}]
# Step 0.5: convert bytes-string (if it is) to string
# my_list = [{k:(lambda x: (x.decode() if type(x) == bytes else x))(v) for k,v in i.items()} for i in my_list ]
# Step 1: reorganize "score"-list on most useful way - to dict
score_dic = {i[0]:i[1] for i in list(zip(score, range(len(score))))}
score_dic
Out[2]:
{'low': 0, 'middle': 1, 'high': 2, 'very high': 3}
# Step 2: get result
import itertools
[max(list(g), key = lambda b: score_dic[b['priority']]) for k,g in itertools.groupby(my_list, lambda x:x['name'])]
Out[3]:
[{'name': 'service', 'priority': 'middle', 'label2': 'text'},
{'name': 'service_b', 'priority': 'very high'}]
Is this what you want?
priorities = ["low", "middle", "high", "very high"]
items = [{'name': 'service', 'priority': 'high'}, {'name': 'service2', 'priority': 'high'}, {'name': 'service', 'priority': 'very high'}, {'name': 'service2', 'priority': 'very high'}]
max_priority = max(items, key=lambda item: priorities.index(item['priority']))['priority']
max_items = [item for item in items if item['priority'] == max_priority]
print(max_items)
Output:
[{'name': 'service', 'priority': 'very high'}, {'name': 'service2', 'priority': 'very high'}]
Here is the scirpt:
from bs4 import BeautifulSoup as bs4
import requests
import json
from lxml import html
from pprint import pprint
import re
def get_data():
url = 'https://sports.bovada.lv//baseball/mlb/game-lines-market-group'
r = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.103 Safari/537.36"})
html_bytes = r.text
soup = bs4(html_bytes, 'lxml')
# res = soup.findAll('script') # find all scripts..
pattern = re.compile(r"swc_market_lists\s+=\s+(\{.*?\})")
script = soup.find("script", text=pattern)
return script.text[23:]
test1 = get_data()
data = json.loads(test1)
for item1 in data['items']:
data1 = item1['itemList']['items']
for item2 in data1:
pitch_a = item2['opponentAName']
pitch_b = item2['opponentBName']
## group = item2['displayGroups']
## for item3 in group:
## new_il = item3['itemList']
## for item4 in new_il:
## market = item4['description']
## oc = item4['outcomes']
print(pitch_a,pitch_b)
##for items in data['items']:
## pos = items['itemList']['items']
## for item in pos:
## work = item['competitors']
## pitcher_a = item['opponentAName']
## pitcher_b = item['opponentBName']
## group = item['displayGroups']
## for item, item2 in zip(work,group):
## team = item['abbreviation']
## place = item['type']
## il2 = item2['itemList']
## for item in il2:
## ml = item['description']
## print(team,place,pitcher_a,pitcher_b,ml)
I have been trying to scrape
team abbrev = ['items']['itemList']['items']['competitors']['abbreviation']
home_away = ['items']['itemList']['items']['competitors']['type']
team pitcher home = ['items']['itemList']['items']['opponentAName']
team pitcher away = ['items']['itemList']['items']['opponentBName']
moneyline american odds = ['items']['itemList']['items']['displayGroups']['itemList']['outcomes']['price']['american']
Total runs = ['items']['itemList']['items']['displayGroups']['itemList']['outcomes']['price']['handicap']
Part of the Json pprinted:
[{'baseLink': '/baseball/mlb/game-lines-market-group',
'defaultType': True,
'description': 'Game Lines',
'id': '136',
'itemList': {'items': [{'LIVE': True,
'atmosphereLink': '/api/atmosphere/eventNotification/events/A/3149961',
'awayTeamFirst': True,
'baseLink': '/baseball/mlb/minnesota-twins-los-angeles-angels-201805112207',
'competitionId': '24736',
'competitors': [{'abbreviation': 'LAA',
'description': 'Los Angeles Angels',
'id': '3149961-1642',
'rotationNumber': '978',
'shortName': 'Angels',
'type': 'HOME'},
{'abbreviation': 'MIN',
'description': 'Minnesota Twins',
'id': '3149961-9990',
'rotationNumber': '977',
'shortName': 'Twins',
'type': 'AWAY'}],
'denySameGame': 'NO',
'description': 'Minnesota Twins # Los Angeles Angels',
'displayGroups': [{'baseLink': '/baseball/mlb/game-lines-market-group',
'defaultType': True,
'description': 'Game Lines',
'id': '136',
'itemList': [{'belongsToDefault': True,
'columns': 'H2Columns',
'description': 'Moneyline',
'displayGroups': '136,A-136',
'id': '46892277',
'isInRunning': True,
'mainMarketType': 'MONEYLINE',
'mainPeriod': True,
'marketTypeGroup': 'MONEY_LINE',
'notes': '',
'outcomes': [{'competitorId': '3149961-9990',
'description': 'Minnesota '
'Twins',
'id': '211933276',
'price': {'american': '-475',
'decimal': '1.210526',
'fractional': '4/19',
'id': '1033002124',
'outcomeId': '211933276'},
'status': 'OPEN',
'type': 'A'},
{'competitorId': '3149961-1642',
'description': 'Los '
'Angeles '
'Angels',
'id': '211933277',
'price': {'american': '+310',
'decimal': '4.100',
'fractional': '31/10',
'id': '1033005679',
'outcomeId': '211933277'},
'status': 'OPEN',
'type': 'H'}],
'periodType': 'Live '
'Match',
'sequence': '14',
'sportCode': 'BASE',
'status': 'OPEN',
'type': 'WW'},
{'belongsToDefault': True,
'columns': 'H2Columns',
'description': 'Runline',
'displayGroups': '136,A-136',
'id': '46892287',
'isInRunning': True,
'mainMarketType': 'SPREAD',
'mainPeriod': True,
'marketTypeGroup': 'SPREAD',
'notes': '',
'outcomes': [{'competitorId': '3149961-9990',
'description': 'Minnesota '
'Twins',
'id': '211933278',
'price': {'american': '+800',
'decimal': '9.00',
'fractional': '8/1',
'handicap': '-1.5',
'id': '1033005677',
'outcomeId': '211933278'},
'status': 'OPEN',
'type': 'A'},
{'competitorId': '3149961-1642',
'description': 'Los '
'Angeles '
'Angels',
'id': '211933279',
'price': {'american': '-2000',
'decimal': '1.050',
'fractional': '1/20',
'handicap': '1.5',
'id': '1033005678',
'outcomeId': '211933279'},
'status': 'OPEN',
'type': 'H'}],
'periodType': 'Live '
'Match',
'sequence': '14',
'sportCode': 'BASE',
'status': 'OPEN',
'type': 'SPR'}],
'link': '/baseball/mlb/game-lines-market-group'}],
'feedCode': '13625145',
'id': '3149961',
'link': '/baseball/mlb/minnesota-twins-los-angeles-angels-201805112207',
'notes': '',
'numMarkets': 2,
'opponentAId': '214704',
'opponentAName': 'Tyler Skaggs (L)',
'opponentBId': '215550',
'opponentBName': 'Lance Lynn (R)',
'sport': 'BASE',
'startTime': 1526090820000,
'status': 'O',
'type': 'MLB'},
There are a few different loops I had started in the script above but either of them are working out the way I would like.
away team | away moneyline | away pitcher | Total Runs | and repeat for Home Team is what I would like it to be eventually. I can write to csv once it is parsed the proper way.
Thank you for the fresh set of eyes, I've been working on this for the better part of a day trying to figure out the best way to access the content I would like. If Json is not the best way and bs4 works better I would love to hear your opinion
There's no simple answer to your problem. Scraping data requires you to carefully assess the data you are dealing with, work out where the parts you want to extract are located and figure out how to effectively store the data you extract.
Try printing the data in your loops to visualise what is happening in your code (or try debugging). From there its easy to figure out it if you're iterating over what you expect. Look for patterns throughout the input data to help organise the data you extract.
To help yourself, you should give your variables descriptive names, separate your code into logical chunks and add comments when it starts to get complicated.
Here's some working code, but I encourage you to try what I told you above, then if you're still stuck look below for guidance.
output = {}
root = data['items'][0]
for game_line in root['itemList']['items']:
# Create a temporary dict to store the data for this gameline
team_data = {}
# Get competitors
competitors = game_line['competitors']
for team in competitors:
team_type = team['type'] # either HOME or AWAY
# Create a new dict to store data for each team
team_data[team_type] = {}
team_data[team_type]['abbreviation'] = team['abbreviation']
team_data[team_type]['name'] = team['description']
# Get MoneyLine and Total Runs
for item in game_line['displayGroups'][0]['itemList']:
for outcome in item['outcomes']:
team_type = outcome['type'] # either A or H
team_type = 'HOME' if team_type == 'H' else 'AWAY'
if item['mainMarketType'] == 'MONEYLINE':
team_data[team_type]['moneyline'] = outcome['price']['american']
elif item['mainMarketType'] == 'SPREAD':
team_data[team_type]['total runs'] = outcome['price']['handicap']
# Get the pitchers
team_data['HOME']['pitcher'] = game_line['opponentAName']
team_data['AWAY']['pitcher'] = game_line['opponentBName']
# For each gameline, add the teamdata we gathered to the output dict
output[game_line['description']] = team_data
This produces like:
{
'Atlanta Braves # Miami Marlins': {
'AWAY': {
'abbreviation': 'ATL',
'moneyline': '-130',
'name': 'Atlanta Braves',
'pitcher': 'Mike Soroka (R)',
'total runs': '-1.5'
},
'HOME': {
'abbreviation': 'MIA',
'moneyline': '+110',
'name': 'Miami Marlins',
'pitcher': 'Jarlin Garcia (L)',
'total runs': '1.5'
}
},
'Boston Red Sox # Toronto Blue Jays': {
'AWAY': {
'abbreviation': 'BOS',
'moneyline': '-133',
'name': 'Boston Red Sox',
'pitcher': 'David Price (L)',
'total runs': '-1.5'
},
'HOME': {
'abbreviation': 'TOR',
'moneyline': '+113',
'name': 'Toronto Blue Jays',
'pitcher': 'Marco Estrada (R)',
'total runs': '1.5'
}
},
}