Related
I'm making a call to an api which is returning a JSON response, whcih i am then trying to retrieve certain data from within the response.
{'data': {'9674': {'category': 'token',
'contract_address': [{'contract_address': '0x2a3bff78b79a009976eea096a51a948a3dc00e34',
'platform': {'coin': {'id': '1027',
'name': 'Ethereum',
'slug': 'ethereum',
'symbol': 'ETH'},
'name': 'Ethereum'}}],
'date_added': '2021-05-10T00:00:00.000Z',
'date_launched': '2021-05-10T00:00:00.000Z',
'description': 'Wilder World (WILD) is a cryptocurrency '
'launched in 2021and operates on the '
'Ethereum platform. Wilder World has a '
'current supply of 500,000,000 with '
'83,683,300.17 in circulation. The last '
'known price of Wilder World is 2.28165159 '
'USD and is down -6.79 over the last 24 '
'hours. It is currently trading on 21 active '
'market(s) with $2,851,332.76 traded over '
'the last 24 hours. More information can be '
'found at https://www.wilderworld.com/.',
'id': 9674,
'is_hidden': 0,
'logo': 'https://s2.coinmarketcap.com/static/img/coins/64x64/9674.png',
'name': 'Wilder World',
'notice': '',
'platform': {'id': 1027,
'name': 'Ethereum',
'slug': 'ethereum',
'symbol': 'ETH',
'token_address': '0x2a3bff78b79a009976eea096a51a948a3dc00e34'},
'self_reported_circulating_supply': 19000000,
'self_reported_tags': None,
'slug': 'wilder-world',
'subreddit': '',
'symbol': 'WILD',
'tag-groups': ['INDUSTRY',
'CATEGORY',
'INDUSTRY',
'CATEGORY',
'CATEGORY',
'CATEGORY',
'CATEGORY'],
'tag-names': ['VR/AR',
'Collectibles & NFTs',
'Gaming',
'Metaverse',
'Polkastarter',
'Animoca Brands Portfolio',
'SkyVision Capital Portfolio'],
'tags': ['vr-ar',
'collectibles-nfts',
'gaming',
'metaverse',
'polkastarter',
'animoca-brands-portfolio',
'skyvision-capital-portfolio'],
'twitter_username': 'WilderWorld',
'urls': {'announcement': [],
'chat': [],
'explorer': ['https://etherscan.io/token/0x2a3bff78b79a009976eea096a51a948a3dc00e34'],
'facebook': [],
'message_board': ['https://medium.com/#WilderWorld'],
'reddit': [],
'source_code': [],
'technical_doc': [],
'twitter': ['https://twitter.com/WilderWorld'],
'website': ['https://www.wilderworld.com/']}}},
'status': {'credit_count': 1,
'elapsed': 7,
'error_code': 0,
'error_message': None,
'notice': None,
'timestamp': '2022-01-20T21:33:04.832Z'}}
The data i am trying to get is 'logo': 'https://s2.coinmarketcap.com/static/img/coins/64x64/9674.png', but this sits within [data][9674][logo]
But as this script to running in the background for other objects, i won't know what the number [9674] is for other requests.
So is there a way to get that number automatically?
[data] will always be consistent.
Im using this to get the data back
session = Session()
session.headers.update(headers)
response = session.get(url, params=parameters)
pprint.pprint(json.loads(response.text)['data']['9674']['logo'])
You can try this:
session = Session()
session.headers.update(headers)
response = session.get(url, params=parameters)
resp = json.loads(response.text)
pprint.pprint(resp['data'][next(iter(resp['data']))]['logo'])
where next(iter(resp['data'])) - returns first key in resp['data'] dict. In your example it '9674'
With .keys() you get a List of all Keys in a Dictionary.
So you can use keys = json.loads(response.text)['data'].keys() to get the keys in the data-dict.
If you know there is always only one entry in 'data' you could use json.loads(response.text)['data'][keys[0]]['logo']. Otherwise you would need to iterate over all keys in the list and check which one you need.
I found myself not understanding how I can select only some elements of my Steam API request response.
Here is the code with the results that makes a correct request on Steam. It is non-reproducable because client_id is personal information. The results are included.
# All online streamers
client_id = "...confidential"
limit = "2"
def request_dataNewAPI(limit):
headers = {"Client-ID": client_id, "Accept": "application/vnd.twitchtv.v5+json"}
url = "https://api.twitch.tv/helix/streams?first=" + limit
r = requests.get(url, headers=headers).json()
return r
# If a bad user login name or offline response will be:
# {'data': [], 'pagination': {}}
table1 = request_dataNewAPI(limit)
The output is:
New API
{'data': [{'id': '34472839600', 'user_id': '12826', 'user_name': 'Twitch', 'game_id': '509663', 'community_ids': ['f261cf73-cbcc-4b08-af72-c6d2020f9ed4'], 'type': 'live', 'title': 'The 1st Ever 3rd or 4th Pre Pre Show! Part 6', 'viewer_count': 19555, 'started_at': '2019-06-10T02:01:20Z', 'language': 'en', 'thumbnail_url': 'https://static-cdn.jtvnw.net/previews-ttv/live_user_twitch-{width}x{height}.jpg', 'tag_ids': ['d27da25e-1ee2-4207-bb11-dd8d54fa29ec', '6ea6bca4-4712-4ab9-a906-e3336a9d8039']}, {'id': '34474693232', 'user_id': '39298218', 'user_name': 'dakotaz', 'game_id': '33214', 'community_ids': [], 'type': 'live', 'title': '𝙖𝙙𝙫𝙚𝙣𝙩𝙪𝙧𝙚 𝙩𝙞𝙢𝙚 | code: dakotaz in itemshop & GFUEL', 'viewer_count': 15300, 'started_at': '2019-06-10T06:37:02Z', 'language': 'en', 'thumbnail_url': 'https://static-cdn.jtvnw.net/previews-ttv/live_user_dakotaz-{width}x{height}.jpg', 'tag_ids': ['6ea6bca4-4712-4ab9-a906-e3336a9d8039']}], 'pagination': {'cursor': 'eyJiIjpudWxsLCJhIjp7Ik9mZnNldCI6Mn19'}}
The problem is that I want to select only the list of 'user_name' of active streamers. I tried the following:
print(table1['data']['user_name'])
gives "TypeError: list indices must be integers or slices, not str".
print(table1['data'])
gives the whole array of data:
[{'id': '34472839600', 'user_id': '12826', 'user_name': 'Twitch', 'game_id': '509663', 'community_ids': ['f261cf73-cbcc-4b08-af72-c6d2020f9ed4'], 'type': 'live', 'title': 'The 1st Ever 3rd or 4th Pre Pre Show! Part 6', 'viewer_count': 19555, 'started_at': '2019-06-10T02:01:20Z', 'language': 'en', 'thumbnail_url': 'https://static-cdn.jtvnw.net/previews-ttv/live_user_twitch-{width}x{height}.jpg', 'tag_ids': ['d27da25e-1ee2-4207-bb11-dd8d54fa29ec', '6ea6bca4-4712-4ab9-a906-e3336a9d8039']}, {'id': '34474693232', 'user_id': '39298218', 'user_name': 'dakotaz', 'game_id': '33214', 'community_ids': [], 'type': 'live', 'title': '𝙖𝙙𝙫𝙚𝙣𝙩𝙪𝙧𝙚 𝙩𝙞𝙢𝙚 | code: dakotaz in itemshop & GFUEL', 'viewer_count': 15300, 'started_at': '2019-06-10T06:37:02Z', 'language': 'en', 'thumbnail_url': 'https://static-cdn.jtvnw.net/previews-ttv/live_user_dakotaz-{width}x{height}.jpg', 'tag_ids': ['6ea6bca4-4712-4ab9-a906-e3336a9d8039']}]
As a final result, I would like to have something like:
'user_name': {name1, name2}
The problem is that I want to select only the list of 'user_name' of active streamers
... print(table1['data']['user_name'])
... gives "TypeError: list indices must be integers or slices, not str".
You receive TypeError because table1['data'] is a list, not dict and you must access its members with int, not str (although dict keys can be int as well).
Use list comprehension:
user_names = [x['user_name'] for x in table1['data']]
This will give you the list of strings representing user names.
I have a ton of dicts that I have converted from twitter JSON data. Now, I want to turn them into one .csv file. I searched the site but the solutions seem to fit dicts with very few values or dicts that already exist. In my case the number of keys is a little higher, and I also have to go through an iterative process to turn each JSON file to a dict. In other words, I want to write each of my JSON files on my .csv file as soon as I turn them into a dict file in an iterative process.
Here's my code so far:
json_path = "C://Users//msalj//OneDrive//Desktop//pypr//Tweets"
for filename in os.listdir(json_path):
with open(filename, 'r') as infh:
for data in json_parse(infh):
and here is a sample of my converted JSON files:
{'actor': {'displayName': 'RIMarkable',
'favoritesCount': 0,
'followersCount': 0,
'friendsCount': 0,
'id': 'id:twitter.com:3847371',
'image': 'Picture_13.png',
'languages': ['en'],
'link': 'ht........ble',
'links': [{'href': 'htt.....m', 'rel': 'me'}],
'listedCount': 0,
'objectType': 'person',
'postedTime': '2007-01-09T02:53:35.000Z',
'preferredUsername': 'RIMarkable',
'statusesCount': 0,
'summary': 'The Official, Unofficial BlackBerry Weblog',
'twitterTimeZone': 'Eastern Time (US & Canada)',
'utcOffset': '0',
'verified': False},
'body': 'Jim Balsillie To Present At JP Morgan Technology Conference: Research in Motion co-CEO, Jim Balsillie,.. ht...qo',
'generator': {'displayName': 'twitterfeed', 'link': 'htt......om'},
'gnip': {'matching_rules': [{'tag': None, 'value': '"JP Morgan"'}]},
'id': 'tag:search.twitter.com,2005:66178882',
'link': 'ht...82',
'object': {'id': 'object:search.twitter.com,2005:66178882',
'link': 'ht.....82',
'objectType': 'note',
'postedTime': '2007-05-16T19:00:24.000Z',
'summary': 'Jim Balsillie To Present At JP Morgan Technology Conference: Research in Motion co-CEO, Jim Balsillie,.. ht......qo'},
'objectType': 'activity',
'postedTime': '2007-05-16T19:00:24.000Z',
'provider': {'displayName': 'Twitter',
'link': 'ht......m',
'objectType': 'service'},
'retweetCount': 0,
'twitter_entities': {'hashtags': [],
'urls': [{'expanded_url': None,
'indices': [105, 130],
'url': 'htt.......5qo'}],
'user_mentions': []},
'verb': 'post'}
Can anybody help me with its coding? Thanks a lot!
With various depths, if you want to keep everything, this problem gets a little more complicated.
What I've done with this issue is flattened the dictionary.
def flatten_dict(input_dict):
flat_dict = {}
for k,v in input_dict.items():
if isinstance(v, dict):
for k2, v2 in flatten_dict.items():
flat_dict[k2] = v2
elif any([isinstance(v, c_type) for c_type in [list, tuple]]):
for index, i in enumerate(v):
flat_dict["{}-{}".format(k, index)] = i
elif any([isinstance(v, c_type) for c_type in [str, int, float]]):
flat_dict[k] = v
else:
print("unknwon type, add handling for: {}".format(type(v)))
return flat_dict
then I'll use the first json instance to create a header row:
header_row = [k for k in flatten_dict(row1)]
and print the header row to the csv
",".join(header_row)
and print the data in the same order for each json row afterwards:
for row in rows:
flat_row = flatten_dict(row)
print_row = ",".join([flat_row[header] if header in flat_row else "" for header in header_row])
Here is the scirpt:
from bs4 import BeautifulSoup as bs4
import requests
import json
from lxml import html
from pprint import pprint
import re
def get_data():
url = 'https://sports.bovada.lv//baseball/mlb/game-lines-market-group'
r = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.103 Safari/537.36"})
html_bytes = r.text
soup = bs4(html_bytes, 'lxml')
# res = soup.findAll('script') # find all scripts..
pattern = re.compile(r"swc_market_lists\s+=\s+(\{.*?\})")
script = soup.find("script", text=pattern)
return script.text[23:]
test1 = get_data()
data = json.loads(test1)
for item1 in data['items']:
data1 = item1['itemList']['items']
for item2 in data1:
pitch_a = item2['opponentAName']
pitch_b = item2['opponentBName']
## group = item2['displayGroups']
## for item3 in group:
## new_il = item3['itemList']
## for item4 in new_il:
## market = item4['description']
## oc = item4['outcomes']
print(pitch_a,pitch_b)
##for items in data['items']:
## pos = items['itemList']['items']
## for item in pos:
## work = item['competitors']
## pitcher_a = item['opponentAName']
## pitcher_b = item['opponentBName']
## group = item['displayGroups']
## for item, item2 in zip(work,group):
## team = item['abbreviation']
## place = item['type']
## il2 = item2['itemList']
## for item in il2:
## ml = item['description']
## print(team,place,pitcher_a,pitcher_b,ml)
I have been trying to scrape
team abbrev = ['items']['itemList']['items']['competitors']['abbreviation']
home_away = ['items']['itemList']['items']['competitors']['type']
team pitcher home = ['items']['itemList']['items']['opponentAName']
team pitcher away = ['items']['itemList']['items']['opponentBName']
moneyline american odds = ['items']['itemList']['items']['displayGroups']['itemList']['outcomes']['price']['american']
Total runs = ['items']['itemList']['items']['displayGroups']['itemList']['outcomes']['price']['handicap']
Part of the Json pprinted:
[{'baseLink': '/baseball/mlb/game-lines-market-group',
'defaultType': True,
'description': 'Game Lines',
'id': '136',
'itemList': {'items': [{'LIVE': True,
'atmosphereLink': '/api/atmosphere/eventNotification/events/A/3149961',
'awayTeamFirst': True,
'baseLink': '/baseball/mlb/minnesota-twins-los-angeles-angels-201805112207',
'competitionId': '24736',
'competitors': [{'abbreviation': 'LAA',
'description': 'Los Angeles Angels',
'id': '3149961-1642',
'rotationNumber': '978',
'shortName': 'Angels',
'type': 'HOME'},
{'abbreviation': 'MIN',
'description': 'Minnesota Twins',
'id': '3149961-9990',
'rotationNumber': '977',
'shortName': 'Twins',
'type': 'AWAY'}],
'denySameGame': 'NO',
'description': 'Minnesota Twins # Los Angeles Angels',
'displayGroups': [{'baseLink': '/baseball/mlb/game-lines-market-group',
'defaultType': True,
'description': 'Game Lines',
'id': '136',
'itemList': [{'belongsToDefault': True,
'columns': 'H2Columns',
'description': 'Moneyline',
'displayGroups': '136,A-136',
'id': '46892277',
'isInRunning': True,
'mainMarketType': 'MONEYLINE',
'mainPeriod': True,
'marketTypeGroup': 'MONEY_LINE',
'notes': '',
'outcomes': [{'competitorId': '3149961-9990',
'description': 'Minnesota '
'Twins',
'id': '211933276',
'price': {'american': '-475',
'decimal': '1.210526',
'fractional': '4/19',
'id': '1033002124',
'outcomeId': '211933276'},
'status': 'OPEN',
'type': 'A'},
{'competitorId': '3149961-1642',
'description': 'Los '
'Angeles '
'Angels',
'id': '211933277',
'price': {'american': '+310',
'decimal': '4.100',
'fractional': '31/10',
'id': '1033005679',
'outcomeId': '211933277'},
'status': 'OPEN',
'type': 'H'}],
'periodType': 'Live '
'Match',
'sequence': '14',
'sportCode': 'BASE',
'status': 'OPEN',
'type': 'WW'},
{'belongsToDefault': True,
'columns': 'H2Columns',
'description': 'Runline',
'displayGroups': '136,A-136',
'id': '46892287',
'isInRunning': True,
'mainMarketType': 'SPREAD',
'mainPeriod': True,
'marketTypeGroup': 'SPREAD',
'notes': '',
'outcomes': [{'competitorId': '3149961-9990',
'description': 'Minnesota '
'Twins',
'id': '211933278',
'price': {'american': '+800',
'decimal': '9.00',
'fractional': '8/1',
'handicap': '-1.5',
'id': '1033005677',
'outcomeId': '211933278'},
'status': 'OPEN',
'type': 'A'},
{'competitorId': '3149961-1642',
'description': 'Los '
'Angeles '
'Angels',
'id': '211933279',
'price': {'american': '-2000',
'decimal': '1.050',
'fractional': '1/20',
'handicap': '1.5',
'id': '1033005678',
'outcomeId': '211933279'},
'status': 'OPEN',
'type': 'H'}],
'periodType': 'Live '
'Match',
'sequence': '14',
'sportCode': 'BASE',
'status': 'OPEN',
'type': 'SPR'}],
'link': '/baseball/mlb/game-lines-market-group'}],
'feedCode': '13625145',
'id': '3149961',
'link': '/baseball/mlb/minnesota-twins-los-angeles-angels-201805112207',
'notes': '',
'numMarkets': 2,
'opponentAId': '214704',
'opponentAName': 'Tyler Skaggs (L)',
'opponentBId': '215550',
'opponentBName': 'Lance Lynn (R)',
'sport': 'BASE',
'startTime': 1526090820000,
'status': 'O',
'type': 'MLB'},
There are a few different loops I had started in the script above but either of them are working out the way I would like.
away team | away moneyline | away pitcher | Total Runs | and repeat for Home Team is what I would like it to be eventually. I can write to csv once it is parsed the proper way.
Thank you for the fresh set of eyes, I've been working on this for the better part of a day trying to figure out the best way to access the content I would like. If Json is not the best way and bs4 works better I would love to hear your opinion
There's no simple answer to your problem. Scraping data requires you to carefully assess the data you are dealing with, work out where the parts you want to extract are located and figure out how to effectively store the data you extract.
Try printing the data in your loops to visualise what is happening in your code (or try debugging). From there its easy to figure out it if you're iterating over what you expect. Look for patterns throughout the input data to help organise the data you extract.
To help yourself, you should give your variables descriptive names, separate your code into logical chunks and add comments when it starts to get complicated.
Here's some working code, but I encourage you to try what I told you above, then if you're still stuck look below for guidance.
output = {}
root = data['items'][0]
for game_line in root['itemList']['items']:
# Create a temporary dict to store the data for this gameline
team_data = {}
# Get competitors
competitors = game_line['competitors']
for team in competitors:
team_type = team['type'] # either HOME or AWAY
# Create a new dict to store data for each team
team_data[team_type] = {}
team_data[team_type]['abbreviation'] = team['abbreviation']
team_data[team_type]['name'] = team['description']
# Get MoneyLine and Total Runs
for item in game_line['displayGroups'][0]['itemList']:
for outcome in item['outcomes']:
team_type = outcome['type'] # either A or H
team_type = 'HOME' if team_type == 'H' else 'AWAY'
if item['mainMarketType'] == 'MONEYLINE':
team_data[team_type]['moneyline'] = outcome['price']['american']
elif item['mainMarketType'] == 'SPREAD':
team_data[team_type]['total runs'] = outcome['price']['handicap']
# Get the pitchers
team_data['HOME']['pitcher'] = game_line['opponentAName']
team_data['AWAY']['pitcher'] = game_line['opponentBName']
# For each gameline, add the teamdata we gathered to the output dict
output[game_line['description']] = team_data
This produces like:
{
'Atlanta Braves # Miami Marlins': {
'AWAY': {
'abbreviation': 'ATL',
'moneyline': '-130',
'name': 'Atlanta Braves',
'pitcher': 'Mike Soroka (R)',
'total runs': '-1.5'
},
'HOME': {
'abbreviation': 'MIA',
'moneyline': '+110',
'name': 'Miami Marlins',
'pitcher': 'Jarlin Garcia (L)',
'total runs': '1.5'
}
},
'Boston Red Sox # Toronto Blue Jays': {
'AWAY': {
'abbreviation': 'BOS',
'moneyline': '-133',
'name': 'Boston Red Sox',
'pitcher': 'David Price (L)',
'total runs': '-1.5'
},
'HOME': {
'abbreviation': 'TOR',
'moneyline': '+113',
'name': 'Toronto Blue Jays',
'pitcher': 'Marco Estrada (R)',
'total runs': '1.5'
}
},
}
How do I merge a specific value from one array of dicts into another array of dicts if a single specific value matches between them?
I have an array of dicts that represent books
books = [{'writer_id': '123-456-789', 'index': None, 'title': 'Yellow Snow'}, {'writer_id': '888-888-777', 'index': None, 'title': 'Python for Dummies'}, {'writer_id': '999-121-223', 'index': 'Foo', 'title': 'Something Else'}]
and I have an array of dicts that represents authors
authors = [{'roles': ['author'], 'profile_picture': None, 'author_id': '123-456-789', 'name': 'Pat'}, {'roles': ['author'], 'profile_picture': None, 'author_id': '999-121-223', 'name': 'May'}]
I want to take the name from authors and add it to the dict in books where the books writer_id matches the authors author_id.
My end result would ideally change the book array of dicts to be (notice the first dict now has the value of 'name': 'Pat' and the second book has 'name': 'May'):
books = [{'writer_id': '123-456-789', 'index': None, 'title': 'Yellow Snow', 'name': 'Pat'}, {'writer_id': '888-888-777', 'index': None, 'title': 'Python for Dummies'}, {'writer_id': '999-121-223', 'index': 'Foo', 'title': 'Something Else', 'name': 'May'}]
My current solution is:
for book in books:
for author in authors:
if book['writer_id'] == author['author_id']:
book['author_name'] = author['name']
And this works. However, the nested statements bother me and feel unwieldy. I also have a number of other such structures so I end up with a function that has a bunch of code resembling this in it:
for book in books:
for author in authors:
if book['writer_id'] == author['author_id']:
book['author_name'] = author['name']
books_with_foo = []
for book in books:
for thing in things:
if something:
// do something
for blah in books_with_foo:
for book_foo in otherthing:
if blah['bar'] == stuff['baz']:
// etc, etc.
Alternatively, how would you aggregate data from multiple database tables into one thing... some of the data comes back as dicts, some as arrays of dicts?
Pandas is almost definitely going to help you here. Convert your dicts to DataFrames for easier manipulation, then merge them:
import pandas as pd
authors = [{'roles': ['author'], 'profile_picture': None, 'author_id': '123-456-789', 'name': 'Pat'}, {'roles': ['author'], 'profile_picture': None, 'author_id': '999-121-223', 'name': 'May'}]
books = [{'writer_id': '123-456-789', 'index': None, 'title': 'Yellow Snow'}, {'writer_id': '888-888-777', 'index': None, 'title': 'Python for Dummies'}, {'writer_id': '999-121-223', 'index': 'Foo', 'title': 'Something Else'}]
df1 = pd.DataFrame.from_dict(books)
df2 = pd.DataFrame.from_dict(authors)
df1['author_id'] = df1.writer_id
df1 = df1.set_index('author_id')
df2 = df2.set_index('author_id')
result = pd.concat([df1, df2], axis=1)
you may find this page helpful for different ways of combining (merging, concatenating, etc) separate DataFrames.