Related
I have a Selenium-based web crawler application that monitors over 100 different medical publications, with more being added regularly. Each of these publications has a different site structure, so I've tried to make the web crawler as general and re-usable as possible (especially because this is intended for use by other colleagues). For each crawler, the user specifies a list of regex URL patterns that the crawler is allowed to crawl. From there, the crawler will grab any links found as well as specified sections of the HTML. This has been useful in downloading large amounts of content in a fraction of the time it would take to do manually.
I'm now trying to figure out a way to generate custom reports based on the HTML of a certain page. For example, when crawling X site, export a JSON file that shows the number of issues on the page, the name of each issue, the number of articles under each issue, then the title and author names of each of those articles. The page I'll use as an example and test case is https://www.paediatrieschweiz.ch/zeitschriften/
I've created a list comprehension that produces my desired output.
url = "https://www.paediatrieschweiz.ch/zeitschriften/"
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
report = [{
'Issue': (unit.find('p', {'class': 'section__spitzmarke'}).text).strip(),
'Articles': [{
'Title': ((article.find('h3', {'class': 'teaser__title'}).text).strip()),
'Author': ((article.find('p', {'class': 'teaser__authors'}).text).strip())
} for article in unit.find_all('article')]}
for unit in soup.find_all('div', {'class': 'section__inner'})]
Sample JSON output:
[
{
"Issue": "Paediatrica Vol. 33-3/2022",
"Articles": [
{
"Title": "Editorial",
"Author": "Daniela Kaiser, Florian Schaub"
},
{
"Title": "Interview mit Dr. med. Germann Clenin",
"Author": "Florian Schaub, Daniela Kaiser"
},
{
"Title": "Ern\u00e4hrung f\u00fcr sportliche Kinder und Jugendliche",
"Author": "Simone Reber"
},
{
"Title": "Diabetes und Sport",
"Author": "Paolo Tonella"
},
{
"Title": "Asthma und Belastung",
"Author": "Isabelle Rochat"
},
{
"Title": "Sport bei Kindern und Jugendlichen mit rheumatischer Erkrankung",
"Author": "Daniela Kaiser"
},
{
"Title": "H\u00e4mophilie und Sport",
"Author": "Denise Etzweiler, Manuela Albisetti"
},
{
"Title": "Apophysen - die Achillesferse junger Sportler",
"Author": "Florian Schaub"
},
{
"Title": "R\u00fcckenschmerzen bei Athleten im Wachstumsalter",
"Author": "Markus Renggli"
},
{
"Title": "COVID-19 bei jugendlichen AthletenInnen: Diagnose und Return to Sports",
"Author": "Susi Kriemler, Jannos Siaplaouras, Holger F\u00f6rster, Christine Joisten"
},
{
"Title": "Schutz von Kindern und Jugendlichen im Sport",
"Author": "Daniela Kaiser"
}
]
},
{
"Issue": "Paediatrica Vol. 33-2/2022",
"Articles": [
{
"Title": "Editorial",
"Author": "Caroline Schnider, Jean-Cristoph Caubet"
},
{
"Title": "Der prim\u00e4re Immundefekt \u2013 Ein praktischer Leitfaden f\u00fcr den Kinderarzt",
"Author": "Tiphaine Arlabosse, Katerina Theodoropoulou, Fabio Candotti"
},
{
"Title": "Arzneimittelallergien bei Kindern: was sollten Kinder\u00e4rzte wissen?",
"Author": "Felicitas Bellutti Enders, Mich\u00e8le Roth, Samuel Roethlisberger"
},
{
"Title": "Orale Immuntherapie bei Nahrungsmittelallergien im Kindesalter",
"Author": "Yannick Perrin, Caroline Roduit"
},
{
"Title": "Autoimmunit\u00e4t in der P\u00e4diatrie: \u00dcberlegungen der p\u00e4diatrischen Rheumatologie",
"Author": "Florence A. Aeschlimann, Raffaella Carlomagno"
However, if possible I'd like to avoid using a custom Python statement or function for each individual crawler, as each would require different code. Each crawler has it's own JSON config file which specifies the start URL, allowed URL patterns, desired content to download, etc.
My initial idea was to create a JSON config to specify the Beautiful Soup elements to scrape and store in a dictionary - something that a colleague who does not write code would be able to set up. My idea was something like this...
{
"name": "Paedriactia",
"unit": {
"selector": {
"name": "div",
"attrs": {"class": "section__inner"},
"find_all": true
},
"items": {
"issue": {
"name": "p", "attrs": {"class": "section__spitzmarke"}, "get_text": true
}
},
"subunits": {
"articles": {
"selector": {
"name": "article",
"find_all": true
},
"items": {
"Title": {
"name": "h3",
"attrs": {"class": "teaser__title"},
"get_text": true
}
}
}
}
}
}
...along with a Python function that would be able to parse the HTML according to the config and produce a JSON output.
However, I'm at a total loss as to how to do this, especially when it comes to handling nested elements. Each time I've tried to approach this on my own I've totally confused myself and have ended up back at the start.
If any of this makese sense, would anyone have any advice or idea of how to approach this sort of Beautiful Soup config?
I'm also fairly proficient in writing code with Beautiful Soup, so I'm open to the idea of writing custom Beautiful Soup functions and statements for each crawler (and perhaps even training others to do the same). If I take this route, where would be the best place to store all of that custom code and import it? Would I be able to have some sort of "parse.py" file in each crawler folder and import its function only as needed (I.e., when that specific crawler is run?)
If helpful, an example of the current structure of the web crawler projects is below:
With BeautifulSoup, I strongly prefer using select and select_one to using find_all and find when scraping nested elements. (If you're not used to working with CSS selectors, I find the w3schools reference page to be a good cheatsheet for them.)
If you defined a function like
def getSoupData(mSoup, dataStruct, maxDepth=None, curDepth=0):
if type(dataStruct) != dict:
# so selector/targetAttr can also be sent as a single string
if str(dataStruct).startswith('"ta":'):
dKey = 'targetAttr'
else:
dKey = 'cssSelector'
dataStruct = str(dataStruct).replace('"ta":', '', 1)
dataStruct = {dKey: dataStruct}
# default values: isList=False, items={}
isList = dataStruct['isList'] if 'isList' in dataStruct else False
if 'items' in dataStruct and type(dataStruct['items']) == dict:
items = dataStruct['items']
else: items = {}
# no selector -> just use the input directly
if 'cssSelector' not in dataStruct:
soup = mSoup if type(mSoup) == list else [mSoup]
else:
soup = mSoup.select(dataStruct['cssSelector'])
# so that unneeded parts are not processed:
if not isList: soup = soup[:1]
# return empty nothing was selected
if not soup: return [] if isList else None
# return text or attribute values - no more recursion
if items == {}:
if 'targetAttr' in dataStruct:
targetAttr = dataStruct['targetAttr']
else:
targetAttr = '"text"' # default
if targetAttr == '"text"':
sData = [s.get_text(strip=True) for s in soup]
# can put in more options with elif
else:
sData = [s.get(targetAttr) for s in soup]
return sData if isList else sData[0]
# return error - recursion limited
if maxDepth is not None and curDepth > maxDepth:
return {
'errorMsg': f'Maximum [{maxDepth}] exceeded at depth={curDepth}'
}
# recursively get items
sData = [dict([(i, getSoupData(
s, items[i], maxDepth, curDepth + 1
)) for i in items]) for s in soup]
return sData if isList else sData[0]
# return list only if isList is set
you can make your data structure as nested as your html structure [because the function is recursive]....if you want that for some reason; but also, you can set maxDepth to limit how nested it can get - if you don't want to set any limits, you can get rid of both maxDepth and curDepth as well as any parts involving them.
Then, you can make your config file something like
{
"name": "Paedriactia",
"data_structure": {
"cssSelector": "div.section__inner",
"items": {
"Issue": "p.section__spitzmarke",
"Articles": {
"cssSelector": "article",
"items": {
"Title": "h3.teaser__title",
"Author": "p.teaser__authors"
},
"isList": true
}
},
"isList": true
}
"url": "https://www.paediatrieschweiz.ch/zeitschriften/"
}
["isList": true here is equivalent to your "find_all": true; and your "subunits" are also defined as "items" here - the function can differentiate based on the structure/dataType.]
Now the same data that you showed [at the beginning of your question] can be extracted with
# import json
configC = json.load(open('crawlerConfig_paedriactia.json', 'r'))
url = configC['url']
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
dStruct = configC['data_structure']
getSoupData(soup, dStruct)
For this example, you could add the article links by adding {"cssSelector": "a.teaser__inner", "targetAttr": "href"} as ...Articles.items.Link.
Also, note that [because of the defaults set at the beginning in the function], "Title": "h3.teaser__title" is the same as
"Title": { "cssSelector": "h3.teaser__title" }
and
"Link": "\"ta\":href"
would be the same as
"Link": {"targetAttr": "href"}
I trying to loop over a list containing Twitter data in a json format. The list is made of several dictionaries each containing data on a politician. The code works if the input json_response only holds data on one politician. However, when json_response is list of dictionaries i get an error.
In short, I believe the issue can be isolated to three for-loops in the code for tweet in json_response['data']:, for dics in json_response['includes']['users']:, and for element in json_response['includes']['media']:.
# Inputs for the request
bearer_token = auth()
headers = create_headers(bearer_token)
keyword = search_query
start_time = "2016-03-01T00:00:00.000Z"
end_time = "2021-03-31T00:00:00.000Z"
max_results = 3000
json_response = [] # empty list that will hold tweet objects
for i in keyword: # loop through list of politicians in keyword i.e. search query and extract tweets
url = create_url(i, start_time, end_time, max_results)
json_response.append(connect_to_endpoint(url[0], headers, url[1]))
pass
I have only pasted the json_response object for 2 out of 30 politicians due cap on characters. However, the structure is the same for the remaining 28 politicians.
print(json.dumps(json_response, indent=4, sort_keys=True)) # look at json_response object.
[
{
"data": [
{
"author_id": "2877379617",
"created_at": "2021-03-25T12:11:14.000Z",
"id": "1375057688355336195",
"text": "#prettynobodyco She blocked me in 2015 - for pointing out that Tim Kaine enables sexual assault in the military and the evidence was his killing of the MJIA and publicly stated that Military commanders should remain in charge of military rape cases. She's Tanden level awful. Congrats!"
},
{
"author_id": "1265018154444562440",
"created_at": "2021-03-22T19:48:59.000Z",
"id": "1374085719472361474",
"text": "#MehcatCat #AlasscanIsBack #PattyArquette #timkaine Funny, they blocked me. \ud83e\udd23\ud83e\udd23"
},
{
"author_id": "2378324935",
"created_at": "2021-03-07T21:32:13.000Z",
"id": "1368675879312887810",
"text": "#DrWinarick #KatieOGrady4 I apologize for any drama. Katie O Grady blocked me because we had a disagreement about Tim Kaine on one of your older posts. I guess I can't please everyone haha. :/"
},
{
"author_id": "821870502943817729",
"created_at": "2021-02-12T23:53:59.000Z",
"id": "1360376637385244673",
"text": "She blocked me a long ass time ago when I asked her why we shoulf care about Tim Kaine's personal view on abortion if it didn't impact legislation"
},
{
"attachments": {
"media_keys": [
"16_1341045032732770306"
]
},
"author_id": "17232340",
"created_at": "2020-12-21T15:37:07.000Z",
"id": "1341045038420275205",
"text": "#DSingh4Biden #moomintroll8 #timkaine #GovernorVA That's why I replied to you. She blocked me previously, for what silliness I can't remember. Tough being a troll AND a snowflake!"
}
],
"includes": {
"media": [
{
"media_key": "16_1341045032732770306",
"type": "animated_gif"
}
],
"users": [
{
"created_at": "2014-11-15T02:23:57.000Z",
"description": "",
"id": "2877379617",
"name": "Laura Saylor",
"username": "lauraleesaylor"
},
{
"created_at": "2020-05-25T20:33:36.000Z",
"description": "Weird Writer & Lunatic Linguist\nWicked Witch of the East\nshe/her",
"id": "1265018154444562440",
"name": "Zauberkind",
"username": "Zauberkind2"
},
{
"created_at": "2014-03-08T07:22:31.000Z",
"description": "#Resist, #BLM, #Vaxxed, liberal, autistic, kidney transplant survivor, political nerd, mental health advocate, fighter for equality, truth, justice, etc.",
"id": "2378324935",
"name": "Trevor \"Trev\" McKee Achilles",
"username": "MrTAchilles"
},
{
"created_at": "2017-01-19T00:02:52.000Z",
"description": "statist / Progressive Gun Nut/ Single and hating it\n\n / \n\nstraight????? /\n\npronouns / brain worm survivor\n\n \n",
"id": "821870502943817729",
"name": "Squirrel Dad",
"username": "nihilisticpillo"
},
{
"created_at": "2008-11-07T15:09:46.000Z",
"description": "Liberal-Veteran-Dog Lover | Taste for irony, but in moderation | Humor is reason gone mad. ~Groucho Marx | I follow & unfollow back #VeteransResist #Resist",
"id": "17232340",
"name": "anti-Fascist Jim",
"username": "JimnBL"
}
]
},
"meta": {
"newest_id": "1375057688355336195",
"next_token": "b26v89c19zqg8o3foseug43lzoqdft4ghg78o9sn9ds3h",
"oldest_id": "1341045038420275205",
"result_count": 5
}
},
{
"data": [
{
"author_id": "1248251899884814336",
"created_at": "2021-03-27T13:36:45.000Z",
"id": "1375803982409576450",
"text": "#gavinjeffries0 #steven86026859 #MSNBC #SenBooker Uh Oh our friend Steve blocked me, I guess not being able to answer your simple question and being asked to was too much for him."
},
{
"author_id": "293104735",
"created_at": "2021-02-07T21:45:47.000Z",
"id": "1358532435122683904",
"text": "#slwilliams1101 #annabella313 #CrossConnection #TiffanyDCross #Scaramucci #JoyAnnReid #CapehartJ #MSNBC #SenBooker #AliVelshi I stopped watching #TiffanyDCross as well and only watch #CapehartJ now (even though he blocked me in 2016 because I had a \"strong\" response to something mean he said about Hillary Clinton)."
},
{
"author_id": "380970864",
"created_at": "2021-02-07T20:58:01.000Z",
"id": "1358520416273326081",
"text": "#annabella313 #CrossConnection #TiffanyDCross #Scaramucci #JoyAnnReid #CapehartJ #MSNBC After I criticized #TiffanyDCross she blocked me. #JoyAnnReid called herself petty during and interview with #SenBooker. Why be petty? Be mature and thoughtful so people can learn. Hosts need to learn too. I only watch #AliVelshi #CapehartJ now."
},
{
"attachments": {
"media_keys": [
"3_1358448920632909825"
]
},
"author_id": "793175035322171397",
"created_at": "2021-02-07T16:17:44.000Z",
"id": "1358449876565164034",
"text": "#FinstaManhattan #SenSchumer #SenBooker #RonWyden Lmao he blocked me over that. His bio said he likes to 'debate & that sometimes he's wrong but he can admit that'.\n\nGuess not.\n\nI wasn't rude or mean at all. This is too funny \ud83e\udd23"
},
{
"author_id": "752266160352010241",
"created_at": "2021-02-06T20:34:06.000Z",
"id": "1358152008948195328",
"text": "#fattypinner #tkbone32221 #SenSchumer #SenBooker #RonWyden He blocked me \ud83e\udd23\ud83d\ude2d\ud83e\udd23\ud83e\udd23\ud83e\udd23\ud83d\ude2d"
}
],
"includes": {
"media": [
{
"media_key": "3_1358448920632909825",
"type": "photo",
"url": ""
}
],
"users": [
{
"created_at": "2020-04-09T14:11:04.000Z",
"description": "",
"id": "1248251899884814336",
"name": "Firstcomm",
"username": "Firstcomm1"
},
{
"created_at": "2011-05-04T19:26:22.000Z",
"description": "Cinephile, balletomane, book lover, tennis fan, K-Drama fanatic, Jang Na-ra fangirl, USC School of Cinematic Arts alumna, Hillary Clinton and Nancy Pelosi Dem.",
"id": "293104735",
"name": "Joyce Tyler",
"username": "joyce_tyler"
},
{
"created_at": "2011-09-27T14:50:37.000Z",
"description": "Spelman College, BA, George Washington University MA, University of South Florida Ph.D. in Political Science, proud Ted Kennedy, Obama, Biden/Harris Democrat!",
"id": "380970864",
"name": "Stephanie L. Williams, Ph.D.",
"username": "slwilliams1101"
},
{
"created_at": "2016-10-31T19:37:19.000Z",
"description": "Loves: life, fam, cats, cars, tattoos, reality TV; collector of t-shirts & Volkswagen\u2019s. Hates: Oxford commas. #CombatVet #Medic #BidenHarris2020 #Resist",
"id": "793175035322171397",
"name": "Que Sarah Sarah \ud83d\udda4",
"username": "sarahalli13"
},
{
"created_at": "2016-07-10T22:20:03.000Z",
"description": "3x Hollywood Video Street Fighter 2 Champion",
"id": "752266160352010241",
"name": "Sugarcoder",
"username": "TheSugarCoder"
}
]
},
"meta": {
"newest_id": "1375803982409576450",
"next_token": "b26v89c19zqg8o3fosktkdplqiw2q9kzx2ibm4r4y27wd",
"oldest_id": "1358152008948195328",
"result_count": 5
}
}
...28 other politicians
# Create file
csvFile = open("tweet_sample.csv", "a", newline="", encoding='utf-8')
csvWriter = csv.writer(csvFile)
# Create headers for the data I want to save. I only want to save these columns in my dataset
csvWriter.writerow(
['author id', 'created_at', 'id', 'tweet', 'bio', 'image_url'])
csvFile.close()
def append_to_csv(json_response, fileName):
# A counter variable
global created_at, tweet_id, bio, text, author_id
counter = 0
# Open OR create the target CSV file
csvFile = open(fileName, "a", newline="", encoding='utf-8')
csvWriter = csv.writer(csvFile)
# Loop through each tweet
for tweet in json_response[0]['data']: # NOTE adding a 0 gives access to the data for the first politician while adding 1 gives access to data for the second politician and so on...
# 1. Author ID
author_id = tweet['author_id']
# 2. Time created
created_at = dateutil.parser.parse(tweet['created_at'])
# 3. Tweet ID
tweet_id = tweet['id']
# 4. Tweet text
text = tweet['text']
for dics in json_response[0]['includes']['users']: # NOTE 0 added
# 5. description. Contained in includes data object
if ('description' in dics):
bio = dics['description']
else:
bio = " "
for element in json_response[0]['includes']['media']: # NOTE 0 added
# 6. image url. Contained in includes data object
if ('url' in element):
image_url = element['url']
else:
image_url = " "
# Assemble all data in a list
res = [author_id, created_at, tweet_id, text, bio, image_url]
# Append the result to the CSV file
csvWriter.writerow(res)
counter += 1
# When done, close the CSV file
csvFile.close()
# Print the number of tweets for this iteration
print("# of Tweets added from this response: ", counter)
append_to_csv(json_response, "tweet_sample.csv") # Save tweet data in a csv file
Error message:
TypeError: list indices must be integers or slices, not str
By adding the [0] in the loop I avoid the TypeError above. However the output from the function append_to_csv is not ideal as it only includes the last tweet for the first politician. I guess my loop overwrites data.
Desired output would be a data frame with columns author_id, created_at, id, tweet, bio, image_url. Not all users have a bio on their profile or an image_url in their tweet hence the if-else statement in the function above and the bio, no_bio and bio, image_url, no_image_url in the desired data frame.
pol_df = pd.read_csv("path_to_tweet_sample.csv" )
pol_df.head()
author_id created_at id tweet bio image_url
0 737885223858384896 2021-03-26T21:56:02.000Z 1375567243082338314 tweet_text no_bio no_image_url
1 847612931487416323 2021-03-26T21:55:24.000Z 1375567083791073283 tweet_text no_bio no_image_url
2 18634205 2021-03-08T12:29:00.000Z 1368901564363051010 tweet_text bio image_url
3 27327319 2021-03-02T11:53:16.000Z 1366718245521211393 tweet_text bio no_image_url
4 917634626247647232 2021-02-28T18:16:45.000Z 1366089974907432961 tweet_text bio image_url
I think you are confusing lists with dicts. When you try to access a list like a dict (e.g. data["author_id"]) the TypeError you're getting will be raised. You have to iterate over a list and then try to access each dict in that list like [x['author_id'] for x in data], for example. If you want to extract values from the dicts and write it to a csv file you might want to do something like this:
import pandas as pd
author_data = []
for data in resp:
for author in data['data']:
author_id = author['author_id']
created_at = author['created_at']
another_id = author['id']
tweet_text = author['text']
author_data.append([author_id, created_at, another_id, tweet_text])
author_df = pd.DataFrame(author_data, columns=['author_id', 'created_at', 'id', 'text'])
media_data = []
for data in resp:
for media in data['includes']['media']:
url = media.get('url', 'no_url')
media_data.append(media)
media_df = pd.DataFrame(media_data, columns=['url'])
bio_data = []
for data in resp:
for user in data['includes']['users']:
bio = user['description']
author_id = user['id']
bio_data.append([bio, author_id])
bio_df = pd.DataFrame(bio_data, columns=['bio', 'author_id'])
final_df = author_df.merge(bio_df, on="author_id")
print(final_df)
You have to save different parts of the data in different dataframes and then merge them. The thing is that media does not contain the author_id or another key that is shared between the ['includes']['media'] part and ['data'] part so you cannot merge that.
I have a set of ad creatives that I retreive through the Facebook Business Python SDK. I need these specifically to retreive the outbound URL when someone clicks on the ad: AdCreative['object_story_spec']['video_data']['call_to_action']['value']['link'].
I use the following call:
adcreatives = set.get_ad_creatives(fields=[
AdCreative.Field.id,
AdCreative.Field.name,
AdCreative.Field.object_story_spec,
AdCreative.Field.effective_object_story_id ,
])
Where set is an ad set.
For some cases, the result looks like this (with actual data removed), which is expected:
<AdCreative> {
"body": "[<BODY>]",
"effective_object_story_id": "[<EFFECTIVE_OBJECT_STORY_ID>]",
"id": "[<ID>]",
"name": "[<NAME>]",
"object_story_spec": {
"instagram_actor_id": "[<INSTAGRAM_ACTOR_ID>]",
"page_id": "[<PAGE_ID>]",
"video_data": {
"call_to_action": {
"type": "[<TYPE>]",
"value": {
"link": "[<LINK>]", <== This is what I need
"link_format": "[<LINK_FORMAT>]"
}
},
"image_hash": "[<IMAGE_HASH>]",
"image_url": "[<IMAGE_URL>]",
"message": "[<MESSAGE>]",
"video_id": "[<VIDEO_ID>]"
}
}
}
While sometimes results look like this:
<AdCreative> {
"effective_object_story_id": "[<EFFECTIVE_OBJECT_STORY_ID>]",
"id": "[<ID>]",
"name": "[<NAME>]",
"object_story_spec": {
"instagram_actor_id": "[<INSTAGRAM_ACTOR_ID>]",
"page_id": "[<PAGE_ID>]"
}
}
According to this earlier question: Can't get AdCreative ObjectStorySpec this is due to the fact that the object_story_spec is not populated if it is linked to a creative, instead of created along with the creative.
However, the video_data (and as such, the link), should be saved somewhere. Is there a way to retreive this? Maybe through effective_object_story_id?
The documentation page for object_story_spec (https://developers.facebook.com/docs/marketing-api/reference/ad-creative-object-story-spec/v12.0) does not have the information I am looking for.
I am using StackAPI to get the most voted questions and the most voted answers to those questions:-
from stackapi import StackAPI
SITE = StackAPI('stackoverflow')
SITE.max_pages=1
SITE.page_size=10
questions = SITE.fetch('questions', min=20, tagged='python', sort='votes')
for quest in questions['items']:
if 'title' not in quest or quest['is_answered'] == False:
continue
title = quest['title']
print('Question :- {0}'.format(title))
question_id = quest['question_id']
print('Question ID :- {0}'.format(question_id))
top_answer = SITE.fetch('questions/' + str(question_id) + '/answers', order = 'desc', sort='votes')
print('Most Voted Answer ID :- {0}'.format(top_answer['items'][0]['answer_id']))
Now using this answer_id I would like to get the body of that answer.
I can get the rest of the details by using this API link.
Refer to these posts on Stack Apps:
Get questions with body and answers
How to get Question/Answer body in the API response using filters?
My filter is not returning any results. How to create a minimal filter?
You need to use a custom filter to get question/answer/post bodies.
The good news is that you can also use the custom filter to get the answer data at the same time as you get the questions -- eliminating the need for later API calls.
For example, if you call the /questions route with the filter:
!*SU8CGYZitCB.D*(BDVIficKj7nFMLLDij64nVID)N9aK3GmR9kT4IzT*5iO_1y3iZ)6W.G*
You get results like:
"items": [ {
"tags": ["python", "iterator", "generator", "yield", "coroutine"],
"answers": [ {
"owner": {"user_id": 8458, "display_name": "Douglas Mayle"},
"is_accepted": false,
"score": 248,
"creation_date": 1224800643,
"answer_id": 231778,
"body": "<p><code>yield</code> is just like <code>return</code> - it returns what..."
}, {
"owner": {"user_id": 22656, "display_name": "Jon Skeet"},
"is_accepted": false,
"score": 139,
"creation_date": 1224800766,
"answer_id": 231788,
"body": "<p>It's returning a generator. I'm not particularly familiar with Python, ..."
}, {
...
} ],
"owner": {"user_id": 18300, "display_name": "Alex. S."},
"is_answered": true,
"accepted_answer_id": 231855,
"answer_count": 40,
"score": 8742,
"creation_date": 1224800471,
"question_id": 231767,
"title": "What does the "yield" keyword do?"
},
...
So, change this:
questions = SITE.fetch('questions', min=20, tagged='python', sort='votes')
To something like this:
questions = SITE.fetch('questions', min=20, tagged='python', sort='votes', filter='!*SU8CGYZitCB.D*(BDVIficKj7nFMLLDij64nVID)N9aK3GmR9kT4IzT*5iO_1y3iZ)6W.G*')
then adjust your for loop accordingly.
I'm quite new to marshmallow but my question refers to the issue of handling dict-like objects. There are no workable examples in the Marshmallow documentation. I came across with a simple example here in stack overflow Original question and this is the original code for the answer suppose this should be quite simple
from marshmallow import Schema, fields, post_load, pprint
class UserSchema(Schema):
name = fields.String()
email = fields.Email()
friends = fields.List(fields.String())
class AddressBookSchema(Schema):
contacts =fields.Dict(keys=fields.String(),values=fields.Nested(UserSchema))
#post_load
def trans_friends(self, item):
for name in item['contacts']:
item['contacts'][name]['friends'] = [item['contacts'][n] for n in item['contacts'][name]['friends']]
data = """
{"contacts": {
"Steve": {
"name": "Steve",
"email": "steve#example.com",
"friends": ["Mike"]
},
"Mike": {
"name": "Mike",
"email": "mike#example.com",
"friends": []
}
}
}
"""
deserialized_data = AddressBookSchema().loads(data)
pprint(deserialized_data)
However, when I run the code I get the following NoneType value
`None`
The input hasn't been marshalled.
I'm using the latest beta version of marshmallow 3.0.0b20. I can't find a way to make this work even it looks so simple. The information seems to indicate that nested dictionaries are being worked by the framework.
Currently I'm working in a cataloging application for flask where I'm receiving JSON messages where I can't really specify the schema beforehand. My specific problem is the following:
data = """
{"book": {
"title": {
"english": "Don Quixiote",
"spanish": "Don Quijote"
},
"author": {
"first_name": "Miguel",
"last_name": "Cervantes de Saavedra"
}
},
"book": {
"title": {
"english": "20000 Leagues Under The Sea",
"french": "20000 Lieues Sous Le Mer",
"japanese": "海の下で20000リーグ",
"spanish": "20000 Leguas Bajo El Mar",
"german": "20000 Meilen unter dem Meeresspiegel",
"russian": "20000 лиг под водой"
},
"author": {
"first_name": "Jules",
"last_name": "Verne"
}
}
}
This is just toy data but exemplifies that the keys in the dictionaries are not fixed, they change in number and text.
So the questions are why am I getting the validation error in a simple already worked example and if it's possible to use the marshmallow framework to validate my data,
Thanks
There are two issues in your code.
The first is the indentation of the post_load decorator. You introduced it when copying the code here, but you don't have it in the code you're running, otherwise you wouldn't get None.
The second is due to a documented change in marshmallow 3. pre/post_load/dump functions are expected to return the value rather than mutate it.
Here's a working version. I also reworked the decorator:
from marshmallow import Schema, fields, post_load, pprint
class UserSchema(Schema):
name = fields.String()
email = fields.Email()
friends = fields.List(fields.String())
class AddressBookSchema(Schema):
contacts = fields.Dict(keys=fields.String(),values=fields.Nested(UserSchema))
#post_load
def trans_friends(self, item):
for contact in item['contacts'].values():
contact['friends'] = [item['contacts'][n] for n in contact['friends']]
return item
data = """
{
"contacts": {
"Steve": {
"name": "Steve",
"email": "steve#example.com",
"friends": ["Mike"]
},
"Mike": {
"name": "Mike",
"email": "mike#example.com",
"friends": []
}
}
}
"""
deserialized_data = AddressBookSchema().loads(data)
pprint(deserialized_data)
And finally, the Dict in marshmallow 2 doesn't have key/value validation feature, so it will just silently ignore the keys and values argument and perform no validation.