I am learning how to get data from arrays and I am slightly stuck on an easy way of locating where that data is to pull it from the array. It feels like there should be an easier way than counting on the screen.
Here is what I have:
r2 = requests.get(
f'https://www.thesportsdb.com/api/v1/json/{apiKey}/lookupevent.php?id={id}')
arr_events = np.array([r2.json()])
#print(arr_events)
event_id = arr_events[0]['events'][0]['idEvent']
locate = arr_events.index('strHomeTeam')
print(locate)
The problem is, on the console this prints out a massive array that looks like (I'll give one line, you probably get the idea):
[{'events': [{'idEvent': '1032723', 'idSoccerXML': None, 'idAPIfootball': '592172', 'strEvent': 'Aston Villa vs Liverpool', 'strEventAlternate': 'Liverpool # Aston Villa', 'strFilename': 'English Premier League 2020-10-04 Aston Villa vs Liverpool'...}]}]
It's a sizeable array, enough to cause a minor slowdown if I neced to pull some info.
So, idEvent was easy to pull using the method above. And if I wanted some of these others in the top line, proabably not hard to count to 5 or 6. But I know there must be an easier way for Python to just locate the ones I want. For instance, I want the home and away team:
'strHomeTeam': 'Aston Villa', 'strAwayTeam': 'Liverpool',
So is there an easier way to just pull the 'strHomeTeam' rather than counting all the way to the point in the array?
I realise this is a basic question - and I have searched and searched, but everything seems to be in a single, really small array and they don't seem to explain getting the data from big arrays easily.
The JSON file is here: https://www.thesportsdb.com/api/v1/json/1/lookupevent.php?id=1032723
Thank you for your help on this - I appreciate it.
So is there an easier way to just pull the 'strHomeTeam' rather than counting all the way to the point in the array?
Try the below
data = {"events": [
{"idEvent": "1032723", "idSoccerXML": "", "idAPIfootball": "592172", "strEvent": "Aston Villa vs Liverpool",
"strEventAlternate": "Liverpool # Aston Villa",
"strFilename": "English Premier League 2020-10-04 Aston Villa vs Liverpool", "strSport": "Soccer",
"idLeague": "4328", "strLeague": "English Premier League", "strSeason": "2020-2021",
"strDescriptionEN": "Aston Villa and Liverpool square off at Villa Park, where last season, these teams produced one of the most exciting finishes of the campaign, as Liverpool scored twice late on to overturn an early Trezeguet goal.",
"strHomeTeam": "Aston Villa", "strAwayTeam": "Liverpool", "intHomeScore": "7", "intRound": "4",
"intAwayScore": "2", "intSpectators": "", "strOfficial": "", "strHomeGoalDetails": "", "strHomeRedCards": "",
"strHomeYellowCards": "", "strHomeLineupGoalkeeper": "", "strHomeLineupDefense": "",
"strHomeLineupMidfield": "", "strHomeLineupForward": "", "strHomeLineupSubstitutes": "",
"strHomeFormation": "", "strAwayRedCards": "", "strAwayYellowCards": "", "strAwayGoalDetails": "",
"strAwayLineupGoalkeeper": "", "strAwayLineupDefense": "", "strAwayLineupMidfield": "",
"strAwayLineupForward": "", "strAwayLineupSubstitutes": "", "strAwayFormation": "", "intHomeShots": "",
"intAwayShots": "", "strTimestamp": "2020-10-04T18:15:00+00:00", "dateEvent": "2020-10-04",
"dateEventLocal": "2020-10-04", "strDate": "", "strTime": "18:15:00", "strTimeLocal": "19:15:00",
"strTVStation": "", "idHomeTeam": "133601", "idAwayTeam": "133602", "strResult": "", "strVenue": "Villa Park",
"strCountry": "England", "strCity": "", "strPoster": "", "strFanart": "",
"strThumb": "https:\/\/www.thesportsdb.com\/images\/media\/event\/thumb\/r00vzl1601721606.jpg", "strBanner": "",
"strMap": "", "strTweet1": "https:\/\/twitter.com\/brfootball\/status\/1312843172385521665",
"strTweet2": "https:\/\/twitter.com\/TomJordan21\/status\/1312854281444306946",
"strTweet3": "https:\/\/twitter.com\/FutbolBible\/status\/1312847622592442370",
"strVideo": "https:\/\/www.youtube.com\/watch?v=0Nbw3jSafGM", "strStatus": "Match Finished", "strPostponed": "no",
"strLocked": "unlocked"}]}
filtered_data = [{'home':entry['strHomeTeam'],'away':entry['strAwayTeam']}for entry in data['events']]
print(filtered_data)
output
[{'home': 'Aston Villa', 'away': 'Liverpool'}]
Ug... I tried something different and it worked - sigh... I am sorry.
event_id = arr_events[0]['events'][0]['idEvent']
home_team = arr_events[0]['events'][0]['strHomeTeam']
away_team = arr_events[0]['events'][0]['strAwayTeam']
home_score = arr_events[0]['events'][0]['intHomeScore']
away_score = arr_events[0]['events'][0]['intAwayScore']
I assume this is the right way to do it.
You should look into
https://python-json-pointer.readthedocs.io/en/latest/tutorial.html
inspect the json, get the path you want to access the value -> use https://github.com/stefankoegl/python-json-pointer
Related
I have a string in R that I would like to pass to python in order to compute something and return the result back into R.
I have the following which "works" but not as I would like.
The below passes a string from R, to a Python file, uses openAI to collect the text data and then load it back into R.
library(reticulate)
computePythonFunction <- "
def print_openai_response():
import openai
openai.api_key = 'ej-powurjf___OpenAI_API_KEY___HGAJjswe' # you will need an API key
prompt = 'write me a poem about the sea'
response = openai.Completion.create(engine = 'text-davinci-003', prompt = prompt, max_tokens=1000)
#response['choices'][0]['text']
print(response)
"
py_run_string(computePythonFunction)
py$print_openai_response()
library("rjson")
fromJSON(as.character(py$print_openai_response()))
I would like to store the results in R objects - i.e. Here is one output from the python script.
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"text": "\n\nThe sea glitters like stars in the night \nWith hues, vibrant and bright\nThe waves flow gentle, serene, and divine \nLike the sun's most gentle shine\n\nAs the sea reaches, so wide, so vast \nAn adventure awaits, and a pleasure, not passed\nWhite sands, with seaweed green \nForms a kingdom of the sea\n\nConnecting different land and tide \nThe sea churns, dancing with the sun's pride\nAs a tempest forms, raging and wild \nThe sea turns, its colors so mild\n\nA raging storm, so wild and deep \nProtecting the creatures that no one can see \nThe sea is a living breathing soul \nA true and untouchable goal \n\nThe sea is a beauty that no one can describe \nAnd it's power, no one can deny \nAn ever-lasting bond, timeless and free \nThe love of the sea, is a love, to keep"
}
],
"created": 1670525403,
"id": "cmpl-6LGG3hDNzeTZ5VFbkyjwfkHH7rDkE",
"model": "text-davinci-003",
"object": "text_completion",
"usage": {
"completion_tokens": 210,
"prompt_tokens": 7,
"total_tokens": 217
}
}
I am interested in the text generated but I am also interested in the completion_tokens, promt_tokens and total_tokens.
I thought about save the Python code as a script, then pass the argument to it such as:
myPythin.py arg1.
How can I return the JSON output from the model to an R object? The only input which changes/varies in the python code is the prompt variable.
I am writing a script that scrapes information from a large travel agency. My code closely follows the tutorial at https://python.gotrained.com/selenium-scraping-booking-com/.
However, I would like to be able to navigate to the next page as I'm now limited to n_results = 25. Where do I add this in the code? I know that I can target the pagination button with driver.find_element_by_class_name('paging-next').click(), but I don't know where to incorporate it.
I have tried to put it in the for loop within the scrape_results function, which I have copied below. However, it doesn't seem to work.
def scrape_results(driver, n_results):
'''Returns the data from n_results amount of results.'''
accommodations_urls = list()
accommodations_data = list()
for accomodation_title in driver.find_elements_by_class_name('sr-hotel__title'):
accommodations_urls.append(accomodation_title.find_element_by_class_name(
'hotel_name_link').get_attribute('href'))
for url in range(0, n_results):
if url == n_results:
break
url_data = scrape_accommodation_data(driver, accommodations_urls[url])
accommodations_data.append(url_data)
return accommodations_data
EDIT
I have added some more code to clarify my input and output. Again, I mostly just used code from the GoTrained tutorial and added some code of my own. How I understand it: the scraper first collects all URLs and then scrapes the info of the individual pages one by one. I need to add the pagination loop in that first part – I think.
if __name__ == '__main__':
try:
driver = prepare_driver(domain)
fill_form(driver, 'Waterberg, South Africa') # my search argument
accommodations_data = scrape_results(driver, 25) # 25 is the maximum of results, higher makes the scraper crash due to the pagination problem
accommodations_data = json.dumps(accommodations_data, indent=4)
with open('booking_data.json', 'w') as f:
f.write(accommodations_data)
finally:
driver.quit()
Below is the JSON output for one search result.
[
{
"name": "Lodge Entabeni Safari Conservancy",
"score": "8.4",
"no_reviews": "41",
"location": "Vosdal Plaas, R520 Marken Road, 0510 Golders Green, South Africa",
"room_types": [
"Tented Chalet - Wildside Safari Camp with 1 game drive",
"Double or Twin Room - Hanglip Mountain Lodge with 1 game drive",
"Tented Family Room - Wildside Safari Camp with 1 game drive"
],
"room_prices": [
"\u20ac 480",
"\u20ac 214",
"\u20ac 650",
"\u20ac 290",
"\u20ac 693"
],
"popular_facilities": [
"1 swimming pool",
"Bar",
"Very Good Breakfast"
]
},
...
]
I have a code that should write information to excel using selenium. I have 1 list with some information. I need to write all this to excel, and i have solution. But, when i tried to use it i got 'DataFrame' object is not callable. How can i solve it?
All this code into iteration:
for schools in List: #in the List i have data from excel file with Name of schools
data = pd.DataFrame()
data({
"School Name":School_list_result[0::17],
"Principal":School_list_result[1::17],
"Principal's E-mail":School_list_result[2::17],
"Type":School_list_result[8::17],
"Grade Span": School_list_result[3::17],
"Address":School_list_result[4::17],
"Phone":School_list_result[14::17],
"Website":School_list_result[13::17],
"Associations/Communities":School_list_result[5::17],
"GreatSchools Summary Rating":School_list_result[6::17],
"U.S.News Rankings":School_list_result[12::17],
"Total # Students":School_list_result[15::17],
"Full-Time Teachers":School_list_result[16::17],
"Student/Teacher Ratio":School_list_result[17::17],
"Charter":School_list_result[9::17],
"Enrollment by Race/Ethnicity": School_list_result[7::17],
"Enrollment by Gender":School_list_result[10::17],
"Enrollment by Grade":School_list_result[11::17],
})
data.to_excel("D:\Schools.xlsx")
In School_list_result i have this data:
'Cape Elizabeth High School',
'Mr. Jeffrey Shedd',
'No data.',
'9-12',
'345 Ocean House Road, Cape Elizabeth, ME 04107',
'Cape Elizabeth Public Schools',
'8/10',
'White\n91%\nAsian\n3%\nTwo or more races\n3%\nHispanic\n3%\nBlack\n1%',
'Regular school',
'No',
' Male Female\n Students 281 252',
' 9 10 11 12\n Students 139 135 117 142',
'#5,667 in National Rankings',
'https://cehs.cape.k12.me.us/',
'Tel: (207)799-3309',
'516 students',
'47 teachers',
'11:1',
Please follow the syntax about how to create a dataframe
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
So your code should be modified as:
for schools in List: #in the List i have data from excel file with Name of schools
data = pd.DataFrame(data={
"School Name": School_list_result[0::17],
"Principal": School_list_result[1::17],
"Principal's E-mail": School_list_result[2::17],
"Type": School_list_result[8::17],
"Grade Span": School_list_result[3::17],
"Address": School_list_result[4::17],
"Phone": School_list_result[14::17],
"Website": School_list_result[13::17],
"Associations/Communities": School_list_result[5::17],
"GreatSchools Summary Rating": School_list_result[6::17],
"U.S.News Rankings": School_list_result[12::17],
"Total # Students": School_list_result[15::17],
"Full-Time Teachers": School_list_result[16::17],
"Student/Teacher Ratio": School_list_result[17::17],
"Charter": School_list_result[9::17],
"Enrollment by Race/Ethnicity": School_list_result[7::17],
"Enrollment by Gender": School_list_result[10::17],
"Enrollment by Grade": School_list_result[11::17],
})
Do you want to add in an existing xlsx file?
First, create the dictionary and then call the DataFrame method, like this:
r = {"column1":["data"], "column2":["data"]}
data = pd.DataFrame(r)
I am trying to slice a json file, the file looks like this:
{"price": 17.95, "categories": [["Musical Instruments", "Instrument Accessories", "General Accessories", "Sheet Music Folders"]], "imUrl": "http://ecx.images-amazon.com/images/I/41EpRmh8MEL._SY300_.jpg", "title": "Six Sonatas For Two Flutes Or Violins, Volume 2 (#4-6)", "salesRank": {"Musical Instruments": 207315}, "asin": "0006428320"}
{"description": "Composer: J.S. Bach.Peters Edition.For two violins and pianos.", "related": {"also_viewed": ["B0058DK7RA"], "buy_after_viewing": ["B0058DK7RA"]}, "categories": [["Musical Instruments"]], "brand": "", "imUrl": "http://ecx.images-amazon.com/images/I/41m6ygCqc8L._SY300_.jpg", "title": "Double Concerto in D Minor By Johann Sebastian Bach. Edited By David Oistrach. For Violin I, Violin Ii and Piano Accompaniment. Urtext. Baroque. Medium. Set of Performance Parts. Solo Parts, Piano Reduction and Introductory Text. BWV 1043.", "salesRank": {"Musical Instruments": 94593}, "asin": "0014072149", "price": 18.77}
{"asin": "0041291905", "categories": [["Musical Instruments", "Instrument Accessories", "General Accessories", "Sheet Music Folders"]], "imUrl": "http://ecx.images-amazon.com/images/I/41maAqSO9hL._SY300_.jpg", "title": "Hal Leonard Vivaldi Four Seasons for Piano (Original Italian Text)", "salesRank": {"Musical Instruments": 222972}, "description": "Vivaldi's famous set of four violin concertos certainly ranks among the all-time top ten classical favorites. Features include an introduction about the history of The Four Seasons and Vivaldi's original vivid Italian score markings. A must for classical purists."}
You can see the fields is not arrange strictly in all the lines and i only need part of the fields.
so I wrote this code:
import json, csv
infile = open("sample_output.strict", "r")
outfile = open("output.csv", "w")
writer = csv.writer(outfile)
fileds = ["asin","price"]
for product in json.loads(infile.read()):
line = []
for f in fields:
if product.has_key(f):
line.append(product[f])
else:
line.append("")
writer.write(line)
I got below error msg:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-14-3e335b184eea> in <module>()
6
7 fileds = ["asin","price"]
----> 8 for product in json.loads(infile.read()):
9 line = []
10 for f in fields:
C:\Anaconda3\lib\json\__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
316 parse_int is None and parse_float is None and
317 parse_constant is None and object_pairs_hook is None and not kw):
--> 318 return _default_decoder.decode(s)
319 if cls is None:
320 cls = JSONDecoder
C:\Anaconda3\lib\json\decoder.py in decode(self, s, _w)
344 end = _w(s, end).end()
345 if end != len(s):
--> 346 raise ValueError(errmsg("Extra data", s, end, len(s)))
347 return obj
348
ValueError: Extra data: line 2 column 1 - line 3 column 617 (char 339 - 1581)
What you have is lines of json, not a single json document. Change your program to read each line and convert it to json, then look in each document that way. This is actually pretty common, I receive data to load all the time that is in this format.
Doing it line by line will save you a lot on memory if you end up dealing with large files anyhow.
import json, csv
with open("sample_output.strict", "r") as infile:
with open("output.csv", "w") as outfile:
writer = csv.writer(outfile)
fields = ["asin","price"]
for json_line in infile:
product = json.loads(json_line)
line = []
for f in fields:
if product.has_key(f):
line.append(product[f])
else:
line.append("")
writer.writerow(line)
Your input json file is ill-formed. That is the reason you are seeing this error. In short, you cannot have multiple JSON "objects" in a single file. However, in your case, there are 3 hashes being seen. One solution for this is to encompass them with a top-level list like this:
[
{"price": 17.95, "categories": [["Musical Instruments", "Instrument Accessories", "General Accessories", "Sheet Music Folders"]], "imUrl": "http://ecx.images-amazon.com/images/I/41EpRmh8MEL._SY300_.jpg", "title": "Six Sonatas For Two Flutes Or Violins, Volume 2 (#4-6)", "salesRank": {"Musical Instruments": 207315}, "asin": "0006428320"},
{"description": "Composer: J.S. Bach.Peters Edition.For two violins and pianos.", "related": {"also_viewed": ["B0058DK7RA"], "buy_after_viewing": ["B0058DK7RA"]}, "categories": [["Musical Instruments"]], "brand": "", "imUrl": "http://ecx.images-amazon.com/images/I/41m6ygCqc8L._SY300_.jpg", "title": "Double Concerto in D Minor By Johann Sebastian Bach. Edited By David Oistrach. For Violin I, Violin Ii and Piano Accompaniment. Urtext. Baroque. Medium. Set of Performance Parts. Solo Parts, Piano Reduction and Introductory Text. BWV 1043.", "salesRank": {"Musical Instruments": 94593}, "asin": "0014072149", "price": 18.77},
{"asin": "0041291905", "categories": [["Musical Instruments", "Instrument Accessories", "General Accessories", "Sheet Music Folders"]], "imUrl": "http://ecx.images-amazon.com/images/I/41maAqSO9hL._SY300_.jpg", "title": "Hal Leonard Vivaldi Four Seasons for Piano (Original Italian Text)", "salesRank": {"Musical Instruments": 222972}, "description": "Vivaldi's famous set of four violin concertos certainly ranks among the all-time top ten classical favorites. Features include an introduction about the history of The Four Seasons and Vivaldi's original vivid Italian score markings. A must for classical purists."}
]
Then you can use the following piece of code to slice:
import json, csv
infile = open("sample_output.strict", "r")
jsondata = json.loads(infile.read())
outfile = open("output.csv", "w")
writer = csv.writer(outfile)
fields = ["asin","price"]
for product in jsondata:
line = []
for f in fields:
if f in product:
line.append(product)
break # I assume you need to print only once per match!?
else:
line.append("")
writer.write(line)
I don't understand what you're trying to do with csv output, so I just copied it as it is, to demonstrate the fix.
I am stuck into Invalid Semantic exception error following is my code:
import json
from py2neo import neo4j, Node, Relationship, Graph
graph = Graph()
graph.schema.create_uniqueness_constraint("Authors", "auth_name")
graph.schema.create_uniqueness_constraint("Mainstream_News", "id")
with open("example.json") as f:
for line in f:
while True:
try:
file = json.loads(line)
break
except ValueError:
# Not yet a complete JSON value
line += next(f)
# Now creating the node and relationships
news = graph.merge_one("Mainstream_News", {"id": unicode(file["_id"]["$oid"]), "entry_url": unicode(file["entry_url"]),"title":unicode(file["title"])})
authors = graph.merge_one("Authors", {"auth_name": unicode(file["auth_name"]), "auth_url" : unicode(file["auth_url"]), "auth_eml" : unicode(file["auth_eml"])})
graph.create_unique(Relationship(news, "hasAuthor", authors))
I am trying to connect the news node with authors node.My Json file looks like this:
{
"_id": {
"$oid": "54933912bf4620870115a2e3"
},
"auth_eml": "",
"auth_url": "",
"cat": [],
"auth_name": "Max Bond",
"out_link": [],
"entry_url": [
"http://www.usda.gov/wps/portal/usda/!ut/p/c5/04_SB8K8xLLM9MSSzPy8xBz9CP0os_gAC9-wMJ8QY0MDpxBDA09nXw9DFxcXQ-cAA_1wkA5kFaGuQBXeASbmnu4uBgbe5hB5AxzA0UDfzyM_N1W_IDs7zdFRUREAZXAypA!!/dl3/d3/L2dJQSEvUUt3QS9ZQnZ3LzZfUDhNVlZMVDMxMEJUMTBJQ01IMURERDFDUDA!/?navtype=SU&navid=AGRICULTURE"
],
"out_link_norm": [],
"title": "United States Department of Agriculture - Agriculture",
"entry_url_norm": [
"usda.gov/wps/portal/usda/!ut/p/c5/04_SB8K8xLLM9MSSzPy8xBz9CP0os_gAC9-wMJ8QY0MDpxBDA09nXw9DFxcXQ-cAA_1wkA5kFaGuQBXeASbmnu4uBgbe5hB5AxzA0UDfzyM_N1W_IDs7zdFRUREAZXAypA!!/dl3/d3/L2dJQSEvUUt3QS9ZQnZ3LzZfUDhNVlZMVDMxMEJUMTBJQ01IMURERDFDUDA!/"
],
"ts": 1290945374000,
"source_url": "",
"content": "\n<a\nhref=\"/wps/portal/usda/!ut/p/c4/04_SB8K8xLLM9MSSzPy8xBz9CP0os_gAC9-wMJ8QY0MDpxBDA09nXw9DFxcXQ-cAA_2CbEdFAEUOjoE!/?navid=AVIAN_INFLUENZA\">\n<b>Avian Influenza, Bird Flu</b></a> <br />\nThe official U.S. government web site for information on pandemic flu and avian influenza\n\n<strong>Pest Management</strong> <br />\nPest management policy, pesticide screening tool, evaluate pesticide risk, conservation\nbuffers, training modules.\n\n<strong>Weather and Climate</strong> <br />\nU.S. agricultural weather highlights, weekly weather and crop bulletin, major world crop areas\nand climatic profiles.\n"
}
The full exception error is like this:
File "/home/mohan/workspace/test.py", line 20, in <module>
news = graph.merge_one("Mainstream_News", {"id": unicode(file["_id"]["$oid"]), "entry_url": unicode(file["entry_url"]),"title":unicode(file["title"])})
File "/usr/local/lib/python2.7/dist-packages/py2neo/core.py", line 958, in merge_one
for node in self.merge(label, property_key, property_value, limit=1):
File "/usr/local/lib/python2.7/dist-packages/py2neo/core.py", line 946, in merge
response = self.cypher.post(statement, parameters)
File "/usr/local/lib/python2.7/dist-packages/py2neo/cypher/core.py", line 86, in post
return self.resource.post(payload)
File "/usr/local/lib/python2.7/dist-packages/py2neo/core.py", line 331, in post
raise_from(self.error_class(message, **content), error)
File "/usr/local/lib/python2.7/dist-packages/py2neo/util.py", line 235, in raise_from
raise exception
py2neo.error.InvalidSemanticsException: Cannot merge node using null property value for {'title': u'United States Department of Agriculture - Agriculture', 'id': u'54933912bf4620870115a2e3', 'entry_url': u"[u'http://www.usda.gov/wps/portal/usda/!ut/p/c5/04_SB8K8xLLM9MSSzPy8xBz9CP0os_gAC9-wMJ8QY0MDpxBDA09nXw9DFxcXQ-cAA_1wkA5kFaGuQBXeASbmnu4uBgbe5hB5AxzA0UDfzyM_N1W_IDs7zdFRUREAZXAypA!!/dl3/d3/L2dJQSEvUUt3QS9ZQnZ3LzZfUDhNVlZMVDMxMEJUMTBJQ01IMURERDFDUDA!/?navtype=SU&navid=AGRICULTURE']"}
Any suggestions to fix this ?
Yeah, I see what's going on here. If you look at the py2neo API and look for the merge_one function, it's defined this way:
merge_one(label, property_key=None, property_value=None)
Match or create a node by label and optional property and
return a single matching node. This method is intended to be
used with a unique constraint and does not fail if more than
one matching node is found.
The way that you're calling it is with a string first (label) and then a dictionary:
news = graph.merge_one("Mainstream_News", {"id": unicode(file["_id"]["$oid"]), "entry_url": unicode(file["entry_url"]),"title":unicode(file["title"])})
Your error message says that py2neo is treating the entire dictionary like a property name, and you haven't provided a property value.
So you're calling this function incorrectly. What you should probably be doing is merge_one only on the basis of the id property, then later adding the extra properties you need to the node that comes back.
You need to convert those merge_one calls into something like this:
news = graph.merge_one("Mainstream News", "id", unicode(file["_id]["$oid]))
Note this doesn't give you the extra properties, those you'd add later.