How to build an extracter spacy pipeline

How to build an extracter spacy pipeline - python

I am currently trying to extract some texts from sentences with spacy, did some courses about it but it is still a bit blur to me.
I have the following sentence:
ZZZ LLC is a limited liability company formed in the UK.
XYZ LLC is a limited liability company formed in the UK. XYZ LLC owns a commercial property located in Germany known as ‘rentview’.
Mr X owns 21% of XYZ LLC, the remaining 79% are own by ZZZ LLC which is the sole director of XYZ LLC.
What I want to extract are the following:
{"name": "XYZ LLC", "type": "ORG", "country": "UK"},
{"name": "ZZZ LLC", "type": "ORG", "country": "UK"},
{"name": "XYZ LLC", "type": "ORG",
"owns": [{"name": "rentview", "type": "commercial property", "country": "Germany"}],
"owned_by": [
{"name": "X", "type": "PERSON", "percent": 21},
{"name":"ZZZ LLC", "type": "ORG", "percent": 79}
]
}
My approach is first to assign an incorporation country to a company.
Then detects owners of companies.
Thereafter detects ownerBy of companies.
And finally, generate the JSON object.
But I am already blocking about assigning country to company
my code to do this is the following, but I am pretty sure I'm not using the right approach.
Span.set_extension("incorporation_country", default=False)
#Language.component("assign_org_country")
def assign_org_country(doc):
org_entities = [ent for ent in doc.ents if ent.label_ == "ORG"]
for ent in org_entities:
head = ent.root.head
if head.lemma_ in ['be']:
for child in head.children:
if child.dep_ == "attr" and child.text == "company" and child.right_edge.ent_type_ == "GPE":
ent._.incorporation_country = child
print(f"country of {ent.text} is {ent._.incorporation_country}")
return doc
Any ideas or tips of how to achieve this?

Related

Write a list of dictionaries (with varying keys) to one .csv file?

Given this dictionary:
{
"last_id": "9095247150673486907",
"stories": [
{
"description": "The $68.7 billion deal would be Microsoft\u2019s biggest takeover ever and the biggest deal in video game history. The acquisition would make Microsoft the world\u2019s third-largest gaming company by revenue,\u2026 The post Following the takeover of Activision by Microsoft, Sony is already being shaken up appeared first on The Latest News.",
"favicon_url": "https://static.tickertick.com/website_icons/gettotext.com.ico",
"id": "5310290716350155140",
"site": "gettotext.com",
"tags": [
"msft"
],
"time": 1642641278000,
"title": "Following the takeover of Activision by Microsoft, Sony is already being shaken up",
"url": "https://gettotext.com/following-the-takeover-of-activision-by-microsoft-sony-is-already-being-shaken-up/"
},
{
"description": "Also Read | Acquisition of Activision Blizzard by Microsoft: an opportunity born out of chaos An announcement of such a nature could only inspire a good number of analysts, whose\u2026 The post Microsoft\u2019s takeover of Activision Blizzard ignites analysts appeared first on The Latest News.",
"favicon_url": "https://static.tickertick.com/website_icons/gettotext.com.ico",
"id": "-14419799692027457",
"site": "gettotext.com",
"tags": [
"msft"
],
"time": 1642641042000,
"title": "Microsoft\u2019s takeover of Activision Blizzard ignites analysts",
"url": "https://gettotext.com/microsofts-takeover-of-activision-blizzard-ignites-analysts/"
},
{
"description": "Practical in-ears, mini speakers with long battery life or powerful boom boxes \u2013 the manufacturer Anker offers a suitable product for almost every situation. On Ebay and Amazon you can\u2026 The post Anker on Ebay and Amazon on offer: Inexpensive Soundcore 3, Motion Boom & Co appeared first on The Latest News.",
"favicon_url": "https://static.tickertick.com/website_icons/gettotext.com.ico",
"id": "5221754710166764872",
"site": "gettotext.com",
"tags": [
"amzn"
],
"time": 1642640469000,
"title": "Anker on Ebay and Amazon on offer: Inexpensive Soundcore 3, Motion Boom & Co",
"url": "https://gettotext.com/anker-on-ebay-and-amazon-on-offer-inexpensive-soundcore-3-motion-boom-co/"
},
{
"favicon_url": "https://static.tickertick.com/website_icons/trib.al.ico",
"id": "-3472956334378244458",
"site": "trib.al",
"tags": [
"goog"
],
"time": 1642640285000,
"title": "Google is forming a group dedicated to blockchain and related technologies under a newly appointed executive",
"url": "https://trib.al/nZz3omw"
},
{
"description": "Texas' attorney general on Wednesday sued Google, alleging the company asked local radio DJs to record personal endorsements for smartphones that they hadn't used or been provided.",
"favicon_url": "https://static.tickertick.com/website_icons/yahoo.com.ico",
"id": "9095247150673486907",
"site": "yahoo.com",
"tags": [
"goog"
],
"time": 1642639680000,
"title": "Texas sues Google over local radio ads for its smartphones",
"url": "https://finance.yahoo.com/m/b44151c6-7276-30d9-bc62-bfe18c6297be/texas-sues-google-over-local.html?.tsrc=rss"
}
]
}
...how can I write the 'stories' list of dictionaries to one csv file, such that the keys are the header row, and the values are all the rest of the rows. Note, that some keys don't appear in ALL of the records (example, some story dictionaries don't have a 'description' key, and some do).
Psuedo might include:
Get all keys in the 'stories' list and assign those as the df's header
Iterate through each story in the 'stories' list and append the appropriate rows, leaving a nan if there isn't a matching key for every column
Looking for a pythonic way of doing this relatively quickly.
UPDATE
Trying this:
# Save to excel file
with open("newsheadlines.csv", "wt") as fp:
writer = csv.writer(fp, delimiter=",")
# writer.writerow(["your", "header", "foo"]) # write header
writer.writerows(response['stories'])
...gives this output
Does that help?

Simplest "pythonic" way to do so is by the pandas package.
import pandas as pd
pd.DataFrame(d["stories"]).to_csv('tmp.csv')
# To retrieve it
stories = pd.read_csv('tmp.csv', index_col=0)

loop through a list of dictionaries, perform function, and append result to csv

I trying to loop over a list containing Twitter data in a json format. The list is made of several dictionaries each containing data on a politician. The code works if the input json_response only holds data on one politician. However, when json_response is list of dictionaries i get an error.
In short, I believe the issue can be isolated to three for-loops in the code for tweet in json_response['data']:, for dics in json_response['includes']['users']:, and for element in json_response['includes']['media']:.
# Inputs for the request
bearer_token = auth()
headers = create_headers(bearer_token)
keyword = search_query
start_time = "2016-03-01T00:00:00.000Z"
end_time = "2021-03-31T00:00:00.000Z"
max_results = 3000
json_response = [] # empty list that will hold tweet objects
for i in keyword: # loop through list of politicians in keyword i.e. search query and extract tweets
url = create_url(i, start_time, end_time, max_results)
json_response.append(connect_to_endpoint(url[0], headers, url[1]))
pass
I have only pasted the json_response object for 2 out of 30 politicians due cap on characters. However, the structure is the same for the remaining 28 politicians.
print(json.dumps(json_response, indent=4, sort_keys=True)) # look at json_response object.
[
{
"data": [
{
"author_id": "2877379617",
"created_at": "2021-03-25T12:11:14.000Z",
"id": "1375057688355336195",
"text": "#prettynobodyco She blocked me in 2015 - for pointing out that Tim Kaine enables sexual assault in the military and the evidence was his killing of the MJIA and publicly stated that Military commanders should remain in charge of military rape cases. She's Tanden level awful. Congrats!"
},
{
"author_id": "1265018154444562440",
"created_at": "2021-03-22T19:48:59.000Z",
"id": "1374085719472361474",
"text": "#MehcatCat #AlasscanIsBack #PattyArquette #timkaine Funny, they blocked me. \ud83e\udd23\ud83e\udd23"
},
{
"author_id": "2378324935",
"created_at": "2021-03-07T21:32:13.000Z",
"id": "1368675879312887810",
"text": "#DrWinarick #KatieOGrady4 I apologize for any drama. Katie O Grady blocked me because we had a disagreement about Tim Kaine on one of your older posts. I guess I can't please everyone haha. :/"
},
{
"author_id": "821870502943817729",
"created_at": "2021-02-12T23:53:59.000Z",
"id": "1360376637385244673",
"text": "She blocked me a long ass time ago when I asked her why we shoulf care about Tim Kaine's personal view on abortion if it didn't impact legislation"
},
{
"attachments": {
"media_keys": [
"16_1341045032732770306"
]
},
"author_id": "17232340",
"created_at": "2020-12-21T15:37:07.000Z",
"id": "1341045038420275205",
"text": "#DSingh4Biden #moomintroll8 #timkaine #GovernorVA That's why I replied to you. She blocked me previously, for what silliness I can't remember. Tough being a troll AND a snowflake!"
}
],
"includes": {
"media": [
{
"media_key": "16_1341045032732770306",
"type": "animated_gif"
}
],
"users": [
{
"created_at": "2014-11-15T02:23:57.000Z",
"description": "",
"id": "2877379617",
"name": "Laura Saylor",
"username": "lauraleesaylor"
},
{
"created_at": "2020-05-25T20:33:36.000Z",
"description": "Weird Writer & Lunatic Linguist\nWicked Witch of the East\nshe/her",
"id": "1265018154444562440",
"name": "Zauberkind",
"username": "Zauberkind2"
},
{
"created_at": "2014-03-08T07:22:31.000Z",
"description": "#Resist, #BLM, #Vaxxed, liberal, autistic, kidney transplant survivor, political nerd, mental health advocate, fighter for equality, truth, justice, etc.",
"id": "2378324935",
"name": "Trevor \"Trev\" McKee Achilles",
"username": "MrTAchilles"
},
{
"created_at": "2017-01-19T00:02:52.000Z",
"description": "statist / Progressive Gun Nut/ Single and hating it\n\n / \n\nstraight????? /\n\npronouns / brain worm survivor\n\n \n",
"id": "821870502943817729",
"name": "Squirrel Dad",
"username": "nihilisticpillo"
},
{
"created_at": "2008-11-07T15:09:46.000Z",
"description": "Liberal-Veteran-Dog Lover | Taste for irony, but in moderation | Humor is reason gone mad. ~Groucho Marx | I follow & unfollow back #VeteransResist #Resist",
"id": "17232340",
"name": "anti-Fascist Jim",
"username": "JimnBL"
}
]
},
"meta": {
"newest_id": "1375057688355336195",
"next_token": "b26v89c19zqg8o3foseug43lzoqdft4ghg78o9sn9ds3h",
"oldest_id": "1341045038420275205",
"result_count": 5
}
},
{
"data": [
{
"author_id": "1248251899884814336",
"created_at": "2021-03-27T13:36:45.000Z",
"id": "1375803982409576450",
"text": "#gavinjeffries0 #steven86026859 #MSNBC #SenBooker Uh Oh our friend Steve blocked me, I guess not being able to answer your simple question and being asked to was too much for him."
},
{
"author_id": "293104735",
"created_at": "2021-02-07T21:45:47.000Z",
"id": "1358532435122683904",
"text": "#slwilliams1101 #annabella313 #CrossConnection #TiffanyDCross #Scaramucci #JoyAnnReid #CapehartJ #MSNBC #SenBooker #AliVelshi I stopped watching #TiffanyDCross as well and only watch #CapehartJ now (even though he blocked me in 2016 because I had a \"strong\" response to something mean he said about Hillary Clinton)."
},
{
"author_id": "380970864",
"created_at": "2021-02-07T20:58:01.000Z",
"id": "1358520416273326081",
"text": "#annabella313 #CrossConnection #TiffanyDCross #Scaramucci #JoyAnnReid #CapehartJ #MSNBC After I criticized #TiffanyDCross she blocked me. #JoyAnnReid called herself petty during and interview with #SenBooker. Why be petty? Be mature and thoughtful so people can learn. Hosts need to learn too. I only watch #AliVelshi #CapehartJ now."
},
{
"attachments": {
"media_keys": [
"3_1358448920632909825"
]
},
"author_id": "793175035322171397",
"created_at": "2021-02-07T16:17:44.000Z",
"id": "1358449876565164034",
"text": "#FinstaManhattan #SenSchumer #SenBooker #RonWyden Lmao he blocked me over that. His bio said he likes to 'debate & that sometimes he's wrong but he can admit that'.\n\nGuess not.\n\nI wasn't rude or mean at all. This is too funny \ud83e\udd23"
},
{
"author_id": "752266160352010241",
"created_at": "2021-02-06T20:34:06.000Z",
"id": "1358152008948195328",
"text": "#fattypinner #tkbone32221 #SenSchumer #SenBooker #RonWyden He blocked me \ud83e\udd23\ud83d\ude2d\ud83e\udd23\ud83e\udd23\ud83e\udd23\ud83d\ude2d"
}
],
"includes": {
"media": [
{
"media_key": "3_1358448920632909825",
"type": "photo",
"url": ""
}
],
"users": [
{
"created_at": "2020-04-09T14:11:04.000Z",
"description": "",
"id": "1248251899884814336",
"name": "Firstcomm",
"username": "Firstcomm1"
},
{
"created_at": "2011-05-04T19:26:22.000Z",
"description": "Cinephile, balletomane, book lover, tennis fan, K-Drama fanatic, Jang Na-ra fangirl, USC School of Cinematic Arts alumna, Hillary Clinton and Nancy Pelosi Dem.",
"id": "293104735",
"name": "Joyce Tyler",
"username": "joyce_tyler"
},
{
"created_at": "2011-09-27T14:50:37.000Z",
"description": "Spelman College, BA, George Washington University MA, University of South Florida Ph.D. in Political Science, proud Ted Kennedy, Obama, Biden/Harris Democrat!",
"id": "380970864",
"name": "Stephanie L. Williams, Ph.D.",
"username": "slwilliams1101"
},
{
"created_at": "2016-10-31T19:37:19.000Z",
"description": "Loves: life, fam, cats, cars, tattoos, reality TV; collector of t-shirts & Volkswagen\u2019s. Hates: Oxford commas. #CombatVet #Medic #BidenHarris2020 #Resist",
"id": "793175035322171397",
"name": "Que Sarah Sarah \ud83d\udda4",
"username": "sarahalli13"
},
{
"created_at": "2016-07-10T22:20:03.000Z",
"description": "3x Hollywood Video Street Fighter 2 Champion",
"id": "752266160352010241",
"name": "Sugarcoder",
"username": "TheSugarCoder"
}
]
},
"meta": {
"newest_id": "1375803982409576450",
"next_token": "b26v89c19zqg8o3fosktkdplqiw2q9kzx2ibm4r4y27wd",
"oldest_id": "1358152008948195328",
"result_count": 5
}
}
...28 other politicians
# Create file
csvFile = open("tweet_sample.csv", "a", newline="", encoding='utf-8')
csvWriter = csv.writer(csvFile)
# Create headers for the data I want to save. I only want to save these columns in my dataset
csvWriter.writerow(
['author id', 'created_at', 'id', 'tweet', 'bio', 'image_url'])
csvFile.close()
def append_to_csv(json_response, fileName):
# A counter variable
global created_at, tweet_id, bio, text, author_id
counter = 0
# Open OR create the target CSV file
csvFile = open(fileName, "a", newline="", encoding='utf-8')
csvWriter = csv.writer(csvFile)
# Loop through each tweet
for tweet in json_response[0]['data']: # NOTE adding a 0 gives access to the data for the first politician while adding 1 gives access to data for the second politician and so on...
# 1. Author ID
author_id = tweet['author_id']
# 2. Time created
created_at = dateutil.parser.parse(tweet['created_at'])
# 3. Tweet ID
tweet_id = tweet['id']
# 4. Tweet text
text = tweet['text']
for dics in json_response[0]['includes']['users']: # NOTE 0 added
# 5. description. Contained in includes data object
if ('description' in dics):
bio = dics['description']
else:
bio = " "
for element in json_response[0]['includes']['media']: # NOTE 0 added
# 6. image url. Contained in includes data object
if ('url' in element):
image_url = element['url']
else:
image_url = " "
# Assemble all data in a list
res = [author_id, created_at, tweet_id, text, bio, image_url]
# Append the result to the CSV file
csvWriter.writerow(res)
counter += 1
# When done, close the CSV file
csvFile.close()
# Print the number of tweets for this iteration
print("# of Tweets added from this response: ", counter)
append_to_csv(json_response, "tweet_sample.csv") # Save tweet data in a csv file
Error message:
TypeError: list indices must be integers or slices, not str
By adding the [0] in the loop I avoid the TypeError above. However the output from the function append_to_csv is not ideal as it only includes the last tweet for the first politician. I guess my loop overwrites data.
Desired output would be a data frame with columns author_id, created_at, id, tweet, bio, image_url. Not all users have a bio on their profile or an image_url in their tweet hence the if-else statement in the function above and the bio, no_bio and bio, image_url, no_image_url in the desired data frame.
pol_df = pd.read_csv("path_to_tweet_sample.csv" )
pol_df.head()
author_id created_at id tweet bio image_url
0 737885223858384896 2021-03-26T21:56:02.000Z 1375567243082338314 tweet_text no_bio no_image_url
1 847612931487416323 2021-03-26T21:55:24.000Z 1375567083791073283 tweet_text no_bio no_image_url
2 18634205 2021-03-08T12:29:00.000Z 1368901564363051010 tweet_text bio image_url
3 27327319 2021-03-02T11:53:16.000Z 1366718245521211393 tweet_text bio no_image_url
4 917634626247647232 2021-02-28T18:16:45.000Z 1366089974907432961 tweet_text bio image_url

I think you are confusing lists with dicts. When you try to access a list like a dict (e.g. data["author_id"]) the TypeError you're getting will be raised. You have to iterate over a list and then try to access each dict in that list like [x['author_id'] for x in data], for example. If you want to extract values from the dicts and write it to a csv file you might want to do something like this:
import pandas as pd
author_data = []
for data in resp:
for author in data['data']:
author_id = author['author_id']
created_at = author['created_at']
another_id = author['id']
tweet_text = author['text']
author_data.append([author_id, created_at, another_id, tweet_text])
author_df = pd.DataFrame(author_data, columns=['author_id', 'created_at', 'id', 'text'])
media_data = []
for data in resp:
for media in data['includes']['media']:
url = media.get('url', 'no_url')
media_data.append(media)
media_df = pd.DataFrame(media_data, columns=['url'])
bio_data = []
for data in resp:
for user in data['includes']['users']:
bio = user['description']
author_id = user['id']
bio_data.append([bio, author_id])
bio_df = pd.DataFrame(bio_data, columns=['bio', 'author_id'])
final_df = author_df.merge(bio_df, on="author_id")
print(final_df)
You have to save different parts of the data in different dataframes and then merge them. The thing is that media does not contain the author_id or another key that is shared between the ['includes']['media'] part and ['data'] part so you cannot merge that.

Reformatting json file

I'm trying to reformat a JSON file so I can convert it into a Python dictionary. The file contains line-separated JSON objects with different product info (looks like this):
{"asin": "7301113188", "category": ["Appliances", "Refrigerators, Freezers & Ice Makers"], "description": [], "fit": "", "title": "Tupperware Freezer Square Round Container Set of 6", "also_buy": [], "image": [], "tech2": "", "brand": "Tupperware", "feature": ["Each 3-pc. set includes two 7/8-cup/200 mL and one 1-3/4-cup/400 mL.", "Use them to keep sandwich fillings, salads or leftovers fresh in the refrigerator.", "Gently twist the container to \"pop\" out frozen foods for reheating.", "Dishwasher Safe.", "Set weights less than 13 oz!"], "rank": [">#39,745 in Appliances (See top 100)"], "also_view": [], "details": {}, "main_cat": "Appliances", "similar_item": "", "date": "November 19, 2008", "price": ""}
{"asin": "7861850250", "category": ["Appliances", "Refrigerators, Freezers & Ice Makers"], "tech2": "", "brand": "Tupperware", "feature": ["2 X Tupperware Pure & Fresh Unique Covered Cool Cubes Ice Tray in Purple With Opening Lid Contain 14 Cubes - HerbalStore_24*7", "Package Contain :- 2 Tray", "Each ice tray has a specially designed seal that allows you to fill from the faucet with no spills on the way to the freezer. While freezing, this seal helps keep flavor in and freezer odors out, ensuring you have pure ice every time. For something special, try freezing lemonade, tea or fruit juices in these Ice Tray to give your beverages an extra-flavorful kick. Or add a piece of fruit to each cube for a stylish touch of elegance.", "Sold By:- HerbalStore_24*7", "Free Shipping"], "rank": [">#6,118 in Appliances (See top 100)"], "also_view": ["B004RUGHJW"], "details": {}, "main_cat": "Appliances", "similar_item": "", "date": "June 5, 2016", "price": "$3.62"}
I want the dict to contain key-pair values where each "asin" is a key and the rest of the product info is a value. What's the most optimal way to do this?

You can parse JSON dictionaries using json.loads.
import json
final = {}
for line in lines:
d = json.loads(line)
final[d['asin']] = d
del d['asin']
It's also possible to parse some JSON-like text using ast.literal_eval. The JSON-like text has to not have any boolean or null values, as is the case with your example. Below is the changes needed:
from ast import literal_eval
...
d = literal_eval(line)
...

How to handle composite type of Entities using RASA NLU?

Let's say I have this utterance: "My name is John James Doe"
{
"rasa_nlu_data": {
"common_examples":
[
{
"text": "My name is John James Doe",
"intent": "Introduction",
"entities": [
{
"start": 11,
"end": 25,
"value": "John James Doe",
"entity": "Name"
}
]
}
],
"regex_features" : [],
"entity_synonyms": []
}
}
Here the substring John James Doe is a composite entity of type Name having 3 simple entities (First Name, Middle Name, Last Name) as follows:
John - First Name(Simple Entity)
James - Middle Name(Simple Entity)
Doe - Last Name(Simple Entity)
So, is there any in RASA for me to make a training format which will handle these kinds of composite type of entities.
Any help is appreciated, Thank you.

I believe you'll have an easier time if you continue to train with an entity type of Name that pulls out a section of text for all the names and then try to process the individual composite parts from the returned entity text. The reason being if you try to train on the component parts, you'll quickly have to provide a whole raft of combinations in your training data and that will become ineffective.
Also bear in mind that this isn't a trivial problem as you go deeper. If you use position alone to determine first / middle / last, you may have problems in Japan (https://www.sljfaq.org/afaq/names-for-people.html ) and if you tried to train to recognise names based on content (ie to pick out Doe as a last name) it will be prone to problems: it's not unknown for Americans to have first names that elsewhere are thought of as last names (Jackson, Hunter etc), middle names vary a lot too (https://en.m.wikipedia.org/wiki/Middle_name )

I've written a custom component for this as I needed composite entities as well. Here is a summary of how it works.
Let's say your training data rather looked like this:
"rasa_nlu_data": {
"common_examples":
[
{
"text": "My name is John James Doe",
"intent": "Introduction",
"entities": [
{
"start": 11,
"end": 15,
"value": "John",
"entity": "first_name"
},
{
"start": 16,
"end": 21,
"value": "James",
"entity": "middle_name"
},
{
"start": 22,
"end": 25,
"value": "Doe",
"entity": "last_name"
},
]
}
],
"regex_features" : [],
"entity_synonyms": []
}
}
So instead of training the full name as an entity that will be split, you train name parts that will be grouped to full names.
The basic idea now is that you define composite patterns with entity placeholders. For your example, you could define this pattern:
full_name = "#first_name #middle_name #last_name"
For your example sentence, Rasa NLU will recognize the three entities in it like this:
My name is John James Doe
^ ^ ^
first_name middle_name last_name
You take the input sentence and replace every recognized entity with its entity type:
My name is #first_name #middle_name #last_name
You can now perform a simple check whether your defined pattern is included in this string.
My name is #first_name #middle_name #last_name
^
| Pattern matches
|
"#first_name #middle_name #last_name"
If it is included, you take all entities values that are part of the inclusion and group them together to a full_name.
My name is John James Doe
^
| Pattern matches
|
#first_name #middle_name #last_name
-> full_name = ["John", "James", "Doe"]
If you use regular expressions instead of simple string matching, you can make this system a lot more flexible. For example, you could make the middle name optional by changing your pattern to
full_name = "#first_name (#middle_name )?#last_name"

Storing dictionary variables in list after test

I have a json structured like this:
{ "status":"OK", "copyright":"Copyright (c) 2017 Pro Publica Inc. All Rights Reserved.","results":[
{
"member_id": "B001288",
"total_votes": "100",
"offset": "0",
"votes": [
{
"member_id": "B001288",
"chamber": "Senate",
"congress": "115",
"session": "1",
"roll_call": "84",
"bill": {
"number": "H.J.Res.57",
"bill_uri": "https://api.propublica.org/congress/v1/115/bills/hjres57.json",
"title": "Providing for congressional disapproval under chapter 8 of title 5, United States Code, of the rule submitted by the Department of Education relating to accountability and State plans under the Elementary and Secondary Education Act of 1965.",
"latest_action": "Message on Senate action sent to the House."
},
"description": "A joint resolution providing for congressional disapproval under chapter 8 of title 5, United States Code, of the rule submitted by the Department of Education relating to accountability and State ...",
"question": "On the Joint Resolution",
"date": "2017-03-09",
"time": "12:02:00",
"position": "No"
},
Sometimes the "bill" parameter is there, sometimes it is blank, like:
{
"member_id": "B001288",
"chamber": "Senate",
"congress": "115",
"session": "1",
"roll_call": "79",
"bill": {
},
"description": "James Richard Perry, of Texas, to be Secretary of Energy",
"question": "On the Nomination",
"date": "2017-03-02",
"time": "13:46:00",
"position": "No"
},
I want to access and store the "bill_uri" in a list, so I can access it later on. I've already performed .json() through the requests package to process it into python. print votes_json["results"][0]["votes"][0]["bill"]["bill_uri"] etc. works just fine, but when I do:
bill_urls_2 = []
for n in range(0, len(votes_json["results"][0]["votes"])):
if votes_json["results"][0]["votes"][n]["bill"]["bill_uri"] in votes_json["results"][0]["votes"][n]:
bill_urls_2.append(votes_json["results"][0]["votes"][n])["bill"]["bill_uri"]
print bill_urls_2
I get the error KeyError: 'bill_uri'. I think I have a problem with the structure of the if statement, specifically what key I'm looking for in the dictionary. Could someone provide an explanation/link to explanation about how to use in to find keys? Or pinpoint the error in how I'm using it?
Update: Aha! I got this to work:
bill_urls_2 = []
for n in range(0, len(votes_json["results"][0]["votes"])):
if "bill" in votes_json["results"][0]["votes"][n]:
if "bill_uri" in votes_json["results"][0]["votes"][n]["bill"]:
bill_urls_2.append(votes_json["results"][0]["votes"][n]["bill"]["bill_uri"])
print bill_urls_2
Thank you to everyone who gave me advice.

The error here is cause by the fact that you are looking for a key in the dictionary by called that key itself. Here's a small example:
my_dict = {'A': 1, 'B':2, 'C':3}
Now C may or may not exist in the dict every time. This is how I can check if C exists in the dict:
if 'C' in my_dict:
print(True)
What you are doing is:
if my_dict['C'] in my_dict:
print(True)
If C doesn't exist to begin with my_dict['C'] isn't found and gives you an error.
What you need to do is:
bill_urls_2 = []
for n in range(0, len(votes_json["results"][0]["votes"])):
if "bill_uri" in votes_json["results"][0]["votes"][n]:
bill_urls_2.append(votes_json["results"][0]["votes"][n]["bill"]["bill_uri"])
print bill_urls_2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to build an extracter spacy pipeline - python

Related

Write a list of dictionaries (with varying keys) to one .csv file?

loop through a list of dictionaries, perform function, and append result to csv

Reformatting json file

How to handle composite type of Entities using RASA NLU?

Storing dictionary variables in list after test

Categories

Resources