How can i get from format json this text? - python

I have a JSON file that contains several images and annotations. Each image has an id, and each annotation references a caption and the image_id of the image. There are thousands of images and multiple annotations refer to the same image. Here's a sample for only one image and its annotations (link to full data):
{
"images": [
{
"license": 5,
"url": "http://farm4.staticflickr.com/3153/2970773875_164f0c0b83_z.jpg",
"file_name": "COCO_train2014_000000057870.jpg",
"id": 57870,
"width": 640,
"date_captured": "2013-11-14 16:28:13",
"height": 480
}
],
"annotations": [
{
"image_id": 57870,
"id": 787980,
"caption": "A restaurant has modern wooden tables and chairs."
},
{
"image_id": 57870,
"id": 789366,
"caption": "A long restaurant table with rattan rounded back chairs."
},
{
"image_id": 57870,
"id": 789888,
"caption": "a long table with a plant on top of it surrounded with wooden chairs "
},
{
"image_id": 57870,
"id": 791316,
"caption": "A long table with a flower arrangement in the middle for meetings"
},
{
"image_id": 57870,
"id": 794853,
"caption": "A table is adorned with wooden chairs with blue accents."
}
]
}
I need to reconstruct the format of the text in this file to be like this:
COCO_train2014_000000057870.jpg#0 A restaurant has modern wooden tables and chairs.
COCO_train2014_000000057870.jpg#1 A long restaurant table with rattan rounded back chairs.
COCO_train2014_000000057870.jpg#2 a long table with a plant on top of it surrounded with wooden chairs
COCO_train2014_000000057870.jpg#3 A long table with a flower arrangement in the middle for meetings
COCO_train2014_000000057870.jpg#4 A table is adorned with wooden chairs with blue accents.
I know the idea but couldn't write it in programming well using Python. I need first to check if the image_id is equal or not and if it is equal I need to get their ids and number it from 0 to 4 and get their captions.

After reading in the data, reorganizing into an dictionary indexed by ID will make it easy to access the correct image when iterating the annotations. Below does this, but also adds each caption to a list of captions added to each image:
import json
with open('captions_train2014.json') as f:
data = json.load(f)
# Collect all images into a dictionary indexed by ID
images = {p['id']:p for p in data['images']}
# To each image, add a list of captions
for image in images.values():
image['captions'] = []
# For each annotation, add its caption to its
# corresponding image's caption list.
for annotation in data['annotations']:
image_id = annotation['image_id']
annotation_id = annotation['id']
images[image_id]['captions'].append(annotation['caption'])
# Iterate over images and print captions in the format requested.
for image in images.values():
for i,caption in enumerate(image['captions']):
print(f"{image['file_name']}#{i} {caption}")
Output:
COCO_train2014_000000057870.jpg#0 A restaurant has modern wooden tables and chairs.
COCO_train2014_000000057870.jpg#1 A long restaurant table with rattan rounded back chairs.
COCO_train2014_000000057870.jpg#2 a long table with a plant on top of it surrounded with wooden chairs
COCO_train2014_000000057870.jpg#3 A long table with a flower arrangement in the middle for meetings
COCO_train2014_000000057870.jpg#4 A table is adorned with wooden chairs with blue accents.
COCO_train2014_000000384029.jpg#0 A man preparing desserts in a kitchen covered in frosting.
COCO_train2014_000000384029.jpg#1 A chef is preparing and decorating many small pastries.
COCO_train2014_000000384029.jpg#2 A baker prepares various types of baked goods.
COCO_train2014_000000384029.jpg#3 a close up of a person grabbing a pastry in a container
COCO_train2014_000000384029.jpg#4 Close up of a hand touching various pastries.
COCO_train2014_000000222016.jpg#0 a big red telephone booth that a man is standing in
COCO_train2014_000000222016.jpg#1 a person standing inside of a phone booth
COCO_train2014_000000222016.jpg#2 this is an image of a man in a phone booth.
COCO_train2014_000000222016.jpg#3 A man is standing in a red phone booth.
COCO_train2014_000000222016.jpg#4 A man using a phone in a phone booth.
...

Related

How can I print the original title of the top 50 movies from themoviedatabase.org with python?

hello all i'm learning python recently and i need to analyze the web of themoviedb.org website. I want to extract all the movies in the database and I want to print the original title of the first 50 movies.This is a piece of the json file that i receive as a response following my network request:
{"page":1,"results":[{"adult":false,"backdrop_path":"/5gPQKfFJnl8d1edbkOzKONo4mnr.jpg","genre_ids":[878,12,28],"id":76600,"original_language":"en","original_title":"Avatar: The Way of Water","overview":"Set more than a decade after the events of the first film, learn the story of the Sully family (Jake, Neytiri, and their kids), the trouble that follows them, the lengths they go to keep each other safe, the battles they fight to stay alive, and the tragedies they endure.","popularity":5332.225,"poster_path":"/t6HIqrRAclMCA60NsSmeqe9RmNV.jpg","release_date":"2022-12-14","title":"Avatar: The Way of Water","video":false,"vote_average":7.7,"vote_count":3497},{......}],"total_pages":36589,"total_results":731777}
And this is my code:
import requests
response = requests.get("https://api.themoviedb.org/3/discover/movie?api_key=my_key&language=en-US&sort_by=popularity.desc&include_adult=false&include_video=false&page=1&with_watch_monetization_types=flatrate")
jsonresponse=response.json()
page=jsonresponse["page"]
results=jsonresponse["results"]
for i in range(50):
for result in jsonresponse["original_title"][i]:
print(result)
My code don't work. Error: "KeyError: 'original_title'". How can I print the original title of the top 50 movies?
When formatting the json you posted properly:
{
"page": 1,
"results": [
{
"adult": false,
"backdrop_path": "/5gPQKfFJnl8d1edbkOzKONo4mnr.jpg",
"genre_ids": [
878,
12,
28
],
"id": 76600,
"original_language": "en",
"original_title": "Avatar: The Way of Water",
"overview": "Set more than a decade after the events of the first film, learn the story of the Sully family (Jake, Neytiri, and their kids), the trouble that follows them, the lengths they go to keep each other safe, the battles they fight to stay alive, and the tragedies they endure.",
"popularity": 5332.225,
"poster_path": "/t6HIqrRAclMCA60NsSmeqe9RmNV.jpg",
"release_date": "2022-12-14",
"title": "Avatar: The Way of Water",
"video": false,
"vote_average": 7.7,
"vote_count": 3497
},
{
....
}
],
"total_pages": 36589,
"total_results": 731777
}
one can easily see that original_tile is part of each dictionary / map in results. So using
for result in results:
print(result["original_title"])
should work.

Write a list of dictionaries (with varying keys) to one .csv file?

Given this dictionary:
{
"last_id": "9095247150673486907",
"stories": [
{
"description": "The $68.7 billion deal would be Microsoft\u2019s biggest takeover ever and the biggest deal in video game history. The acquisition would make Microsoft the world\u2019s third-largest gaming company by revenue,\u2026 The post Following the takeover of Activision by Microsoft, Sony is already being shaken up appeared first on The Latest News.",
"favicon_url": "https://static.tickertick.com/website_icons/gettotext.com.ico",
"id": "5310290716350155140",
"site": "gettotext.com",
"tags": [
"msft"
],
"time": 1642641278000,
"title": "Following the takeover of Activision by Microsoft, Sony is already being shaken up",
"url": "https://gettotext.com/following-the-takeover-of-activision-by-microsoft-sony-is-already-being-shaken-up/"
},
{
"description": "Also Read | Acquisition of Activision Blizzard by Microsoft: an opportunity born out of chaos An announcement of such a nature could only inspire a good number of analysts, whose\u2026 The post Microsoft\u2019s takeover of Activision Blizzard ignites analysts appeared first on The Latest News.",
"favicon_url": "https://static.tickertick.com/website_icons/gettotext.com.ico",
"id": "-14419799692027457",
"site": "gettotext.com",
"tags": [
"msft"
],
"time": 1642641042000,
"title": "Microsoft\u2019s takeover of Activision Blizzard ignites analysts",
"url": "https://gettotext.com/microsofts-takeover-of-activision-blizzard-ignites-analysts/"
},
{
"description": "Practical in-ears, mini speakers with long battery life or powerful boom boxes \u2013 the manufacturer Anker offers a suitable product for almost every situation. On Ebay and Amazon you can\u2026 The post Anker on Ebay and Amazon on offer: Inexpensive Soundcore 3, Motion Boom & Co appeared first on The Latest News.",
"favicon_url": "https://static.tickertick.com/website_icons/gettotext.com.ico",
"id": "5221754710166764872",
"site": "gettotext.com",
"tags": [
"amzn"
],
"time": 1642640469000,
"title": "Anker on Ebay and Amazon on offer: Inexpensive Soundcore 3, Motion Boom & Co",
"url": "https://gettotext.com/anker-on-ebay-and-amazon-on-offer-inexpensive-soundcore-3-motion-boom-co/"
},
{
"favicon_url": "https://static.tickertick.com/website_icons/trib.al.ico",
"id": "-3472956334378244458",
"site": "trib.al",
"tags": [
"goog"
],
"time": 1642640285000,
"title": "Google is forming a group dedicated to blockchain and related technologies under a newly appointed executive",
"url": "https://trib.al/nZz3omw"
},
{
"description": "Texas' attorney general on Wednesday sued Google, alleging the company asked local radio DJs to record personal endorsements for smartphones that they hadn't used or been provided.",
"favicon_url": "https://static.tickertick.com/website_icons/yahoo.com.ico",
"id": "9095247150673486907",
"site": "yahoo.com",
"tags": [
"goog"
],
"time": 1642639680000,
"title": "Texas sues Google over local radio ads for its smartphones",
"url": "https://finance.yahoo.com/m/b44151c6-7276-30d9-bc62-bfe18c6297be/texas-sues-google-over-local.html?.tsrc=rss"
}
]
}
...how can I write the 'stories' list of dictionaries to one csv file, such that the keys are the header row, and the values are all the rest of the rows. Note, that some keys don't appear in ALL of the records (example, some story dictionaries don't have a 'description' key, and some do).
Psuedo might include:
Get all keys in the 'stories' list and assign those as the df's header
Iterate through each story in the 'stories' list and append the appropriate rows, leaving a nan if there isn't a matching key for every column
Looking for a pythonic way of doing this relatively quickly.
UPDATE
Trying this:
# Save to excel file
with open("newsheadlines.csv", "wt") as fp:
writer = csv.writer(fp, delimiter=",")
# writer.writerow(["your", "header", "foo"]) # write header
writer.writerows(response['stories'])
...gives this output
Does that help?
Simplest "pythonic" way to do so is by the pandas package.
import pandas as pd
pd.DataFrame(d["stories"]).to_csv('tmp.csv')
# To retrieve it
stories = pd.read_csv('tmp.csv', index_col=0)

Reformatting json file

I'm trying to reformat a JSON file so I can convert it into a Python dictionary. The file contains line-separated JSON objects with different product info (looks like this):
{"asin": "7301113188", "category": ["Appliances", "Refrigerators, Freezers & Ice Makers"], "description": [], "fit": "", "title": "Tupperware Freezer Square Round Container Set of 6", "also_buy": [], "image": [], "tech2": "", "brand": "Tupperware", "feature": ["Each 3-pc. set includes two 7/8-cup/200 mL and one 1-3/4-cup/400 mL.", "Use them to keep sandwich fillings, salads or leftovers fresh in the refrigerator.", "Gently twist the container to \"pop\" out frozen foods for reheating.", "Dishwasher Safe.", "Set weights less than 13 oz!"], "rank": [">#39,745 in Appliances (See top 100)"], "also_view": [], "details": {}, "main_cat": "Appliances", "similar_item": "", "date": "November 19, 2008", "price": ""}
{"asin": "7861850250", "category": ["Appliances", "Refrigerators, Freezers & Ice Makers"], "tech2": "", "brand": "Tupperware", "feature": ["2 X Tupperware Pure & Fresh Unique Covered Cool Cubes Ice Tray in Purple With Opening Lid Contain 14 Cubes - HerbalStore_24*7", "Package Contain :- 2 Tray", "Each ice tray has a specially designed seal that allows you to fill from the faucet with no spills on the way to the freezer. While freezing, this seal helps keep flavor in and freezer odors out, ensuring you have pure ice every time. For something special, try freezing lemonade, tea or fruit juices in these Ice Tray to give your beverages an extra-flavorful kick. Or add a piece of fruit to each cube for a stylish touch of elegance.", "Sold By:- HerbalStore_24*7", "Free Shipping"], "rank": [">#6,118 in Appliances (See top 100)"], "also_view": ["B004RUGHJW"], "details": {}, "main_cat": "Appliances", "similar_item": "", "date": "June 5, 2016", "price": "$3.62"}
I want the dict to contain key-pair values where each "asin" is a key and the rest of the product info is a value. What's the most optimal way to do this?
You can parse JSON dictionaries using json.loads.
import json
final = {}
for line in lines:
d = json.loads(line)
final[d['asin']] = d
del d['asin']
It's also possible to parse some JSON-like text using ast.literal_eval. The JSON-like text has to not have any boolean or null values, as is the case with your example. Below is the changes needed:
from ast import literal_eval
...
d = literal_eval(line)
...

how to split large json file according to the value of a key?

I have a large json file that I would like to split according to the key "metadata". One example of record is
{"text": "The primary outcome of the study was hospital mortality; secondary outcomes included ICU mortality and lengths of stay for hospital and ICU. ICU mortality was defined as survival of a patient at ultimate discharge from the ICU and hospital mortality was defined as survival at discharge or transfer from our hospital.", "label": "conclusion", "metadata": "18982114"}
There are many records in the json file where the key "metadata" is "18982114". How can I extract all of these records and store them into a separate json file? Ideally, I'm looking for a solution that includes no loading and looping over the file, otherwise it would be very cumbersome every time I query it. I think by using shell command maybe it's doable, but unfortunately I'm not an expert in shell commands...so I would highly appreciate a non-looping fast query solution, thx!
==========================================================================
here are some samples of the file (contains 5 records):
{"text": "Finally, after an emergency laparotomy, patients who received i.v. vasoactive drugs within the first 24 h on ICU were 3.9 times more likely to die (OR 3.85; 95% CI, 1.64 -9.02; P\u00bc0.002). No significant prognostic factors were determined by the model on day 2.", "label": "conclusion", "metadata": "18982114"}
{"text": "Kinetics ofA TP Binding to Normal and Myopathic", "label": "conclusion", "metadata": "10700033"}
{"text": "Observed rate constants, k0b,, were obtained by fitting the equation I(t)=oe-kobs+C by the method of moments, where I is the observed fluorescence intensity, and I0 is the amplitude of fluorescence change. 38 ", "label": "conclusion", "metadata": "235564322"}
{"text": "The capabilities of modern angiographic platforms have recently improved substantially.", "label": "conclusion", "metadata": "2877272"}
{"text": "Few studies have concentrated specifically on the outcomes after surgery.", "label": "conclusion", "metadata": "18989842"}
The job is to fast retrieve the text for the record with metadata "18982114"
Use json package to convert the json object into a dictionary then use the data stored in the metadata key. here is an working example:
# importing the module
import json
# Opening JSON file
with open('data.json') as json_file:
data = json.load(json_file)
# Print the type of data variable
print("Type:", type(data))
# Print the data of dictionary
print("metadata: ", data['metadata'])
You can try this approach:
import json
with open('data.json') as data_json:
data = json.load(data_json)
MATCH_META_DATA = '18982114'
match_records = []
for part_data in data:
if part_data.get('metadata') == MATCH_META_DATA:
match_records.append(part_data)
Let us imagine we have the following JSON content in example.json:
{
"1":{"text": "Some text 1.", "label": "xxx", "metadata": "18982114"},
"2":{"text": "Some text 2.", "label": "yyy", "metadata": "18982114"},
"3":{"text": "Some text 3.", "label": "zzz", "metadata": "something else"}
}
You can do the following:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import json
# 1. read json content from file
my_json = None
with open("example.json", "r") as file:
my_json = json.load(file)
# 2. filter content
# you can use a list instead of a new dictionary if you don't want to create a new json file
new_json_data = {}
for record_id in my_json:
if my_json[record_id]["metadata"] == str(18982114):
new_json_data[record_id] = my_json[record_id]
# 3. write a new json with filtered data
with open("result.json"), "w") as file:
json.dump(new_json_data, file)
This will output the following result.json file:
{"1": {"text": "Some text 1.", "label": "", "metadata": "18982114"}, "2": {"text": "Some text 2.", "label": "", "metadata": "18982114"}}

How to handle composite type of Entities using RASA NLU?

Let's say I have this utterance: "My name is John James Doe"
{
"rasa_nlu_data": {
"common_examples":
[
{
"text": "My name is John James Doe",
"intent": "Introduction",
"entities": [
{
"start": 11,
"end": 25,
"value": "John James Doe",
"entity": "Name"
}
]
}
],
"regex_features" : [],
"entity_synonyms": []
}
}
Here the substring John James Doe is a composite entity of type Name having 3 simple entities (First Name, Middle Name, Last Name) as follows:
John - First Name(Simple Entity)
James - Middle Name(Simple Entity)
Doe - Last Name(Simple Entity)
So, is there any in RASA for me to make a training format which will handle these kinds of composite type of entities.
Any help is appreciated, Thank you.
I believe you'll have an easier time if you continue to train with an entity type of Name that pulls out a section of text for all the names and then try to process the individual composite parts from the returned entity text. The reason being if you try to train on the component parts, you'll quickly have to provide a whole raft of combinations in your training data and that will become ineffective.
Also bear in mind that this isn't a trivial problem as you go deeper. If you use position alone to determine first / middle / last, you may have problems in Japan (https://www.sljfaq.org/afaq/names-for-people.html ) and if you tried to train to recognise names based on content (ie to pick out Doe as a last name) it will be prone to problems: it's not unknown for Americans to have first names that elsewhere are thought of as last names (Jackson, Hunter etc), middle names vary a lot too (https://en.m.wikipedia.org/wiki/Middle_name )
I've written a custom component for this as I needed composite entities as well. Here is a summary of how it works.
Let's say your training data rather looked like this:
"rasa_nlu_data": {
"common_examples":
[
{
"text": "My name is John James Doe",
"intent": "Introduction",
"entities": [
{
"start": 11,
"end": 15,
"value": "John",
"entity": "first_name"
},
{
"start": 16,
"end": 21,
"value": "James",
"entity": "middle_name"
},
{
"start": 22,
"end": 25,
"value": "Doe",
"entity": "last_name"
},
]
}
],
"regex_features" : [],
"entity_synonyms": []
}
}
So instead of training the full name as an entity that will be split, you train name parts that will be grouped to full names.
The basic idea now is that you define composite patterns with entity placeholders. For your example, you could define this pattern:
full_name = "#first_name #middle_name #last_name"
For your example sentence, Rasa NLU will recognize the three entities in it like this:
My name is John James Doe
^ ^ ^
first_name middle_name last_name
You take the input sentence and replace every recognized entity with its entity type:
My name is #first_name #middle_name #last_name
You can now perform a simple check whether your defined pattern is included in this string.
My name is #first_name #middle_name #last_name
^
| Pattern matches
|
"#first_name #middle_name #last_name"
If it is included, you take all entities values that are part of the inclusion and group them together to a full_name.
My name is John James Doe
^
| Pattern matches
|
#first_name #middle_name #last_name
-> full_name = ["John", "James", "Doe"]
If you use regular expressions instead of simple string matching, you can make this system a lot more flexible. For example, you could make the middle name optional by changing your pattern to
full_name = "#first_name (#middle_name )?#last_name"

Categories

Resources