Modify existing json to create new custom one python - python

I'm trying to trim unused data in json to create new one with only two fields. Title and description. The title works great but I can't figure out how to get the description field. The json is public and you can get it here or at the end of the post.
My code that extracts title field:
import requests
import json
def trim_json(d):
newd = {}
for name in ['title']:
newd[name] = d[name]
return newd
def clean():
books = requests.get('https://openlibrary.org/authors/OL23919A/works.json')
books_parsed = books.json()
book_data = books_parsed['entries']
book_data = [trim_json(d) for d in book_data]
print(book_data)
return book_data
update
clean function returns list of dicts in this format:
[{'title': 'Harry Potter House Gryffindor Edition Series 1-5 Books Collection Set By J.K. Rowling'}]
What I want to get is:
[{'title': 'Harry Potter House Gryffindor Edition Series 1-5 Books Collection Set By J.K. Rowling', 'description': 'lorem ipsum'}]
and if there is no description:
[{'title': 'Harry Potter House Gryffindor Edition Series 1-5 Books Collection Set By J.K. Rowling', 'description': 'undefind'}]
How can I get json that returns title & description?
{
"type": {
"key": "/type/work"
},
"title": "Journey to Hogwarts",
"authors": [
{
"type": {
"key": "/type/author_role"
},
"author": {
"key": "/authors/OL23919A"
}
}
],
"covers": [
2520429
],
"key": "/works/OL28602152W",
"latest_revision": 1,
"revision": 1,
"created": {
"type": "/type/datetime",
"value": "2022-08-05T00:16:59.602176"
},
"last_modified": {
"type": "/type/datetime",
"value": "2022-08-05T00:16:59.602176"
}
},
{
"description": "Harry Potter #2\r\n\r\nThroughout the summer holidays after his first year at Hogwarts School of Witchcraft and Wizardry, Harry Potter has been receiving sinister warnings from a house-elf called Dobby.\r\n\r\nNow, back at school to start his second year, Harry hears unintelligible whispers echoing through the corridors.\r\n\r\nBefore long the attacks begin: students are found as if turned to stone.\r\n\r\nDobby’s predictions seem to be coming true.\r\n\r\n[Source][1]\r\n\r\n\r\n [1]: https://www.jkrowling.com/book/harry-potter-chamber-secrets/",
"links": [
{
"title": "Author's book page",
"url": "https://www.jkrowling.com/book/harry-potter-chamber-secrets/",
"type": {
"key": "/type/link"
}
},
{
"url": "https://en.wikipedia.org/wiki/Harry_Potter_and_the_Chamber_of_Secrets",
"title": "Wikipedia entry",
"type": {
"key": "/type/link"
}
},
{
"title": "Harry Potter and the Chamber of Secrets by J.K. Rowling - review | Children's books | The Guardian",
"url": "https://www.theguardian.com/childrens-books-site/2015/mar/02/review-j-k-rowling-harry-potter-chamber-secrets",
"type": {
"key": "/type/link"
}
},
{
"url": "https://www.theguardian.com/childrens-books-site/2016/may/26/harry-potter-and-the-chamber-of-secrets-jk-rowling-review",
"title": "Harry Potter and the Chamber of Secrets by J.K. Rowling - review 2 | Children's books | The Guardian",
"type": {
"key": "/type/link"
}
}
],
"title": "Harry Potter and the Chamber of Secrets",
"covers": [
8234423,
8237628,
8237644,
8392798,
8995302,
8762432,
8081272,
8353396,
10301720,
8938317,
10471286,
10413455,
10487260,
-1,
10535729,
10722535,
10722534,
11522289,
12347254,
12581306,
12606939,
10536577,
11540339,
12023623
],
"subject_places": [
"England",
"London",
"Hogwarts School of Witchcraft and Wizardry",
"Inglaterra",
"Privet Drive"
],
"subjects": [
"Fantasy fiction",
"school stories",
"Fiction",
"Fantasy",
"Nestlé Smarties Book Prize winner",
"Juvenile fiction",
"Wizards",
"Magic",
"Schools",
"Spanish language materials",
"Magia",
"Escuelas",
"Ficción juvenil",
"Novela fantástica",
"Hogwarts School of Witchcraft and Wizardry (Imaginary place)",
"Harry Potter (Fictitious character)",
"Wizards -- Juvenile fiction",
"Witches",
"Hogwarts School of Witchcraft and Wizardry (Imaginary organization)",
"Magos",
"Translations from English",
"Chinese fiction",
"Orphans",
"Aunts",
"Uncles",
"Cousins",
"Determination (Personality trait) in children",
"Friendship",
"Potter, Harry (Fictitious character)",
"Witches Fiction",
"Wizards Fiction",
"Schools Fiction",
"England Fiction",
"Magic -- Juvenile fiction",
"Hogwarts School of Witchcraft and Wizardry (Imaginary place) -- Juvenile fiction",
"Schools -- Juvenile fiction",
"Wizards -- Fiction",
"Magic -- Fiction",
"Schools -- Fiction",
"England -- Juvenile fiction",
"England -- Fiction",
"Fantasy & Magic",
"Action & Adventure",
"Witchcraft",
"Harry Potter (Fictional character)",
"Engels",
"Social Themes",
"Reading Level-Grade 11",
"Reading Level-Grade 12",
"Schools, fiction",
"England, fiction",
"Potter, harry (fictitious character), fiction",
"Hogwarts school of witchcraft and wizardry (imaginary organization), fiction",
"Wizards, fiction",
"Magic, fiction",
"Children's fiction",
"Adventure and adventurers, fiction",
"English literature",
"Fiction, fantasy, general",
"Large type books",
"Hermione Granger (Fictitious character)",
"Ron Weasley (Fictitious character)",
"Latin language materials",
"Children's stories",
"Magiciens",
"Romans, nouvelles, etc. pour la jeunesse",
"Nécromancie",
"Écoles",
"Potter, Harry (Personnage fictif)",
"Romans, nouvelles",
"Magie",
"Family",
"Orphans & Foster Homes",
"Magía",
"Novela juvenil",
"Juvenile",
"Children's stories, English",
"Sieg",
"Basilisk",
"Das Böse",
"Das Gute",
"Internat",
"Lebensgefahr",
"Lebensrettung",
"List",
"Magier",
"Jugendbuch",
"Kampf",
"Schule",
"Basilisk (Fabeltier)",
"Junge",
"Phönix",
"Deutschland Grenzschutzkommando Mitte Schule",
"Deutschland",
"Friendship, fiction",
"Hogwartes School of Witchcraft and Wizardry (Imaginary place)",
"General",
"Social Issues",
"Witches, fiction"
],
"subject_people": [
"Harry Potter",
"Hermione Granger",
"Ron Weasley",
"Albus Dumbledore",
"Hagrid",
"The Dursleys",
"Gilderoy Lockhart",
"Dobby",
"Moaning Myrtle",
"Ginny Weasley",
"Draco Malfoy",
"Hermine Granger",
"Ron Weasly",
"Harry Potter (Fictitious character)"
],
"key": "/works/OL82537W",
"authors": [
{
"author": {
"key": "/authors/OL23919A"
},
"type": {
"key": "/type/author_role"
}
}
],
"excerpts": [
{
"excerpt": "Not for the first time, an argument had broken out over breakfast at number four, Privet Drive.",
"comment": "first sentence",
"author": {
"key": "/people/seabelis"
}
}
],
"type": {
"key": "/type/work"
},
"latest_revision": 80,
"revision": 80,
"created": {
"type": "/type/datetime",
"value": "2009-10-17T07:07:29.461716"
},
"last_modified": {
"type": "/type/datetime",
"value": "2022-06-22T07:57:49.863271"
}
},

All entries don't have title and description field. Therefore you have to use try...except clauses to prevent KeyErrors to happen.
def trim_json(d):
newd = {}
try:
newd["title"] = d["title"]
except KeyError:
pass
try:
newd["description"] = d["description"]
except KeyError:
pass
return newd
Or, in a more elegant way, you could use a filter in a dictionnary comprehension:
key_filter = ['title', 'description']
cleaned_data = [{k:d[k] for k in key_filter if k in d} for d in book_data]
And since the first element in the entries list is not a book data (and does not have a title nor a description key), you should start the list comprehension after the first element :
def clean():
books = requests.get('https://openlibrary.org/authors/OL23919A/works.json')
books_parsed = books.json()
book_data = books_parsed['entries']
cleaned_data = [trim_json(d) for d in book_data[1:]]
return book_data
It prevents obtaining an empty dictionnary that corresponds to no book.

Use the json library. It comes installed in python by default.
Let us say your json string is stored in a variable called json_str, we can run:
import json
info = json.loads(json_str)
title = info['title']

Related

How to loop through sibling tags while scraping data

I am trying to scrape editor data from this page using python scrapy framework.
The problem I am facing is every tag is a sibling tag and the editor role is inside h3 tags and names are inside div tags. All these are inside a div tag with id "editors-section". I can loop through each div tag like
response.css("#editors-section>div.row.align-items-center")
and collect editor name and organization,
but how to collect their respective roles.How to loop through all the tags. Thanks .
You can use relative xpath expressions and using the following-sibling directive along with testing for adjacent role headers using the selectors root.tag attribute, you can accurately determine each persons role.
For example:
for header in response.xpath("//h2"):
role = header.xpath("./text()").get()
for sibling in header.xpath("./following-sibling::*"):
if sibling.root.tag == "h2":
break
name = sibling.xpath(".//h3/*/text()").get()
location = sibling.xpath(".//p[#class='mb-2']/text()").get()
if name and location:
yield{
"role": role.strip(),
"name": name.strip(),
"location": location.strip()
}
OUTPUT
[
{
"role": "Editors-in-Chief",
"name": "Hua Wang",
"location": "University of Electronic Science and Technology of China, China"
},
{
"role": "Editors-in-Chief",
"name": "Gabriele Morra",
"location": "University of Louisiana at Lafayette, USA"
},
{
"role": "Board Members",
"name": "Luca Caricchi",
"location": "University of Geneva, Switzerland"
},
{
"role": "Board Members",
"name": "Michael Fehler",
"location": "Massachusetts Institute of Technology, USA"
},
{
"role": "Board Members",
"name": "Peter Gerstoft",
"location": "Scripps Institution of Oceanography, USA"
},
{
"role": "Board Members",
"name": "Forrest M. Hoffman",
"location": "Oak Ridge National Laboratory, United States of America"
},
{
"role": "Board Members",
"name": "Xiangyun Hu",
"location": "China University of Geosciences, China"
},
{
"role": "Board Members",
"name": "Guangmin Hu",
"location": "University of Electronic Science and Technology of China, China"
},
{
"role": "Board Members",
"name": "Qingkai Kong",
"location": "UC Berkeley, USA"
},
{
"role": "Board Members",
"name": "Yuemin Li",
"location": "University of Electronic Science and Technology of China, China"
},
{
"role": "Board Members",
"name": "Hongjun Lin",
"location": "Zhejiang Normal University, China"
},
{
"role": "Board Members",
"name": "Aldo Lipani",
"location": "University College London, United Kingdom"
},
{
"role": "Board Members",
"name": "Zhigang Peng",
"location": "Georgia Institute of Technology, USA"
},
{
"role": "Board Members",
"name": "Piero Poli",
"location": "Grenoble Alpes University, France"
},
{
"role": "Board Members",
"name": "Kunfeng Qiu",
"location": "China University of Geoscience, China"
},
{
"role": "Board Members",
"name": "Calogero Schillaci",
"location": "JRC European Commission, Italy"
},
{
"role": "Board Members",
"name": "Hosein Shahnas",
"location": "University of Toronto, Canada"
},
{
"role": "Board Members",
"name": "Byung-Dal So",
"location": "Kangwon National University, South Korea"
},
{
"role": "Board Members",
"name": "Rui Wang",
"location": "China University of Geoscience, China"
},
{
"role": "Board Members",
"name": "Yong Wang",
"location": "East Carolina University, USA"
},
{
"role": "Board Members",
"name": "Zhiguo Wang",
"location": "Xi'an Jiaotong University, China"
},
{
"role": "Board Members",
"name": "Jun Xia",
"location": "Wuhan University, China"
},
{
"role": "Board Members",
"name": "Lizhi Xiao",
"location": "China University of Petroleum(Beijing), China"
},
{
"role": "Board Members",
"name": "Chicheng Xu",
"location": "Aramco Services Company, USA"
},
{
"role": "Board Members",
"name": "Zhibing Yang",
"location": "Wuhan University, China"
},
{
"role": "Board Members",
"name": "Nana Yoshimitsu",
"location": "Kyoto University, Japan"
},
{
"role": "Board Members",
"name": "Hongyan Zhang",
"location": "Wuhan University, China"
}
]
Same result but using a bit another approach (and a single for loop). I find each h3 element (name) and get the role (first h2 element above) using preceding XPath expression:
def parse(self, response):
for h3_node in response.xpath('//div[#class="container"]//h3'):
role = h3_node.xpath('normalize-space(./preceding::h2[1])').get()
name = h3_node.xpath('normalize-space(.)').get()
location = h3_node.xpath("normalize-space(./following-sibling::p[1])").get()
if name and location:
yield{
"role": role,
"name": name,
"location": location,
}

How to access Twitter included data object?

I have extracted the following Twitter data using Tweepy. However, I am not able to fetch data from the included data object. I am specifically trying to fetch the URL and description data. I can see from the json_response that both data on URL and description are present.
My data has the following structure:
{
"data": [
{
"attachments": {
"media_keys": [
"3_1376989039262195713"
]
},
"author_id": "964661980551266304",
"created_at": "2021-03-30T20:05:45.000Z",
"id": "1376989044039544836",
"text": "#RichardGrenell I also want to speak out against this FB group who blocked me (after asking me to invite all my friends) for making the point that this recall not be made a MAGA one. \n\nI didn\u2019t stump on the ground for Trump, I did it for my children."
},
{
"attachments": {
"media_keys": [
"3_1376986160963145736",
"3_1376986160988368898",
"3_1376986160963198980",
"3_1376986160954757129"
]
},
"author_id": "1000347213145563136",
"created_at": "2021-03-30T19:54:20.000Z",
"id": "1376986169704071171",
"text": "#Bobbrock8013 #irishson19161 #RandPaul It's ok to question the election of Trump, but if you question Biden's win you are a \"domestic terrorist.\" Does the Biden Admin welcome a discussion of opposing views on policies regarding lockdowns, masks and vaccines? Why is Big Tech censoring conservatives? Fascists censor."
},
{
"attachments": {
"media_keys": [
"3_1376961169450221571"
]
},
"author_id": "328673472",
"created_at": "2021-03-30T18:15:00.000Z",
"id": "1376961171841036291",
"text": "#ByronYork Newsworthy, but Democrats via their minions will likely censor Trump's statement from Twitter, Facebook, CNN, MSNBC, Washington Post, NY Times etc You know our free speech rules now are based on the Democrats' version of what they will ALLOW us Deplorables to say let alone think."
},
{
"author_id": "18774517",
"created_at": "2021-03-30T10:31:58.000Z",
"id": "1376844643837566986",
"text": "RT #BrexitBuster: #EditingMike #LauraHa15799415 I\u2019m old enough to remember when Piers Morgan was Donald J Trump\u2019s number one fanboy. Are yo\u2026"
},
{
"author_id": "52405628",
"created_at": "2021-03-30T10:30:33.000Z",
"id": "1376844286646480899",
"text": "RT #BrexitBuster: #EditingMike #LauraHa15799415 I\u2019m old enough to remember when Piers Morgan was Donald J Trump\u2019s number one fanboy. Are yo\u2026"
},
{
"author_id": "848911132496723969",
"created_at": "2021-03-30T10:30:11.000Z",
"id": "1376844194921250818",
"text": "RT #BrexitBuster: #EditingMike #LauraHa15799415 I\u2019m old enough to remember when Piers Morgan was Donald J Trump\u2019s number one fanboy. Are yo\u2026"
},
{
"attachments": {
"media_keys": [
"3_1376836461601898499",
"3_1376836461614542853"
]
},
"author_id": "848911132496723969",
"created_at": "2021-03-30T09:59:37.000Z",
"id": "1376836504308305921",
"text": "#EditingMike #LauraHa15799415 I\u2019m old enough to remember when Piers Morgan was Donald J Trump\u2019s number one fanboy. Are you?\n\nThen he praised Joe Biden\u2019s speech... until he was offered the chance to pen a vicious hatchet piece for the Daily Mail! Pointing this out earned me a block.\n#shapeshiftingcreep"
},
{
"attachments": {
"media_keys": [
"3_1376821889004363777"
]
},
"author_id": "31308988",
"created_at": "2021-03-30T09:01:34.000Z",
"id": "1376821895811715073",
"text": "A lady sent this to my messenger right before she blocked me because she was mad I typed the names of Trump's sex assault victims"
},
{
"attachments": {
"media_keys": [
"3_1376704749379145731"
]
},
"author_id": "198202008",
"created_at": "2021-03-30T01:16:05.000Z",
"id": "1376704753145643014",
"text": "#moondancer34 #MrCrispyMAGA #lonelymilkshake #EFMoriarty #CBSNews Who is this person who blocked me? A MAGA lover? Guess that\u2019s why. But how ironic that he\u2019s a Trump supporter yet a WA fan when Woody is about as liberal as they get. In fact he donated to Hillary\u2019s campaign so she\u2019d win against Trump. Whatever! \ud83d\ude02\ud83e\udd37\u200d\u2640\ufe0f"
}
],
"includes": {
"media": [
{
"media_key": "3_1376989039262195713",
"type": "photo",
"url": "https://pbs.twimg.com/media/ExwMPFDUYAEHKn0.jpg"
},
{
"media_key": "3_1376986160963145736",
"type": "photo",
"url": "https://pbs.twimg.com/media/ExwJnijWUAgfPlb.jpg"
},
{
"media_key": "3_1376986160988368898",
"type": "photo",
"url": "https://pbs.twimg.com/media/ExwJnipXMAIHmJp.jpg"
},
{
"media_key": "3_1376986160963198980",
"type": "photo",
"url": "https://pbs.twimg.com/media/ExwJnijXIAQ4F_x.jpg"
},
{
"media_key": "3_1376986160954757129",
"type": "photo",
"url": "https://pbs.twimg.com/media/ExwJnihWUAkr8bi.jpg"
},
{
"media_key": "3_1376961169450221571",
"type": "photo",
"url": "https://pbs.twimg.com/media/Exvy416WQAMRlO0.jpg"
},
{
"media_key": "3_1376836461601898499",
"type": "photo",
"url": "https://pbs.twimg.com/media/ExuBd4-WQAMgTTR.jpg"
},
{
"media_key": "3_1376836461614542853",
"type": "photo",
"url": "https://pbs.twimg.com/media/ExuBd5BXMAU2-p_.jpg"
},
{
"media_key": "3_1376821889004363777",
"type": "photo",
"url": "https://pbs.twimg.com/media/Ext0Np0WYAEUBXy.jpg"
},
{
"media_key": "3_1376704749379145731",
"type": "photo",
"url": "https://pbs.twimg.com/media/ExsJrOtWUAMgVxk.jpg"
}
],
"users": [
{
"created_at": "2018-02-17T00:45:13.000Z",
"description": "Congressional Candidate for CA-28 Proud Angeleno/Catholic/Californio by marriage Localist\u2022Centrist\u2022Pragmatist\u2022Realist",
"id": "964661980551266304",
"name": "Beatrice Cardenas",
"username": "RealBetyCardens"
},
{
"created_at": "2018-05-26T12:05:35.000Z",
"description": "Following President Trump .... KAG 2020 \ud83c\uddfa\ud83c\uddf8",
"id": "1000347213145563136",
"name": "Joseph Fong",
"username": "JosephEugeneFo1"
},
{
"created_at": "2011-07-03T20:29:43.000Z",
"description": "Husband, Dad, Granddad, Christian,Army MP Sgt vet, I.U. grad, former banker & retired City Finance Director, Reagan guy. Cancer survivor. \u271d\ufe0f\ud83c\uddfa\ud83c\uddf8",
"id": "328673472",
"name": "Steve B",
"username": "Stevebfrs"
},
{
"created_at": "2009-01-08T19:06:29.000Z",
"description": "a younger Victor Meldrew but interesting - I hope - nice sometimes !",
"id": "18774517",
"name": "NORBET",
"username": "NORBET"
},
{
"created_at": "2009-06-30T14:17:41.000Z",
"description": "Tanglewood and Gretsch",
"id": "52405628",
"name": "FSociety Tom \ud83c\uddea\ud83c\uddfa #FBPE ANTIFA #RESIST #FBPPR #BLM",
"username": "thebdaman"
},
{
"created_at": "2017-04-03T14:52:40.000Z",
"description": "We are the Remain Resistance... popping Brexit bubbles one at a time. Mostly sarcasm, occasionally deadly serious. Love the UK & the EU. Detest racism & Nazis.",
"id": "848911132496723969",
"name": "Brexit Buster",
"username": "BrexitBuster"
},
{
"created_at": "2009-04-15T02:18:58.000Z",
"description": "No DMs !!! \ud83c\udf0a \ud83c\udf0a\nBLM ,Trans lives matter, LGBT \ud83c\udf08\nAlly of all marginalized",
"id": "31308988",
"name": "Stephy Pachuco (Her, She) \ud83c\udf0a\ud83c\udf0a",
"username": "Stephaniespc"
},
{
"created_at": "2010-10-03T16:56:45.000Z",
"description": "How'd you know I was looking at you if you weren't looking at me? \ud83d\udde3Mike Patton \u2615\ufe0fCoffee \ud83d\ude0eWeekends \ud83c\udf0aPolitics \ud83d\ude0dNYC \ud83e\udd96Museum Employee",
"id": "198202008",
"name": "Patti\ud83d\uddfd",
"username": "PattiFromNYC"
}
]
},
"meta": {
"newest_id": "1376989044039544836",
"next_token": "b26v89c19zqg8o3fosqtjm19orv2gber5hh7b0fu7uem5",
"oldest_id": "1376704753145643014",
"result_count": 9
}
}
I can successfully fetch the data from the data object which is 'id', 'text', 'created_at', and 'author_id' using the following code. However, the code does not retrieve the 'URL' and 'description' data from the included object which leaves me with two empty columns.
# Create file
csvFile = open("data.csv", "a", newline="", encoding='utf-8')
csvWriter = csv.writer(csvFile)
# Create headers for the data
csvWriter.writerow(
['author id', 'created_at', 'id', 'tweet', 'bio', 'image_url'])
csvFile.close()
def append_to_csv(json_response, fileName):
# A counter variable
counter = 0
# Open OR create the target CSV file
csvFile = open(fileName, "a", newline="", encoding='utf-8')
csvWriter = csv.writer(csvFile)
# Loop through each tweet
for tweet in json_response['data']:
# We will create a variable for each since some of the keys might not exist for some tweets
# So we will account for that
# 1. Author ID
author_id = tweet['author_id']
# 2. Time created
created_at = dateutil.parser.parse(tweet['created_at'])
# 3. Tweet ID
tweet_id = tweet['id']
# 4. Tweet text
text = tweet['text']
# 5. description
if('description' in tweet):
bio = tweet['users']['description']
else:
bio = " "
# 6. image url
if ('url' in tweet):
image_url = tweet['media']['url']
else:
image_url = " "
# Assemble all data in a list
res = [author_id, created_at, tweet_id, text, bio, image_url]
# Append the result to the CSV file
csvWriter.writerow(res)
counter += 1
# When done, close the CSV file
csvFile.close()
# Print the number of tweets for this iteration
print("# of Tweets added from this response: ", counter)

How to execute multiple json objects stored in file object [duplicate]

I have a very large json file (9GB). I'm reading in one object from it at a time, and then deleting key-value pairs in this object when the key is not in the list fields.
Each object is basically someone's user profile on a job searching website, but it comes with many unwanted key-value pairs that are not relevant to my analysis. There are about 3 million of these profiles.
I'd like to write each new profile/object to a json file, cleaned.json. Essentially this should be a copy of the original json file, except any of the key-value pairs not mentioned in fields have been removed from all 3 million profiles.
To do this, I wrote the following code:
# fields to keep
fields = ["skills", "industry", "summary", "education", "experience"]
with open('cleaned.json', 'w', encoding='UTF8') as f:
for profile in open(path_to_file, encoding = 'UTF8'):
profile = json.loads(profile)
# remove unwanted fields from profile
for key in list(profile.keys()):
if key not in fields:
del(profile[key])
# write profile to new json file
json.dump(profile, f)
To test whether it worked, I tried reading the json file in again, like so:
for foo in open('cleaned.json', encoding='UTF8'):
foo = json.loads(foo)
print(json.dumps(foo, indent=4))
But I'm getting this error: JSONDecodeError: Extra data on the foo = json.loads(foo) line.
I've tested this by only modifying 1 profile from the original json and writing this modified profile to cleaned.json, and cleaned.json looks like this (except it's all on one line, I've just pretty printed it for this post):
{
"skills": [
"Key Account Development",
"Strategic Planning",
"Market Planning",
"Team Leadership",
"Negotiation",
"Forecasting",
"Key Account Management",
"Sales Management",
"New Business Development",
"Business Planning",
"Cross-functional Team Leadership",
"Budgeting",
"Strategy Development",
"Business Strategy",
"Consultative Selling",
"Medical Devices",
"Customer Relations",
"Contract Negotiation",
"Mentoring",
"Coaching",
"Healthcare",
"Territory",
"Sales Process",
"Direct Sales",
"Sales Operations",
"Pharmaceutical Sales"
],
"industry": "Medical Devices",
"summary": "SALES MANAGEMENT / BUSINESS DEVELOPMENT / PROJECT MANAGEMENTDOMESTIC & INTERNATIONAL KEY ACCOUNT MANAGEMENTBusiness and Sales Executive with 20 years of accomplished career track, reflecting extensive experience and dynamic record-breaking performance in the Medical Industry markets. Exceptional communicator, strong team player, flexible self-starter with consultative sales style, strong negotiations skills, exceptional problem solving abilities, and accurate customer assessment aptitude. Manage and lead teams to success, drive new business through key accounts management, establish partnerships, manage solid distributor relationship for increased profitability and sales volumes. Very well organized, accurate and on-time administrative work, with a track record that demonstrates self-motivation, creativity, sales team leadership, initiative to achieve corporate, team and personal goals. Experience in the following markets: Medical Devices, Medical Disposables, Capital Equipment, Pharmaceuticals."
}{
"education": [
{
"start": "2008",
"major": "Economics",
"end": "2008",
"name": "Columbia University - Columbia Business School",
"desc": "Coursework \"Principals of Economics\" ECON1105\tSpring 2008"
},
{
"start": "2007",
"end": "2007",
"name": "Columbia University - Columbia Business School"
},
{
"major": "Cancer genomics",
"end": "2001",
"name": "G\u00f6teborgs universitet",
"degree": "Ph.D.",
"start": "1996",
"desc": "Thesis: \"The role of p53 in tumor progression and prognosis in patients with primary colorectal cancer\""
},
{
"start": "1994",
"major": "Biology, Medicine;German Language",
"end": "1995",
"name": "Universit\u00e4t Regensburg",
"degree": "Cancer Research, Coursework"
},
{
"major": "Biology",
"end": "1994",
"name": "G\u00f6teborgs universitet",
"degree": "Master",
"start": "1989",
"desc": ""
},
{
"start": "1992",
"major": "50% Biology and Medicine, 50% mixed music, sports, computer science, art etc",
"end": "1993",
"name": "The University of Georgia",
"desc": "Scholarship for one full year of Graduate Studies."
}
],
"skills": [
"Molecular Biology",
"Biomarkers"
],
"industry": "Pharmaceuticals",
"experience": [
{
"org": "Johnson and Johnson",
"title": "Senior Scientist, Oncology Biomarkers",
"end": "Present",
"start": "November 2009",
"desc": "Biomarker Leader for compounds in clinical development.*Developing and implementing predictive and pharmacodynamic biomarkers for the use in Phase 0 - III oncology clinical trials.."
},
{
"org": "Albert Einstein Medical Center",
"title": "Associate at Dept of Molecular Genetics",
"start": "September 2008",
"desc": "Single Cell Gene expression."
},
{
"org": "Columbia University",
"title": "Associate Research Scientist",
"start": "August 2006",
"desc": "Work on peptide to restore wt p53 function in cancer."
},
{
"org": "Memorial Sloan Kettering Cancer Center",
"title": "Post Doctoral Research Fellow",
"start": "January 2003",
"desc": "Molecular profiling of colorectal cancer."
},
{
"org": "Sahlgrenska University Hospital",
"title": "Research Scientist",
"start": "November 2001",
"desc": "Cancer Research at Dept of Surgery.Molecular profiling of Colorectal Cancer with focus on p53."
}
],
"summary": "Ph.D. scientist with background in cancer research, translational medicine and early drug development with special focus on biomarkers and personalized medicine."
}
So when I read this in, I'm getting the error. What am I doing wrong? I guess there is something wrong with the way I'm writing the profile to cleaned.json?
Sample input for testing
Sample input has 3 profiles.
{"_id": "in-00000001", "name": {"family_name": "Mazalu MBA", "given_name": "Dr Catalin"}, "locality": "United States", "skills": ["Key Account Development", "Strategic Planning", "Market Planning", "Team Leadership", "Negotiation", "Forecasting", "Key Account Management", "Sales Management", "New Business Development", "Business Planning", "Cross-functional Team Leadership", "Budgeting", "Strategy Development", "Business Strategy", "Consultative Selling", "Medical Devices", "Customer Relations", "Contract Negotiation", "Mentoring", "Coaching", "Healthcare", "Territory", "Sales Process", "Direct Sales", "Sales Operations", "Pharmaceutical Sales"], "industry": "Medical Devices", "summary": "SALES MANAGEMENT / BUSINESS DEVELOPMENT / PROJECT MANAGEMENTDOMESTIC & INTERNATIONAL KEY ACCOUNT MANAGEMENTBusiness and Sales Executive with 20 years of accomplished career track, reflecting extensive experience and dynamic record-breaking performance in the Medical Industry markets. Exceptional communicator, strong team player, flexible self-starter with consultative sales style, strong negotiations skills, exceptional problem solving abilities, and accurate customer assessment aptitude. Manage and lead teams to success, drive new business through key accounts management, establish partnerships, manage solid distributor relationship for increased profitability and sales volumes. Very well organized, accurate and on-time administrative work, with a track record that demonstrates self-motivation, creativity, sales team leadership, initiative to achieve corporate, team and personal goals. Experience in the following markets: Medical Devices, Medical Disposables, Capital Equipment, Pharmaceuticals.", "url": "http://www.linkedin.com/in/00000001", "also_view": [{"url": "http://www.linkedin.com/pub/krisa-drost/45/909/513", "id": "pub-krisa-drost-45-909-513"}, {"url": "http://ro.linkedin.com/pub/florin-ut/18/b33/77b", "id": "pub-florin-ut-18-b33-77b"}, {"url": "http://ro.linkedin.com/pub/cristian-radu/21/225/149", "id": "pub-cristian-radu-21-225-149"}, {"url": "http://ro.linkedin.com/pub/traian-rusu/16/652/279", "id": "pub-traian-rusu-16-652-279"}, {"url": "http://ro.linkedin.com/pub/dumitrescu-catalin/3/283/92", "id": "pub-dumitrescu-catalin-3-283-92"}, {"url": "http://www.linkedin.com/pub/jody-brelsford/9/21a/354", "id": "pub-jody-brelsford-9-21a-354"}, {"url": "http://www.linkedin.com/pub/mary-anne-dilloway/2/55a/18", "id": "pub-mary-anne-dilloway-2-55a-18"}, {"url": "http://ro.linkedin.com/pub/carmen-baleanu/2b/252/203", "id": "pub-carmen-baleanu-2b-252-203"}, {"url": "http://il.linkedin.com/in/shimonlobel", "id": "in-shimonlobel"}, {"url": "http://ro.linkedin.com/pub/monica-danilescu/19/36a/121", "id": "pub-monica-danilescu-19-36a-121"}]}
{"_id": "in-00001", "education": [{"start": "2008", "major": "Economics", "end": "2008", "name": "Columbia University - Columbia Business School", "desc": "Coursework \"Principals of Economics\" ECON1105\tSpring 2008"}, {"start": "2007", "end": "2007", "name": "Columbia University - Columbia Business School"}, {"major": "Cancer genomics", "end": "2001", "name": "G\u00f6teborgs universitet", "degree": "Ph.D.", "start": "1996", "desc": "Thesis: \"The role of p53 in tumor progression and prognosis in patients with primary colorectal cancer\""}, {"start": "1994", "major": "Biology, Medicine;German Language", "end": "1995", "name": "Universit\u00e4t Regensburg", "degree": "Cancer Research, Coursework"}, {"major": "Biology", "end": "1994", "name": "G\u00f6teborgs universitet", "degree": "Master", "start": "1989", "desc": ""}, {"start": "1992", "major": "50% Biology and Medicine, 50% mixed music, sports, computer science, art etc", "end": "1993", "name": "The University of Georgia", "desc": "Scholarship for one full year of Graduate Studies."}], "group": {"affilition": ["ASMALLWORLD.net", "Biomarker Research & Executive Network", "Biomarker Society", "Biomarkers", "Biomarkers in Discovery, Development and the Clinic Network", "Biotechnology/Pharmaceuticals", "Circulating Tumor Cell (CTC) and Cancer Stem Cell Group", "Clinical Development Job Opportunities - Europe", "Epigenetics", "Molecular Diagnostics Professional Network", "Molecular Diagnostics for Cancer Drug Development Forum", "NYC Women in Biotech", "Oncology Drug Development (Premier Group For Cancer Drug Development)", "Oncology Pharma\u2122", "Personalized Medicine", "Personalized Oncology Medicine - Global Group", "Professionals in the Pharmaceutical and Biotech Industry", "Svenskar i New York", "Translational Medicine Alliance"]}, "name": {"family_name": "Forslund", "given_name": "Ann"}, "overview_html": "<dl id=\"overview\"><dt id=\"overview-summary-current-title\" class=\"summary-current\" style=\"display:block\">\nCurrent\n</dt>\n<dd class=\"summary-current\" style=\"display:block\">\n<ul class=\"current\"><li>\nSenior Scientist, Oncology Biomarkers\n<span class=\"at\">at </span>\n<a class=\"company-profile-public\" href=\"/company/johnson-&-johnson?trk=ppro_cprof\"><span class=\"org summary\">Johnson and Johnson</span></a>\n</li>\n</ul></dd>\n<dt id=\"overview-summary-past-title\" class=\"summary-past\" style=\"display:block\">\nPast\n</dt>\n<dd class=\"summary-past\" style=\"display:block\">\n<ul class=\"past\"><li>\nAssociate at Dept of Molecular Genetics\n<span class=\"at\">at </span>\n<a class=\"company-profile-public\" href=\"/company/einstein-medical-center-philadelphia?trk=ppro_cprof\"><span class=\"org summary\">Albert Einstein Medical Center</span></a>\n</li>\n<li>\nAssociate Research Scientist\n<span class=\"at\">at </span>\n<a class=\"company-profile-public\" href=\"/company/columbia-university?trk=ppro_cprof\"><span class=\"org summary\">Columbia University</span></a>\n</li>\n<li>\nPost Doctoral Research Fellow\n<span class=\"at\">at </span>\nMemorial Sloan Kettering Cancer Center\n</li>\n</ul><div class=\"showhide-block\" id=\"morepast\">\n<ul class=\"past\"><li>\nResearch Scientist\n<span class=\"at\">at </span>\n<a class=\"company-profile-public\" href=\"/company/sahlgrenska-university-hospital?trk=ppro_cprof\"><span class=\"org summary\">Sahlgrenska University Hospital</span></a>\n</li>\n</ul><p class=\"seeall showhide-link\">see less</p>\n</div>\n<p class=\"seeall showhide-link\">see all</p>\n</dd>\n<dt id=\"overview-summary-education-title\" class=\"summary-education\" style=\"display:block\">\nEducation\n</dt>\n<dd class=\"summary-education\" style=\"display:block\">\n<ul><li>\nColumbia University - Columbia Business School\n</li>\n<li>\nColumbia University - Columbia Business School\n</li>\n<li>\nG\u00f6teborgs universitet\n</li>\n</ul><div class=\"showhide-block\" id=\"moreedu\">\n<ul><li>\n<div name=\"education\">\nUniversit\u00e4t Regensburg\n</div>\n</li>\n<li>\n<div name=\"education\">\nG\u00f6teborgs universitet\n</div>\n</li>\n<li>\n<div name=\"education\">\nThe University of Georgia\n</div>\n</li>\n</ul><p class=\"seeall showhide-link\">see less</p>\n</div>\n<p class=\"seeall showhide-link\">see all</p>\n</dd>\n<dt>\nConnections\n</dt>\n<dd class=\"overview-connections\">\n<p>\n<strong>244</strong> connections\n</p>\n</dd>\n</dl>", "locality": "Antwerp Area, Belgium", "skills": ["Molecular Biology", "Biomarkers"], "industry": "Pharmaceuticals", "interval": 20, "experience": [{"org": "Johnson and Johnson", "title": "Senior Scientist, Oncology Biomarkers", "end": "Present", "start": "November 2009", "desc": "Biomarker Leader for compounds in clinical development.*Developing and implementing predictive and pharmacodynamic biomarkers for the use in Phase 0 - III oncology clinical trials.."}, {"org": "Albert Einstein Medical Center", "title": "Associate at Dept of Molecular Genetics", "start": "September 2008", "desc": "Single Cell Gene expression."}, {"org": "Columbia University", "title": "Associate Research Scientist", "start": "August 2006", "desc": "Work on peptide to restore wt p53 function in cancer."}, {"org": "Memorial Sloan Kettering Cancer Center", "title": "Post Doctoral Research Fellow", "start": "January 2003", "desc": "Molecular profiling of colorectal cancer."}, {"org": "Sahlgrenska University Hospital", "title": "Research Scientist", "start": "November 2001", "desc": "Cancer Research at Dept of Surgery.Molecular profiling of Colorectal Cancer with focus on p53."}], "summary": "Ph.D. scientist with background in cancer research, translational medicine and early drug development with special focus on biomarkers and personalized medicine.", "url": "http://be.linkedin.com/in/00001", "also_view": [{"url": "http://www.linkedin.com/pub/peter-king/4/993/a16", "id": "pub-peter-king-4-993-a16"}, {"url": "http://www.linkedin.com/pub/hans-winkler/1/1ab/78a", "id": "pub-hans-winkler-1-1ab-78a"}, {"url": "http://de.linkedin.com/pub/michael-koslowski/26/964/99b", "id": "pub-michael-koslowski-26-964-99b"}, {"url": "http://de.linkedin.com/pub/werner-seiz/b/14/436", "id": "pub-werner-seiz-b-14-436"}, {"url": "http://de.linkedin.com/pub/miro-venturi/7/725/217", "id": "pub-miro-venturi-7-725-217"}, {"url": "http://ch.linkedin.com/pub/lisa-d-amato/3/808/267", "id": "pub-lisa-d-amato-3-808-267"}, {"url": "http://www.linkedin.com/pub/june-kaplow-ph-d/2/382/924", "id": "pub-june-kaplow-ph-d-2-382-924"}, {"url": "http://fr.linkedin.com/pub/fabien-schmidlin/b/b73/4b2", "id": "pub-fabien-schmidlin-b-b73-4b2"}, {"url": "http://be.linkedin.com/pub/tine-casneuf/2/563/884", "id": "pub-tine-casneuf-2-563-884"}, {"url": "http://be.linkedin.com/pub/jeroen-aerssens/0/b9a/6ba", "id": "pub-jeroen-aerssens-0-b9a-6ba"}], "specilities": "Biomarkers in Oncology, Cancer Genomics, Molecular Profiling of Cancer, Translational Cancer Research, Early Development Drug Discovery", "events": [{"from": "Sahlgrenska University Hospital", "to": "Memorial Sloan Kettering Cancer Center", "title1": "Research Scientist", "start": 24022, "title2": "Post Doctoral Research Fellow", "end": 24036}, {"from": "Memorial Sloan Kettering Cancer Center", "to": "Columbia University", "title1": "Post Doctoral Research Fellow", "start": 24036, "title2": "Associate Research Scientist", "end": 24079}, {"from": "Columbia University", "to": "Albert Einstein Medical Center", "title1": "Associate Research Scientist", "start": 24079, "title2": "Associate at Dept of Molecular Genetics", "end": 24104}, {"from": "Albert Einstein Medical Center", "to": "Johnson and Johnson", "title1": "Associate at Dept of Molecular Genetics", "start": 24104, "title2": "Senior Scientist, Oncology Biomarkers", "end": 24118}]}
{"_id": "in-00006", "interests": "personal genomics, nanotechnology", "education": [{"major": "Biophysics", "end": "2009", "name": "Harvard University", "degree": "Ph.D", "start": "2004", "desc": ""}, {"major": "Computer Science", "end": "2003", "name": "Yale University", "degree": "B.S.", "start": "1999", "desc": ""}], "name": {"family_name": "Douglas", "given_name": "Shawn"}, "overview_html": "<dl id=\"overview\"><dt id=\"overview-summary-current-title\" class=\"summary-current\" style=\"display:block\">\nCurrent\n</dt>\n<dd class=\"summary-current\" style=\"display:block\">\n<ul class=\"current\"><li>\nAssistant Professor\n<span class=\"at\">at </span>\nUCSF\n</li>\n</ul></dd>\n<dt id=\"overview-summary-past-title\" class=\"summary-past\" style=\"display:block\">\nPast\n</dt>\n<dd class=\"summary-past\" style=\"display:block\">\n<ul class=\"past\"><li>\nTechnology Development Fellow\n<span class=\"at\">at </span>\n<a class=\"company-profile-public\" href=\"/company/wyss-institute-for-biologically-inspired-engineering?trk=ppro_cprof\"><span class=\"org summary\">Wyss Institute for Biologically Inspired Engineering</span></a>\n</li>\n</ul></dd>\n<dt id=\"overview-summary-education-title\" class=\"summary-education\" style=\"display:block\">\nEducation\n</dt>\n<dd class=\"summary-education\" style=\"display:block\">\n<ul><li>\nHarvard University\n</li>\n<li>\nYale University\n</li>\n</ul></dd>\n<dt>\nConnections\n</dt>\n<dd class=\"overview-connections\">\n<p>\n<strong>164</strong> connections\n</p>\n</dd>\n<dt class=\"websites\">Websites</dt>\n<dd class=\"websites\">\n<ul><li>\n\nCompany Website\n\n</li>\n<li>\n\nPersonal Website\n\n</li>\n<li>\n\nBIOMOD\n\n</li>\n</ul></dd>\n</dl>", "locality": "San Francisco, California", "skills": ["DNA", "Nanotechnology", "Molecular Biology", "Software Development"], "industry": "Research", "interval": 0, "experience": [{"org": "UCSF", "title": "Assistant Professor", "end": "Present", "start": "September 2012"}, {"org": "Wyss Institute for Biologically Inspired Engineering", "title": "Technology Development Fellow", "start": "May 2009"}], "summary": "I am interested in inventing new methods to construct and manipulate biological molecules at the nanometer scale, toward developing new scientific tools and therapeutic devices.", "url": "http://www.linkedin.com/in/00006", "also_view": [{"url": "http://www.linkedin.com/pub/george-church/1/630/2b8", "id": "pub-george-church-1-630-2b8"}, {"url": "http://www.linkedin.com/pub/andrew-hessel/4/4b0/290", "id": "pub-andrew-hessel-4-4b0-290"}, {"url": "http://www.linkedin.com/pub/ayis-antoniou/0/216/630", "id": "pub-ayis-antoniou-0-216-630"}, {"url": "http://uk.linkedin.com/pub/matthew-bellis/35/973/888", "id": "pub-matthew-bellis-35-973-888"}, {"url": "http://www.linkedin.com/pub/john-mulligan-ph-d/7/5a3/5aa", "id": "pub-john-mulligan-ph-d-7-5a3-5aa"}, {"url": "http://www.linkedin.com/pub/yang-mao/38/621/a83", "id": "pub-yang-mao-38-621-a83"}, {"url": "http://www.linkedin.com/pub/sidney-wang/25/3b8/b84", "id": "pub-sidney-wang-25-3b8-b84"}, {"url": "http://www.linkedin.com/pub/yang-mao/9/815/369", "id": "pub-yang-mao-9-815-369"}, {"url": "http://www.linkedin.com/pub/j-markson/32/572/10", "id": "pub-j-markson-32-572-10"}], "homepage": {"BIOMOD": ["http://biomod.net/"], "Company Website": ["http://bionano.ucsf.edu/"], "Personal Website": ["http://www.shawndouglas.com/"]}, "events": [{"from": "Wyss Institute for Biologically Inspired Engineering", "to": "UCSF", "title1": "Technology Development Fellow", "start": 24112, "title2": "Assistant Professor", "end": 24152}]}
Here's code that seems to work with your sample input. As I said in a comment the file you are dealing with is in something called JSON Lines format rather than JSON format.
Since you appear to want the cleaned version in that same format (in other words, not converted to standard JSON format, as I thought a one point), here's how to do that:
import json
path_to_file = "sample_input.json"
cleaned_file = "cleaned.json"
# Fields to keep.
fields = ["skills", "industry", "summary", "education", "experience"]
# Clean profiles in JSON Lines format file.
with open(path_to_file, encoding='UTF8') as inf, \
open(cleaned_file, 'w', encoding='UTF8') as outf:
for line in inf:
profile = json.loads(line) # Read a profile object.
for key in list(profile.keys()): # Remove unwanted fields it.
if key not in fields:
del profile[key]
outf.write(json.dumps(profile) + '\n') # Write cleaned profile to new file
# Test whether it worked.
with open(cleaned_file, encoding='UTF8') as cleaned:
for line in cleaned:
profile = json.loads(line)
print(json.dumps(profile, indent=4))
You are basically dumping new json objects into a file every time you are calling json.dump(profile, f). But that does not generate valid JSON, since it does not emped the objects correctly.
E.g. {}{} instead of {{},{}}
As for a solution - the size of your JSON makes reading / writing while holding everything in memory a bad solution.
I would probably try the library https://pypi.org/project/jsonstreams/ or something like this.

Writing to JSON file, then reading this same file and getting "JSONDecodeError: Extra data"

I have a very large json file (9GB). I'm reading in one object from it at a time, and then deleting key-value pairs in this object when the key is not in the list fields.
Each object is basically someone's user profile on a job searching website, but it comes with many unwanted key-value pairs that are not relevant to my analysis. There are about 3 million of these profiles.
I'd like to write each new profile/object to a json file, cleaned.json. Essentially this should be a copy of the original json file, except any of the key-value pairs not mentioned in fields have been removed from all 3 million profiles.
To do this, I wrote the following code:
# fields to keep
fields = ["skills", "industry", "summary", "education", "experience"]
with open('cleaned.json', 'w', encoding='UTF8') as f:
for profile in open(path_to_file, encoding = 'UTF8'):
profile = json.loads(profile)
# remove unwanted fields from profile
for key in list(profile.keys()):
if key not in fields:
del(profile[key])
# write profile to new json file
json.dump(profile, f)
To test whether it worked, I tried reading the json file in again, like so:
for foo in open('cleaned.json', encoding='UTF8'):
foo = json.loads(foo)
print(json.dumps(foo, indent=4))
But I'm getting this error: JSONDecodeError: Extra data on the foo = json.loads(foo) line.
I've tested this by only modifying 1 profile from the original json and writing this modified profile to cleaned.json, and cleaned.json looks like this (except it's all on one line, I've just pretty printed it for this post):
{
"skills": [
"Key Account Development",
"Strategic Planning",
"Market Planning",
"Team Leadership",
"Negotiation",
"Forecasting",
"Key Account Management",
"Sales Management",
"New Business Development",
"Business Planning",
"Cross-functional Team Leadership",
"Budgeting",
"Strategy Development",
"Business Strategy",
"Consultative Selling",
"Medical Devices",
"Customer Relations",
"Contract Negotiation",
"Mentoring",
"Coaching",
"Healthcare",
"Territory",
"Sales Process",
"Direct Sales",
"Sales Operations",
"Pharmaceutical Sales"
],
"industry": "Medical Devices",
"summary": "SALES MANAGEMENT / BUSINESS DEVELOPMENT / PROJECT MANAGEMENTDOMESTIC & INTERNATIONAL KEY ACCOUNT MANAGEMENTBusiness and Sales Executive with 20 years of accomplished career track, reflecting extensive experience and dynamic record-breaking performance in the Medical Industry markets. Exceptional communicator, strong team player, flexible self-starter with consultative sales style, strong negotiations skills, exceptional problem solving abilities, and accurate customer assessment aptitude. Manage and lead teams to success, drive new business through key accounts management, establish partnerships, manage solid distributor relationship for increased profitability and sales volumes. Very well organized, accurate and on-time administrative work, with a track record that demonstrates self-motivation, creativity, sales team leadership, initiative to achieve corporate, team and personal goals. Experience in the following markets: Medical Devices, Medical Disposables, Capital Equipment, Pharmaceuticals."
}{
"education": [
{
"start": "2008",
"major": "Economics",
"end": "2008",
"name": "Columbia University - Columbia Business School",
"desc": "Coursework \"Principals of Economics\" ECON1105\tSpring 2008"
},
{
"start": "2007",
"end": "2007",
"name": "Columbia University - Columbia Business School"
},
{
"major": "Cancer genomics",
"end": "2001",
"name": "G\u00f6teborgs universitet",
"degree": "Ph.D.",
"start": "1996",
"desc": "Thesis: \"The role of p53 in tumor progression and prognosis in patients with primary colorectal cancer\""
},
{
"start": "1994",
"major": "Biology, Medicine;German Language",
"end": "1995",
"name": "Universit\u00e4t Regensburg",
"degree": "Cancer Research, Coursework"
},
{
"major": "Biology",
"end": "1994",
"name": "G\u00f6teborgs universitet",
"degree": "Master",
"start": "1989",
"desc": ""
},
{
"start": "1992",
"major": "50% Biology and Medicine, 50% mixed music, sports, computer science, art etc",
"end": "1993",
"name": "The University of Georgia",
"desc": "Scholarship for one full year of Graduate Studies."
}
],
"skills": [
"Molecular Biology",
"Biomarkers"
],
"industry": "Pharmaceuticals",
"experience": [
{
"org": "Johnson and Johnson",
"title": "Senior Scientist, Oncology Biomarkers",
"end": "Present",
"start": "November 2009",
"desc": "Biomarker Leader for compounds in clinical development.*Developing and implementing predictive and pharmacodynamic biomarkers for the use in Phase 0 - III oncology clinical trials.."
},
{
"org": "Albert Einstein Medical Center",
"title": "Associate at Dept of Molecular Genetics",
"start": "September 2008",
"desc": "Single Cell Gene expression."
},
{
"org": "Columbia University",
"title": "Associate Research Scientist",
"start": "August 2006",
"desc": "Work on peptide to restore wt p53 function in cancer."
},
{
"org": "Memorial Sloan Kettering Cancer Center",
"title": "Post Doctoral Research Fellow",
"start": "January 2003",
"desc": "Molecular profiling of colorectal cancer."
},
{
"org": "Sahlgrenska University Hospital",
"title": "Research Scientist",
"start": "November 2001",
"desc": "Cancer Research at Dept of Surgery.Molecular profiling of Colorectal Cancer with focus on p53."
}
],
"summary": "Ph.D. scientist with background in cancer research, translational medicine and early drug development with special focus on biomarkers and personalized medicine."
}
So when I read this in, I'm getting the error. What am I doing wrong? I guess there is something wrong with the way I'm writing the profile to cleaned.json?
Sample input for testing
Sample input has 3 profiles.
{"_id": "in-00000001", "name": {"family_name": "Mazalu MBA", "given_name": "Dr Catalin"}, "locality": "United States", "skills": ["Key Account Development", "Strategic Planning", "Market Planning", "Team Leadership", "Negotiation", "Forecasting", "Key Account Management", "Sales Management", "New Business Development", "Business Planning", "Cross-functional Team Leadership", "Budgeting", "Strategy Development", "Business Strategy", "Consultative Selling", "Medical Devices", "Customer Relations", "Contract Negotiation", "Mentoring", "Coaching", "Healthcare", "Territory", "Sales Process", "Direct Sales", "Sales Operations", "Pharmaceutical Sales"], "industry": "Medical Devices", "summary": "SALES MANAGEMENT / BUSINESS DEVELOPMENT / PROJECT MANAGEMENTDOMESTIC & INTERNATIONAL KEY ACCOUNT MANAGEMENTBusiness and Sales Executive with 20 years of accomplished career track, reflecting extensive experience and dynamic record-breaking performance in the Medical Industry markets. Exceptional communicator, strong team player, flexible self-starter with consultative sales style, strong negotiations skills, exceptional problem solving abilities, and accurate customer assessment aptitude. Manage and lead teams to success, drive new business through key accounts management, establish partnerships, manage solid distributor relationship for increased profitability and sales volumes. Very well organized, accurate and on-time administrative work, with a track record that demonstrates self-motivation, creativity, sales team leadership, initiative to achieve corporate, team and personal goals. Experience in the following markets: Medical Devices, Medical Disposables, Capital Equipment, Pharmaceuticals.", "url": "http://www.linkedin.com/in/00000001", "also_view": [{"url": "http://www.linkedin.com/pub/krisa-drost/45/909/513", "id": "pub-krisa-drost-45-909-513"}, {"url": "http://ro.linkedin.com/pub/florin-ut/18/b33/77b", "id": "pub-florin-ut-18-b33-77b"}, {"url": "http://ro.linkedin.com/pub/cristian-radu/21/225/149", "id": "pub-cristian-radu-21-225-149"}, {"url": "http://ro.linkedin.com/pub/traian-rusu/16/652/279", "id": "pub-traian-rusu-16-652-279"}, {"url": "http://ro.linkedin.com/pub/dumitrescu-catalin/3/283/92", "id": "pub-dumitrescu-catalin-3-283-92"}, {"url": "http://www.linkedin.com/pub/jody-brelsford/9/21a/354", "id": "pub-jody-brelsford-9-21a-354"}, {"url": "http://www.linkedin.com/pub/mary-anne-dilloway/2/55a/18", "id": "pub-mary-anne-dilloway-2-55a-18"}, {"url": "http://ro.linkedin.com/pub/carmen-baleanu/2b/252/203", "id": "pub-carmen-baleanu-2b-252-203"}, {"url": "http://il.linkedin.com/in/shimonlobel", "id": "in-shimonlobel"}, {"url": "http://ro.linkedin.com/pub/monica-danilescu/19/36a/121", "id": "pub-monica-danilescu-19-36a-121"}]}
{"_id": "in-00001", "education": [{"start": "2008", "major": "Economics", "end": "2008", "name": "Columbia University - Columbia Business School", "desc": "Coursework \"Principals of Economics\" ECON1105\tSpring 2008"}, {"start": "2007", "end": "2007", "name": "Columbia University - Columbia Business School"}, {"major": "Cancer genomics", "end": "2001", "name": "G\u00f6teborgs universitet", "degree": "Ph.D.", "start": "1996", "desc": "Thesis: \"The role of p53 in tumor progression and prognosis in patients with primary colorectal cancer\""}, {"start": "1994", "major": "Biology, Medicine;German Language", "end": "1995", "name": "Universit\u00e4t Regensburg", "degree": "Cancer Research, Coursework"}, {"major": "Biology", "end": "1994", "name": "G\u00f6teborgs universitet", "degree": "Master", "start": "1989", "desc": ""}, {"start": "1992", "major": "50% Biology and Medicine, 50% mixed music, sports, computer science, art etc", "end": "1993", "name": "The University of Georgia", "desc": "Scholarship for one full year of Graduate Studies."}], "group": {"affilition": ["ASMALLWORLD.net", "Biomarker Research & Executive Network", "Biomarker Society", "Biomarkers", "Biomarkers in Discovery, Development and the Clinic Network", "Biotechnology/Pharmaceuticals", "Circulating Tumor Cell (CTC) and Cancer Stem Cell Group", "Clinical Development Job Opportunities - Europe", "Epigenetics", "Molecular Diagnostics Professional Network", "Molecular Diagnostics for Cancer Drug Development Forum", "NYC Women in Biotech", "Oncology Drug Development (Premier Group For Cancer Drug Development)", "Oncology Pharma\u2122", "Personalized Medicine", "Personalized Oncology Medicine - Global Group", "Professionals in the Pharmaceutical and Biotech Industry", "Svenskar i New York", "Translational Medicine Alliance"]}, "name": {"family_name": "Forslund", "given_name": "Ann"}, "overview_html": "<dl id=\"overview\"><dt id=\"overview-summary-current-title\" class=\"summary-current\" style=\"display:block\">\nCurrent\n</dt>\n<dd class=\"summary-current\" style=\"display:block\">\n<ul class=\"current\"><li>\nSenior Scientist, Oncology Biomarkers\n<span class=\"at\">at </span>\n<a class=\"company-profile-public\" href=\"/company/johnson-&-johnson?trk=ppro_cprof\"><span class=\"org summary\">Johnson and Johnson</span></a>\n</li>\n</ul></dd>\n<dt id=\"overview-summary-past-title\" class=\"summary-past\" style=\"display:block\">\nPast\n</dt>\n<dd class=\"summary-past\" style=\"display:block\">\n<ul class=\"past\"><li>\nAssociate at Dept of Molecular Genetics\n<span class=\"at\">at </span>\n<a class=\"company-profile-public\" href=\"/company/einstein-medical-center-philadelphia?trk=ppro_cprof\"><span class=\"org summary\">Albert Einstein Medical Center</span></a>\n</li>\n<li>\nAssociate Research Scientist\n<span class=\"at\">at </span>\n<a class=\"company-profile-public\" href=\"/company/columbia-university?trk=ppro_cprof\"><span class=\"org summary\">Columbia University</span></a>\n</li>\n<li>\nPost Doctoral Research Fellow\n<span class=\"at\">at </span>\nMemorial Sloan Kettering Cancer Center\n</li>\n</ul><div class=\"showhide-block\" id=\"morepast\">\n<ul class=\"past\"><li>\nResearch Scientist\n<span class=\"at\">at </span>\n<a class=\"company-profile-public\" href=\"/company/sahlgrenska-university-hospital?trk=ppro_cprof\"><span class=\"org summary\">Sahlgrenska University Hospital</span></a>\n</li>\n</ul><p class=\"seeall showhide-link\">see less</p>\n</div>\n<p class=\"seeall showhide-link\">see all</p>\n</dd>\n<dt id=\"overview-summary-education-title\" class=\"summary-education\" style=\"display:block\">\nEducation\n</dt>\n<dd class=\"summary-education\" style=\"display:block\">\n<ul><li>\nColumbia University - Columbia Business School\n</li>\n<li>\nColumbia University - Columbia Business School\n</li>\n<li>\nG\u00f6teborgs universitet\n</li>\n</ul><div class=\"showhide-block\" id=\"moreedu\">\n<ul><li>\n<div name=\"education\">\nUniversit\u00e4t Regensburg\n</div>\n</li>\n<li>\n<div name=\"education\">\nG\u00f6teborgs universitet\n</div>\n</li>\n<li>\n<div name=\"education\">\nThe University of Georgia\n</div>\n</li>\n</ul><p class=\"seeall showhide-link\">see less</p>\n</div>\n<p class=\"seeall showhide-link\">see all</p>\n</dd>\n<dt>\nConnections\n</dt>\n<dd class=\"overview-connections\">\n<p>\n<strong>244</strong> connections\n</p>\n</dd>\n</dl>", "locality": "Antwerp Area, Belgium", "skills": ["Molecular Biology", "Biomarkers"], "industry": "Pharmaceuticals", "interval": 20, "experience": [{"org": "Johnson and Johnson", "title": "Senior Scientist, Oncology Biomarkers", "end": "Present", "start": "November 2009", "desc": "Biomarker Leader for compounds in clinical development.*Developing and implementing predictive and pharmacodynamic biomarkers for the use in Phase 0 - III oncology clinical trials.."}, {"org": "Albert Einstein Medical Center", "title": "Associate at Dept of Molecular Genetics", "start": "September 2008", "desc": "Single Cell Gene expression."}, {"org": "Columbia University", "title": "Associate Research Scientist", "start": "August 2006", "desc": "Work on peptide to restore wt p53 function in cancer."}, {"org": "Memorial Sloan Kettering Cancer Center", "title": "Post Doctoral Research Fellow", "start": "January 2003", "desc": "Molecular profiling of colorectal cancer."}, {"org": "Sahlgrenska University Hospital", "title": "Research Scientist", "start": "November 2001", "desc": "Cancer Research at Dept of Surgery.Molecular profiling of Colorectal Cancer with focus on p53."}], "summary": "Ph.D. scientist with background in cancer research, translational medicine and early drug development with special focus on biomarkers and personalized medicine.", "url": "http://be.linkedin.com/in/00001", "also_view": [{"url": "http://www.linkedin.com/pub/peter-king/4/993/a16", "id": "pub-peter-king-4-993-a16"}, {"url": "http://www.linkedin.com/pub/hans-winkler/1/1ab/78a", "id": "pub-hans-winkler-1-1ab-78a"}, {"url": "http://de.linkedin.com/pub/michael-koslowski/26/964/99b", "id": "pub-michael-koslowski-26-964-99b"}, {"url": "http://de.linkedin.com/pub/werner-seiz/b/14/436", "id": "pub-werner-seiz-b-14-436"}, {"url": "http://de.linkedin.com/pub/miro-venturi/7/725/217", "id": "pub-miro-venturi-7-725-217"}, {"url": "http://ch.linkedin.com/pub/lisa-d-amato/3/808/267", "id": "pub-lisa-d-amato-3-808-267"}, {"url": "http://www.linkedin.com/pub/june-kaplow-ph-d/2/382/924", "id": "pub-june-kaplow-ph-d-2-382-924"}, {"url": "http://fr.linkedin.com/pub/fabien-schmidlin/b/b73/4b2", "id": "pub-fabien-schmidlin-b-b73-4b2"}, {"url": "http://be.linkedin.com/pub/tine-casneuf/2/563/884", "id": "pub-tine-casneuf-2-563-884"}, {"url": "http://be.linkedin.com/pub/jeroen-aerssens/0/b9a/6ba", "id": "pub-jeroen-aerssens-0-b9a-6ba"}], "specilities": "Biomarkers in Oncology, Cancer Genomics, Molecular Profiling of Cancer, Translational Cancer Research, Early Development Drug Discovery", "events": [{"from": "Sahlgrenska University Hospital", "to": "Memorial Sloan Kettering Cancer Center", "title1": "Research Scientist", "start": 24022, "title2": "Post Doctoral Research Fellow", "end": 24036}, {"from": "Memorial Sloan Kettering Cancer Center", "to": "Columbia University", "title1": "Post Doctoral Research Fellow", "start": 24036, "title2": "Associate Research Scientist", "end": 24079}, {"from": "Columbia University", "to": "Albert Einstein Medical Center", "title1": "Associate Research Scientist", "start": 24079, "title2": "Associate at Dept of Molecular Genetics", "end": 24104}, {"from": "Albert Einstein Medical Center", "to": "Johnson and Johnson", "title1": "Associate at Dept of Molecular Genetics", "start": 24104, "title2": "Senior Scientist, Oncology Biomarkers", "end": 24118}]}
{"_id": "in-00006", "interests": "personal genomics, nanotechnology", "education": [{"major": "Biophysics", "end": "2009", "name": "Harvard University", "degree": "Ph.D", "start": "2004", "desc": ""}, {"major": "Computer Science", "end": "2003", "name": "Yale University", "degree": "B.S.", "start": "1999", "desc": ""}], "name": {"family_name": "Douglas", "given_name": "Shawn"}, "overview_html": "<dl id=\"overview\"><dt id=\"overview-summary-current-title\" class=\"summary-current\" style=\"display:block\">\nCurrent\n</dt>\n<dd class=\"summary-current\" style=\"display:block\">\n<ul class=\"current\"><li>\nAssistant Professor\n<span class=\"at\">at </span>\nUCSF\n</li>\n</ul></dd>\n<dt id=\"overview-summary-past-title\" class=\"summary-past\" style=\"display:block\">\nPast\n</dt>\n<dd class=\"summary-past\" style=\"display:block\">\n<ul class=\"past\"><li>\nTechnology Development Fellow\n<span class=\"at\">at </span>\n<a class=\"company-profile-public\" href=\"/company/wyss-institute-for-biologically-inspired-engineering?trk=ppro_cprof\"><span class=\"org summary\">Wyss Institute for Biologically Inspired Engineering</span></a>\n</li>\n</ul></dd>\n<dt id=\"overview-summary-education-title\" class=\"summary-education\" style=\"display:block\">\nEducation\n</dt>\n<dd class=\"summary-education\" style=\"display:block\">\n<ul><li>\nHarvard University\n</li>\n<li>\nYale University\n</li>\n</ul></dd>\n<dt>\nConnections\n</dt>\n<dd class=\"overview-connections\">\n<p>\n<strong>164</strong> connections\n</p>\n</dd>\n<dt class=\"websites\">Websites</dt>\n<dd class=\"websites\">\n<ul><li>\n\nCompany Website\n\n</li>\n<li>\n\nPersonal Website\n\n</li>\n<li>\n\nBIOMOD\n\n</li>\n</ul></dd>\n</dl>", "locality": "San Francisco, California", "skills": ["DNA", "Nanotechnology", "Molecular Biology", "Software Development"], "industry": "Research", "interval": 0, "experience": [{"org": "UCSF", "title": "Assistant Professor", "end": "Present", "start": "September 2012"}, {"org": "Wyss Institute for Biologically Inspired Engineering", "title": "Technology Development Fellow", "start": "May 2009"}], "summary": "I am interested in inventing new methods to construct and manipulate biological molecules at the nanometer scale, toward developing new scientific tools and therapeutic devices.", "url": "http://www.linkedin.com/in/00006", "also_view": [{"url": "http://www.linkedin.com/pub/george-church/1/630/2b8", "id": "pub-george-church-1-630-2b8"}, {"url": "http://www.linkedin.com/pub/andrew-hessel/4/4b0/290", "id": "pub-andrew-hessel-4-4b0-290"}, {"url": "http://www.linkedin.com/pub/ayis-antoniou/0/216/630", "id": "pub-ayis-antoniou-0-216-630"}, {"url": "http://uk.linkedin.com/pub/matthew-bellis/35/973/888", "id": "pub-matthew-bellis-35-973-888"}, {"url": "http://www.linkedin.com/pub/john-mulligan-ph-d/7/5a3/5aa", "id": "pub-john-mulligan-ph-d-7-5a3-5aa"}, {"url": "http://www.linkedin.com/pub/yang-mao/38/621/a83", "id": "pub-yang-mao-38-621-a83"}, {"url": "http://www.linkedin.com/pub/sidney-wang/25/3b8/b84", "id": "pub-sidney-wang-25-3b8-b84"}, {"url": "http://www.linkedin.com/pub/yang-mao/9/815/369", "id": "pub-yang-mao-9-815-369"}, {"url": "http://www.linkedin.com/pub/j-markson/32/572/10", "id": "pub-j-markson-32-572-10"}], "homepage": {"BIOMOD": ["http://biomod.net/"], "Company Website": ["http://bionano.ucsf.edu/"], "Personal Website": ["http://www.shawndouglas.com/"]}, "events": [{"from": "Wyss Institute for Biologically Inspired Engineering", "to": "UCSF", "title1": "Technology Development Fellow", "start": 24112, "title2": "Assistant Professor", "end": 24152}]}
Here's code that seems to work with your sample input. As I said in a comment the file you are dealing with is in something called JSON Lines format rather than JSON format.
Since you appear to want the cleaned version in that same format (in other words, not converted to standard JSON format, as I thought a one point), here's how to do that:
import json
path_to_file = "sample_input.json"
cleaned_file = "cleaned.json"
# Fields to keep.
fields = ["skills", "industry", "summary", "education", "experience"]
# Clean profiles in JSON Lines format file.
with open(path_to_file, encoding='UTF8') as inf, \
open(cleaned_file, 'w', encoding='UTF8') as outf:
for line in inf:
profile = json.loads(line) # Read a profile object.
for key in list(profile.keys()): # Remove unwanted fields it.
if key not in fields:
del profile[key]
outf.write(json.dumps(profile) + '\n') # Write cleaned profile to new file
# Test whether it worked.
with open(cleaned_file, encoding='UTF8') as cleaned:
for line in cleaned:
profile = json.loads(line)
print(json.dumps(profile, indent=4))
You are basically dumping new json objects into a file every time you are calling json.dump(profile, f). But that does not generate valid JSON, since it does not emped the objects correctly.
E.g. {}{} instead of {{},{}}
As for a solution - the size of your JSON makes reading / writing while holding everything in memory a bad solution.
I would probably try the library https://pypi.org/project/jsonstreams/ or something like this.

how can i declare a list of map defined types in cassandra

i want to declare a list of objects in cassandra and i have already created the type object
CREATE TYPE profiles.educations (
major text,
end text,
name text,
degree text,
start text,
desce text
);
how can declare a list of map educations type
cause i have a json file this format:
{
...
"educations": [
{
"start": "2009",
"major": "Business Administration and Management, General",
"end": "2010",
"name": "Gordon Institute of Business Science - University of Pretoria",
"degree": "PDBA"
},
{
"start": "2002",
"major": "Marketing Management",
"end": "2006",
"name": "University of Pretoria/Universiteit van Pretoria",
"degree": "B. com with specialization in Marketing Management"
},
{
"major": "Finanzas",
"end": "2013",
"name": "Universidad de Los Andes",
"degree": "Maestr\u00eda en Finanzas",
"start": "2011",
"desce": ""
}]
...
}

Categories

Resources