How to execute multiple json objects stored in file object [duplicate] - python

I have a very large json file (9GB). I'm reading in one object from it at a time, and then deleting key-value pairs in this object when the key is not in the list fields.
Each object is basically someone's user profile on a job searching website, but it comes with many unwanted key-value pairs that are not relevant to my analysis. There are about 3 million of these profiles.
I'd like to write each new profile/object to a json file, cleaned.json. Essentially this should be a copy of the original json file, except any of the key-value pairs not mentioned in fields have been removed from all 3 million profiles.
To do this, I wrote the following code:
# fields to keep
fields = ["skills", "industry", "summary", "education", "experience"]
with open('cleaned.json', 'w', encoding='UTF8') as f:
for profile in open(path_to_file, encoding = 'UTF8'):
profile = json.loads(profile)
# remove unwanted fields from profile
for key in list(profile.keys()):
if key not in fields:
del(profile[key])
# write profile to new json file
json.dump(profile, f)
To test whether it worked, I tried reading the json file in again, like so:
for foo in open('cleaned.json', encoding='UTF8'):
foo = json.loads(foo)
print(json.dumps(foo, indent=4))
But I'm getting this error: JSONDecodeError: Extra data on the foo = json.loads(foo) line.
I've tested this by only modifying 1 profile from the original json and writing this modified profile to cleaned.json, and cleaned.json looks like this (except it's all on one line, I've just pretty printed it for this post):
{
"skills": [
"Key Account Development",
"Strategic Planning",
"Market Planning",
"Team Leadership",
"Negotiation",
"Forecasting",
"Key Account Management",
"Sales Management",
"New Business Development",
"Business Planning",
"Cross-functional Team Leadership",
"Budgeting",
"Strategy Development",
"Business Strategy",
"Consultative Selling",
"Medical Devices",
"Customer Relations",
"Contract Negotiation",
"Mentoring",
"Coaching",
"Healthcare",
"Territory",
"Sales Process",
"Direct Sales",
"Sales Operations",
"Pharmaceutical Sales"
],
"industry": "Medical Devices",
"summary": "SALES MANAGEMENT / BUSINESS DEVELOPMENT / PROJECT MANAGEMENTDOMESTIC & INTERNATIONAL KEY ACCOUNT MANAGEMENTBusiness and Sales Executive with 20 years of accomplished career track, reflecting extensive experience and dynamic record-breaking performance in the Medical Industry markets. Exceptional communicator, strong team player, flexible self-starter with consultative sales style, strong negotiations skills, exceptional problem solving abilities, and accurate customer assessment aptitude. Manage and lead teams to success, drive new business through key accounts management, establish partnerships, manage solid distributor relationship for increased profitability and sales volumes. Very well organized, accurate and on-time administrative work, with a track record that demonstrates self-motivation, creativity, sales team leadership, initiative to achieve corporate, team and personal goals. Experience in the following markets: Medical Devices, Medical Disposables, Capital Equipment, Pharmaceuticals."
}{
"education": [
{
"start": "2008",
"major": "Economics",
"end": "2008",
"name": "Columbia University - Columbia Business School",
"desc": "Coursework \"Principals of Economics\" ECON1105\tSpring 2008"
},
{
"start": "2007",
"end": "2007",
"name": "Columbia University - Columbia Business School"
},
{
"major": "Cancer genomics",
"end": "2001",
"name": "G\u00f6teborgs universitet",
"degree": "Ph.D.",
"start": "1996",
"desc": "Thesis: \"The role of p53 in tumor progression and prognosis in patients with primary colorectal cancer\""
},
{
"start": "1994",
"major": "Biology, Medicine;German Language",
"end": "1995",
"name": "Universit\u00e4t Regensburg",
"degree": "Cancer Research, Coursework"
},
{
"major": "Biology",
"end": "1994",
"name": "G\u00f6teborgs universitet",
"degree": "Master",
"start": "1989",
"desc": ""
},
{
"start": "1992",
"major": "50% Biology and Medicine, 50% mixed music, sports, computer science, art etc",
"end": "1993",
"name": "The University of Georgia",
"desc": "Scholarship for one full year of Graduate Studies."
}
],
"skills": [
"Molecular Biology",
"Biomarkers"
],
"industry": "Pharmaceuticals",
"experience": [
{
"org": "Johnson and Johnson",
"title": "Senior Scientist, Oncology Biomarkers",
"end": "Present",
"start": "November 2009",
"desc": "Biomarker Leader for compounds in clinical development.*Developing and implementing predictive and pharmacodynamic biomarkers for the use in Phase 0 - III oncology clinical trials.."
},
{
"org": "Albert Einstein Medical Center",
"title": "Associate at Dept of Molecular Genetics",
"start": "September 2008",
"desc": "Single Cell Gene expression."
},
{
"org": "Columbia University",
"title": "Associate Research Scientist",
"start": "August 2006",
"desc": "Work on peptide to restore wt p53 function in cancer."
},
{
"org": "Memorial Sloan Kettering Cancer Center",
"title": "Post Doctoral Research Fellow",
"start": "January 2003",
"desc": "Molecular profiling of colorectal cancer."
},
{
"org": "Sahlgrenska University Hospital",
"title": "Research Scientist",
"start": "November 2001",
"desc": "Cancer Research at Dept of Surgery.Molecular profiling of Colorectal Cancer with focus on p53."
}
],
"summary": "Ph.D. scientist with background in cancer research, translational medicine and early drug development with special focus on biomarkers and personalized medicine."
}
So when I read this in, I'm getting the error. What am I doing wrong? I guess there is something wrong with the way I'm writing the profile to cleaned.json?
Sample input for testing
Sample input has 3 profiles.
{"_id": "in-00000001", "name": {"family_name": "Mazalu MBA", "given_name": "Dr Catalin"}, "locality": "United States", "skills": ["Key Account Development", "Strategic Planning", "Market Planning", "Team Leadership", "Negotiation", "Forecasting", "Key Account Management", "Sales Management", "New Business Development", "Business Planning", "Cross-functional Team Leadership", "Budgeting", "Strategy Development", "Business Strategy", "Consultative Selling", "Medical Devices", "Customer Relations", "Contract Negotiation", "Mentoring", "Coaching", "Healthcare", "Territory", "Sales Process", "Direct Sales", "Sales Operations", "Pharmaceutical Sales"], "industry": "Medical Devices", "summary": "SALES MANAGEMENT / BUSINESS DEVELOPMENT / PROJECT MANAGEMENTDOMESTIC & INTERNATIONAL KEY ACCOUNT MANAGEMENTBusiness and Sales Executive with 20 years of accomplished career track, reflecting extensive experience and dynamic record-breaking performance in the Medical Industry markets. Exceptional communicator, strong team player, flexible self-starter with consultative sales style, strong negotiations skills, exceptional problem solving abilities, and accurate customer assessment aptitude. Manage and lead teams to success, drive new business through key accounts management, establish partnerships, manage solid distributor relationship for increased profitability and sales volumes. Very well organized, accurate and on-time administrative work, with a track record that demonstrates self-motivation, creativity, sales team leadership, initiative to achieve corporate, team and personal goals. Experience in the following markets: Medical Devices, Medical Disposables, Capital Equipment, Pharmaceuticals.", "url": "http://www.linkedin.com/in/00000001", "also_view": [{"url": "http://www.linkedin.com/pub/krisa-drost/45/909/513", "id": "pub-krisa-drost-45-909-513"}, {"url": "http://ro.linkedin.com/pub/florin-ut/18/b33/77b", "id": "pub-florin-ut-18-b33-77b"}, {"url": "http://ro.linkedin.com/pub/cristian-radu/21/225/149", "id": "pub-cristian-radu-21-225-149"}, {"url": "http://ro.linkedin.com/pub/traian-rusu/16/652/279", "id": "pub-traian-rusu-16-652-279"}, {"url": "http://ro.linkedin.com/pub/dumitrescu-catalin/3/283/92", "id": "pub-dumitrescu-catalin-3-283-92"}, {"url": "http://www.linkedin.com/pub/jody-brelsford/9/21a/354", "id": "pub-jody-brelsford-9-21a-354"}, {"url": "http://www.linkedin.com/pub/mary-anne-dilloway/2/55a/18", "id": "pub-mary-anne-dilloway-2-55a-18"}, {"url": "http://ro.linkedin.com/pub/carmen-baleanu/2b/252/203", "id": "pub-carmen-baleanu-2b-252-203"}, {"url": "http://il.linkedin.com/in/shimonlobel", "id": "in-shimonlobel"}, {"url": "http://ro.linkedin.com/pub/monica-danilescu/19/36a/121", "id": "pub-monica-danilescu-19-36a-121"}]}
{"_id": "in-00001", "education": [{"start": "2008", "major": "Economics", "end": "2008", "name": "Columbia University - Columbia Business School", "desc": "Coursework \"Principals of Economics\" ECON1105\tSpring 2008"}, {"start": "2007", "end": "2007", "name": "Columbia University - Columbia Business School"}, {"major": "Cancer genomics", "end": "2001", "name": "G\u00f6teborgs universitet", "degree": "Ph.D.", "start": "1996", "desc": "Thesis: \"The role of p53 in tumor progression and prognosis in patients with primary colorectal cancer\""}, {"start": "1994", "major": "Biology, Medicine;German Language", "end": "1995", "name": "Universit\u00e4t Regensburg", "degree": "Cancer Research, Coursework"}, {"major": "Biology", "end": "1994", "name": "G\u00f6teborgs universitet", "degree": "Master", "start": "1989", "desc": ""}, {"start": "1992", "major": "50% Biology and Medicine, 50% mixed music, sports, computer science, art etc", "end": "1993", "name": "The University of Georgia", "desc": "Scholarship for one full year of Graduate Studies."}], "group": {"affilition": ["ASMALLWORLD.net", "Biomarker Research & Executive Network", "Biomarker Society", "Biomarkers", "Biomarkers in Discovery, Development and the Clinic Network", "Biotechnology/Pharmaceuticals", "Circulating Tumor Cell (CTC) and Cancer Stem Cell Group", "Clinical Development Job Opportunities - Europe", "Epigenetics", "Molecular Diagnostics Professional Network", "Molecular Diagnostics for Cancer Drug Development Forum", "NYC Women in Biotech", "Oncology Drug Development (Premier Group For Cancer Drug Development)", "Oncology Pharma\u2122", "Personalized Medicine", "Personalized Oncology Medicine - Global Group", "Professionals in the Pharmaceutical and Biotech Industry", "Svenskar i New York", "Translational Medicine Alliance"]}, "name": {"family_name": "Forslund", "given_name": "Ann"}, "overview_html": "<dl id=\"overview\"><dt id=\"overview-summary-current-title\" class=\"summary-current\" style=\"display:block\">\nCurrent\n</dt>\n<dd class=\"summary-current\" style=\"display:block\">\n<ul class=\"current\"><li>\nSenior Scientist, Oncology Biomarkers\n<span class=\"at\">at </span>\n<a class=\"company-profile-public\" href=\"/company/johnson-&-johnson?trk=ppro_cprof\"><span class=\"org summary\">Johnson and Johnson</span></a>\n</li>\n</ul></dd>\n<dt id=\"overview-summary-past-title\" class=\"summary-past\" style=\"display:block\">\nPast\n</dt>\n<dd class=\"summary-past\" style=\"display:block\">\n<ul class=\"past\"><li>\nAssociate at Dept of Molecular Genetics\n<span class=\"at\">at </span>\n<a class=\"company-profile-public\" href=\"/company/einstein-medical-center-philadelphia?trk=ppro_cprof\"><span class=\"org summary\">Albert Einstein Medical Center</span></a>\n</li>\n<li>\nAssociate Research Scientist\n<span class=\"at\">at </span>\n<a class=\"company-profile-public\" href=\"/company/columbia-university?trk=ppro_cprof\"><span class=\"org summary\">Columbia University</span></a>\n</li>\n<li>\nPost Doctoral Research Fellow\n<span class=\"at\">at </span>\nMemorial Sloan Kettering Cancer Center\n</li>\n</ul><div class=\"showhide-block\" id=\"morepast\">\n<ul class=\"past\"><li>\nResearch Scientist\n<span class=\"at\">at </span>\n<a class=\"company-profile-public\" href=\"/company/sahlgrenska-university-hospital?trk=ppro_cprof\"><span class=\"org summary\">Sahlgrenska University Hospital</span></a>\n</li>\n</ul><p class=\"seeall showhide-link\">see less</p>\n</div>\n<p class=\"seeall showhide-link\">see all</p>\n</dd>\n<dt id=\"overview-summary-education-title\" class=\"summary-education\" style=\"display:block\">\nEducation\n</dt>\n<dd class=\"summary-education\" style=\"display:block\">\n<ul><li>\nColumbia University - Columbia Business School\n</li>\n<li>\nColumbia University - Columbia Business School\n</li>\n<li>\nG\u00f6teborgs universitet\n</li>\n</ul><div class=\"showhide-block\" id=\"moreedu\">\n<ul><li>\n<div name=\"education\">\nUniversit\u00e4t Regensburg\n</div>\n</li>\n<li>\n<div name=\"education\">\nG\u00f6teborgs universitet\n</div>\n</li>\n<li>\n<div name=\"education\">\nThe University of Georgia\n</div>\n</li>\n</ul><p class=\"seeall showhide-link\">see less</p>\n</div>\n<p class=\"seeall showhide-link\">see all</p>\n</dd>\n<dt>\nConnections\n</dt>\n<dd class=\"overview-connections\">\n<p>\n<strong>244</strong> connections\n</p>\n</dd>\n</dl>", "locality": "Antwerp Area, Belgium", "skills": ["Molecular Biology", "Biomarkers"], "industry": "Pharmaceuticals", "interval": 20, "experience": [{"org": "Johnson and Johnson", "title": "Senior Scientist, Oncology Biomarkers", "end": "Present", "start": "November 2009", "desc": "Biomarker Leader for compounds in clinical development.*Developing and implementing predictive and pharmacodynamic biomarkers for the use in Phase 0 - III oncology clinical trials.."}, {"org": "Albert Einstein Medical Center", "title": "Associate at Dept of Molecular Genetics", "start": "September 2008", "desc": "Single Cell Gene expression."}, {"org": "Columbia University", "title": "Associate Research Scientist", "start": "August 2006", "desc": "Work on peptide to restore wt p53 function in cancer."}, {"org": "Memorial Sloan Kettering Cancer Center", "title": "Post Doctoral Research Fellow", "start": "January 2003", "desc": "Molecular profiling of colorectal cancer."}, {"org": "Sahlgrenska University Hospital", "title": "Research Scientist", "start": "November 2001", "desc": "Cancer Research at Dept of Surgery.Molecular profiling of Colorectal Cancer with focus on p53."}], "summary": "Ph.D. scientist with background in cancer research, translational medicine and early drug development with special focus on biomarkers and personalized medicine.", "url": "http://be.linkedin.com/in/00001", "also_view": [{"url": "http://www.linkedin.com/pub/peter-king/4/993/a16", "id": "pub-peter-king-4-993-a16"}, {"url": "http://www.linkedin.com/pub/hans-winkler/1/1ab/78a", "id": "pub-hans-winkler-1-1ab-78a"}, {"url": "http://de.linkedin.com/pub/michael-koslowski/26/964/99b", "id": "pub-michael-koslowski-26-964-99b"}, {"url": "http://de.linkedin.com/pub/werner-seiz/b/14/436", "id": "pub-werner-seiz-b-14-436"}, {"url": "http://de.linkedin.com/pub/miro-venturi/7/725/217", "id": "pub-miro-venturi-7-725-217"}, {"url": "http://ch.linkedin.com/pub/lisa-d-amato/3/808/267", "id": "pub-lisa-d-amato-3-808-267"}, {"url": "http://www.linkedin.com/pub/june-kaplow-ph-d/2/382/924", "id": "pub-june-kaplow-ph-d-2-382-924"}, {"url": "http://fr.linkedin.com/pub/fabien-schmidlin/b/b73/4b2", "id": "pub-fabien-schmidlin-b-b73-4b2"}, {"url": "http://be.linkedin.com/pub/tine-casneuf/2/563/884", "id": "pub-tine-casneuf-2-563-884"}, {"url": "http://be.linkedin.com/pub/jeroen-aerssens/0/b9a/6ba", "id": "pub-jeroen-aerssens-0-b9a-6ba"}], "specilities": "Biomarkers in Oncology, Cancer Genomics, Molecular Profiling of Cancer, Translational Cancer Research, Early Development Drug Discovery", "events": [{"from": "Sahlgrenska University Hospital", "to": "Memorial Sloan Kettering Cancer Center", "title1": "Research Scientist", "start": 24022, "title2": "Post Doctoral Research Fellow", "end": 24036}, {"from": "Memorial Sloan Kettering Cancer Center", "to": "Columbia University", "title1": "Post Doctoral Research Fellow", "start": 24036, "title2": "Associate Research Scientist", "end": 24079}, {"from": "Columbia University", "to": "Albert Einstein Medical Center", "title1": "Associate Research Scientist", "start": 24079, "title2": "Associate at Dept of Molecular Genetics", "end": 24104}, {"from": "Albert Einstein Medical Center", "to": "Johnson and Johnson", "title1": "Associate at Dept of Molecular Genetics", "start": 24104, "title2": "Senior Scientist, Oncology Biomarkers", "end": 24118}]}
{"_id": "in-00006", "interests": "personal genomics, nanotechnology", "education": [{"major": "Biophysics", "end": "2009", "name": "Harvard University", "degree": "Ph.D", "start": "2004", "desc": ""}, {"major": "Computer Science", "end": "2003", "name": "Yale University", "degree": "B.S.", "start": "1999", "desc": ""}], "name": {"family_name": "Douglas", "given_name": "Shawn"}, "overview_html": "<dl id=\"overview\"><dt id=\"overview-summary-current-title\" class=\"summary-current\" style=\"display:block\">\nCurrent\n</dt>\n<dd class=\"summary-current\" style=\"display:block\">\n<ul class=\"current\"><li>\nAssistant Professor\n<span class=\"at\">at </span>\nUCSF\n</li>\n</ul></dd>\n<dt id=\"overview-summary-past-title\" class=\"summary-past\" style=\"display:block\">\nPast\n</dt>\n<dd class=\"summary-past\" style=\"display:block\">\n<ul class=\"past\"><li>\nTechnology Development Fellow\n<span class=\"at\">at </span>\n<a class=\"company-profile-public\" href=\"/company/wyss-institute-for-biologically-inspired-engineering?trk=ppro_cprof\"><span class=\"org summary\">Wyss Institute for Biologically Inspired Engineering</span></a>\n</li>\n</ul></dd>\n<dt id=\"overview-summary-education-title\" class=\"summary-education\" style=\"display:block\">\nEducation\n</dt>\n<dd class=\"summary-education\" style=\"display:block\">\n<ul><li>\nHarvard University\n</li>\n<li>\nYale University\n</li>\n</ul></dd>\n<dt>\nConnections\n</dt>\n<dd class=\"overview-connections\">\n<p>\n<strong>164</strong> connections\n</p>\n</dd>\n<dt class=\"websites\">Websites</dt>\n<dd class=\"websites\">\n<ul><li>\n\nCompany Website\n\n</li>\n<li>\n\nPersonal Website\n\n</li>\n<li>\n\nBIOMOD\n\n</li>\n</ul></dd>\n</dl>", "locality": "San Francisco, California", "skills": ["DNA", "Nanotechnology", "Molecular Biology", "Software Development"], "industry": "Research", "interval": 0, "experience": [{"org": "UCSF", "title": "Assistant Professor", "end": "Present", "start": "September 2012"}, {"org": "Wyss Institute for Biologically Inspired Engineering", "title": "Technology Development Fellow", "start": "May 2009"}], "summary": "I am interested in inventing new methods to construct and manipulate biological molecules at the nanometer scale, toward developing new scientific tools and therapeutic devices.", "url": "http://www.linkedin.com/in/00006", "also_view": [{"url": "http://www.linkedin.com/pub/george-church/1/630/2b8", "id": "pub-george-church-1-630-2b8"}, {"url": "http://www.linkedin.com/pub/andrew-hessel/4/4b0/290", "id": "pub-andrew-hessel-4-4b0-290"}, {"url": "http://www.linkedin.com/pub/ayis-antoniou/0/216/630", "id": "pub-ayis-antoniou-0-216-630"}, {"url": "http://uk.linkedin.com/pub/matthew-bellis/35/973/888", "id": "pub-matthew-bellis-35-973-888"}, {"url": "http://www.linkedin.com/pub/john-mulligan-ph-d/7/5a3/5aa", "id": "pub-john-mulligan-ph-d-7-5a3-5aa"}, {"url": "http://www.linkedin.com/pub/yang-mao/38/621/a83", "id": "pub-yang-mao-38-621-a83"}, {"url": "http://www.linkedin.com/pub/sidney-wang/25/3b8/b84", "id": "pub-sidney-wang-25-3b8-b84"}, {"url": "http://www.linkedin.com/pub/yang-mao/9/815/369", "id": "pub-yang-mao-9-815-369"}, {"url": "http://www.linkedin.com/pub/j-markson/32/572/10", "id": "pub-j-markson-32-572-10"}], "homepage": {"BIOMOD": ["http://biomod.net/"], "Company Website": ["http://bionano.ucsf.edu/"], "Personal Website": ["http://www.shawndouglas.com/"]}, "events": [{"from": "Wyss Institute for Biologically Inspired Engineering", "to": "UCSF", "title1": "Technology Development Fellow", "start": 24112, "title2": "Assistant Professor", "end": 24152}]}

Here's code that seems to work with your sample input. As I said in a comment the file you are dealing with is in something called JSON Lines format rather than JSON format.
Since you appear to want the cleaned version in that same format (in other words, not converted to standard JSON format, as I thought a one point), here's how to do that:
import json
path_to_file = "sample_input.json"
cleaned_file = "cleaned.json"
# Fields to keep.
fields = ["skills", "industry", "summary", "education", "experience"]
# Clean profiles in JSON Lines format file.
with open(path_to_file, encoding='UTF8') as inf, \
open(cleaned_file, 'w', encoding='UTF8') as outf:
for line in inf:
profile = json.loads(line) # Read a profile object.
for key in list(profile.keys()): # Remove unwanted fields it.
if key not in fields:
del profile[key]
outf.write(json.dumps(profile) + '\n') # Write cleaned profile to new file
# Test whether it worked.
with open(cleaned_file, encoding='UTF8') as cleaned:
for line in cleaned:
profile = json.loads(line)
print(json.dumps(profile, indent=4))

You are basically dumping new json objects into a file every time you are calling json.dump(profile, f). But that does not generate valid JSON, since it does not emped the objects correctly.
E.g. {}{} instead of {{},{}}
As for a solution - the size of your JSON makes reading / writing while holding everything in memory a bad solution.
I would probably try the library https://pypi.org/project/jsonstreams/ or something like this.

Related

How to loop through sibling tags while scraping data

I am trying to scrape editor data from this page using python scrapy framework.
The problem I am facing is every tag is a sibling tag and the editor role is inside h3 tags and names are inside div tags. All these are inside a div tag with id "editors-section". I can loop through each div tag like
response.css("#editors-section>div.row.align-items-center")
and collect editor name and organization,
but how to collect their respective roles.How to loop through all the tags. Thanks .
You can use relative xpath expressions and using the following-sibling directive along with testing for adjacent role headers using the selectors root.tag attribute, you can accurately determine each persons role.
For example:
for header in response.xpath("//h2"):
role = header.xpath("./text()").get()
for sibling in header.xpath("./following-sibling::*"):
if sibling.root.tag == "h2":
break
name = sibling.xpath(".//h3/*/text()").get()
location = sibling.xpath(".//p[#class='mb-2']/text()").get()
if name and location:
yield{
"role": role.strip(),
"name": name.strip(),
"location": location.strip()
}
OUTPUT
[
{
"role": "Editors-in-Chief",
"name": "Hua Wang",
"location": "University of Electronic Science and Technology of China, China"
},
{
"role": "Editors-in-Chief",
"name": "Gabriele Morra",
"location": "University of Louisiana at Lafayette, USA"
},
{
"role": "Board Members",
"name": "Luca Caricchi",
"location": "University of Geneva, Switzerland"
},
{
"role": "Board Members",
"name": "Michael Fehler",
"location": "Massachusetts Institute of Technology, USA"
},
{
"role": "Board Members",
"name": "Peter Gerstoft",
"location": "Scripps Institution of Oceanography, USA"
},
{
"role": "Board Members",
"name": "Forrest M. Hoffman",
"location": "Oak Ridge National Laboratory, United States of America"
},
{
"role": "Board Members",
"name": "Xiangyun Hu",
"location": "China University of Geosciences, China"
},
{
"role": "Board Members",
"name": "Guangmin Hu",
"location": "University of Electronic Science and Technology of China, China"
},
{
"role": "Board Members",
"name": "Qingkai Kong",
"location": "UC Berkeley, USA"
},
{
"role": "Board Members",
"name": "Yuemin Li",
"location": "University of Electronic Science and Technology of China, China"
},
{
"role": "Board Members",
"name": "Hongjun Lin",
"location": "Zhejiang Normal University, China"
},
{
"role": "Board Members",
"name": "Aldo Lipani",
"location": "University College London, United Kingdom"
},
{
"role": "Board Members",
"name": "Zhigang Peng",
"location": "Georgia Institute of Technology, USA"
},
{
"role": "Board Members",
"name": "Piero Poli",
"location": "Grenoble Alpes University, France"
},
{
"role": "Board Members",
"name": "Kunfeng Qiu",
"location": "China University of Geoscience, China"
},
{
"role": "Board Members",
"name": "Calogero Schillaci",
"location": "JRC European Commission, Italy"
},
{
"role": "Board Members",
"name": "Hosein Shahnas",
"location": "University of Toronto, Canada"
},
{
"role": "Board Members",
"name": "Byung-Dal So",
"location": "Kangwon National University, South Korea"
},
{
"role": "Board Members",
"name": "Rui Wang",
"location": "China University of Geoscience, China"
},
{
"role": "Board Members",
"name": "Yong Wang",
"location": "East Carolina University, USA"
},
{
"role": "Board Members",
"name": "Zhiguo Wang",
"location": "Xi'an Jiaotong University, China"
},
{
"role": "Board Members",
"name": "Jun Xia",
"location": "Wuhan University, China"
},
{
"role": "Board Members",
"name": "Lizhi Xiao",
"location": "China University of Petroleum(Beijing), China"
},
{
"role": "Board Members",
"name": "Chicheng Xu",
"location": "Aramco Services Company, USA"
},
{
"role": "Board Members",
"name": "Zhibing Yang",
"location": "Wuhan University, China"
},
{
"role": "Board Members",
"name": "Nana Yoshimitsu",
"location": "Kyoto University, Japan"
},
{
"role": "Board Members",
"name": "Hongyan Zhang",
"location": "Wuhan University, China"
}
]
Same result but using a bit another approach (and a single for loop). I find each h3 element (name) and get the role (first h2 element above) using preceding XPath expression:
def parse(self, response):
for h3_node in response.xpath('//div[#class="container"]//h3'):
role = h3_node.xpath('normalize-space(./preceding::h2[1])').get()
name = h3_node.xpath('normalize-space(.)').get()
location = h3_node.xpath("normalize-space(./following-sibling::p[1])").get()
if name and location:
yield{
"role": role,
"name": name,
"location": location,
}

Modify existing json to create new custom one python

I'm trying to trim unused data in json to create new one with only two fields. Title and description. The title works great but I can't figure out how to get the description field. The json is public and you can get it here or at the end of the post.
My code that extracts title field:
import requests
import json
def trim_json(d):
newd = {}
for name in ['title']:
newd[name] = d[name]
return newd
def clean():
books = requests.get('https://openlibrary.org/authors/OL23919A/works.json')
books_parsed = books.json()
book_data = books_parsed['entries']
book_data = [trim_json(d) for d in book_data]
print(book_data)
return book_data
update
clean function returns list of dicts in this format:
[{'title': 'Harry Potter House Gryffindor Edition Series 1-5 Books Collection Set By J.K. Rowling'}]
What I want to get is:
[{'title': 'Harry Potter House Gryffindor Edition Series 1-5 Books Collection Set By J.K. Rowling', 'description': 'lorem ipsum'}]
and if there is no description:
[{'title': 'Harry Potter House Gryffindor Edition Series 1-5 Books Collection Set By J.K. Rowling', 'description': 'undefind'}]
How can I get json that returns title & description?
{
"type": {
"key": "/type/work"
},
"title": "Journey to Hogwarts",
"authors": [
{
"type": {
"key": "/type/author_role"
},
"author": {
"key": "/authors/OL23919A"
}
}
],
"covers": [
2520429
],
"key": "/works/OL28602152W",
"latest_revision": 1,
"revision": 1,
"created": {
"type": "/type/datetime",
"value": "2022-08-05T00:16:59.602176"
},
"last_modified": {
"type": "/type/datetime",
"value": "2022-08-05T00:16:59.602176"
}
},
{
"description": "Harry Potter #2\r\n\r\nThroughout the summer holidays after his first year at Hogwarts School of Witchcraft and Wizardry, Harry Potter has been receiving sinister warnings from a house-elf called Dobby.\r\n\r\nNow, back at school to start his second year, Harry hears unintelligible whispers echoing through the corridors.\r\n\r\nBefore long the attacks begin: students are found as if turned to stone.\r\n\r\nDobby’s predictions seem to be coming true.\r\n\r\n[Source][1]\r\n\r\n\r\n [1]: https://www.jkrowling.com/book/harry-potter-chamber-secrets/",
"links": [
{
"title": "Author's book page",
"url": "https://www.jkrowling.com/book/harry-potter-chamber-secrets/",
"type": {
"key": "/type/link"
}
},
{
"url": "https://en.wikipedia.org/wiki/Harry_Potter_and_the_Chamber_of_Secrets",
"title": "Wikipedia entry",
"type": {
"key": "/type/link"
}
},
{
"title": "Harry Potter and the Chamber of Secrets by J.K. Rowling - review | Children's books | The Guardian",
"url": "https://www.theguardian.com/childrens-books-site/2015/mar/02/review-j-k-rowling-harry-potter-chamber-secrets",
"type": {
"key": "/type/link"
}
},
{
"url": "https://www.theguardian.com/childrens-books-site/2016/may/26/harry-potter-and-the-chamber-of-secrets-jk-rowling-review",
"title": "Harry Potter and the Chamber of Secrets by J.K. Rowling - review 2 | Children's books | The Guardian",
"type": {
"key": "/type/link"
}
}
],
"title": "Harry Potter and the Chamber of Secrets",
"covers": [
8234423,
8237628,
8237644,
8392798,
8995302,
8762432,
8081272,
8353396,
10301720,
8938317,
10471286,
10413455,
10487260,
-1,
10535729,
10722535,
10722534,
11522289,
12347254,
12581306,
12606939,
10536577,
11540339,
12023623
],
"subject_places": [
"England",
"London",
"Hogwarts School of Witchcraft and Wizardry",
"Inglaterra",
"Privet Drive"
],
"subjects": [
"Fantasy fiction",
"school stories",
"Fiction",
"Fantasy",
"Nestlé Smarties Book Prize winner",
"Juvenile fiction",
"Wizards",
"Magic",
"Schools",
"Spanish language materials",
"Magia",
"Escuelas",
"Ficción juvenil",
"Novela fantástica",
"Hogwarts School of Witchcraft and Wizardry (Imaginary place)",
"Harry Potter (Fictitious character)",
"Wizards -- Juvenile fiction",
"Witches",
"Hogwarts School of Witchcraft and Wizardry (Imaginary organization)",
"Magos",
"Translations from English",
"Chinese fiction",
"Orphans",
"Aunts",
"Uncles",
"Cousins",
"Determination (Personality trait) in children",
"Friendship",
"Potter, Harry (Fictitious character)",
"Witches Fiction",
"Wizards Fiction",
"Schools Fiction",
"England Fiction",
"Magic -- Juvenile fiction",
"Hogwarts School of Witchcraft and Wizardry (Imaginary place) -- Juvenile fiction",
"Schools -- Juvenile fiction",
"Wizards -- Fiction",
"Magic -- Fiction",
"Schools -- Fiction",
"England -- Juvenile fiction",
"England -- Fiction",
"Fantasy & Magic",
"Action & Adventure",
"Witchcraft",
"Harry Potter (Fictional character)",
"Engels",
"Social Themes",
"Reading Level-Grade 11",
"Reading Level-Grade 12",
"Schools, fiction",
"England, fiction",
"Potter, harry (fictitious character), fiction",
"Hogwarts school of witchcraft and wizardry (imaginary organization), fiction",
"Wizards, fiction",
"Magic, fiction",
"Children's fiction",
"Adventure and adventurers, fiction",
"English literature",
"Fiction, fantasy, general",
"Large type books",
"Hermione Granger (Fictitious character)",
"Ron Weasley (Fictitious character)",
"Latin language materials",
"Children's stories",
"Magiciens",
"Romans, nouvelles, etc. pour la jeunesse",
"Nécromancie",
"Écoles",
"Potter, Harry (Personnage fictif)",
"Romans, nouvelles",
"Magie",
"Family",
"Orphans & Foster Homes",
"Magía",
"Novela juvenil",
"Juvenile",
"Children's stories, English",
"Sieg",
"Basilisk",
"Das Böse",
"Das Gute",
"Internat",
"Lebensgefahr",
"Lebensrettung",
"List",
"Magier",
"Jugendbuch",
"Kampf",
"Schule",
"Basilisk (Fabeltier)",
"Junge",
"Phönix",
"Deutschland Grenzschutzkommando Mitte Schule",
"Deutschland",
"Friendship, fiction",
"Hogwartes School of Witchcraft and Wizardry (Imaginary place)",
"General",
"Social Issues",
"Witches, fiction"
],
"subject_people": [
"Harry Potter",
"Hermione Granger",
"Ron Weasley",
"Albus Dumbledore",
"Hagrid",
"The Dursleys",
"Gilderoy Lockhart",
"Dobby",
"Moaning Myrtle",
"Ginny Weasley",
"Draco Malfoy",
"Hermine Granger",
"Ron Weasly",
"Harry Potter (Fictitious character)"
],
"key": "/works/OL82537W",
"authors": [
{
"author": {
"key": "/authors/OL23919A"
},
"type": {
"key": "/type/author_role"
}
}
],
"excerpts": [
{
"excerpt": "Not for the first time, an argument had broken out over breakfast at number four, Privet Drive.",
"comment": "first sentence",
"author": {
"key": "/people/seabelis"
}
}
],
"type": {
"key": "/type/work"
},
"latest_revision": 80,
"revision": 80,
"created": {
"type": "/type/datetime",
"value": "2009-10-17T07:07:29.461716"
},
"last_modified": {
"type": "/type/datetime",
"value": "2022-06-22T07:57:49.863271"
}
},
All entries don't have title and description field. Therefore you have to use try...except clauses to prevent KeyErrors to happen.
def trim_json(d):
newd = {}
try:
newd["title"] = d["title"]
except KeyError:
pass
try:
newd["description"] = d["description"]
except KeyError:
pass
return newd
Or, in a more elegant way, you could use a filter in a dictionnary comprehension:
key_filter = ['title', 'description']
cleaned_data = [{k:d[k] for k in key_filter if k in d} for d in book_data]
And since the first element in the entries list is not a book data (and does not have a title nor a description key), you should start the list comprehension after the first element :
def clean():
books = requests.get('https://openlibrary.org/authors/OL23919A/works.json')
books_parsed = books.json()
book_data = books_parsed['entries']
cleaned_data = [trim_json(d) for d in book_data[1:]]
return book_data
It prevents obtaining an empty dictionnary that corresponds to no book.
Use the json library. It comes installed in python by default.
Let us say your json string is stored in a variable called json_str, we can run:
import json
info = json.loads(json_str)
title = info['title']

How to reformat a corrupt json file with escaped ' and "?

Problem
I have a large JSON file (~700.000 lines, 1.2GB filesize) containing twitter data that I need to preprocess for data and network analysis.
During the data collection an error happend: Instead of using " as a seperator ' was used. As this does not conform with the JSON standard, the file can not be processed by R or Python.
Information about the dataset:
Every about 500 lines start with meta info + meta information for the users, etc. then there are the tweets in json (order of fields not stable) starting with a space, one tweet per line.
This is what I tried so far:
A simple data.replace('\'', '\"') is not possible, as the "text" fields contain tweets which may contain ' or " themselves.
Using regex, I was able to catch some of the instances, but it does not catch everything:
re.compile(r'"[^"]*"(*SKIP)(*FAIL)|\'')
Using literal.eval(data) from the ast package also throws an error.
As the order of the fields and the legth for each field is not stable I am stuck on how to reformat that file in order to conform to JSON.
Normal sample line of the data (for this options one and two would work, but note that the tweets are also in non-english languages, which use " or ' in their tweets):
{'author_id': '1236888827605725186', 'entities': {'mentions': [{'start': 108, 'end': 124, 'username': 'realDonaldTrump'}], 'hashtags': [{'start': 49, 'end': 55, 'tag': 'QAnon'}, {'start': 56, 'end': 66, 'tag': 'ProudBoys'}]}, 'context_annotations': [{'domain': {'id': '10', 'name': 'Person', 'description': 'Named people in the world like Nelson Mandela'}, 'entity': {'id': '799022225751871488', 'name': 'Donald Trump', 'description': 'US President Donald Trump'}}, {'domain': {'id': '35', 'name': 'Politician', 'description': 'Politicians in the world, like Joe Biden'}, 'entity': {'id': '799022225751871488', 'name': 'Donald Trump', 'description': 'US President Donald Trump'}}], 'text': 'RT #NinjaHodon: Here’s an example of the average #QAnon #ProudBoys crackass trash that’s going to vote for #realDonaldTrump. \n\n https://t.…', 'referenced_tweets': [{'type': 'retweeted', 'id': '1315363137240010753'}], 'conversation_id': '1315441338427506689', 'id': '1315441338427506689', 'lang': 'en', 'public_metrics': {'retweet_count': 20, 'reply_count': 0, 'like_count': 0, 'quote_count': 0}, 'created_at': '20201011T23:57:09.000Z', 'source': 'Twitter for Android', 'possibly_sensitive': False}
Reformatted sample line which causes an issue:
{"users": [{"id": "437781219", "username": "HakesJon", "location": `"Wisconsin", "description": "#IndieFictionWriter. Husband. Father. Bearded.\n#BlackLivesMatter #DemilitarizeThePolice #DismantlePolicing", "name": "Jon Hakes", "created_at": "20111215T20:42:41.000Z"}, {"id": "1171947445841997824", "username": "FactNc", "location": "Under Carolina blue skies ", "description": "Defender of truth, justice and the American way. "I never give them hell. I just tell the truth and they think it\'s hell." Harry S. Truman", "name": "NCFactFinder", "created_at": "20190912T00:44:21.000Z"}, {"id": "315041625", "username": "o0rimbuk0o", "description": "Your desire to put pronouns here is not my issue. Get help.\n\n#resist #notmypresident\n#FBiden", "name": "Sick of it", "created_at": "20110611T06:16:11.000Z"}, {"id": "3141427487", "username": "theGeekSheek", "description": "I don't believe in your God. Don't tell me he hates me.", "name": "Chic Geek", "created_at": "20150406T18:34:45.000Z"}, {"id": "1084112678", "username": "KarinBorjeesson", "description": "Love to help people & animals in need. Love music. Fucking hate racists. #Anon #OpExposeCPS #BLM #FreePalestine #Yemen #OpSerenaShim #Animalrights #NoDAPL", "name": "AnonyMISSKarin", "created_at": "20130112T20:57:28.000Z"}, {"id": "1003712866011308033", "username": "persian_pesar", "description": "\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200fبه ستواری و سختی رشک پولاد/\nبه راه عشق سرها داده بر باد/\nقرین بیستون هم\u200cسنگ فرهاد/\nز کرمانشاهیان یاد اینچنین باد\n\u200e#Civil_Environment_Engineer", "name": "persianpesar🏳\u200d🌈", "created_at": "20180604T18:59:30.000Z"}, {"id": "814795859644809217", "username": "Aazadist", "description": "\u200f\u200e#Equality🌐\n\u200e#Humanity🌐\nخواهی نشوی همرنگ ، رسوای جماعت شو", "name": "Aazad 🏳️\u200d🌈 آزاد", "created_at": "20161230T11:30:45.000Z"}, {"id": "790375699638915072", "username": "Isaihstewart", "location": "Los Angeles, CA", "description": "Part time assistant manager at “Sheets and Things”", "name": "Dey got the henessey 🗣", "created_at": "20161024T02:13:46.000Z"}, {"id": "4846243708", "username": "williamvercetti", "location": "Virginia Beach, VA", "description": "vma. art. modelo papi. tpain to the dms.", "name": "William Vercetti", "created_at": "20160125T17:21:50.000Z"}, {"id": "1160723882", "username": "k_cawsey", "location": "Halifax, Nova Scotia", "description": "Chaucer, Malory, Arthur Tolkien. #Dal_English", "name": "Dr. Kathy Cawsey", "created_at": "20130208T17:15:30.000Z"}, {"id": "3789298943", "username": "solomonesther17", "location": "Lagos, Nigeria", "description": "FairBib Legal Practitioners", "name": "Esther Solomon", "created_at": "20150927T04:52:29.000Z"}, {"id": "14860380", "username": "Dejify", "location": "San Francisco", "description": "The Nigerian State is a festering boil that the world can't afford to ignore. Because, when it pops, its rancid ooze won't be pleasant nor easy to contain.", "name": "Buhari: Uber Ment (Dèjì Akọ́mọláfẹ́)", "created_at": "20080521T18:57:27.000Z"}, {"id": "1120883223070773248", "username": "Donna780780", "description": "", "name": "Donna Swidley", "created_at": "20190424T02:52:40.000Z"}, {"id": "1253742908487929858", "username": "Neros_sis", "location": "Florida", "description": "", "name": "#Nero's Fiddle GOP has a terrorism problem", "created_at": "20200424T17:50:00.000Z"}, {"id": "585090491", "username": "vickierae562", "location": "The LBC", "description": "That’s Right, I’m a Lefty 🤣 and I don’t feed trolls! #resist #DumpTrump #DitchMitch #LooseLindsey", "name": "Vickie Rae", "created_at": "20120519T21:00:28.000Z"}, {"id": "1262122532607574022", "username": "EmilySi49944255", "description": "", "name": "Skylar Aubrey", "created_at": "20200517T20:47:34.000Z"}, {"id": "1401663176", "username": "mdeHummelchen", "location": "Tief im Westen", "description": "Pflegewissenschaftlerin,Pflegeberaterin,Dozentin,Lächeln und winken...Pro Pflegekammer", "name": "Madame Hummelchen 💙", "created_at": "20130504T07:44:32.000Z"}, {"id": "2381808114", "username": "mommy97giraffe", "location": "Antifa HQs/Mom Division Office", "description": "Follower of Jesus, Mennonite mom&wife, lover of books, world, peo, poetry&art. 6 autoimmunes&fibro🥄ie Proud Mama Bear of 1gayD & 1pan&autistic son, in 20s🌈💖", "name": "Mennonite Mom(she/her)", "created_at": "20140310T08:51:02.000Z"}, {"id": "2362182011", "username": "rd2glry", "location": "Washington, DC", "description": "", "name": "ateachr", "created_at": "20140224T04:07:21.000Z"}, {"id": "974917494870700032", "username": "GiraffeOld", "location": "Arizona, USA", "description": "", "name": "old man giraffe", "created_at": "20180317T07:56:58.000Z"}, {"id": "830939480", "username": "redz041", "description": "", "name": "Jan Mouzone", "created_at": "20120918T12:18:36.000Z"}, {"id": "3346032292", "username": "kumccaig44", "description": "", "name": "Katrine McCaig", "created_at": "20150625T21:25:21.000Z"}, {"id": "80630279", "username": "LuluTheCalm", "location": "Green Grass & Puddles, Canada", "description": "Mischief in My Eyes & Adventure in My Soul. \nLet's Have a Laugh &, you know, Make the World a Better Place.😎 \nAus/Brit/Cdn🇨🇦", "name": "Lulu 🇨🇦#BeKindBeCalmBeSafe💞 😷 🎏", "created_at": "20091007T17:26:56.000Z"}, {"id": "3252437864", "username": "engelhardterin", "location": "Houston, TX || Lubbock, TX", "description": "24 || Texas Tech || ♀️ || she/her", "name": "Erin Engelhardt", "created_at": "20150622T07:26:28.000Z"}, {"id": "93797267", "username": "mcbeaz", "location": "he/him", "description": "black lives matter.", "name": "mike", "created_at": "20091201T05:28:58.000Z"}, {"id": "2585773107", "username": "michiganington", "location": "Washington, D.C. ", "description": "", "name": "Allyoop", "created_at": "20140606T02:12:33.000Z"}, {"id": "27857135", "username": "JackRayher", "location": "Northport, NY", "description": "Senior Marketing Executive\nLifelong Democrat\n#BidenHarris", "name": "Jack Rayher", "created_at": "20090331T12:12:03.000Z"}, {"id": "1078457644736827392", "username": "RobertCooper58", "description": "Bilingual community advocate. Father of five wonderful kids. Lifelong progressive and proud member of #TheDemCoalition. Early supporter of President #JoeBiden.", "name": "Robert Cooper 🌊", "created_at": "20181228T01:08:34.000Z"}, {"id": "206860139", "username": "MariaArtze", "location": "Münster, Deutschland", "description": "Nas trincheiras da ESO\nEmigrante a medio retornar. Womansplainer.\n(Sie vostede)\n\nTrans rights are human rights.", "name": "A Malvada Profe mediovacinada", "created_at": "20101023T22:27:26.000Z"}, {"id": "2903906123", "username": "lm1067", "location": "London, England", "description": "B A FINE ARTIST GRADUATED", "name": "Luis Pais", "created_at": "20141203T15:53:10.000Z"}, {"id": "64119853", "username": "IAM_SHAKESPEARE", "location": "Tweeting from the Grave", "description": "This bot has tweeted the complete works of Shakespeare (in order) 5 times over the last 12years. On hiatus for a bit. Created by #strebel", "name": "Willy Shakes", "created_at": "20090809T05:41:08.000Z"}, {"id": "3176623941", "username": "acastellich", "location": "Chicago, Il.", "description": "Abogado,Restaurantero,Immigrant , UVM. AD1 IPADE MBA. Restaurant Hospitality Industry, Chicago IL.", "name": "Alejandro Castelli", "created_at": "20150417T13:23:17.000Z"}, {"id": "782765390925533185", "username": "Diane_L_Espo", "location": "Florida, USA", "description": "", "name": "DianeEspo 🇺🇲🗽", "created_at": "20161003T02:13:07.000Z"}, {"id": "67471020", "username": "thedcma", "location": "Fort Lauderdale, FL", "description": "🖤💎 Style is the only substance I abuse.💎🖤 I’m just a 🌈 Gay 🐔Hillbilly 🔮Warlock 🛵 Riding a 👨🏻\u200d🎤Vaporwave Fever Dream #blacklivesmatter", "name": "Grace Kelly on Steiroids", "created_at": "20090821T00:32:37.000Z"}, {"id": "78797635", "username": "graciosodiablo", "description": "Too much of a good thing can be bad. So too little of a bad thing must be good. 160 characters or less of me should be perfect.", "name": "gracioso diabloint", "created_at": "20091001T03:59:16.000Z"}, {"id": "268314713", "username": "philppedurand", "location": "Auxerre", "description": "Je suis une personne gentille je milite pour la PMA. je suis militant communiste je suis aussi à l’association des Rosoirs je suis conseillé quartier", "name": "Philippe durand", "created_at": "20110318T14:37:36.000Z"}, {"id": "37996028", "username": "nicrawhide", "location": "Pinconning Michigan ", "description": "Just your average small town gay with big town sensibility!!", "name": "Nicholas Bean", "created_at": "20090505T19:20:37.000Z"}, {"id": "1236656342674407427", "username": "LadyJayPersists", "location": "Valhalla", "description": "USN Veteran | Shieldmaiden | Mom | Not here for a man, I have one | PTSD Warrior | My mind is a beautiful servant to a dangerous master", "name": "Jax", "created_at": "20200308T14:13:48.000Z"}, {"id": "171183306", "username": "dawndawnB", "location": "United States", "description": "Mrs. B, mother of 2 amazing kids, Substance Abuse Counselor, Volunteer, Music Lover. Born in DC but a VA Lo❤️er!", "name": "nwad", "created_at": "20100726T19:21:24.000Z"}, {"id": "817247846751555587", "username": "me2020_2021", "location": "Brisbane, Queensland", "description": "Proud Aussie, living a wonderful life with my wife, Australian Cricket 🏏👏,😷 🍺🥃 \U0001f9ae Alex", "name": "👀🏳️\u200d🌈 "A girl has no Name”', 'created_at': '20170106T05:54:05.000Z'}, {'id': '879459933988585472', 'username': 'Davecl3069', 'location': 'San Francisco Bay Area', 'description': 'proud of my views, life long learner,& hopefully, that guy!\n#LowerTheFlagForCovidVictims #VoteBlue #BLM #SupportThePlayers #LGBTQ #WeNeedToDoBetter #ResistStill', 'name': 'David', 'created_at': '20170626T22:02:42.000Z'}`
Code used
Regex:
def convert_to_json(file):
with open(file, "r", encoding="utf-8") as f:
x = f.read()
x = x.replace("-", "")
rx = re.compile(r'"[^"]*"(*SKIP)(*FAIL)|\'')
decoded = rx.sub('"', x)
literal_eval:
def open_json():
with open("data.json", "r", encoding="utf-8") as f:
f.read()
data = literal_eval(f)
data = json.loads(str(data))
What I would like to achieve
Reformat the data to conform to JSON (this question) in order to be able to
Build a dataframe with the relevant tweettext, user information and metadata (secondary goal) to be used in further analyses.
Thanks in advance for any suggestions! :)
if the ' that are causing the problem are only in the tweets and desciption
you could try that
pre_tweet ="'text': '"
post_tweet = "', 'referenced_tweets':"
with open(file, encoding="utf-8") as f:
data=f.readlines()
output = []
errors = []
for line in data:
if pre_tweet in line and post_tweet in line :
first_part,rest = line.split(pre_tweet)
tweet,last_part = rest.split(post_tweet)
pre_tweet = first_part.replace('\'', '\"') + pre_tweet.replace('\'', '\"')
post_tweet = post_tweet.replace('\'', '\"') + last_part.replace('\'', '\"')
output.append(pre_tweet + tweet + post_tweet)
else :
errors.append(line)
and if errors is not empty, either it's because there are no tweets in the line (you can change the code a little bit to add it to your output),
or what's after the tweet is not 'referenced_tweets'. In the second case, you may try to figure what could the changes be and modify the above code to add multiple post_tweet
then you may do the same with the description by changing pre and post tweet by what's usually before and after the description
The numbers of possible keys after the tweets/description must be finite, so it may take some time to figure out all the possibilities but in the end you should succeed
So I figured out a way to process the corrupt data.
The solution can be found here.
Using ast.literal_eval(input_string) lets me read in the corrupt json lines as a dictionary. Only thing is to make sure that no leading or trailing whitespace, commata etc. are included in the input string.
Example code for reading in data with ast.literal_eval():
from ast import literal_eval
with open("inputdata.json", "r", encoding="utf-8") as f:
dictlist = []
for line in f:
x: str = f.readline()
x = x.lstrip()
data = literal_eval(x)
dictlist.append(data)

Writing to JSON file, then reading this same file and getting "JSONDecodeError: Extra data"

I have a very large json file (9GB). I'm reading in one object from it at a time, and then deleting key-value pairs in this object when the key is not in the list fields.
Each object is basically someone's user profile on a job searching website, but it comes with many unwanted key-value pairs that are not relevant to my analysis. There are about 3 million of these profiles.
I'd like to write each new profile/object to a json file, cleaned.json. Essentially this should be a copy of the original json file, except any of the key-value pairs not mentioned in fields have been removed from all 3 million profiles.
To do this, I wrote the following code:
# fields to keep
fields = ["skills", "industry", "summary", "education", "experience"]
with open('cleaned.json', 'w', encoding='UTF8') as f:
for profile in open(path_to_file, encoding = 'UTF8'):
profile = json.loads(profile)
# remove unwanted fields from profile
for key in list(profile.keys()):
if key not in fields:
del(profile[key])
# write profile to new json file
json.dump(profile, f)
To test whether it worked, I tried reading the json file in again, like so:
for foo in open('cleaned.json', encoding='UTF8'):
foo = json.loads(foo)
print(json.dumps(foo, indent=4))
But I'm getting this error: JSONDecodeError: Extra data on the foo = json.loads(foo) line.
I've tested this by only modifying 1 profile from the original json and writing this modified profile to cleaned.json, and cleaned.json looks like this (except it's all on one line, I've just pretty printed it for this post):
{
"skills": [
"Key Account Development",
"Strategic Planning",
"Market Planning",
"Team Leadership",
"Negotiation",
"Forecasting",
"Key Account Management",
"Sales Management",
"New Business Development",
"Business Planning",
"Cross-functional Team Leadership",
"Budgeting",
"Strategy Development",
"Business Strategy",
"Consultative Selling",
"Medical Devices",
"Customer Relations",
"Contract Negotiation",
"Mentoring",
"Coaching",
"Healthcare",
"Territory",
"Sales Process",
"Direct Sales",
"Sales Operations",
"Pharmaceutical Sales"
],
"industry": "Medical Devices",
"summary": "SALES MANAGEMENT / BUSINESS DEVELOPMENT / PROJECT MANAGEMENTDOMESTIC & INTERNATIONAL KEY ACCOUNT MANAGEMENTBusiness and Sales Executive with 20 years of accomplished career track, reflecting extensive experience and dynamic record-breaking performance in the Medical Industry markets. Exceptional communicator, strong team player, flexible self-starter with consultative sales style, strong negotiations skills, exceptional problem solving abilities, and accurate customer assessment aptitude. Manage and lead teams to success, drive new business through key accounts management, establish partnerships, manage solid distributor relationship for increased profitability and sales volumes. Very well organized, accurate and on-time administrative work, with a track record that demonstrates self-motivation, creativity, sales team leadership, initiative to achieve corporate, team and personal goals. Experience in the following markets: Medical Devices, Medical Disposables, Capital Equipment, Pharmaceuticals."
}{
"education": [
{
"start": "2008",
"major": "Economics",
"end": "2008",
"name": "Columbia University - Columbia Business School",
"desc": "Coursework \"Principals of Economics\" ECON1105\tSpring 2008"
},
{
"start": "2007",
"end": "2007",
"name": "Columbia University - Columbia Business School"
},
{
"major": "Cancer genomics",
"end": "2001",
"name": "G\u00f6teborgs universitet",
"degree": "Ph.D.",
"start": "1996",
"desc": "Thesis: \"The role of p53 in tumor progression and prognosis in patients with primary colorectal cancer\""
},
{
"start": "1994",
"major": "Biology, Medicine;German Language",
"end": "1995",
"name": "Universit\u00e4t Regensburg",
"degree": "Cancer Research, Coursework"
},
{
"major": "Biology",
"end": "1994",
"name": "G\u00f6teborgs universitet",
"degree": "Master",
"start": "1989",
"desc": ""
},
{
"start": "1992",
"major": "50% Biology and Medicine, 50% mixed music, sports, computer science, art etc",
"end": "1993",
"name": "The University of Georgia",
"desc": "Scholarship for one full year of Graduate Studies."
}
],
"skills": [
"Molecular Biology",
"Biomarkers"
],
"industry": "Pharmaceuticals",
"experience": [
{
"org": "Johnson and Johnson",
"title": "Senior Scientist, Oncology Biomarkers",
"end": "Present",
"start": "November 2009",
"desc": "Biomarker Leader for compounds in clinical development.*Developing and implementing predictive and pharmacodynamic biomarkers for the use in Phase 0 - III oncology clinical trials.."
},
{
"org": "Albert Einstein Medical Center",
"title": "Associate at Dept of Molecular Genetics",
"start": "September 2008",
"desc": "Single Cell Gene expression."
},
{
"org": "Columbia University",
"title": "Associate Research Scientist",
"start": "August 2006",
"desc": "Work on peptide to restore wt p53 function in cancer."
},
{
"org": "Memorial Sloan Kettering Cancer Center",
"title": "Post Doctoral Research Fellow",
"start": "January 2003",
"desc": "Molecular profiling of colorectal cancer."
},
{
"org": "Sahlgrenska University Hospital",
"title": "Research Scientist",
"start": "November 2001",
"desc": "Cancer Research at Dept of Surgery.Molecular profiling of Colorectal Cancer with focus on p53."
}
],
"summary": "Ph.D. scientist with background in cancer research, translational medicine and early drug development with special focus on biomarkers and personalized medicine."
}
So when I read this in, I'm getting the error. What am I doing wrong? I guess there is something wrong with the way I'm writing the profile to cleaned.json?
Sample input for testing
Sample input has 3 profiles.
{"_id": "in-00000001", "name": {"family_name": "Mazalu MBA", "given_name": "Dr Catalin"}, "locality": "United States", "skills": ["Key Account Development", "Strategic Planning", "Market Planning", "Team Leadership", "Negotiation", "Forecasting", "Key Account Management", "Sales Management", "New Business Development", "Business Planning", "Cross-functional Team Leadership", "Budgeting", "Strategy Development", "Business Strategy", "Consultative Selling", "Medical Devices", "Customer Relations", "Contract Negotiation", "Mentoring", "Coaching", "Healthcare", "Territory", "Sales Process", "Direct Sales", "Sales Operations", "Pharmaceutical Sales"], "industry": "Medical Devices", "summary": "SALES MANAGEMENT / BUSINESS DEVELOPMENT / PROJECT MANAGEMENTDOMESTIC & INTERNATIONAL KEY ACCOUNT MANAGEMENTBusiness and Sales Executive with 20 years of accomplished career track, reflecting extensive experience and dynamic record-breaking performance in the Medical Industry markets. Exceptional communicator, strong team player, flexible self-starter with consultative sales style, strong negotiations skills, exceptional problem solving abilities, and accurate customer assessment aptitude. Manage and lead teams to success, drive new business through key accounts management, establish partnerships, manage solid distributor relationship for increased profitability and sales volumes. Very well organized, accurate and on-time administrative work, with a track record that demonstrates self-motivation, creativity, sales team leadership, initiative to achieve corporate, team and personal goals. Experience in the following markets: Medical Devices, Medical Disposables, Capital Equipment, Pharmaceuticals.", "url": "http://www.linkedin.com/in/00000001", "also_view": [{"url": "http://www.linkedin.com/pub/krisa-drost/45/909/513", "id": "pub-krisa-drost-45-909-513"}, {"url": "http://ro.linkedin.com/pub/florin-ut/18/b33/77b", "id": "pub-florin-ut-18-b33-77b"}, {"url": "http://ro.linkedin.com/pub/cristian-radu/21/225/149", "id": "pub-cristian-radu-21-225-149"}, {"url": "http://ro.linkedin.com/pub/traian-rusu/16/652/279", "id": "pub-traian-rusu-16-652-279"}, {"url": "http://ro.linkedin.com/pub/dumitrescu-catalin/3/283/92", "id": "pub-dumitrescu-catalin-3-283-92"}, {"url": "http://www.linkedin.com/pub/jody-brelsford/9/21a/354", "id": "pub-jody-brelsford-9-21a-354"}, {"url": "http://www.linkedin.com/pub/mary-anne-dilloway/2/55a/18", "id": "pub-mary-anne-dilloway-2-55a-18"}, {"url": "http://ro.linkedin.com/pub/carmen-baleanu/2b/252/203", "id": "pub-carmen-baleanu-2b-252-203"}, {"url": "http://il.linkedin.com/in/shimonlobel", "id": "in-shimonlobel"}, {"url": "http://ro.linkedin.com/pub/monica-danilescu/19/36a/121", "id": "pub-monica-danilescu-19-36a-121"}]}
{"_id": "in-00001", "education": [{"start": "2008", "major": "Economics", "end": "2008", "name": "Columbia University - Columbia Business School", "desc": "Coursework \"Principals of Economics\" ECON1105\tSpring 2008"}, {"start": "2007", "end": "2007", "name": "Columbia University - Columbia Business School"}, {"major": "Cancer genomics", "end": "2001", "name": "G\u00f6teborgs universitet", "degree": "Ph.D.", "start": "1996", "desc": "Thesis: \"The role of p53 in tumor progression and prognosis in patients with primary colorectal cancer\""}, {"start": "1994", "major": "Biology, Medicine;German Language", "end": "1995", "name": "Universit\u00e4t Regensburg", "degree": "Cancer Research, Coursework"}, {"major": "Biology", "end": "1994", "name": "G\u00f6teborgs universitet", "degree": "Master", "start": "1989", "desc": ""}, {"start": "1992", "major": "50% Biology and Medicine, 50% mixed music, sports, computer science, art etc", "end": "1993", "name": "The University of Georgia", "desc": "Scholarship for one full year of Graduate Studies."}], "group": {"affilition": ["ASMALLWORLD.net", "Biomarker Research & Executive Network", "Biomarker Society", "Biomarkers", "Biomarkers in Discovery, Development and the Clinic Network", "Biotechnology/Pharmaceuticals", "Circulating Tumor Cell (CTC) and Cancer Stem Cell Group", "Clinical Development Job Opportunities - Europe", "Epigenetics", "Molecular Diagnostics Professional Network", "Molecular Diagnostics for Cancer Drug Development Forum", "NYC Women in Biotech", "Oncology Drug Development (Premier Group For Cancer Drug Development)", "Oncology Pharma\u2122", "Personalized Medicine", "Personalized Oncology Medicine - Global Group", "Professionals in the Pharmaceutical and Biotech Industry", "Svenskar i New York", "Translational Medicine Alliance"]}, "name": {"family_name": "Forslund", "given_name": "Ann"}, "overview_html": "<dl id=\"overview\"><dt id=\"overview-summary-current-title\" class=\"summary-current\" style=\"display:block\">\nCurrent\n</dt>\n<dd class=\"summary-current\" style=\"display:block\">\n<ul class=\"current\"><li>\nSenior Scientist, Oncology Biomarkers\n<span class=\"at\">at </span>\n<a class=\"company-profile-public\" href=\"/company/johnson-&-johnson?trk=ppro_cprof\"><span class=\"org summary\">Johnson and Johnson</span></a>\n</li>\n</ul></dd>\n<dt id=\"overview-summary-past-title\" class=\"summary-past\" style=\"display:block\">\nPast\n</dt>\n<dd class=\"summary-past\" style=\"display:block\">\n<ul class=\"past\"><li>\nAssociate at Dept of Molecular Genetics\n<span class=\"at\">at </span>\n<a class=\"company-profile-public\" href=\"/company/einstein-medical-center-philadelphia?trk=ppro_cprof\"><span class=\"org summary\">Albert Einstein Medical Center</span></a>\n</li>\n<li>\nAssociate Research Scientist\n<span class=\"at\">at </span>\n<a class=\"company-profile-public\" href=\"/company/columbia-university?trk=ppro_cprof\"><span class=\"org summary\">Columbia University</span></a>\n</li>\n<li>\nPost Doctoral Research Fellow\n<span class=\"at\">at </span>\nMemorial Sloan Kettering Cancer Center\n</li>\n</ul><div class=\"showhide-block\" id=\"morepast\">\n<ul class=\"past\"><li>\nResearch Scientist\n<span class=\"at\">at </span>\n<a class=\"company-profile-public\" href=\"/company/sahlgrenska-university-hospital?trk=ppro_cprof\"><span class=\"org summary\">Sahlgrenska University Hospital</span></a>\n</li>\n</ul><p class=\"seeall showhide-link\">see less</p>\n</div>\n<p class=\"seeall showhide-link\">see all</p>\n</dd>\n<dt id=\"overview-summary-education-title\" class=\"summary-education\" style=\"display:block\">\nEducation\n</dt>\n<dd class=\"summary-education\" style=\"display:block\">\n<ul><li>\nColumbia University - Columbia Business School\n</li>\n<li>\nColumbia University - Columbia Business School\n</li>\n<li>\nG\u00f6teborgs universitet\n</li>\n</ul><div class=\"showhide-block\" id=\"moreedu\">\n<ul><li>\n<div name=\"education\">\nUniversit\u00e4t Regensburg\n</div>\n</li>\n<li>\n<div name=\"education\">\nG\u00f6teborgs universitet\n</div>\n</li>\n<li>\n<div name=\"education\">\nThe University of Georgia\n</div>\n</li>\n</ul><p class=\"seeall showhide-link\">see less</p>\n</div>\n<p class=\"seeall showhide-link\">see all</p>\n</dd>\n<dt>\nConnections\n</dt>\n<dd class=\"overview-connections\">\n<p>\n<strong>244</strong> connections\n</p>\n</dd>\n</dl>", "locality": "Antwerp Area, Belgium", "skills": ["Molecular Biology", "Biomarkers"], "industry": "Pharmaceuticals", "interval": 20, "experience": [{"org": "Johnson and Johnson", "title": "Senior Scientist, Oncology Biomarkers", "end": "Present", "start": "November 2009", "desc": "Biomarker Leader for compounds in clinical development.*Developing and implementing predictive and pharmacodynamic biomarkers for the use in Phase 0 - III oncology clinical trials.."}, {"org": "Albert Einstein Medical Center", "title": "Associate at Dept of Molecular Genetics", "start": "September 2008", "desc": "Single Cell Gene expression."}, {"org": "Columbia University", "title": "Associate Research Scientist", "start": "August 2006", "desc": "Work on peptide to restore wt p53 function in cancer."}, {"org": "Memorial Sloan Kettering Cancer Center", "title": "Post Doctoral Research Fellow", "start": "January 2003", "desc": "Molecular profiling of colorectal cancer."}, {"org": "Sahlgrenska University Hospital", "title": "Research Scientist", "start": "November 2001", "desc": "Cancer Research at Dept of Surgery.Molecular profiling of Colorectal Cancer with focus on p53."}], "summary": "Ph.D. scientist with background in cancer research, translational medicine and early drug development with special focus on biomarkers and personalized medicine.", "url": "http://be.linkedin.com/in/00001", "also_view": [{"url": "http://www.linkedin.com/pub/peter-king/4/993/a16", "id": "pub-peter-king-4-993-a16"}, {"url": "http://www.linkedin.com/pub/hans-winkler/1/1ab/78a", "id": "pub-hans-winkler-1-1ab-78a"}, {"url": "http://de.linkedin.com/pub/michael-koslowski/26/964/99b", "id": "pub-michael-koslowski-26-964-99b"}, {"url": "http://de.linkedin.com/pub/werner-seiz/b/14/436", "id": "pub-werner-seiz-b-14-436"}, {"url": "http://de.linkedin.com/pub/miro-venturi/7/725/217", "id": "pub-miro-venturi-7-725-217"}, {"url": "http://ch.linkedin.com/pub/lisa-d-amato/3/808/267", "id": "pub-lisa-d-amato-3-808-267"}, {"url": "http://www.linkedin.com/pub/june-kaplow-ph-d/2/382/924", "id": "pub-june-kaplow-ph-d-2-382-924"}, {"url": "http://fr.linkedin.com/pub/fabien-schmidlin/b/b73/4b2", "id": "pub-fabien-schmidlin-b-b73-4b2"}, {"url": "http://be.linkedin.com/pub/tine-casneuf/2/563/884", "id": "pub-tine-casneuf-2-563-884"}, {"url": "http://be.linkedin.com/pub/jeroen-aerssens/0/b9a/6ba", "id": "pub-jeroen-aerssens-0-b9a-6ba"}], "specilities": "Biomarkers in Oncology, Cancer Genomics, Molecular Profiling of Cancer, Translational Cancer Research, Early Development Drug Discovery", "events": [{"from": "Sahlgrenska University Hospital", "to": "Memorial Sloan Kettering Cancer Center", "title1": "Research Scientist", "start": 24022, "title2": "Post Doctoral Research Fellow", "end": 24036}, {"from": "Memorial Sloan Kettering Cancer Center", "to": "Columbia University", "title1": "Post Doctoral Research Fellow", "start": 24036, "title2": "Associate Research Scientist", "end": 24079}, {"from": "Columbia University", "to": "Albert Einstein Medical Center", "title1": "Associate Research Scientist", "start": 24079, "title2": "Associate at Dept of Molecular Genetics", "end": 24104}, {"from": "Albert Einstein Medical Center", "to": "Johnson and Johnson", "title1": "Associate at Dept of Molecular Genetics", "start": 24104, "title2": "Senior Scientist, Oncology Biomarkers", "end": 24118}]}
{"_id": "in-00006", "interests": "personal genomics, nanotechnology", "education": [{"major": "Biophysics", "end": "2009", "name": "Harvard University", "degree": "Ph.D", "start": "2004", "desc": ""}, {"major": "Computer Science", "end": "2003", "name": "Yale University", "degree": "B.S.", "start": "1999", "desc": ""}], "name": {"family_name": "Douglas", "given_name": "Shawn"}, "overview_html": "<dl id=\"overview\"><dt id=\"overview-summary-current-title\" class=\"summary-current\" style=\"display:block\">\nCurrent\n</dt>\n<dd class=\"summary-current\" style=\"display:block\">\n<ul class=\"current\"><li>\nAssistant Professor\n<span class=\"at\">at </span>\nUCSF\n</li>\n</ul></dd>\n<dt id=\"overview-summary-past-title\" class=\"summary-past\" style=\"display:block\">\nPast\n</dt>\n<dd class=\"summary-past\" style=\"display:block\">\n<ul class=\"past\"><li>\nTechnology Development Fellow\n<span class=\"at\">at </span>\n<a class=\"company-profile-public\" href=\"/company/wyss-institute-for-biologically-inspired-engineering?trk=ppro_cprof\"><span class=\"org summary\">Wyss Institute for Biologically Inspired Engineering</span></a>\n</li>\n</ul></dd>\n<dt id=\"overview-summary-education-title\" class=\"summary-education\" style=\"display:block\">\nEducation\n</dt>\n<dd class=\"summary-education\" style=\"display:block\">\n<ul><li>\nHarvard University\n</li>\n<li>\nYale University\n</li>\n</ul></dd>\n<dt>\nConnections\n</dt>\n<dd class=\"overview-connections\">\n<p>\n<strong>164</strong> connections\n</p>\n</dd>\n<dt class=\"websites\">Websites</dt>\n<dd class=\"websites\">\n<ul><li>\n\nCompany Website\n\n</li>\n<li>\n\nPersonal Website\n\n</li>\n<li>\n\nBIOMOD\n\n</li>\n</ul></dd>\n</dl>", "locality": "San Francisco, California", "skills": ["DNA", "Nanotechnology", "Molecular Biology", "Software Development"], "industry": "Research", "interval": 0, "experience": [{"org": "UCSF", "title": "Assistant Professor", "end": "Present", "start": "September 2012"}, {"org": "Wyss Institute for Biologically Inspired Engineering", "title": "Technology Development Fellow", "start": "May 2009"}], "summary": "I am interested in inventing new methods to construct and manipulate biological molecules at the nanometer scale, toward developing new scientific tools and therapeutic devices.", "url": "http://www.linkedin.com/in/00006", "also_view": [{"url": "http://www.linkedin.com/pub/george-church/1/630/2b8", "id": "pub-george-church-1-630-2b8"}, {"url": "http://www.linkedin.com/pub/andrew-hessel/4/4b0/290", "id": "pub-andrew-hessel-4-4b0-290"}, {"url": "http://www.linkedin.com/pub/ayis-antoniou/0/216/630", "id": "pub-ayis-antoniou-0-216-630"}, {"url": "http://uk.linkedin.com/pub/matthew-bellis/35/973/888", "id": "pub-matthew-bellis-35-973-888"}, {"url": "http://www.linkedin.com/pub/john-mulligan-ph-d/7/5a3/5aa", "id": "pub-john-mulligan-ph-d-7-5a3-5aa"}, {"url": "http://www.linkedin.com/pub/yang-mao/38/621/a83", "id": "pub-yang-mao-38-621-a83"}, {"url": "http://www.linkedin.com/pub/sidney-wang/25/3b8/b84", "id": "pub-sidney-wang-25-3b8-b84"}, {"url": "http://www.linkedin.com/pub/yang-mao/9/815/369", "id": "pub-yang-mao-9-815-369"}, {"url": "http://www.linkedin.com/pub/j-markson/32/572/10", "id": "pub-j-markson-32-572-10"}], "homepage": {"BIOMOD": ["http://biomod.net/"], "Company Website": ["http://bionano.ucsf.edu/"], "Personal Website": ["http://www.shawndouglas.com/"]}, "events": [{"from": "Wyss Institute for Biologically Inspired Engineering", "to": "UCSF", "title1": "Technology Development Fellow", "start": 24112, "title2": "Assistant Professor", "end": 24152}]}
Here's code that seems to work with your sample input. As I said in a comment the file you are dealing with is in something called JSON Lines format rather than JSON format.
Since you appear to want the cleaned version in that same format (in other words, not converted to standard JSON format, as I thought a one point), here's how to do that:
import json
path_to_file = "sample_input.json"
cleaned_file = "cleaned.json"
# Fields to keep.
fields = ["skills", "industry", "summary", "education", "experience"]
# Clean profiles in JSON Lines format file.
with open(path_to_file, encoding='UTF8') as inf, \
open(cleaned_file, 'w', encoding='UTF8') as outf:
for line in inf:
profile = json.loads(line) # Read a profile object.
for key in list(profile.keys()): # Remove unwanted fields it.
if key not in fields:
del profile[key]
outf.write(json.dumps(profile) + '\n') # Write cleaned profile to new file
# Test whether it worked.
with open(cleaned_file, encoding='UTF8') as cleaned:
for line in cleaned:
profile = json.loads(line)
print(json.dumps(profile, indent=4))
You are basically dumping new json objects into a file every time you are calling json.dump(profile, f). But that does not generate valid JSON, since it does not emped the objects correctly.
E.g. {}{} instead of {{},{}}
As for a solution - the size of your JSON makes reading / writing while holding everything in memory a bad solution.
I would probably try the library https://pypi.org/project/jsonstreams/ or something like this.

how can i declare a list of map defined types in cassandra

i want to declare a list of objects in cassandra and i have already created the type object
CREATE TYPE profiles.educations (
major text,
end text,
name text,
degree text,
start text,
desce text
);
how can declare a list of map educations type
cause i have a json file this format:
{
...
"educations": [
{
"start": "2009",
"major": "Business Administration and Management, General",
"end": "2010",
"name": "Gordon Institute of Business Science - University of Pretoria",
"degree": "PDBA"
},
{
"start": "2002",
"major": "Marketing Management",
"end": "2006",
"name": "University of Pretoria/Universiteit van Pretoria",
"degree": "B. com with specialization in Marketing Management"
},
{
"major": "Finanzas",
"end": "2013",
"name": "Universidad de Los Andes",
"degree": "Maestr\u00eda en Finanzas",
"start": "2011",
"desce": ""
}]
...
}

Categories

Resources