Convert pandas dataframe to JSON schema - python

I have a dataframe
import pandas as pd
data = {
"ID": [123123, 222222, 333333],
"Main Authors": ["[Jim Allen, Tim H]", "[Rob Garder, Harry S, Tim H]", "[Wo Shu, Tee Ru, Fuu Wan, Gee Han]"],
"Abstract": ["This is paper about hehe", "This paper is very nice", "Hello there paper from kellogs"],
"paper IDs": ["[123768, 123123]", "[123432, 34345, 353545, 454545]", "[123123, 3433434, 55656655, 988899]"],
}
and I am trying to export it to a JSON schema. I do so via
df.to_json(orient='records')
'[{"ID":123123,"Main Authors":"[Jim Allen, Tim H]","Abstract":"This is paper about hehe","paper IDs":"[123768, 123123]"},
{"ID":222222,"Main Authors":"[Rob Garder, Harry S, Tim H]","Abstract":"This paper is very nice","paper IDs":"[123432, 34345, 353545, 454545]"},
{"ID":333333,"Main Authors":"[Wo Shu, Tee Ru, Fuu Wan, Gee Han]","Abstract":"Hello there paper from kellogs","paper IDs":"[123123, 3433434, 55656655, 988899]"}]'
but this is not in the right format for JSON. How can I get my output to look like this
{"ID": "123123", "Main Authors": ["Jim Allen", "Tim H"], "Abstract": "This is paper about hehe", "paper IDs": ["123768", "123123"]}
{and so on for paper 2...}
I can't find an easy way to achieve this schema with the basic functions.

to_json returns a proper JSON document. What you want is not a JSON document.
Add lines=True to the call:
df.to_json(orient='records', lines=True)
The output you desire is not valid JSON. It's a very common way to stream JSON objects though: write one unindented JSON object per line.
Streaming JSON is an old technique, used to write JSON records to logs, send them over the network etc. There's no specification for this, but a lot of people tried to hijack it, even creating sites that mirrored Douglas Crockford's original JSON site, or mimicking the language of RFCs.
Streaming JSON formats are used a lot in IoT and event processing applications, where events will arrive over a long period of time.
PS: I remembered I saw a few months ago a question about json-seq. Seems there was an attempt to standardize streaming JSON RFC 7464 as JSON Sequences, using the mime type application/json-seq.

You can convert DataFrame to list of dictionaries first.
import pandas as pd
data = {
"ID": [123123, 222222, 333333],
"Main Authors": [["Jim Allen", "Tim H"], ["Rob Garder", "Harry S", "Tim H"], ["Wo Shu", "Tee Ru", "Fuu Wan", "Gee Han"]],
"Abstract": ["This is paper about hehe", "This paper is very nice", "Hello there paper from kellogs"],
"paper IDs": [[123768, 123123], [123432, 34345, 353545, 454545], [123123, 3433434, 55656655, 988899]],
}
df = pd.DataFrame(data)
df.to_dict('records')
The result:
[{'ID': 123123,
'Main Authors': ['Jim Allen', 'Tim H'],
'Abstract': 'This is paper about hehe',
'paper IDs': [123768, 123123]},
{'ID': 222222,
'Main Authors': ['Rob Garder', 'Harry S', 'Tim H'],
'Abstract': 'This paper is very nice',
'paper IDs': [123432, 34345, 353545, 454545]},
{'ID': 333333,
'Main Authors': ['Wo Shu', 'Tee Ru', 'Fuu Wan', 'Gee Han'],
'Abstract': 'Hello there paper from kellogs',
'paper IDs': [123123, 3433434, 55656655, 988899]}]
Is that what you are looking for?

Related

Pass text to a Python script and return the result using R JSON

I have a string in R that I would like to pass to python in order to compute something and return the result back into R.
I have the following which "works" but not as I would like.
The below passes a string from R, to a Python file, uses openAI to collect the text data and then load it back into R.
library(reticulate)
computePythonFunction <- "
def print_openai_response():
import openai
openai.api_key = 'ej-powurjf___OpenAI_API_KEY___HGAJjswe' # you will need an API key
prompt = 'write me a poem about the sea'
response = openai.Completion.create(engine = 'text-davinci-003', prompt = prompt, max_tokens=1000)
#response['choices'][0]['text']
print(response)
"
py_run_string(computePythonFunction)
py$print_openai_response()
library("rjson")
fromJSON(as.character(py$print_openai_response()))
I would like to store the results in R objects - i.e. Here is one output from the python script.
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"text": "\n\nThe sea glitters like stars in the night \nWith hues, vibrant and bright\nThe waves flow gentle, serene, and divine \nLike the sun's most gentle shine\n\nAs the sea reaches, so wide, so vast \nAn adventure awaits, and a pleasure, not passed\nWhite sands, with seaweed green \nForms a kingdom of the sea\n\nConnecting different land and tide \nThe sea churns, dancing with the sun's pride\nAs a tempest forms, raging and wild \nThe sea turns, its colors so mild\n\nA raging storm, so wild and deep \nProtecting the creatures that no one can see \nThe sea is a living breathing soul \nA true and untouchable goal \n\nThe sea is a beauty that no one can describe \nAnd it's power, no one can deny \nAn ever-lasting bond, timeless and free \nThe love of the sea, is a love, to keep"
}
],
"created": 1670525403,
"id": "cmpl-6LGG3hDNzeTZ5VFbkyjwfkHH7rDkE",
"model": "text-davinci-003",
"object": "text_completion",
"usage": {
"completion_tokens": 210,
"prompt_tokens": 7,
"total_tokens": 217
}
}
I am interested in the text generated but I am also interested in the completion_tokens, promt_tokens and total_tokens.
I thought about save the Python code as a script, then pass the argument to it such as:
myPythin.py arg1.
How can I return the JSON output from the model to an R object? The only input which changes/varies in the python code is the prompt variable.

Python - manipulating pyodbc.fetchall() into a pandas usable format

I'm writing a program that obtains data from a database using pyodbc, the end goal being to analyze this data with a pandas.
as it stands, my program works quite well to connect to the database and collect the data that I need, however I'm having some trouble organizing or formatting this data in such a way that I can analyze it with pandas, or simply write it out clean to a .csv file (I know I can do this with pandas as well).
Here is the basis of my simple program:
from Logger import Logger
import pyodbc
from configparser import ConfigParser
from connectDB import connectDatabase, disconnectDatabase
config = ConfigParser()
config.read('config.ini')
getNeedlesPlaintiffs = config.get('QUERIES', 'pullNeedlesPlaintiffs')
getNeedlesDefendants = config.get('QUERIES', 'pullNeedlesDefendants')
def pullNeedlesData():
Logger.writeAndPrintLine("Connecting to needles db...", 0)
cnxn = connectDatabase()
if cnxn:
cursor=cnxn.cursor()
Logger.writeAndPrintLine("Connection successful. Getting Plaintiffs...", 0)
cursor.execute(getNeedlesPlaintiffs)
with open('needlesPlaintiffs.csv', 'w') as f:
for row in cursor.fetchall():
row = str(row)
f.write(row)
f.close()
Logger.writeAndPrintLine("Plaintiffs written to file, getting Defendants...", 0)
cursor.execute(getNeedlesDefendants)
with open('needlesDefendants.csv', 'w') as d:
for row in cursor.fetchall():
row = str(row)
d.write(row)
d.close()
disconnectDatabase(cnxn)
Logger.writeAndPrintLine("Defendants obtained, written to file.", 0)
else:
Logger.writeAndPrintLine("Connection to Needles DB Failed.", 2)
if __name__ == "__main__":
pullNeedlesData()
However, the output I'm getting in the .csv (and console) is simply unworkable. I would like to parse my data into a list of dictionaries, so that I can more easily use it for analysis with pandas.
For example, something like this (which I can then json.loads() into a pandas dataframe):
text_data = '[{"lname": "jones", "fname": "matt", "dob": "01-02-1990", "addr1": "28 sheffield dr"},\
{"lname": "kalinski", "fname": "fred", "dob": "01-02-1980", "addr1": "28 purple st"}, \
{"lname": "kyle", "fname": "ken", "dob": "05-01-1978", "addr1": "28 carlisle dr"}, \
{"lname": "jones", "fname": "matt", "dob": "01-02-1990", "addr1": "new address"}, \
{"lname": "kalinski", "fname": "fred", "dob": "01-02-1980", "addr1": "28 purple st"}, \
{"lname": "kyle", "fname": "ken", "dob": "05-01-1979", "addr1": "other address"}]'
Where I am now, I'm simply at a loss for how one would go about parsing this data from pyodbc.fetchall() into what I know I can work with- a list of dictionaries. Additionally, I would eventually like to print results to csv in a readable way.
My data is currently returned in a format like this:
(238384, 'Mr. Nathan Brown', 'Person', datetime.date(1989, 2, 3), '41 Fake Rd 1 \r\nTownName, State 13827')(283928, 'Mr. Logan Green', 'Person', datetime.date(2003, 5, 18), '36 county rd \r\nTownName, State 14432')(38272, 'Mrs. Penellope Blue', 'Person', datetime.date(1988, 1, 27), '123 fake st \r\nTownName, State, 14280)(...)
I realize I need to create an empty list object, then parse each row into a dictionary, and add it to the list- but I've never had to work with data on this scale and I'm wondering if there's a library or something that makes this type of work easier to accomplish.
Thank you for any insights.
Why not just import the data directly into pandas ?
df = pd.read_sql_query(sql_query, db.connection)

How to get fortnite stats in python

So i was trying to find something to code, and i decided to use python to get fortnite stats, i came across the fortnite_python library and it works, but it displays item codes for items in the shop when i want it to display the names. Anyone know how to convert them or just disply the name in the first place? This is my code.
​
fortnite = Fortnite('c954ed23-756d-4843-8f99-cfe850d2ed0c')
store = fortnite.store()
fortnite.store()
It outputs something like this
[<StoreItem 12511>,
To print out the attributes of a Python object you can use __dict__ e.g.
from fortnite_python import Fortnite
from json import dumps
fortnite = Fortnite('Your API Key')
# ninjas_account_id = fortnite.player('ninja')
# print(f'ninjas_account: {ninjas_account_id}') # ninjas_account: 4735ce91-3292-4caf-8a5b-17789b40f79c
store = fortnite.store()
example_store_item = store[0]
print(dumps(example_store_item.__dict__, indent=2))
Output:
{
"_data": {
"imageUrl": "https://trackercdn.com/legacycdn/fortnite/237112511_large.png",
"manifestId": 12511,
"name": "Dragacorn",
"rarity": "marvel",
"storeCategory": "BRSpecialFeatured",
"vBucks": 0
},
"id": 12511,
"image_url": "https://trackercdn.com/legacycdn/fortnite/237112511_large.png",
"name": "Dragacorn",
"rarity": "marvel",
"store_category": "BRSpecialFeatured",
"v_bucks": 0
}
So it looks like you want to use name attribute of StoreItem:
for store_item in store:
print(store_item.name)
Output:
Dragacorn
Hulk Smashers
Domino
Unstoppable Force
Scootin'
Captain America
Cable
Probability Dagger
Chimichanga!
Daywalker's Kata
Psi-blade
Snap
Psylocke
Psi-Rider
The Devil's Wings
Daredevil
Meaty Mallets
Silver Surfer
Dayflier
Silver Surfer's Surfboard
Ravenpool
Silver Surfer Pickaxe
Grand Salute
Cuddlepool
Blade
Daredevil's Billy Clubs
Mecha Team
Tricera Ops
Combo Cleaver
Mecha Team Leader
Dino
Triassic
Rex
Cap Kick
Skully
Gold Digger
Windmill Floss
Bold Stance
Jungle Scout
It seems that the library doesn't contain a function to get the names. Also this is what the class of a item from the store looks like:
class StoreItem(Domain):
"""Object containing store items attributes"""
and thats it.

'DataFrame' object is not callable PYTHON

I have a code that should write information to excel using selenium. I have 1 list with some information. I need to write all this to excel, and i have solution. But, when i tried to use it i got 'DataFrame' object is not callable. How can i solve it?
All this code into iteration:
for schools in List: #in the List i have data from excel file with Name of schools
data = pd.DataFrame()
data({
"School Name":School_list_result[0::17],
"Principal":School_list_result[1::17],
"Principal's E-mail":School_list_result[2::17],
"Type":School_list_result[8::17],
"Grade Span": School_list_result[3::17],
"Address":School_list_result[4::17],
"Phone":School_list_result[14::17],
"Website":School_list_result[13::17],
"Associations/Communities":School_list_result[5::17],
"GreatSchools Summary Rating":School_list_result[6::17],
"U.S.News Rankings":School_list_result[12::17],
"Total # Students":School_list_result[15::17],
"Full-Time Teachers":School_list_result[16::17],
"Student/Teacher Ratio":School_list_result[17::17],
"Charter":School_list_result[9::17],
"Enrollment by Race/Ethnicity": School_list_result[7::17],
"Enrollment by Gender":School_list_result[10::17],
"Enrollment by Grade":School_list_result[11::17],
})
data.to_excel("D:\Schools.xlsx")
In School_list_result i have this data:
'Cape Elizabeth High School',
'Mr. Jeffrey Shedd',
'No data.',
'9-12',
'345 Ocean House Road, Cape Elizabeth, ME 04107',
'Cape Elizabeth Public Schools',
'8/10',
'White\n91%\nAsian\n3%\nTwo or more races\n3%\nHispanic\n3%\nBlack\n1%',
'Regular school',
'No',
' Male Female\n Students 281 252',
' 9 10 11 12\n Students 139 135 117 142',
'#5,667 in National Rankings',
'https://cehs.cape.k12.me.us/',
'Tel: (207)799-3309',
'516 students',
'47 teachers',
'11:1',
Please follow the syntax about how to create a dataframe
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
So your code should be modified as:
for schools in List: #in the List i have data from excel file with Name of schools
data = pd.DataFrame(data={
"School Name": School_list_result[0::17],
"Principal": School_list_result[1::17],
"Principal's E-mail": School_list_result[2::17],
"Type": School_list_result[8::17],
"Grade Span": School_list_result[3::17],
"Address": School_list_result[4::17],
"Phone": School_list_result[14::17],
"Website": School_list_result[13::17],
"Associations/Communities": School_list_result[5::17],
"GreatSchools Summary Rating": School_list_result[6::17],
"U.S.News Rankings": School_list_result[12::17],
"Total # Students": School_list_result[15::17],
"Full-Time Teachers": School_list_result[16::17],
"Student/Teacher Ratio": School_list_result[17::17],
"Charter": School_list_result[9::17],
"Enrollment by Race/Ethnicity": School_list_result[7::17],
"Enrollment by Gender": School_list_result[10::17],
"Enrollment by Grade": School_list_result[11::17],
})
Do you want to add in an existing xlsx file?
First, create the dictionary and then call the DataFrame method, like this:
r = {"column1":["data"], "column2":["data"]}
data = pd.DataFrame(r)

Extracting names from JSON file

Im trying to get only the names of playlist from a json file that I have but I cannot make it
{'playlists': [{'description': '',
'lastModifiedDate': '2018-11-20',
'name': 'Piano',
'numberOfFollowers': 0,
'tracks': [{'artistName': 'Kenzie Smith Piano',
'trackName': "You've Got a Friend in Me (From "
'"Toy Story")'},
{'artistName': 'Kenzie Smith Piano',
'trackName': 'A Whole New World (From "Aladdin")'},
{'artistName': 'Kenzie Smith Piano',
'trackName': 'Can You Feel the Love Tonight? (From '
'"The Lion King")'},
{'artistName': 'Kenzie Smith Piano',
'trackName': "He's a Pirate / The Black Pearl "
'(From "Pirates of the Caribbean")'},
{'artistName': 'Kenzie Smith Piano',
'trackName': "You'll be in My Heart (From "
'"Tarzan") [Soft Version]'},
import json
from pprint import pprint
json_data=open('C:/Users/alvar/Desktop/Alvaro/Nueva carpeta/Playlist.json', encoding="utf8").read()
playlist = json.loads(json_data)
pprint(playlist)
Here is where is not working:
for names in playlist_list:
print(names['name'])
print '\n'
What I want is to extract only the names of the playlists.
Error is due to you not accessing the dictionary key 'playlists'
for plst in playlist['playlists']:
print(plst['name'])
# Piano
You iterate on the wrong object.
Do not forget that json.loads(json_data) return the object as it is stored. In your case, it's a dict with only one element : 'playlist'. You have to access this element with loaded_json['playlist'] then iterate over the list of playlist.
Here, loaded_json is of type Dict[List[Dict]]. Be careful with JSON and nested data structures.
Try :
loaded_json= json.loads(json_data) #type: Dict[List[dict]]
for playlist in loaded_json['playlist']: #type: dict
print('{}\n'.format(playlist['name']))
By doing this, you will get all the playlist's name.
Documentation: JSON encoder and decoder

Categories

Resources