I'd like to use the to_json() function to serialize a pandas dataframe while encapsulating each row in a root 'Person' element.
import pandas as pd
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
df.to_json(orient='records')
'[{"Name":"tom","Age":10},{"Name":"nick","Age":15},{"Name":"juli","Age":14}]'
I'd like the to_json() output to be:
'[{"Person":{"Name":"tom","Age":10}},{"Person":{"Name":"nick","Age":15}},{"Person":{"Name":"juli","Age":14}}]'
I'm thinking this can be achieved with dataframe.apply() but haven't been able to figure it out.
Thx.
Use List Comprehension to create a list of dicts using df.to_dict:
In [4370]: d = [{'Person':i} for i in df.to_dict(orient='records')]
Convert above dict to json using json.dumps:
In [4372]: import json
In [4373]: j = json.dumps(d)
In [4374]: print(j)
Out[4373]: '[{"Person": {"Name": "tom", "Age": 10}}, {"Person": {"Name": "nick", "Age": 15}}, {"Person": {"Name": "juli", "Age": 14}}]'
I suppose you want to use Person as an index or identifier for each set. Otherwise, just include Person as a fixed string key for each nested dict would be redundant. If this is the case, you can use index inside the orient argument. In this case, it would append the index associated with the data frame.
import pandas as pd
>>> data = [['tom', 10], ['nick', 15], ['juli', 14]]
>>> df = pd.DataFrame(data, columns = ['Name', 'Age'])
>>> temp = [df.to_json(orient='index')]
>>> temp = ['{"0":{"Name":"tom","Age":10},"1":{"Name":"nick","Age":15},"2":{"Name":"juli","Age":14}}']
Also, you can adjust your index to whatever you want. I hope this is what you want.
Related
Ive got a little issue while coding a script that takes a CSV string and is supposed to select a column name and value based on the input. The CSV string contains Names of NBA players, their Universities etc. Now when the input is "name" && "Andre Brown", it should search for those values in the given CSV string. I have a rough code laid out - but I am unsure on how to implement the where method. Any ideas?
import csv
import pandas as pd
import io
class MySelectQuery:
def __init__(self, table, columns, where):
self.table = table
self.columns = columns
self.where = where
def __str__(self):
return f"SELECT {self.columns} FROM {self.table} WHERE {self.where}"
csvString = "name,year_start,year_end,position,height,weight,birth_date,college\nAlaa Abdelnaby,1991,1995,F-C,6-10,240,'June 24, 1968',Duke University\nZaid Abdul-Aziz,1969,1978,C-F,6-9,235,'April 7, 1946',Iowa State University\nKareem Abdul-Jabbar,1970,1989,C,7-2,225,'April 16, 1947','University of California, Los Angeles\nMahmoud Abdul-Rauf,1991,2001,G,6-1,162,'March 9, 1969',Louisiana State University\n"
df = pd.read_csv(io.StringIO(csvString), error_bad_lines=False)
where = "name = 'Alaa Abdelnaby' AND year_start = 1991"
df = df.query(where)
print(df)
The CSV string is being transformed into a pandas Dataframe, which should then find the values based on the input - however I get the error "name 'where' not defined". I believe everything until the df = etc. part is correct, now I need help implementing the where method. (Ive seen one other solution on SO but wasnt able to understand or figure that out)
# importing pandas
import pandas as pd
record = {
'Name': ['Ankit', 'Amit', 'Aishwarya', 'Priyanka', 'Priya', 'Shaurya' ],
'Age': [21, 19, 20, 18, 17, 21],
'Stream': ['Math', 'Commerce', 'Science', 'Math', 'Math', 'Science'],
'Percentage': [88, 92, 95, 70, 65, 78]}
# create a dataframe
dataframe = pd.DataFrame(record, columns = ['Name', 'Age', 'Stream', 'Percentage'])
print("Given Dataframe :\n", dataframe)
options = ['Math', 'Science']
# selecting rows based on condition
rslt_df = dataframe[(dataframe['Age'] == 21) &
dataframe['Stream'].isin(options)]
print('\nResult dataframe :\n', rslt_df)
Output:
Source: https://www.geeksforgeeks.org/selecting-rows-in-pandas-dataframe-based-on-conditions/
Sometimes Googling does the trick ;)
You need the double = there. So should be:
where = "name == 'Alaa Abdelnaby' AND year_start == 1991"
I am using GridDB Python Client and I have a container that stores my database and I need to get the dataframe object converted to a list of lists. The read_sql_query function offered by pandas returns a dataframe but I need to get the dataframe returned as a list of lists instead of a dataframe. The first element in the list of lists is for the header of the dataframe (the column names) and the other elements are for the rows in the dataframe. Is there a way I could do that? Please help
Here is the code for the container and the part where the program reads SQL queries:
#...
import griddb_python as griddb
import pandas as pd
from pprint import pprint
factory = griddb.StoreFactory.get_instance()
# Initialize container
try:
gridstore = factory.get_store(host="127.0.0.1", port="8080",
cluster_name="Cls36", username="root",
password="")
conInfo = griddb.ContainerInfo("Fresher_Students",
[["id", griddb.Type.INTEGER],
["First Name",griddb.Type.STRING],
["Last Name", griddb.Type.STRING],
["Gender", griddb.Type.STRING],
["Course", griddb.Type.STRING]
],
griddb.ContainerType.COLLECTION, True)
cont = gridstore.put_container(conInfo)
cont.create_index("id", griddb.IndexType.DEFAULT)
data = pd.read_csv("fresher_students.csv")
#Add data
for i in range(len(data)):
ret = cont.put(data.iloc[i, :])
print("Data added successfully")
except griddb.GSException as e:
print(e)
sql_statement = ('SELECT * FROM Fresher_Students')
sql_query = pd.read_sql_query(sql_statement, cont)
def convert_to_lol(query):
# Code goes here
# ...
LOL = convert_to_lol(sql_query.head()) # Not Laughing Out Load, it's List of Lists
pprint(LOL)
#...
I want to get something that looks like this:
[["id", "First Name", "Last Name", "Gender", "Course"],
[0, "Catherine", "Ghua", "F", "EEE"],
[1, "Jake", "Jonathan", "M", "BMS"],
[2, "Paul", "Smith", "M", "BFA"],
[3, "Laura", "Williams", "F", "MBBS"],
[4, "Felix", "Johnson", "M", "BSW"],
[5, "Vivian", "Davis", "F", "BFD"]]
[UPDATED]
The easiest way I know about(for any DF):
df = pd.DataFrame({'id':[2, 3 ,4], 'age':[24, 42, 13]})
[df.columns.tolist()] + df.reset_index().values.tolist()
output:
[['id', 'age'], [0, 2, 24], [1, 3, 42], [2, 4, 13]]
How can I add outputs of different for loops into one dataframe. For example I have scraped data from website and have list of Names,Email and phone number using loops. I want to add all outputs into a table in single dataframe.
I am able to do it for One single loop but not for multiple loops.
Please look at the code and output in attached images.
By removing Zip from for loop its giving error. "Too many values to unpack"
Loop
phone = soup.find_all(class_ = "directory_item_phone directory_item_info_item")
for phn in phone:
print(phn.text.strip())
##Output - List of Numbers
Code for df
df = list()
for name,mail,phn in zip(faculty_name,email,phone):
df.append(name.text.strip())
df.append(mail.text.strip())
df.append(phn.text.strip())
df = pd.DataFrame(df)
df
For loops
Code and Output for df
An efficient way to create a pandas.DataFrame is to first create a dict and then convert it into a DataFrame.
In your case you probably could do :
import pandas as pd
D = {'name': [], 'mail': [], 'phone': []}
for name, mail, phn in zip(faculty_name, email, phone):
D['name'].append(name.text.strip())
D['mail'].append(mail.text.strip())
D['phone'].append(phn.text.strip())
df = pd.DataFrame(D)
Another way with a lambda function :
import pandas as pd
text_strip = lambda s : s.text.strip()
D = {
'name': list(map(text_strip, faculty_name)),
'mail': list(map(text_strip, email)),
'phone': list(map(text_strip, phone))
}
df = pd.DataFrame(D)
If lists don't all have the same length you may try this (but I am not sure that is very efficient) :
import pandas as pd
columns_names = ['name', 'mail', 'phone']
all_lists = [faculty_name, email, phone]
max_lenght = max(map(len, all_lists))
D = {c_name: [None]*max_lenght for c_name in columns_names}
for c_name, l in zip(columns_names , all_lists):
for ind, element in enumerate(l):
D[c_name][ind] = element
df = pd.DataFrame(D)
Try this,
data = {'name':[name.text.strip() for name in faculty_name],
'mail':[mail.text.strip() for mail in email],
'phn':[phn.text.strip() for phn in phone],}
df = pd.DataFrame.from_dict(data)
I have a JSON files Data. Given below is a sample of it.
[{
"Type": "Fruit",
"Names": "Apple;Orange;Papaya"
}, {
"Type": "Veggie",
"Names": "Cucumber;Spinach;Tomato"
}]
I have to read the Names and match each item of the Names with a column in another df.
I am stuck at converting the value of the Names key into a list that can be used in Pattern. The code I tried is
df1 = pd.DataFrame(data)
PriList=df1['Names'].str.split(";", n = 1, expand = True)
Pripat = '|'.join(r"\b{}\b".format(x) for x in PriList)
df['Match'] = df['MasterList'].str.findall('('+ Pripat + ')').str.join(', ')
The issue is with the Pripat. Its content is
\bApple, Orange\b
If I give the Names in a list like below
Prilist=['Apple','Orange','Papaya']
the code works fine...
Please help.
You'll need to call str.split and then flatten the result using itertools.chain.
First, do
df2 = df1.loc[df1.Type.eq('Fruit')]
Now,
from itertools import chain
prilist = list(chain.from_iterable(df2.Names.str.split(';').values))
There's also stack (which is slower):
prilist = df2.Names.str.split(';', expand=True).stack().tolist()
print(prilist)
['Apple', 'Orange', 'Papaya']
df2 = df1.loc[df1.Type.eq('Fruit')]
out_list=';'.join(df2['Names'].values).split(';')
#print(out_list)
['Apple', 'Orange', 'Papaya']
I have a json where the objects contain some subset of a superset of strings, e.g. the 'ideal' case where all the strings of a superset are included in an object:
{
"firstName": "foo",
"lastName": "bar",
"age": 20,
"email":"email#example.com"
}
However, some objects are like this:
{
"firstName": "name",
"age": 40,
"email":"email#example.com"
}
What's the optimal way to only write the objects with each string of the superset to a csv?
If it were simply a case of a string having a null value, I think I'd just use .dropna with pandas and it'd omit that row from the csv.
Should I impute the missing strings so that each object contains the superset, but with null values? If so, how?
As you suggested, reading into a pandas dataframe should do the trick. Using the pandas df.read_json() will leave a NaN for any value not contained in a given json record. So try:
a = pd.read_json(json_string, orient='records')
a.dropna(inplace=True)
a.to_csv(filename,index=False)
Suppose you have
json_string ='[{ "firstName": "foo", "lastName": "bar", "age": 20, "email":"email#example.com"}, {"firstName": "name", "age": 40,"email":"email#example.com"}]'
Then you can
l = json.loads(json_string)
df = pd.DataFrame(l)
Which yields
age email firstName lastName
0 20 email#example.com foo bar
1 40 email#example.com name NaN
Then,
>>> df.to_dict('records')
[{'age': 20,
'email': 'email#example.com',
'firstName': 'foo',
'lastName': 'bar'},
{'age': 40,
'email': 'email#example.com',
'firstName': 'name',
'lastName': nan}]
or
>>> df.to_json()
'{"age":{"0":20,"1":40},"email":{"0":"email#example.com","1":"email#example.com"},"firstName":{"0":"foo","1":"name"},"lastName":{"0":"bar","1":null}}'
The good thing about having a data frame is that you can parse/manipulate the data however you want before making it a dictionary/json again.
Test for all the values you want:
x = json.loads(json_string)
if 'firstName' in x and 'lastName' in x and 'age' in x and 'email' in x:
print 'we got all the values'
else
print 'one or more values are missing'
Or, a prettier way to do it:
fields = ['firstName', 'lastName', 'age', 'email']
if all(f in x for f in fields):
print 'we got all the fields'