MemoryError in Python when saving list to dataframe - python

I was trying to do the following, which is to save a python list that contains json strings into a dataframe in jupyternotebook
df = pd.io.json.json_normalize(mon_list)
df[['gfmsStr','_id']]
But then I received this error:
MemoryError
Then if I run other blocks, they all start to show the memory error. I am wondering what caused this and if there is anyway I can increase the memory to avoid the error.
Thanks!
update:
what's in mon_list is like the following:
mon_list[1]
[{'id': 1, 'name': {'first': 'Coleen', 'last': 'Volk'}},
{'name': {'given': 'Mose', 'family': 'Regner'}},
{'id': 2, 'name': 'Faye Raker'}]

Do you really have a list? Or do you have a JSON file? What format is the "mon_list" variable?
This is how you convert a list to a Dataframe
# import pandas as pd
import pandas as pd
# list of strings
lst = ['Geeks', 'For', 'Geeks', 'is',
'portal', 'for', 'Geeks']
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
https://www.geeksforgeeks.org/create-a-pandas-dataframe-from-lists/

Related

How to convert columns of numpy arrays to lists when using .to_dict

I would like to take my pandas Dataframe and convert it to a list of dictionaries. I can do this using the pandas to_dict('records') function. However, this function takes any column values that are lists and returns numpy arrays. I would like for the content of the returned list of dictionaries to be base python objects rather than numpy arrays.
I understand I could iterate my outputted dictionaries but I was wondering if there is something more clever to do this.
Here is some sample code that shows my problem:
import pandas as pd
import numpy as np
data = pd.concat([
pd.Series(['a--b', 'c--d', 'e--f'], name='key'),
pd.Series(['123', '456', '789'], name='code'),
pd.Series([np.array(['123', '098']), np.array(['000', '999']), np.array(['789', '432'])], name='codes')
], axis=1)
output = data.to_dict('records')
# this prints <class 'numpy.ndarray'>
print(type(output[0]['codes']))
output, in this case, looks like this:
[{'key': 'a--b', 'code': '123', 'codes': array(['123', '098'], dtype='<U3')},
{'key': 'c--d', 'code': '456', 'codes': array(['000', '999'], dtype='<U3')},
{'key': 'e--f', 'code': '789', 'codes': array(['789', '432'], dtype='<U3')}]
I would like for that print statement to print a list. I understand I could simply do the following:
for row in output:
row['codes'] = row['codes'].tolist()
# this now prints <class 'list'>, which is what I want
print(type(output[0]['codes']))
However, my dataframe is of course much more complicated than this, and I have multiple columns that are numpy arrays. I know I could expand the snippet above to check which columns are array type and cast them using tolist(), but I'm wondering if there is something snappier or more clever? Perhaps something provided by Pandas that is optimized?
To be clear, here is the output I need to have:
print(output)
[{'key': 'a--b', 'code': '123', 'codes': ['123', '098']},
{'key': 'c--d', 'code': '456', 'codes': ['000', '999']},
{'key': 'e--f', 'code': '789', 'codes': ['789', '432']}]
Let us first use applymap to convert numpy array's to python lists, then use to_dict
cols = ['codes']
data.assign(**data[cols].applymap(list)).to_dict('records')
[{'key': 'a--b', 'code': '123', 'codes': ['123', '098']},
{'key': 'c--d', 'code': '456', 'codes': ['000', '999']},
{'key': 'e--f', 'code': '789', 'codes': ['789', '432']}]
I ended up creating a list of the numpy-typed column names:
np_fields = ['codes']
and then I replaced each field in place in my dataframe:
for col in np_fields:
data[col] = data[col].map(np.ndarray.tolist)
I then called data.to_dict('records') once that was complete.

Pandas: Convert dictionary to dataframe where keys and values are the columns

I have a dictionary like so:
d = {'3a0fe308-b78d-4080-a68b-84fdcbf5411e': 'SUCCEEDED-HALL-IC_GBI', '7c975c26-f9fc-4579-822d-a1042b82cb17': 'SUCCEEDED-AEN-IC_GBI', '9ff20206-a841-4dbf-a736-a35fcec604f3': 'SUCCEEDED-ESP2-IC_GBI'}
I would like to convert my dictionary into something like this to make a dataframe where I put all the keys and values in a separate list.
d = {'key': ['3a0fe308-b78d-4080-a68b-84fdcbf5411e', '7c975c26-f9fc-4579-822d-a1042b82cb17', '9ff20206-a841-4dbf-a736-a35fcec604f3'],
'value': ['SUCCEEDED-HALL-IC_GBI', 'SUCCEEDED-AEN-IC_GBI', 'SUCCEEDED-ESP2-IC_GBI']
What would be the best way to go about this?
You can easily create a DataFrame like this:
import pandas as pd
d = {'3a0fe308-b78d-4080-a68b-84fdcbf5411e': 'SUCCEEDED-HALL-IC_GBI',
'7c975c26-f9fc-4579-822d-a1042b82cb17': 'SUCCEEDED-AEN-IC_GBI',
'9ff20206-a841-4dbf-a736-a35fcec604f3': 'SUCCEEDED-ESP2-IC_GBI'}
table = pd.DataFrame(d.items(), columns=['key', 'value'])
If you just want to rearrange your Dictionary you could do this:
d2 = {'key': list(d.keys()), 'value': list(d.values())}
Since you tagged pandas, try:
pd.Series(d).reset_index(name='value').to_dict('list')
Output:
{'index': ['3a0fe308-b78d-4080-a68b-84fdcbf5411e',
'7c975c26-f9fc-4579-822d-a1042b82cb17',
'9ff20206-a841-4dbf-a736-a35fcec604f3'],
'value': ['SUCCEEDED-HALL-IC_GBI',
'SUCCEEDED-AEN-IC_GBI',
'SUCCEEDED-ESP2-IC_GBI']}
Pure python:
{'key':list(d.keys()), 'value': list(d.values())}
output:
{'key': ['3a0fe308-b78d-4080-a68b-84fdcbf5411e',
'7c975c26-f9fc-4579-822d-a1042b82cb17',
'9ff20206-a841-4dbf-a736-a35fcec604f3'],
'value': ['SUCCEEDED-HALL-IC_GBI',
'SUCCEEDED-AEN-IC_GBI',
'SUCCEEDED-ESP2-IC_GBI']}
You can create the dataframe zipping the key/value lists with zip function:
import pandas as pd
df = pd.DataFrame(list(zip(d.keys(),d.values())), columns=['key','value'])

Traverse through a JSON file

Hi I have a lrage JSON file.I'm reading the data from the JSON file & storing it in a list. I need to extract some element from the JSON file. So I wrote this code
l=len(alldata_json)
for i in range(l):
df_school_us.loc[i,'schoolName']=alldata_json[i].get('schoolName')
data_address=alldata_json[i].get('addressLocations')
df_school_us.loc[i,'Latitude']=data_address[0].get('Location').get('latitude')
df_school_us.loc[i,'Longitude']=data_address[0].get('Location').get('longitude')
print("i= ",i)
len(alldata_json) is returning 87598 & alldata_json contains my json data.But I'm feeling running for loop with this many number of rows is not an optimized approach. Can you suggest me how to do it without for loop?
df = pd.DataFrame(alldata_json)
df2 = pd.concat([df.drop('addressLocations', axis=1),
df['addressLocations'].apply(pd.Series)], axis=1)
Extracting countryCode, latitude, and longitude
import pandas as pd
data = [{'locationType': 'ab',
'address': {'countryCode': 'IN',
'city': 'Mumbai',
'zipCode': '5000',
'schoolNumber': '2252'},
'Location': {'latitude': 19.0760,
'longitude': 72.8777},
'names': [{'languageCode': 'IN', 'name': 'DPS'},
{'languageCode': 'IN', 'name': 'DPS'}]}]
df = pd.DataFrame(data)
df2 = pd.concat([df['address'].apply(pd.Series)['countryCode'],
df['Location'].apply(pd.Series)[['latitude', 'longitude']]

seeking help regarding converting data from nested JSON to flat json in python

I am looking for converting a nested json into flat json using python.
I have the data coming from an API response, the number of keys/columns can be upto 100, and the rows/overall count of elements can be 100k
[{"Name":"John", "Location":{"City":"Los Angeles","State":"CA"}},{"Name":"Sam", "Location":{"City":"Chicago","State":"IL"}}]
I did came across this
(Python flatten multilevel JSON)
but this flattens the whole JSON completely, as a result everything falls under one line which I am not looking for currently. I also thought of using this on one the data one array at a time in loop but that is causing a lot of load on the system
[{"Name":"John", "City":"Los Angeles","State":"CA"},{"Name":"Sam", "City":"Chicago","State":"IL"}]
Use unpacking with dict.pop:
[{**d.pop("Location"), **d} for d in l]
Output:
[{'City': 'Los Angeles', 'Name': 'John', 'State': 'CA'},
{'City': 'Chicago', 'Name': 'Sam', 'State': 'IL'}]

Convert list of Dictionaries to a Dataframe [duplicate]

This question already has answers here:
Convert list of dictionaries to a pandas DataFrame
(7 answers)
Closed 4 years ago.
I am facing a basic problem of converting a list of dictionaries obtained from parsing a column with text in json format. Below is the brief snapshot of data:
[{u'PAGE TYPE': u'used-serp.model.brand.city'},
{u'BODY TYPE': u'MPV Cars',
u'ENGINE CAPACITY': u'1461',
u'FUEL TYPE': u' Diesel',
u'MODEL NAME': u'Renault Lodgy',
u'OEM NAME': u'Renault',
u'PAGE TYPE': u'New-ModelPage.OverviewTab'},
{u'PAGE TYPE': u'used-serp.brand.city'},
{u'BODY TYPE': u'SUV Cars',
u'ENGINE CAPACITY': u'2477',
u'FUEL TYPE': u' Diesel',
u'MODEL NAME': u'Mitsubishi Pajero',
u'OEM NAME': u'Mitsubishi',
u'PAGE TYPE': u'New-ModelPage.OverviewTab'},
{u'BODY TYPE': u'Hatchback Cars',
u'ENGINE CAPACITY': u'1198',
u'FUEL TYPE': u' Petrol , Diesel',
u'MODEL NAME': u'Volkswagen Polo',
u'OEM NAME': u'Volkswagen',
u'PAGE TYPE': u'New-ModelPage.GalleryTab'},
Furthermore, the code i am using to parse is detailed below:
stdf_noncookie = []
stdf_noncookiejson = []
for index, row in df_noncookie.iterrows():
try:
loop_data = json.loads(row['attributes'])
stdf_noncookie.append(loop_data)
except ValueError:
loop_nondata = row['attributes']
stdf_noncookiejson.append(loop_nondata)
stdf_noncookie is the list of dictionaries i am trying to convert into a pandas dataframe. 'attributes' is the column with text in json format. I have tried to get some learning from this link, however this was not able to solve my problem. Any suggestion/tips for converting a list of dictionaries to panda dataframe will be helpful.
To convert your list of dicts to a pandas dataframe use the following:
stdf_noncookiejson = pd.DataFrame.from_records(data)
pandas.DataFrame.from_records
DataFrame.from_records (data, index=None, exclude=None, columns=None, coerce_float=False, nrows=None)
You can set the index, name the columns etc as you read it in
If youre working with json you can also use the read_json method
stdf_noncookiejson = pd.read_json(data)
pandas.read_json
pandas.read_json (path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True,
keep_default_dates=True, numpy=False, precise_float=False,
date_unit=None, encoding=None, lines=False)
Reference this answer.
Assuming d is your List of Dictionaries, simply use:
df = pd.DataFrame(d)
Simply, you can use the pandas DataFrame constructor.
import pandas as pd
print (pd.DataFrame(data))
Finally found a way to convert a list of dict to panda dataframe. Below is the code:
Method A
stdf_noncookie = df_noncookie['attributes'].apply(json.loads)
stdf_noncookie = stdf_noncookie.apply(pd.Series)
Method B
stdf_noncookie = df_noncookie['attributes'].apply(json.loads)
stdf_noncookie = pd.DataFrame(stdf_noncookie.tolist())
Method A is much quicker than Method B. I will create another post asking for help on the difference between the two methods. Also, on some datasets Method B is not working.
I was able to do it with a list comprehension. But my problem was that I left my dict's json encoded so they looked like strings.
d = r.zrangebyscore('live-ticks', '-inf', time.time())
dform = [json.loads(i) for i in d]
df = pd.DataFram(dfrom)

Categories

Resources