How to extract list of dictionaries from Pandas column - python

I have the following dataframe that I extracted from an API, inside that df there is a column that I need to extract data from it, but the structure of that data inside that column is a list of dictionaries:
I could get the data that I care from that dictionary using this chunk of code:
for k,v in d.items():
for i,j in v.items():
if isinstance(j, list):
for l in range(len(j)):
for k in j[l]:
print(j[l])
I get a structure like this one, so I´d need to get each of that 'values' inside the list of dictionaries
and then organize them in a dataframe. like for example the first item on the list of dictionaries:
Once I get to the point of getting the above structure, how could I make a dataframe like the one in the image?
Raw data:
data = {'rows': [{'values': ['Tesla Inc (TSLA)', '$1056.78', '$1199.78', '13.53%'], 'children': []}, {'values': ['Taiwan Semiconductor Manufacturing Company Limited (TSM)', '$120.31', '$128.80', '7.06%'], 'children': []}]}

You can use pandas. First cast your data to pd.DataFrame, then use apply(pd.Series) to expand lists inside 'values' column to separate columns and set_axis method to change column names:
import pandas as pd
data = {'rows': [{'values': ['Tesla Inc (TSLA)', '$1056.78', '$1199.78', '13.53%'], 'children': []}, {'values': ['Taiwan Semiconductor Manufacturing Company Limited (TSM)', '$120.31', '$128.80', '7.06%'], 'children': []}]}
out = pd.DataFrame(data['rows'])['values'].apply(pd.Series).set_axis(['name','price','price_n','pct'], axis=1)
Output:
name price price_n pct
0 Tesla Inc (TSLA) $1056.78 $1199.78 13.53%
1 Taiwan Semiconductor Manufacturing Company Lim... $120.31 $128.80 7.06%

Related

Pandas : Create new columns from a list of dictionaries in another column

I have a list of arbitrary number of dictionaries in each cell of a pandas column.
df['Amenities'][0]
[{'Description': 'Basketball Court(s)'},
{'Description': 'Bike Rack / Bike Storage'},
{'Description': 'Bike Rental'},
{'Description': 'Business Center'},
{'Description': 'Clubhouse'},
{'Description': 'Community Garden'},
{'Description': 'Complex Wifi '},
{'Description': 'Courtesy Patrol/Officer'},
{'Description': 'Dog Park'},
{'Description': 'Health Club / Fitness Center'},
{'Description': 'Jacuzzi'},
{'Description': 'Pet Friendly'},
{'Description': 'Pet Park / Dog Run'},
{'Description': 'Pool'}]
I'd like to do the following.
1) Iterate over list of dicts, Unpack them and create columns with value 1 (Amenities exits).
2) On the subsequent iterations, check if the column label already exists, then add 1 as value to the cell, or create a new column if it doesn't exist.
3) Fill the remaining columns with 0.
Basically, I am trying to create features that hold values 0 and 1 from a list of dictionaries.
The code below creates new columns based on dict values but the part around checking if the column exists , creating a new one if doesn't and assigning 1s and 0s needs a bit of thinking.
for i, row in df.iterrows():
dict_obj = row['Amenities']
for key, val in dict_obj.items():
if val in df.columns:
df.loc[i, val] = 1
else
.......
Expected Outcome something like this:
one way is to explode the column Amenities, then create a dataframe, use str.get_dummies on the column and sum on the level=0 like:
#data example
df = pd.DataFrame({
'Amenities': [
[{'Description': 'Basketball Court(s)'},
{'Description': 'Bike Rental'}],
[{'Description': 'Basketball Court(s)'},
{'Description': 'Clubhouse'},
{'Description': 'Community Garden'}]
]})
# explode
s = df['Amenities'].explode()
# create dataframe, use get_dummies and sum on the level=0 of index
df_ = pd.DataFrame(s.tolist(), s.index)['Description'].str.get_dummies().sum(level=0)
print (df_)
Basketball Court(s) Bike Rental Clubhouse Community Garden
0 1 1 0 0
1 1 0 1 1
Your code was a great start and very close!
As you said, you need to iterate through the dictionaries. The solution is to use .loc to create the new column on your dataframe (for the amenity currently being processed) if it doesn't yet exist or set its value if it does.
import pandas as pd
df = pd.DataFrame(
{
"Amenities": [
[
{"Description": "Basketball Court(s)"},
{"Description": "Bike Rack / Bike Storage"},
{"Description": "Bike Rental"},
],
[
{"Description": "Basketball Court(s)"},
{"Description": "Courtesy Patrol/Officer"},
{"Description": "Dog Park"},
],
]
}
)
for i, row in df.iterrows():
amenities_list = row["Amenities"]
for amenity in amenities_list:
for k, v in amenity.items():
df.loc[i, v] = 1
df = df.drop(columns="Amenities")
df = df.fillna(0).astype({i: "int" for i in df.columns})
Short description:
i is the row index and v is the name of the amenity (string). df.loc[] takes in row index, column index and creates a new column if the column index is not yet present.
After the for loop, we just drop the no longer needed "Amentities" column, replace all NA values with 0 and then convert all columns to integers (NA values exist only for floats and so by default they are floats to begin with).

Import nested MongoDB to Pandas

I have a Collection with heavily nested docs in MongoDB, I want to flatten and import to Pandas. There are some nested dicts, but also a list of dicts that I want to transform into columns (see examples below for details).
I already have function, that works for smaller batches of documents. But the solution (I found it in the answer to this question) uses json. The problem with the json.loads operation is, that it fails with a MemoryError on bigger selections from the Collection.
I tried many solutions suggesting other json-parsers (e.g. ijson), but for different reasons none of them solved my problem. The only way left, if I want to keep up the transformation via json, would be chunking bigger selections into smaller groups of documents and iterate the parsing.
At this point I thought, - and that is my main question here - maybe there is a smarter way to do the unnesting without taking the detour through json directly in MongoDB or in Pandas or somehow combined?
This is a shortened example Doc:
{
'_id': ObjectId('5b40fcc4affb061b8871cbc5'),
'eventId': 2,
'sId' : 6833,
'stage': {
'value': 1,
'Name': 'FirstStage'
},
'quality': [
{
'type': {
'value': 2,
'Name': 'Color'
},
'value': '124'
},
{
'type': {
'value': 7,
'Name': 'Length'
},
'value': 'Short'
},
{
'type': {
'value': 15,
'Name': 'Printed'
}
}
}
This is what a succcesful dataframe-representation would look like (I skipped columns '_id' and 'sId' for readability:
eventId stage.value stage.name q_color q_length q_printed
1 2 1 'FirstStage' 124 'Short' 1
My code so far (which runs into memory problems - see above):
def load_events(filter = 'sId', id = 6833, all = False):
if all:
print('Loading all events.')
cursor = events.find()
else:
print('Loading events with %s equal to %s.' %(filter, id))
print('Filtering...')
cursor = events.find({filter : id})
print('Loading...')
l = list(cursor)
print('Parsing json...')
sanitized = json.loads(json_util.dumps(l))
print('Parsing quality...')
for ev in sanitized:
for q in ev['quality']:
name = 'q_' + str(q['type']['Name'])
value = q.pop('value', 1)
ev[name] = value
ev.pop('quality',None)
normalized = json_normalize(sanitized)
df = pd.DataFrame(normalized)
return df
You don't need to convert the nested structures using json parsers. Just create your dataframe from the record list:
df = DataFrame(list(cursor))
and afterwards use pandas in order to unpack your lists and dictionaries:
import pandas
from itertools import chain
import numpy
df = pandas.DataFrame(t)
df['stage.value'] = df['stage'].apply(lambda cell: cell['value'])
df['stage.name'] = df['stage'].apply(lambda cell: cell['Name'])
df['q_']= df['quality'].apply(lambda cell: [(m['type']['Name'], m['value'] if 'value' in m.keys() else 1) for m in cell])
df['q_'] = df['q_'].apply(lambda cell: dict((k, v) for k, v in cell))
keys = set(chain(*df['q_'].apply(lambda column: column.keys())))
for key in keys:
column_name = 'q_{}'.format(key).lower()
df[column_name] = df['q_'].apply(lambda cell: cell[key] if key in cell.keys() else numpy.NaN)
df.drop(['stage', 'quality', 'q_'], axis=1, inplace=True)
I use three steps in order to unpack the nested data types. Firstly, the names and values are used to create a flat list of pairs (tuples). In the second step a dictionary based on the tuples takes keys from 1st and values from 2nd location of the tuples. Then all existing property names are extracted once using a set. Each property gets a new column using a loop. Inside the loop the values of each pair is mapped to the respective column cells.

Updating a dictionary with values and predefined keys

I want to create a dictionary that has predefined keys, like this:
dict = {'state':'', 'county': ''}
and read through and get values from a spreadsheet, like this:
for row in range(rowNum):
for col in range(colNum):
and update the values for the keys 'state' (sheet.cell_value(row, 1)) and 'county' (sheet.cell_value(row, 1)) like this:
dict[{}]
I am confused on how to get the state value with the state key and the county value with the county key. Any suggestions?
Desired outcome would look like this:
>>>print dict
[
{'state':'NC', 'county': 'Nash County'},
{'state':'VA', 'county': 'Albemarle County'},
{'state':'GA', 'county': 'Cook County'},....
]
I made a few assumptions regarding your question. You mentioned in the comments that State is at index 1 and County is at index 3; what is at index 2? I assumed that they occur sequentially. In addition to that, there needs to be a way in which you can map the headings to the data columns, hence I used a list to do that as it maintains order.
# A list containing the headings that you are interested in the order in which you expect them in your spreadsheet
list_of_headings = ['state', 'county']
# Simulating your spreadsheet
spreadsheet = [['NC', 'Nash County'], ['VA', 'Albemarle County'], ['GA', 'Cook County']]
list_of_dictionaries = []
for i in range(len(spreadsheet)):
dictionary = {}
for j in range(len(spreadsheet[i])):
dictionary[list_of_headings[j]] = spreadsheet[i][j]
list_of_dictionaries.append(dictionary)
print(list_of_dictionaries)
Raqib's answer is partially correct but had to be modified for use with an actual spreadsheet with row and columns and the xlrd mod. What I did was first use xlrd methods to grab the cell values, that I wanted and put them into a list (similar to the spreadsheet variable raqib has shown above). Not that the parameters sI and cI are the column index values I picked out in a previous step. sI=StateIndex and cI=CountyIndex
list =[]
for row in range(rowNum):
for col in range(colNum):
list.append([str(sheet.cell_value(row, sI)), str(sheet.cell_value(row, cI))])
Now that I have a list of the states and counties, I can apply raqib's solution:
list_of_headings = ['state', 'county']
fipsDic = []
print len(list)
for i in range(len(list)):
temp = {}
for j in range(len(list[i])):
tempDic[list_of_headings[j]] = list[i][j]
fipsDic.append(temp)
The result is a nice dictionary list that looks like this:
[{'county': 'Minnehaha County', 'state': 'SD'}, {'county': 'Minnehaha County', 'state': 'SD', ...}]

JSON to Pandas Dataframe not knowing if JSON will have all the columns of the dataframe

I am doing a research project and trying to pull thousands of quarterly results for companies from the SEC EDGAR API.
Each result is a list of dictionaries structured as follows:
[{'field': 'othercurrentliabilities', 'value': 6886000000.0},
{'field': 'otherliabilities', 'value': 13700000000.0},
{'field': 'propertyplantequipmentnet', 'value': 15789000000.0}...]
I want each result to be a row of a pandas dataframe. The issue is that each result may not have the same fields due to the data available. I would like to check if the column(field) of the dataframe is present in one of the results field and if it is add the result value to the row. If not, I would like to add an np.NaN. How would I go about doing this?
A list/dict comprehension ought to work here:
In [11]: s
Out[11]:
[[{'field': 'othercurrentliabilities', 'value': 6886000000.0},
{'field': 'otherliabilities', 'value': 13700000000.0},
{'field': 'propertyplantequipmentnet', 'value': 15789000000.0}],
[{'field': 'othercurrentliabilities', 'value': 6886000000.0}]]
In [12]: pd.DataFrame([{d["field"]: d["value"] for d in row} for row in s])
Out[12]:
othercurrentliabilities otherliabilities propertyplantequipmentnet
0 6.886000e+09 1.370000e+10 1.578900e+10
1 6.886000e+09 NaN NaN
make a list of df.result.rows[x]['values']
like below
s=[]
for x in range(df.result.totalrows[0]):
s=s+[df.result.rows[x]['values']]
print(x)
df1=pd.DataFrame([{d["field"]: d["value"] for d in row} for row in s]
df1
will give you result.

Python Dataframe contains a list of dictionaries, need to create new dataframe with dictionary items

I have a Python dataframe that contains a list of dictionaries (for certain rows):
In[1]:
cards_df.head()
Out[1]:
card_id labels
0 'cid_1' []
1 'cid_2' []
3 'cid_3' [{'id': 'lid_a', 'name': 'lname_a'}, {'id': 'lid_b', 'name': 'lname_b'}]
4 'cid_4' [{'id': 'lid_c', 'name': 'lname_c'}]
I would like to create a new dataframe that expands the list of dictionary items into separate rows:
card_id label_id label_name
0 cid_3 lid_a lname_a
1 cid_3 lid_b lname_b
2 cid_4 lid_c lname_c
Use pd.Series.str.len to produce the appropriate values to pass to np.repeat. This in turn is used to repeat the values of df.card_id.values and make the first column of our new dataframe.
Then use pd.Series.sum on df['labels'] to concatenate all lists into a single list. This new list is now perfect for passing to the pd.DataFrame constructor. All that's left is to prepend a string to each column name and join to the column we created above.
pd.DataFrame(dict(
card_id=df.card_id.values.repeat(df['labels'].str.len()),
)).join(pd.DataFrame(df['labels'].sum()).add_prefix('label_'))
card_id label_id label_name
0 cid_3 lid_a lname_a
1 cid_3 lid_b lname_b
2 cid_4 lid_c lname_c
Setup
df = pd.DataFrame(dict(
card_id=['cid_1', 'cid_2', 'cid_3', 'cid_4'],
labels=[
[],
[],
[
{'id': 'lid_a', 'name': 'lname_a'},
{'id': 'lid_b', 'name': 'lname_b'}
],
[{'id': 'lid_c', 'name': 'lname_c'}],
]
))
You could do this as a dict comprehension over the rows of your dataframe:
pd.DataFrame({{i: {'card_id': row['card_id'],
'label_id': label['label_id'],
'label_name': label['name']}}
for i, row in df.iterrows()
for label in row['labels']

Categories

Resources