Pandas list of JSON element to python array

Pandas list of JSON element to python array - python

I have a dataframe like so:
Store
matches
Murphy's
[{'domain': 'murphyscolumbus.com', 'location': 'Columbus, OH'}, {'domain': 'murphystampa.com', 'location': 'Tampa, FL'}]
Bill's
[{'domain': 'billsdallas.com', 'location': 'Dallas, TX'}, {'domain': 'billsorlando.com', 'location': 'Orlando, FL'}]
What I want is a dataframe like so:
Store
domains
Murphy's
['murphyscolumbus.com', 'murphystampa.com']
Bill's
['billsdallas.com','billsorlando.com']
I'm hoping for something less computationally expensive than something like a for loop that steps through as the dataframe is quite large.

Try:
from ast import literal_eval
# apply literal_eval if necessary
df["matches"] = df["matches"].apply(literal_eval)
df["domains"] = df.pop("matches").apply(lambda x: [d["domain"] for d in x])
print(df)
Prints:
Store domains
0 Murphy's [murphyscolumbus.com, murphystampa.com]
1 Bill's [billsdallas.com, billsorlando.com]

Related

Comparing values of two dictionary's items

I need to compare the values of the items in two different dictionaries.
Let's say that dictionary RawData has items that represent phone numbers and number names.
Rawdata for example has items like: {'name': 'Customer Service', 'number': '123987546'} {'name': 'Switchboard', 'number': '48621364'}
Now, I got dictionary FilteredData, which already contains some items from RawData: {'name': 'IT-support', 'number': '32136994'} {'name': 'Company Customer Service', 'number': '123987546'}
As you can see, Customer Service and Company Customer Service both have the same values, but different keys. In my project, there might be hundreds of similar duplicates, and we only want unique numbers to end up in FilteredData.
FilteredData is what we will be using later in the code, and RawData will be discarded.
Their names(keys) can be close duplicates, but not their numbers(values)**
There are two ways to do this.
A. Remove the duplicate items in RawData, before appending them into FilteredData.
B. Append them into FilteredData, and go through the numbers(values) there, removing the duplicates. Can I use a set here to do that? It would work on a list, obviously.
I'm not looking for the most time-efficient solution. I'd like the most simple and easy to learn one, if and when someone takes over my job someday. In my project it's mandatory for the next guy working on the code to get a quick grip of it.
I've already looked at sets, and tried to face the problem by nesting two for loops, but something tells me there gotta be an easier way.
Of course I might have missed the obvious solution here.
Thanks in advance!

I hope I understands your problem here:
data = [{'name': 'Customer Service', 'number': '123987546'}, {'name': 'Switchboard', 'number': '48621364'}]
newdata = [{'name': 'IT-support', 'number': '32136994'}, {'name': 'Company Customer Service', 'number': '123987546'}]
def main():
numbers = set()
for entry in data:
numbers.add(entry['number'])
for entry in newdata:
if entry['number'] not in numbers:
data.append(entry)
print data
main()
Output:
[{'name': 'Customer Service', 'number': '123987546'},
{'name': 'Switchboard', 'number': '48621364'},
{'name': 'IT-support', 'number': '32136994'}]

What you can do is take a dict.values(), create a set of those to remove duplicates and then go through the old dictionary and find the first key with that value and add it to a new one. Keep the set around because when you get the next dict entry, try adding the element to that set and see if the length of the set is longer that before adding it. If it is, it's a unique element and you can add it to the dict.

If you're willing on changing how FilteredData is currently, you can just use a dict and use the number as your key:
RawData = [
{'name': 'Customer Service', 'number': '123987546'},
{'name': 'Switchboard', 'number': '48621364'}
]
# Change how FilteredData is structured
FilteredDataMap = {
'32136994':
{'name': 'IT-support', 'number': '32136994'},
'123987546':
{'name': 'Company Customer Service', 'number': '123987546'}
}
for item in RawData:
number = item.get('number')
if number not in FilteredDataMap:
FilteredDataMap[number] = item
# If you need the list of items
FilteredData = list(FilteredDataMap.values())
You can just pull the actual list from the Map using .values()

I take the numbers are unique. Then, another solution would be taking advantage of the uniqueness of dictionary keys. This means converting each list of dictionary to a dictionary of 'number:name' pairs. Then, you simple need to update RawData with FilteredData.
RawData = [
{'name': 'Customer Service', 'number': '123987546'},
{'name': 'Switchboard', 'number': '48621364'}
]
FilteredData = [
{'name': 'IT-support', 'number': '32136994'},
{'name': 'Company Customer Service', 'number': '123987546'}
]
def convert_list(input_list):
return {item['number']:item['name'] for item in input_list}
def unconvert_dict(input_dict):
return [{'name':val, 'number': key} for key, val in input_dict.items()]
NewRawData = convert_list(RawData)
NewFilteredData = conver_list(FilteredData)
DesiredResultConverted = NewRawData.update(NewFilteredData)
DesuredResult = unconvert_dict(DesiredResultConverted)
In this example, the variables will have the following values:
NewRawData = {'123987546':'Customer Service', '48621364': 'Switchboard'}
NewFilteredData = {'32136994': 'IT-support', '123987546': 'Company Customer Service'}
When you update NewRawData with NewFilteredData, Company Customer Service will overwrite Customer Service as the value associated with the key 123987546. So,
DesiredResultConverted = {'123987546':'Company Customer Service', '48621364': 'Switchboard', '32136994': 'IT-support'}
Then, if you still prefer the original format, you can "unconvert" back.

Filtering through a list with embedded dictionaries

I've got a json format list with some dictionaries within each list, it looks like the following:
[{"id":13, "name":"Albert", "venue":{"id":123, "town":"Birmingham"}, "month":"February"},
{"id":17, "name":"Alfred", "venue":{"id":456, "town":"London"}, "month":"February"},
{"id":20, "name":"David", "venue":{"id":14, "town":"Southampton"}, "month":"June"},
{"id":17, "name":"Mary", "venue":{"id":56, "town":"London"}, "month":"December"}]
The amount of entries within the list can be up to 100. I plan to present the 'name' for each entry, one result at a time, for those that have London as a town. The rest are of no use to me. I'm a beginner at python so I would appreciate a suggestion in how to go about this efficiently. I initially thought it would be best to remove all entries that don't have London and then I can go through them one by one.
I also wondered if it might be quicker to not filter but to cycle through the entire json and select the names of entries that have the town as London.

You can use filter:
data = [{"id":13, "name":"Albert", "venue":{"id":123, "town":"Birmingham"}, "month":"February"},
{"id":17, "name":"Alfred", "venue":{"id":456, "town":"London"}, "month":"February"},
{"id":20, "name":"David", "venue":{"id":14, "town":"Southampton"}, "month":"June"},
{"id":17, "name":"Mary", "venue":{"id":56, "town":"London"}, "month":"December"}]
london_dicts = filter(lambda d: d['venue']['town'] == 'London', data)
for d in london_dicts:
print(d)
This is as efficient as it can get because:
The loop is written in C (in case of CPython)
filter returns an iterator (in Python 3), which means that the results are loaded to memory one by one as required

One way is to use list comprehension:
>>> data = [{"id":13, "name":"Albert", "venue":{"id":123, "town":"Birmingham"}, "month":"February"},
{"id":17, "name":"Alfred", "venue":{"id":456, "town":"London"}, "month":"February"},
{"id":20, "name":"David", "venue":{"id":14, "town":"Southampton"}, "month":"June"},
{"id":17, "name":"Mary", "venue":{"id":56, "town":"London"}, "month":"December"}]
>>> [d for d in data if d['venue']['town'] == 'London']
[{'id': 17,
'name': 'Alfred',
'venue': {'id': 456, 'town': 'London'},
'month': 'February'},
{'id': 17,
'name': 'Mary',
'venue': {'id': 56, 'town': 'London'},
'month': 'December'}]

Separate pd DataFrame Rows that are dictionaries into columns

I am extracting some data from an API and having challenges transforming it into a proper dataframe.
The resulting DataFrame df is arranged as such:
Index Column
0 {'email#email.com': [{'action': 'data', 'date': 'date'}, {'action': 'data', 'date': 'date'}]}
1 {'different-email#email.com': [{'action': 'data', 'date': 'date'}]}
I am trying to split the emails into one column and the list into a separate column:
Index Column1 Column2
0 email#email.com [{'action': 'data', 'date': 'date'}, {'action': 'data', 'date': 'date'}]}
Ideally, each 'action'/'date' would have it's own separate row, however I believe I can do the further unpacking myself.
After looking around I tried/failed lots of solutions such as:
df.apply(pd.Series) # does nothing
pd.DataFrame(df['column'].values.tolist()) # makes each dictionary key as a separate colum
where most of the rows are NaN except one which has the pair value
Edit:
As many of the questions asked the initial format of the data in the API, it's a list of dictionaries:
[{'email#email.com': [{'action': 'data', 'date': 'date'}, {'action': 'data', 'date': 'date'}]},{'different-email#email.com': [{'action': 'data', 'date': 'date'}]}]
Thanks

One naive way of doing this is as below:
inp = [{'email#email.com': [{'action': 'data', 'date': 'date'}, {'action': 'data', 'date': 'date'}]}
, {'different-email#email.com': [{'action': 'data', 'date': 'date'}]}]
index = 0
df = pd.DataFrame()
for each in inp: # iterate through the list of dicts
for k, v in each.items(): #take each key value pairs
for eachv in v: #the values being a list, iterate through each
print (str(eachv))
df.set_value(index,'Column1',k)
df.set_value(index,'Column2',str(eachv))
index += 1
I am sure there might be a better way of writing this. Hope this helps :)

Assuming you have already read it as dataframe, you can use following -
import ast
df['Column'] = df['Column'].apply(lambda x: ast.literal_eval(x))
df['email'] = df['Column'].apply(lambda x: x.keys()[0])
df['value'] = df['Column'].apply(lambda x: x.values()[0])

JSON to Pandas Dataframe not knowing if JSON will have all the columns of the dataframe

I am doing a research project and trying to pull thousands of quarterly results for companies from the SEC EDGAR API.
Each result is a list of dictionaries structured as follows:
[{'field': 'othercurrentliabilities', 'value': 6886000000.0},
{'field': 'otherliabilities', 'value': 13700000000.0},
{'field': 'propertyplantequipmentnet', 'value': 15789000000.0}...]
I want each result to be a row of a pandas dataframe. The issue is that each result may not have the same fields due to the data available. I would like to check if the column(field) of the dataframe is present in one of the results field and if it is add the result value to the row. If not, I would like to add an np.NaN. How would I go about doing this?

A list/dict comprehension ought to work here:
In [11]: s
Out[11]:
[[{'field': 'othercurrentliabilities', 'value': 6886000000.0},
{'field': 'otherliabilities', 'value': 13700000000.0},
{'field': 'propertyplantequipmentnet', 'value': 15789000000.0}],
[{'field': 'othercurrentliabilities', 'value': 6886000000.0}]]
In [12]: pd.DataFrame([{d["field"]: d["value"] for d in row} for row in s])
Out[12]:
othercurrentliabilities otherliabilities propertyplantequipmentnet
0 6.886000e+09 1.370000e+10 1.578900e+10
1 6.886000e+09 NaN NaN

make a list of df.result.rows[x]['values']
like below
s=[]
for x in range(df.result.totalrows[0]):
s=s+[df.result.rows[x]['values']]
print(x)
df1=pd.DataFrame([{d["field"]: d["value"] for d in row} for row in s]
df1
will give you result.

Use list of indices to manipulate a nested dictionary

I'm trying to perform operations on a nested dictionary (data retrieved from a yaml file):
data = {'services': {'web': {'name': 'x'}}, 'networks': {'prod': 'value'}}
I'm trying to modify the above using the inputs like:
{'services.web.name': 'new'}
I converted the above to a list of indices ['services', 'web', 'name']. But I'm not able to/not sure how to perform the below operation in a loop:
data['services']['web']['name'] = new
That way I can modify dict the data. There are other values I plan to change in the above dictionary (it is extensive one) so I need a solution that works in cases where I have to change, EG:
data['services2']['web2']['networks']['local'].
Is there a easy way to do this? Any help is appreciated.

You may iterate over the keys while moving a reference:
data = {'networks': {'prod': 'value'}, 'services': {'web': {'name': 'x'}}}
modification = {'services.web.name': 'new'}
for key, value in modification.items():
keyparts = key.split('.')
to_modify = data
for keypart in keyparts[:-1]:
to_modify = to_modify[keypart]
to_modify[keyparts[-1]] = value
print(data)
Giving:
{'networks': {'prod': 'value'}, 'services': {'web': {'name': 'new'}}}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.