I have a json where the objects contain some subset of a superset of strings, e.g. the 'ideal' case where all the strings of a superset are included in an object:
{
"firstName": "foo",
"lastName": "bar",
"age": 20,
"email":"email#example.com"
}
However, some objects are like this:
{
"firstName": "name",
"age": 40,
"email":"email#example.com"
}
What's the optimal way to only write the objects with each string of the superset to a csv?
If it were simply a case of a string having a null value, I think I'd just use .dropna with pandas and it'd omit that row from the csv.
Should I impute the missing strings so that each object contains the superset, but with null values? If so, how?
As you suggested, reading into a pandas dataframe should do the trick. Using the pandas df.read_json() will leave a NaN for any value not contained in a given json record. So try:
a = pd.read_json(json_string, orient='records')
a.dropna(inplace=True)
a.to_csv(filename,index=False)
Suppose you have
json_string ='[{ "firstName": "foo", "lastName": "bar", "age": 20, "email":"email#example.com"}, {"firstName": "name", "age": 40,"email":"email#example.com"}]'
Then you can
l = json.loads(json_string)
df = pd.DataFrame(l)
Which yields
age email firstName lastName
0 20 email#example.com foo bar
1 40 email#example.com name NaN
Then,
>>> df.to_dict('records')
[{'age': 20,
'email': 'email#example.com',
'firstName': 'foo',
'lastName': 'bar'},
{'age': 40,
'email': 'email#example.com',
'firstName': 'name',
'lastName': nan}]
or
>>> df.to_json()
'{"age":{"0":20,"1":40},"email":{"0":"email#example.com","1":"email#example.com"},"firstName":{"0":"foo","1":"name"},"lastName":{"0":"bar","1":null}}'
The good thing about having a data frame is that you can parse/manipulate the data however you want before making it a dictionary/json again.
Test for all the values you want:
x = json.loads(json_string)
if 'firstName' in x and 'lastName' in x and 'age' in x and 'email' in x:
print 'we got all the values'
else
print 'one or more values are missing'
Or, a prettier way to do it:
fields = ['firstName', 'lastName', 'age', 'email']
if all(f in x for f in fields):
print 'we got all the fields'
Related
I have a script that retrieves user data from a CSV (~2.5m) and record data from Salesforce via API (~2m) and matches them based on a unique user_id.
For each user, I need the relevant record_id (if it exists). There is a one-to-one relationship with users and records, so the user_id should only appear on 1 record.
To try and increase performance both lists are sorted ascending by user_id, and I break the loop if record['user_id'] > user['user_id'] as that means there is no relevant record.
It's working, however it's slow when trying to match the 2 datasets taking ~1.5hrs. Is there a faster method of performing the matching to retrieve the relevant record_id?
Here is an example of the data, current function, and expected result:
users = [
{"user_id": 11111, "name": "Customer A", "age": 34, 'record_id': None},
{"user_id": 22222, "name": "Customer B", "age": 18, 'record_id': None},
{"user_id": 33333, "name": "Customer C", "age": 66, 'record_id': None}
]
records = [
{"user_id": 11111, "record_id": "ABC123"},
{"user_id": 33333, "record_id": "GHI789"}
]
upload = []
for user in users:
for record in records:
if user['user_id'] == record['user_id']:
user['record_id'] = record['record_id']
records.remove(record)
break
elif record['user_id'] > user['user_id']:
break
if user['record_id']:
upload.append(user)
print(upload)
This outputs:
[
{'user_id': 11111, 'name': 'Customer A', 'age': 34, 'record_id': 'ABC123'},
{'user_id': 33333, 'name': 'Customer C', 'age': 66, 'record_id': 'GHI789'}
]
Create a dictionary that maps from a user's id to its corresponding dictionary. Then, you can add the relevant record_id fields using a for loop. Finally, you can remove the entries without an assigned record_id using a list comprehension.
This doesn't require any preprocessing (e.g. sorting) to obtain speedup; the efficiency gain comes from the fact that lookups in a large dictionary are faster than searching a large list:
user_id_mapping = {entry["user_id"]: entry for entry in users}
for record in records:
if record["user_id"] in user_id_mapping:
user_id_mapping[record["user_id"]]["record_id"] = record["record_id"]
result = [item for item in user_id_mapping.values() if item["record_id"] is not None]
print(result)
This outputs:
[
{'user_id': 11111, 'name': 'Customer A', 'age': 34, 'record_id': 'ABC123'},
{'user_id': 33333, 'name': 'Customer C', 'age': 66, 'record_id': 'GHI789'}
]
With this being said, if you have to execute similar flavors of this operation repeatedly, I would recommend using some sort of a database rather than performing this in Python.
You could use pandas.read_csv() to read your CSV data into a dataframe, and then merge that with the records on the user_id value:
import pandas as pd
users = pd.read_csv('csv file')
records = pd.DataFrame('result of salesforce query')
result = users.drop('record_id', axis=1).merge(records, on='user_id')
If you want to keep the users which have no matching value in records, change the merge to
merge(records, on='user_id', how='left')
To output the result as a list of dictionaries, use to_dict():
result.to_dict('records')
Note - it may be possible to execute your Salesforce query directly into a dataframe. See for example this Q&A
For scalability, you can use pandas dataframes, like so:
result = pd.merge(pd.DataFrame(users), pd.DataFrame(records), on='user_id').to_dict('records')
If you want to keep the entries which do not have a record_id, you can add the how="left" to the arguments of the merge function.
Your approach isn't unreasonable. But removing record after it's used has a cost. Sorting your two lists ahead of time also has a cost. These costs may add up more than you think they do.
One possible approach would be to NOT sort the lists, but instead build a dict of record_ids, eg:
rdict = { r['user_id']:r['record_id'] for r in records }
for user in users:
user_id = user['user_id']
record_id = rdict.get(user_id)
if record_id:
user['record_id'] = record_id
upload.append(user)
This way you're paying the price once for building the hash, and everything else is very efficient.
Given a list of dictionaries such as:
list_ = [
{ 'name' : 'date',
'value': '2021-01-01'
},
{ 'name' : 'length',
'value': '500'
},
{ 'name' : 'server',
'value': 'g.com'
},
How can I access the value where the key name == length?
I want to avoid iteration if possible, and just be able to check if the key called 'length' exists, and if so, get its value.
With iteration, and using next, you could do:
list_ = [
{'name': 'date',
'value': '2021-01-01'
},
{'name': 'length',
'value': '500'
},
{'name': 'server',
'value': 'g.com'
}
]
res = next(dic["value"] for dic in list_ if dic.get("name", "") == "length")
print(res)
Output
500
As an alternative, if the "names" are unique you could build a dictionary to avoid further iterations on list_, as follows:
lookup = {d["name"] : d["value"] for d in list_}
res = lookup["length"]
print(res)
Output
500
Notice that if you need a second key such as "server", you won't need to iterate, just do:
lookup["server"] # g.com
It sure is hard to find an element in a list without iterating through it. Thats the first solution I will show:
list(filter(lambda element: element['name'] == 'length', list_))[0]['value']
this will filter through your list only the elements with name 'length', choose the first from that list, then select the 'value' of that element.
Now, if you had a better data structure, you wouldn't have to iterate. In order to create that better data structure, unfortunately, we will have to iterate the list. A list of dicts with "name" and "value" could really just be a single dict where "name" is the key and "value" is the value. To create that dict:
dict_ = {item['name']:item['value'] for item in list_}
then you can just select 'length'
dict_['length']
I am running an API and saving the responses as a dictionary with response.to_dict() to a new column for referencing later.
Sample dataframe:
dict1 = {'thing': 200,
'other thing': 18,
'available_data': {'premium': {'emails': 1}},
'query': {'names': [{'first': 'John','last': 'Smith'}]}}
dict2 = {'thing': 123,
'other thing': 13,
'available_data': {'premium': {'emails': 1}},
'query': {'names': [{'first': 'Foo','last': 'Bar'}]}}
dict_frame = pd.DataFrame({'customers':['John','Foo'],
'api_response':[dict1,dict2]})
print(dict_frame)
customers api_response
0 John {'thing': 200, 'other thing': 18, 'available_d...
1 Foo {'thing': 123, 'other thing': 13, 'available_d...
We can see that the data is stil a dict type:
type(dict_frame.loc[1,'api_response'])
dict
However if I save it to file and re-load it, the data is now a string.
# save to file
dict_frame.to_csv('mydicts.csv')
# reload dataframe
dict_frame = pd.read_csv('mydicts.csv')
# check type
type(dict_frame.loc[1,'api_response'])
#it's a string
str
With some googling, I see there is a package to convert it back to a dict:
from ast import literal_eval
python_dict = literal_eval(first_dict)
It works, but I have a feeling there's a way to avoid this in the first place. Any advice?
I tried dtype={'api_response': dict} while reading in the csv, but TypeError: dtype '<class 'dict'>' not understood
That is a limitation of CSV file type: everything is converted to text. pandas must guess the data type when it reads the text back in. You can specify a converter:
from ast import literal_eval
dict_frame_csv = pd.read_csv('mydicts.csv', converters={'api_response': literal_eval})
I have a Collection with heavily nested docs in MongoDB, I want to flatten and import to Pandas. There are some nested dicts, but also a list of dicts that I want to transform into columns (see examples below for details).
I already have function, that works for smaller batches of documents. But the solution (I found it in the answer to this question) uses json. The problem with the json.loads operation is, that it fails with a MemoryError on bigger selections from the Collection.
I tried many solutions suggesting other json-parsers (e.g. ijson), but for different reasons none of them solved my problem. The only way left, if I want to keep up the transformation via json, would be chunking bigger selections into smaller groups of documents and iterate the parsing.
At this point I thought, - and that is my main question here - maybe there is a smarter way to do the unnesting without taking the detour through json directly in MongoDB or in Pandas or somehow combined?
This is a shortened example Doc:
{
'_id': ObjectId('5b40fcc4affb061b8871cbc5'),
'eventId': 2,
'sId' : 6833,
'stage': {
'value': 1,
'Name': 'FirstStage'
},
'quality': [
{
'type': {
'value': 2,
'Name': 'Color'
},
'value': '124'
},
{
'type': {
'value': 7,
'Name': 'Length'
},
'value': 'Short'
},
{
'type': {
'value': 15,
'Name': 'Printed'
}
}
}
This is what a succcesful dataframe-representation would look like (I skipped columns '_id' and 'sId' for readability:
eventId stage.value stage.name q_color q_length q_printed
1 2 1 'FirstStage' 124 'Short' 1
My code so far (which runs into memory problems - see above):
def load_events(filter = 'sId', id = 6833, all = False):
if all:
print('Loading all events.')
cursor = events.find()
else:
print('Loading events with %s equal to %s.' %(filter, id))
print('Filtering...')
cursor = events.find({filter : id})
print('Loading...')
l = list(cursor)
print('Parsing json...')
sanitized = json.loads(json_util.dumps(l))
print('Parsing quality...')
for ev in sanitized:
for q in ev['quality']:
name = 'q_' + str(q['type']['Name'])
value = q.pop('value', 1)
ev[name] = value
ev.pop('quality',None)
normalized = json_normalize(sanitized)
df = pd.DataFrame(normalized)
return df
You don't need to convert the nested structures using json parsers. Just create your dataframe from the record list:
df = DataFrame(list(cursor))
and afterwards use pandas in order to unpack your lists and dictionaries:
import pandas
from itertools import chain
import numpy
df = pandas.DataFrame(t)
df['stage.value'] = df['stage'].apply(lambda cell: cell['value'])
df['stage.name'] = df['stage'].apply(lambda cell: cell['Name'])
df['q_']= df['quality'].apply(lambda cell: [(m['type']['Name'], m['value'] if 'value' in m.keys() else 1) for m in cell])
df['q_'] = df['q_'].apply(lambda cell: dict((k, v) for k, v in cell))
keys = set(chain(*df['q_'].apply(lambda column: column.keys())))
for key in keys:
column_name = 'q_{}'.format(key).lower()
df[column_name] = df['q_'].apply(lambda cell: cell[key] if key in cell.keys() else numpy.NaN)
df.drop(['stage', 'quality', 'q_'], axis=1, inplace=True)
I use three steps in order to unpack the nested data types. Firstly, the names and values are used to create a flat list of pairs (tuples). In the second step a dictionary based on the tuples takes keys from 1st and values from 2nd location of the tuples. Then all existing property names are extracted once using a set. Each property gets a new column using a loop. Inside the loop the values of each pair is mapped to the respective column cells.
Currently i am having an question in python pandas. I want to filter a dataframe using url query string dynamically.
For eg:
CSV:
url: http://example.com/filter?Name=Sam&Age=21&Gender=male
Hardcoded:
filtered_data = data[
(data['Name'] == 'Sam') &
(data['Age'] == 21) &
(data['Gender'] == 'male')
];
I don't want to hard code the filter keys like before because the csv file changes anytime with different column headers.
Any suggestions
The easiest way to create this filter dynamically is probably to use np.all.
For example:
import numpy as np
query = {'Name': 'Sam', 'Age': 21, 'Gender': 'male'}
filters = [data[k] == v for k, v in query.items()]
filter_data = data[np.all(filters, axis=0)]
use df.query. For example
df = pd.read_csv(url)
conditions = "Name == 'Sam' and Age == 21 and Gender == 'Male'"
filtered_data = df.query(conditions)
You can build the conditions string dynamically using string formatting like
conditions = " and ".join("{} == {}".format(col, val)
for col, val in zip(df.columns, values)
Typically, your web framework will return the arguments in a dict-like structure. Let's say your args are like this:
args = {
'Name': ['Sam'],
'Age': ['21'], # Note that Age is a string
'Gender': ['male']
}
You can filter your dataset successively like this:
for key, values in args.items():
data = data[data[key].isin(values)]
However, this is likely not to match any data for Age, which may have been loaded as an integer. In that case, you could load the CSV file as a string via pd.read_csv(filename, dtype=object), or convert to string before comparison:
for key, values in args.items():
data = data[data[key].astype(str).isin(values)]
Incidentally, this will also match multiple values. For example, take the URL http://example.com/filter?Name=Sam&Name=Ben&Age=21&Gender=male -- which leads to the structure:
args = {
'Name': ['Sam', 'Ben'], # There are 2 names
'Age': ['21'],
'Gender': ['male']
}
In this case, both Ben and Sam will be matched, since we're using .isin to match.