validation check - dictionary value types [duplicate] - python

This question already has answers here:
How do I parse a string to a float or int?
(32 answers)
Closed 7 months ago.
After converting my csv to dictionary with pandas, a sample of the dictionary will look like this:
[{'Name': '1234', 'Age': 20},
{'Name': 'Alice', 'Age': 30.1},
{'Name': '5678', 'Age': 41.0},
{'Name': 'Bob 1', 'Age': 14},
{'Name': '!##$%', 'Age': 65}]
My goal is to do a validation check if the columns are in string. I'm trying to use pandera or schema libs to achieve it as the csv may contain a million rows. Therefore, I am trying to convert the dict to as follows.
[{'Name': 1234, 'Age': 20},
{'Name': 'Alice', 'Age': 30.1},
{'Name': 5678, 'Age': 41.0},
{'Name': 'Bob 1', 'Age': 14},
{'Name': '!##$%', 'Age': 65}]
After converting the csv data to dict, I use the following code to check if Name is string.
import pandas as pd
from schema import Schema, And, Use, Optional, SchemaError
schema = Schema([{'Name': str,
'Age': float}])
validated = schema.validate(dict)
Is it possible?

Is it possible?
For sure. You can use the int constructor to convert that strings to integers if possible.
for element in list_:
try:
element["Name"] = int(element["Name"])
except ValueError:
pass
A faster way for doing it would be using isdigit method of class str.
for element in list_:
if element["Name"].isdigit(): # Otherwise no need to convert
element["Name"] = int(element["Name"])
So that you don't have to enter that try/except block.

Related

Convert Nested JSON into Dataframe

I have a nested JSON like below. I want to convert it into a pandas dataframe. As part of that, I also need to parse the weight value only. I don't need the unit.
I also want the number values converted from string to numeric.
Any help would be appreciated. I'm relatively new to python. Thank you.
JSON Example:
{'id': '123', 'name': 'joe', 'weight': {'number': '100', 'unit': 'lbs'},
'gender': 'male'}
Sample output below:
id name weight gender
123 joe 100 male
use " from pandas.io.json import json_normalize ".
id name weight.number weight.unit gender
123 joe 100 lbs male
if you want to discard the weight unit, just flatten the json:
temp = {'id': '123', 'name': 'joe', 'weight': {'number': '100', 'unit': 'lbs'}, 'gender': 'male'}
temp['weight'] = temp['weight']['number']
then turn it into a dataframe:
pd.DataFrame(temp)
Something like this should do the trick:
json_data = [{'id': '123', 'name': 'joe', 'weight': {'number': '100', 'unit': 'lbs'}, 'gender': 'male'}]
# convert the data to a DataFrame
df = pd.DataFrame.from_records(json_data)
# conver id to an int
df['id'] = df['id'].apply(int)
# get the 'number' field of weight and convert it to an int
df['weight'] = df['weight'].apply(lambda x: int(x['number']))
df

Comparing values of two dictionary's items

I need to compare the values of the items in two different dictionaries.
Let's say that dictionary RawData has items that represent phone numbers and number names.
Rawdata for example has items like: {'name': 'Customer Service', 'number': '123987546'} {'name': 'Switchboard', 'number': '48621364'}
Now, I got dictionary FilteredData, which already contains some items from RawData: {'name': 'IT-support', 'number': '32136994'} {'name': 'Company Customer Service', 'number': '123987546'}
As you can see, Customer Service and Company Customer Service both have the same values, but different keys. In my project, there might be hundreds of similar duplicates, and we only want unique numbers to end up in FilteredData.
FilteredData is what we will be using later in the code, and RawData will be discarded.
Their names(keys) can be close duplicates, but not their numbers(values)**
There are two ways to do this.
A. Remove the duplicate items in RawData, before appending them into FilteredData.
B. Append them into FilteredData, and go through the numbers(values) there, removing the duplicates. Can I use a set here to do that? It would work on a list, obviously.
I'm not looking for the most time-efficient solution. I'd like the most simple and easy to learn one, if and when someone takes over my job someday. In my project it's mandatory for the next guy working on the code to get a quick grip of it.
I've already looked at sets, and tried to face the problem by nesting two for loops, but something tells me there gotta be an easier way.
Of course I might have missed the obvious solution here.
Thanks in advance!
I hope I understands your problem here:
data = [{'name': 'Customer Service', 'number': '123987546'}, {'name': 'Switchboard', 'number': '48621364'}]
newdata = [{'name': 'IT-support', 'number': '32136994'}, {'name': 'Company Customer Service', 'number': '123987546'}]
def main():
numbers = set()
for entry in data:
numbers.add(entry['number'])
for entry in newdata:
if entry['number'] not in numbers:
data.append(entry)
print data
main()
Output:
[{'name': 'Customer Service', 'number': '123987546'},
{'name': 'Switchboard', 'number': '48621364'},
{'name': 'IT-support', 'number': '32136994'}]
What you can do is take a dict.values(), create a set of those to remove duplicates and then go through the old dictionary and find the first key with that value and add it to a new one. Keep the set around because when you get the next dict entry, try adding the element to that set and see if the length of the set is longer that before adding it. If it is, it's a unique element and you can add it to the dict.
If you're willing on changing how FilteredData is currently, you can just use a dict and use the number as your key:
RawData = [
{'name': 'Customer Service', 'number': '123987546'},
{'name': 'Switchboard', 'number': '48621364'}
]
# Change how FilteredData is structured
FilteredDataMap = {
'32136994':
{'name': 'IT-support', 'number': '32136994'},
'123987546':
{'name': 'Company Customer Service', 'number': '123987546'}
}
for item in RawData:
number = item.get('number')
if number not in FilteredDataMap:
FilteredDataMap[number] = item
# If you need the list of items
FilteredData = list(FilteredDataMap.values())
You can just pull the actual list from the Map using .values()
I take the numbers are unique. Then, another solution would be taking advantage of the uniqueness of dictionary keys. This means converting each list of dictionary to a dictionary of 'number:name' pairs. Then, you simple need to update RawData with FilteredData.
RawData = [
{'name': 'Customer Service', 'number': '123987546'},
{'name': 'Switchboard', 'number': '48621364'}
]
FilteredData = [
{'name': 'IT-support', 'number': '32136994'},
{'name': 'Company Customer Service', 'number': '123987546'}
]
def convert_list(input_list):
return {item['number']:item['name'] for item in input_list}
def unconvert_dict(input_dict):
return [{'name':val, 'number': key} for key, val in input_dict.items()]
NewRawData = convert_list(RawData)
NewFilteredData = conver_list(FilteredData)
DesiredResultConverted = NewRawData.update(NewFilteredData)
DesuredResult = unconvert_dict(DesiredResultConverted)
In this example, the variables will have the following values:
NewRawData = {'123987546':'Customer Service', '48621364': 'Switchboard'}
NewFilteredData = {'32136994': 'IT-support', '123987546': 'Company Customer Service'}
When you update NewRawData with NewFilteredData, Company Customer Service will overwrite Customer Service as the value associated with the key 123987546. So,
DesiredResultConverted = {'123987546':'Company Customer Service', '48621364': 'Switchboard', '32136994': 'IT-support'}
Then, if you still prefer the original format, you can "unconvert" back.

json_normalize None handling change (pandas .23)

I have jsons containing nested values that are sometimes None and the behavior has changed between pandas 0.22.0 and pandas 0.23.0.
In 0.22.0:
from pandas.io.json import json_normalize
my_json = {'event': {'name': 'Bob', 'id': '12345','id2': None},
'id': '12345', 'labels': []}
json_normalize(my_json)
gives:
event.id event.id2 event.name id labels
12345 None Bob 12345 []
which I want.
In 0.23.0:
from pandas.io.json import json_normalize
my_json = {'event': {'name': 'Bob', 'id': '12345','id2': None},
'id': '12345', 'labels': []}
json_normalize(my_json)
returns KeyError: 'id2'
Toggling ignore errors does nothing, and it's not really feasible to change the nested Nones to placeholder values. Anyone know how to achieve the prior behavior with the update?

Use list of indices to manipulate a nested dictionary

I'm trying to perform operations on a nested dictionary (data retrieved from a yaml file):
data = {'services': {'web': {'name': 'x'}}, 'networks': {'prod': 'value'}}
I'm trying to modify the above using the inputs like:
{'services.web.name': 'new'}
I converted the above to a list of indices ['services', 'web', 'name']. But I'm not able to/not sure how to perform the below operation in a loop:
data['services']['web']['name'] = new
That way I can modify dict the data. There are other values I plan to change in the above dictionary (it is extensive one) so I need a solution that works in cases where I have to change, EG:
data['services2']['web2']['networks']['local'].
Is there a easy way to do this? Any help is appreciated.
You may iterate over the keys while moving a reference:
data = {'networks': {'prod': 'value'}, 'services': {'web': {'name': 'x'}}}
modification = {'services.web.name': 'new'}
for key, value in modification.items():
keyparts = key.split('.')
to_modify = data
for keypart in keyparts[:-1]:
to_modify = to_modify[keypart]
to_modify[keyparts[-1]] = value
print(data)
Giving:
{'networks': {'prod': 'value'}, 'services': {'web': {'name': 'new'}}}

dict.update overwrites existing keys, how to avoid?

When using the update function for a dictionary in python where you are merging two dictionaries and the two dictionaries have the same keys they are apparently being overwritten.
A simple example:
simple_dict_one = {'name': "tom", 'age': 20}
simple_dict_two = {'name': "lisa", 'age': 17}
simple_dict_one.update(simple_dict_two)
After the dicts are merged the following dict remains:
{'age': 17, 'name': 'lisa'}
So if you have the same key in both dict only one remains (the last one apparently).
If i have a lot of names for several sources i would probably want a temp dict from each of those and then want to add it to a whole bigger dict.
Is there a way to merge two dicts and still keep all the keys ? I guess you are only suppose to have one unique key but then how would i merge two dicts without loosing data
Well i have several sources i gather information from, for example an
ldap database and other sources where i have python functions that
create a temp dict each but i want a complete dict at the end that
sort of concatenates or displays all information gathered from all the
sources.. so i would have one dict holding all the info
What you are trying to do with the 'merging' is not quite ideal. As you said yourself
I guess you are only suppose to have one unique key
Which makes it relatively and unnecessarily hard to gather all your information in one dict.
What you could do, instead of calling .update() on the existing dict, is add a sub-dict. Its key could be the name of the source from which you gathered the information. The value could be the dict you receive from the source, and if you need to store more than 1 dict of the same source you can store them in a list.
Example
>>> data = {}
>>> person_1 = {'name': 'lisa', 'age': 17}
>>> person_2 = {'name': 'tom', 'age': 20}
>>> data['people'] = [person_1, person_2]
>>> data
{'people': [{'age': 17, 'name': 'lisa'}, {'age': 20, 'name': 'tom'}]}
Then whenever you need to add newly gathered information, you just add a new entry to the data dict
>>> ldap_data = {'foo': 1, 'bar': 'baz'} # just some dummy data
>>> data['ldap_data'] = ldap_data
>>> data
{'people': [{'age': 17, 'name': 'lisa'}, {'age': 20, 'name': 'tom'}],
'ldap_data': {'foo': 1, 'bar': 'baz'}}
The source-specific data is easily extractable from the data dict
>>> data['people']
[{'age': 17, 'name': 'lisa'}, {'age': 20, 'name': 'tom'}]
>>> data['ldap_data']
{'foo': 1, 'bar': 'baz'}

Categories

Resources