RDD to DF conversion

RDD to DF conversion - python

I am new to Pyspark. My code looks something like below. I am not sure why df.collect() is showing None values for all the string values.
>> rdd = sc.parallelize([{'name': 'test', 'age': {"id": 326, "first_name": "Will", "last_name": "Cur"}},
{'name': 'test2', 'age': {"id": 751, "first_name": "Will", "last_name": "Mc"}}])
>> rdd.collect()
[{'name': 'test', 'age': {'id': 326, 'first_name': 'Will', 'last_name': 'Cur'}}, {'name': 'test2', 'age': {'id': 751, 'first_name': 'Will', 'last_name': 'Mc'}}]
>> df = spark.createDataFrame(rdd)
>> df.collect()
[Row(age={'last_name': None, 'first_name': None, 'id': 326}, name='test'), Row(age={'last_name': None, 'first_name': None, 'id': 751}, name='test2')]

For complex data structures, Spark might have difficulty in inferring the schema from the RDD, so you can instead provide a schema to make sure that the conversion is done properly:
df = spark.createDataFrame(
rdd,
'name string, age struct<id:int, first_name:string, last_name:string>'
)
df.collect()
# [Row(name='test', age=Row(id=326, first_name='Will', last_name='Cur')),
# Row(name='test2', age=Row(id=751, first_name='Will', last_name='Mc'))]

Related

Changing schema of avro file when writing to it in append mode

I'm looking for a way to modify the schema of an avro file in python. Taking the following example, using the fastavro package, first write out some initial records, with corresponding schema:
from fastavro import writer, parse_schema
schema = {
'name': 'test',
'type': 'record',
'fields': [
{'name': 'id', 'type': 'int'},
{'name': 'val', 'type': 'long'},
],
}
records = [
{u'id': 1, u'val': 0.2},
{u'id': 2, u'val': 3.1},
]
with open('test.avro', 'wb') as f:
writer(f, parse_schema(schema), records)
Uhoh, I've got some more records, but they contain None values. I'd like to append these records to the avro file, and modify my schema accordingly:
more_records = [
{u'id': 3, u'val': 1.5},
{u'id': 2, u'val': None},
]
schema['fields'][1]['type'] = ['long', 'null']
with open('test.avro', 'a+b') as f:
writer(f, parse_schema(schema), more_records)
Instead of overwriting the schema, this results in an error:
ValueError: Provided schema {'type': 'record', 'name': 'test', 'fields': [{'name': 'id', 'type': 'int'}, {'name': 'val', 'type': ['long', 'null']}], '__fastavro_parsed': True, '__named_schemas': {'test': {'type': 'record', 'name': 'test', 'fields': [{'name': 'id', 'type': 'int'}, {'name': 'val', 'type': ['long', 'null']}]}}} does not match file writer_schema {'type': 'record', 'name': 'test', 'fields': [{'name': 'id', 'type': 'int'}, {'name': 'val', 'type': 'long'}], '__fastavro_parsed': True, '__named_schemas': {'test': {'type': 'record', 'name': 'test', 'fields': [{'name': 'id', 'type': 'int'}, {'name': 'val', 'type': 'long'}]}}}
Is there a workaround for this? The fastavro docs for this suggest it's not possible, but I'm hoping someone knows of a way!
Cheers

The append API in fastavro does not currently support this. You could open an issue in that repository and discuss if something like this makes sense.

How do I access a specific value in a nested Python dictionary?

I am trying to figure out how to filter for the dictionaries that have a status of "awaiting_delivery". I am not sure how to do this (or if it is impossible). I am new to python and programming. I am using Python 3.8.5 on VS Code on Ubuntu 20.04. The data below is sample data that I created that resembles json data from an API. Any help on how to filter for "status" would be great. Thank you.
nested_dict = {
'list_data': [
{
'id': 189530,
'total': 40.05,
'user_data': {
'id': 1001,
'first_name': 'jane',
'last_name': 'doe'
},
'status': 'future_delivery'
},
{
'id': 286524,
'total': 264.89,
'user_data': {
'id': 1002,
'first_name': 'john',
'last_name': 'doe'
},
'status': 'awaiting_delivery'
},
{
'id': 368725,
'total': 1054.98,
'user_data': {
'id': 1003,
'first_name': 'chris',
'last_name': 'nobody'
},
'status': 'awaiting_delivery'
},
{
'id': 422955,
'total': 4892.78,
'user_data': {
'id': 1004,
'first_name': 'mary',
'last_name': 'madeup'
},
'status': 'future_delivery'
}
],
'current_page': 1,
'total': 2,
'first': 1,
'last': 5,
'per_page': 20
}
#confirm that nested_dict is a dictionary
print(type(nested_dict))
#create a list(int_list) from the nested_dict dictionary
int_list = nested_dict['list_data']
#confirm that int_list is a list
print(type(int_list))
#create the int_dict dictionary from the int_list list
for int_dict in int_list:
print(int_dict)
#this is my attempt at filtering the int_dict dictionar for all orders with a status of awaiting_delivery
for order in int_dict:
int_dict.get('status')
print(order)
Output from Terminal Follows:
<class 'dict'>
<class 'list'>
{'id': 189530, 'total': 40.05, 'user_data': {'id': 1001, 'first_name': 'jane', 'last_name': 'doe'}, 'status': 'future_delivery'}
{'id': 286524, 'total': 264.89, 'user_data': {'id': 1002, 'first_name': 'john', 'last_name': 'doe'}, 'status': 'awaiting_delivery'}
{'id': 368725, 'total': 1054.98, 'user_data': {'id': 1003, 'first_name': 'chris', 'last_name': 'nobody'}, 'status': 'awaiting_delivery'}
{'id': 422955, 'total': 4892.78, 'user_data': {'id': 1004, 'first_name': 'mary', 'last_name': 'madeup'}, 'status': 'future_delivery'}
id
total
user_data
status

You can obtain a filtered list of dicts by doing conditional list comprehension on your list of dicts:
# filter the data
list_data_filtered = [entry for entry in nested_dict['list_data']
if entry['status'] == 'awaiting_delivery']
# print out the results
for entry in list_data_filtered:
print(entry)
# results
# {'id': 286524, 'total': 264.89, 'user_data': {'id': 1002, 'first_name': 'john', 'last_name': 'doe'}, 'status': 'awaiting_delivery'}
# {'id': 368725, 'total': 1054.98, 'user_data': {'id': 1003, 'first_name': 'chris', 'last_name': 'nobody'}, 'status': 'awaiting_delivery'}

How to rename keys in a dictionary and make a dataframe of it?

I have a complex situation which I hope to solve and which might profit us all. I collected data from my API, added a pagination and inserted the complete data package in a tuple named q1 and finally I have made a dictionary named dict_1of that tuple which looks like this:
dict_1 = {100: {'ID': 100, 'DKSTGFase': None, 'DK': False, 'KM': None,
'Country: {'Name': GE', 'City': {'Name': 'Berlin'}},
'Type': {'Name': '219'}, 'DKObject': {'Name': '8555', 'Object': {'Name': 'Car'}},
'Order': {'OrderId': 101, 'CreatedOn': '2018-07-06T16:54:36.783+02:00',
'ModifiedOn': '2018-07-06T16:54:36.783+02:00',
'Name': Audi, 'Client': {‘1’ }}, 'DKComponent': {'Name': ‘John’}},
{200: {'ID': 200, 'DKSTGFase': None, 'DK': False, ' KM ': None,
'Country: {'Name': ES', 'City': {'Name': 'Madrid'}}, 'Type': {'Name': '220'},
'DKObject': {'Name': '8556', 'Object': {'Name': 'Car'}},
'Order': {'OrderId': 102, 'CreatedOn': '2018-07-06T16:54:36.783+02:00',
'ModifiedOn': '2018-07-06T16:54:36.783+02:00',
'Name': Mercedes, 'Client': {‘2’ }}, 'DKComponent': {'Name': ‘Sergio’}},
Please note that in the above dictionary I have just stated 2 records. The actual dictionary has 1400 records till it reaches ID 1500.
Now I want to 2 things:
I want to change some keys for all the records. key DK has to become DK1. Key Name in Country has to become Name1 and Name in Object has to become 'Name2'
The second thing I want is to make a dataFrame of the whole bunch of data. My expected outcome is:
This is my code:
q1 = response_2.json()
next_link = q1['#odata.nextLink']
q1 = [tuple(q1.values())]
while next_link:
new_response = requests.get(next_link, headers=headers, proxies=proxies)
new_data = new_response.json()
q1.append(tuple(new_data.values()))
next_link = new_data.get('#odata.nextLink', None)
dict_1 = {
record['ID']: record
for tup in q1
for record in tup[2]
}
#print(dict_1)
for x in dict_1.values():
x['DK1'] = x['DK']
x['Country']['Name1'] = x['Country']['Name']
x['Object']['Name2'] = x['Object']['Name']
df = pd.DataFrame(dict_1)
When i run this I receive the following Error:
Traceback (most recent call last):
File "c:\data\FF\Desktop\Python\PythongMySQL\Talky.py", line 57, in <module>
x['Country']['Name1'] = x['Country']['Name']
TypeError: 'NoneType' object is not subscriptable

working code
lists=[]
alldict=[{100: {'ID': 100, 'DKSTGFase': None, 'DK': False, 'KM': None,
'Country': {'Name': 'GE', 'City': {'Name': 'Berlin'}},
'Type': {'Name': '219'}, 'DKObject': {'Name': '8555', 'Object': {'Name': 'Car'}},
'Order': {'OrderId': 101, 'CreatedOn': '2018-07-06T16:54:36.783+02:00',
'ModifiedOn': '2018-07-06T16:54:36.783+02:00',
'Name': 'Audi', 'Client': {'1' }}, 'DKComponent': {'Name': 'John'}}}]
for eachdict in alldict:
key=list(eachdict.keys())[0]
eachdict[key]['DK1']=eachdict[key]['DK']
del eachdict[key]['DK']
eachdict[key]['Country']['Name1']=eachdict[key]['Country']['Name']
del eachdict[key]['Country']['Name']
eachdict[key]['DKObject']['Object']['Name2']=eachdict[key]['DKObject']['Object']['Name']
del eachdict[key]['DKObject']['Object']['Name']
lists.append([key, eachdict[key]['DK1'], eachdict[key]['KM'], eachdict[key]['Country']['Name1'],
eachdict[key]['Country']['City']['Name'], eachdict[key]['DKObject']['Object']['Name2'], eachdict[key]['Order']['Client']])
pd.DataFrame(lists, columns=[<columnNamesHere>])
Output:
{100: {'ID': 100,
'DKSTGFase': None,
'KM': None,
'Country': {'City': {'Name': 'Berlin'}, 'Name1': 'GE'},
'Type': {'Name': '219'},
'DKObject': {'Name': '8555', 'Object': {'Name2': 'Car'}},
'Order': {'OrderId': 101,
'CreatedOn': '2018-07-06T16:54:36.783+02:00',
'ModifiedOn': '2018-07-06T16:54:36.783+02:00',
'Name': 'Audi',
'Client': {'1'}},
'DKComponent': {'Name': 'John'},
'DK1': False}}

Add dictionary item in list to a dictionary

Hi guys I'm pretty lost with this simple problem. I have a dictionary and a list of dictionaries in python and I want to loop over the list to add each dictionary to the first dictionary but somehow it just adds the last dictionary with the solution I came up with. I'm using Python 3.6.5
This is what I've tried:
res = []
dictionary = {"id": 1, "name": "Jhon"}
dictionary_2 = [
{"surname": "Doe", "email": "jd#example.com"},
{"surname": "Morrison", "email": "jm#example.com"},
{"surname": "Targetson", "email": "jt#example.com"}
]
for info in dictionary_2:
aux_dict = dictionary
aux_dict["extra"] = info
res.append(aux_dict)
print(res)
What I expect is:
[{'id': 1, 'name': 'Jhon', 'extra': {'surname': 'Doe', 'email': 'jd#example.com'}},
{'id': 1, 'name': 'Jhon', 'extra': {'surname': 'Morrison', 'email': 'jm#example.com'}},
{'id': 1, 'name': 'Jhon', 'extra': {'surname': 'Targetson', 'email': 'jt#example.com'}}]
And this is what I get
[{'id': 1, 'name': 'Jhon', 'extra': {'surname': 'Targetson', 'email': 'jt#example.com'}},
{'id': 1, 'name': 'Jhon', 'extra': {'surname': 'Targetson', 'email': 'jt#example.com'}},
{'id': 1, 'name': 'Jhon', 'extra': {'surname': 'Targetson', 'email': 'jt#example.com'}}]
This is probably a duplicate of some other question but I can't manage to find it

This is because you keep adding the same aux_dict to res.
What you probably want to do is make a copy of dictionary; just assigning it to aux_dict does not make a copy.
This is how you make a (shallow) copy:
aux_dict = dictionary.copy()
That would be sufficient in your case.

You can achieve this in one line using list comprehension and dict constructor:
dictionary = {"id": 1, "name": "Jhon"}
dictionary_2 = [
{"surname": "Doe", "email": "jd#example.com"},
{"surname": "Morrison", "email": "jm#example.com"},
{"surname": "Targetson", "email": "jt#example.com"}
]
# ...
res = [dict(dictionary, extra=item) for item in dictionary_2]
# ...
print(res)

Dictionary keys and values

I'm working with Python dictionaries and there's something I don't understand when I use the function dict.values() and dict.keys().
Why do they give as a result also the "description" of the function? Am I missing something here?
participant = {"name": "Lisa", "age": 16, "activities": [{"name": "running", "duration": 340},{"name": "walking", "duration": 790}]}
print(participant.values())
print(participant.keys())
The print gives these results:
dict_values([[{'duration': 340, 'name': 'running'}, {'duration': 790, 'name': 'walking'}], 'Lisa', 16])
dict_keys(['activities', 'name', 'age'])
I don't want 'dict_values' and 'dict_keys' in the result. What am I doing wrong?

For this purpose you can use keyword list:
list(participant.keys()) # ['name', 'activities', 'age']
list(participant.values())
# ['Lisa', [{'name': 'running', 'duration': 340}, {'name': 'walking', 'duration': 790}], 16]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

RDD to DF conversion - python

Related

Changing schema of avro file when writing to it in append mode

How do I access a specific value in a nested Python dictionary?

How to rename keys in a dictionary and make a dataframe of it?

Add dictionary item in list to a dictionary

Dictionary keys and values

Categories

Resources