connecting two dictionaries and storing it into an RDD - python

I have a dictionary users with 1748 elements as (showing only the first 12 elements)-
defaultdict(int,
{'470520068': 1,
'2176120173': 1,
'145087572': 3,
'23047147': 1,
'526506000': 1,
'326311693': 1,
'851106379': 4,
'161900469': 1,
'3222966471': 1,
'2562842034': 1,
'18658617': 1,
'73654065': 4,})
and another dictionary partition with 452743 elements as(showing first 42 elements)-
{'609232972': 4,
'975151075': 4,
'14247572': 4,
'2987788788': 4,
'3064695250': 2,
'54097674': 3,
'510333371': 0,
'34150587': 4,
'26170001': 0,
'1339755391': 3,
'419536996': 4,
'2558131184': 2,
'23068646': 6,
'2781517567': 3,
'701206260771905541': 4,
'754263126': 4,
'33799684': 0,
'1625984816': 4,
'4893416104': 3,
'263520530': 3,
'60625681': 4,
'470528618': 3,
'4512063372': 6,
'933683112': 3,
'402379005': 4,
'1015823005': 2,
'244673821': 0,
'3279677882': 4,
'16206240': 4,
'3243924564': 6,
'2438275574': 6,
'205941266': 3,
'330723222': 1,
'3037002897': 0,
'75454729': 0,
'3033154947': 6,
'67475302': 3,
'922914019': 6,
'2598199242': 6,
'2382444216': 3,
'1388012203': 4,
'3950452641': 5,}
The keys in users(all unique) are all in partition and also are repeated with different values(and also partition contains some extra keys which is not of our use). What I want is a new dictionary final which connects the keys of users matching with those of partition with the values of partition, i.e. if I have '145087572' as a key in users and the same key has been repeated twice or thrice in partition with different values as: {'145087572':2, '145087572':3,'145087572':7} then I should get all these three elements in the new dictionary final. Also I have to store this dictionary as a key:value RDD.
Here's what I tried:
user_key=list(users.keys())
final=[]
for x in user_key:
s={x:partition.get(x) for x in partition}
final.append(s)
After running this code my laptop stops to respond (the code still shows [*]) and I have to restart it. May I know that is there any problem with my code and a more efficient way to do this.

First dictionary cannot hold duplicate keys, duplicate key's value will be ovewritten by the last value of same key.
Now lets analyze your code
user_key=list(users.keys()) # here you get all the keys say(1,2,3)
final=[]
for x in user_key: #you are iterating over the keys so x will be 1, 2, 3
s={x:partition.get(x) for x in partition} #This is the reason for halting
''' breaking the above line this is what it looks like.
s = {}
for x in partition:
s[x] = partition.get(x)
isn't the outer forloop and inner forloop is using the same variable x
so basically instead of iterating over the keys of users you are
iterating over the keys of partition table,
as x is updated inside inner foorloop(so x contains the keys of partition
table).
'''
final.append(s)
Now the reason for halting is (say you have 10 keys in users dictionary).
so outer forloop will iterate 10 times and for the 10 times
Inner forloop will iterate over whole partition keys and make a copy
which is causing memory error and eventually your system gets hung due to out of memory.
I think this will work for you
store partition data in a python defaultdict(list)
from collections import defaultdict
user_key = users.keys()
part_dict = defaultdict(list)
# partition = [[key1, value], [key2, value], ....]
# store your parition data in this way (list inside list)
for index in parition:
if index[0] not in part_dict:
part_dict[index[0]] = index[1]
else:
part_dict[index[0]].append(index[1])
# part_dict = {key1:[1,2,3], key2:[1,2,3], key3:[4,5],....}
final = []
for x in user_keys:
for values in part_dict[x]:
final.append([x, values])
# if you want your result of dictionary format(I don't think it's required) then you ca use
# final.append({x:values})
# final = [{key1: 1}, {key2: 2}, ....]
# final = [[key1, 1], [key1, 2], [key1, 3], .....]
The above code is not tested, some minor changes may be required

Related

how append (key,value) with loop on python

I want to create a new dict with a loop but I don't find the way to push key and value in loop with append. I try something like this but I'm still searching the good way.
frigo = {"mangue" : 2, "orange" : 8, "cassoulet" : 1, "thon" : 2, "coca" : 8, "fenouil" : 1, "lait" : 3}
new_frigo = {}
for i, (key, value) in enumerate(frigo.items()):
print(i, key, value)
new_frigo[i].append{key,value}
There's already a python function for that:
new_frigo.update(frigo)
No need for a loop! dict.update(other_dict) just goes and adds all content of the other_dict to the dict.
Anyway, if you wanted for some reason to do it with a loop,
for key, value in frigo.items():
new_frigo[key] = value
would do that. Using an i here makes no sense - a dictionary new_frigo doesn't have indices, but keys.
You can use update to append the key and values in the dictionary as follows:
frigo = {"mangue": 2, "orange": 8, "cassoulet": 1, "thon": 2, "coca": 8, "fenouil": 1, "lait": 3}
new_frigo = {}
for (key, value) in frigo.items():
new_frigo.update({key:value})
print(new_frigo)
Result:
{'mangue': 2, 'orange': 8, 'cassoulet': 1, 'thon': 2, 'coca': 8, 'fenouil': 1, 'lait': 3}

How to write a nested python dict to csv with row being key, value, key, value

I have a nested dict that looks like:
{KeyA: {'ItemA': 1, 'ItemB': 2, 'ItemC': 3, 'ItemD': 4, 'ItemE': 5, 'ItemF': 6},
{KeyB: {'ItemR': 2, 'ItemQ': 3, 'ItemG': 4, 'ItemZ': 5, 'ItemX': 6, 'ItemY': 7}
I would like to output this to a csv where the desired row format is:
ItemA, 1, Item B, 2, ItemC, 3, ItemD, 4, ItemE, 5, ItemF, 6
I've managed to get a row that's keys and then another below it with the associated value with the below code:
for item in myDict:
item = myDict[x]
itemVals = item.values()
wr.writerow(item)
wr.writerow(itemVals)
x += 1
I've tried a number of ways of reformatting this and keep running into subscriptable errors every which way I try.
The length of the top level dict could be large, up to 30k nested dicts. The nested dicts are a constant length of 6 key:value pairs, currently.
What's a clean way to achieve this?
Here is an implementation with loops:
myDict = {'KeyA': {'ItemA': 1, 'ItemB': 2, 'ItemC': 3, 'ItemD': 4, 'ItemE': 5, 'ItemF': 6},
'KeyB': {'ItemR': 2, 'ItemQ': 3, 'ItemG': 4, 'ItemZ': 5, 'ItemX': 6, 'ItemY': 7}}
with open("output.csv", "w") as file:
for key in myDict:
for nestedKey in myDict[key]:
file.write(key + "," + str(myDict[key][nestedKey]) + ",")
file.write("\n")
output.csv:
KeyA,1,KeyA,2,KeyA,3,KeyA,4,KeyA,5,KeyA,6,
KeyB,2,KeyB,3,KeyB,4,KeyB,5,KeyB,6,KeyB,7,

Updating key values in dictionaries

I am trying to write code for the following:
The idea is to have a storage/inventory dictionary and then have the key values be reduced by certain household tasks. E.g. cleaning, cooking etc.
This would be the storage dictionary:
cupboard= {"cookies":30,
"coffee":3,
"washingpowder": 5,
"cleaningspray": 5,
'Pasta': 0.5,
'Tomato': 4,
'Beef': 2,
'Potato': 2,
'Flour': 0.2,
'Milk': 1,
"Burger buns": 6}
now this is the code that I wrote to try and reduce one single key's value (idea is that the action "cleaning" reduces the key "cleaning spray" by one cleaning unit = 0.5
cleaning_amount = 0.5
def cleaning(room):
while cupboard["cleaningspray"] <0.5:
cleaned = {key: cupboard.get(key) - cleaning_amount for key in cupboard}
return cupboard
livingroom = 1*cleaning_amount
cleaning(livingroom)
print(cupboard)
but it returns this, which is the same dictionary as before, with no updated values
{'cookies': 30, 'coffee': 3, 'washingpowder': 5, 'cleaningspray': 5, 'Pasta': 0.5, 'Tomato': 4, 'Beef': 2, 'Potato': 2, 'Flour': 0.2, 'Milk': 1, 'Burger buns': 6}
Can anybody help?
Thank you!!
picture attached to see indents etc.
I guess you want to decrease the "cleaningspray" amount depending on the room size (or other factors). I would do it like this:
cleaning_amount = 0.5
def cleaning(cleaning_factor):
if cupboard["cleaningspray"] > 0.5:
# reduce the amount of cleaning spray depending on the cleaning_factor and the global cleaning_amount
cupboard["cleaningspray"] -= cleaning_factor * cleaning_amount
livingroom_cleaning_factor = 1
cleaning(livingroom_cleaning_factor)
print(cupboard)
Output:
{'cookies': 30, 'coffee': 3, 'washingpowder': 5, 'cleaningspray': 4.5, 'Pasta': 0.5, 'Tomato': 4, 'Beef': 2, 'Potato': 2, 'Flour': 0.2, 'Milk': 1, 'Burger buns': 6}
So I believe the reason that the values don't change is because it is being done in a for loop.
e.g.
list_values = [1, 2, 3, 4, 5]
new_variable = [num + 1 for num in list_values]
print("list_values", list_values) # The original list_values variable doesn't change
print("new_variable", new_variable) # This new variable holds the required value
This returns:
list_values [1, 2, 3, 4, 5]
new_variable [2, 3, 4, 5, 6]
So to fix the problem, you can use the 'new_variable'
So, now that the concept is clear (I hope), in your case it would be something like this
def cleaning():
while cupboard["cleaningspray"] > 0.5: # Also here, i beleive you intend to have `>`
#and not `<` in the original code
cleaned = {key: cupboard.get(key) - cleaning_amount for key in cupboard}
return cleaned
We return the 'new_variable' that is cleaned
so it can be assigned to the original dictionary variable as follows if required:
cupboard = cleaning()
EDIT:
Also, as #d-k-bo commented, if you intend to only have the operation carry out once... an if statement would also do the job
if cupboard["cleaningspray"] > 0.5: # Again assuming you intended '>' and not '<'
Otherwise, you should keep the return statement outside the while loop

Dataframe with fixed length (over writing)

I write a code that generates a mass amount of data in each round. So, I need to only store data for the last 10 rounds. How can I create a dataframe which erases the oldest object when I add a need object (over-writing)? The order of observations -from old to new- should be maintained. Is there any simple function or data format to do this?
Thanks in advance!
You could use this function:
def ins(arr, item):
if len(arr) < 10:
arr.insert(0, item)
else:
arr.pop()
arr.insert(0, item)
ex = [1, 2, 3, 4, 5, 6, 7, 8, 9]
ins(ex, 'a')
print(ex)
# ['a', 1, 2, 3, 4, 5, 6, 7, 8, 9]
ins(ex, 'b')
print(ex)
# ['b', 'a', 1, 2, 3, 4, 5, 6, 7, 8]
In order for this to work you MUST pass a list as argument to the function ins(), so that the new item is inserted and the 10th is removed (if there is one).
(I considered that the question is not pandas specific, but rather a way to store a maximum amount of items in an array)

Get a list of tuples (or lists) where the lists with the same elements are grouped?

I have a dictionary in python with several lists, and what I try to do is get a list of tuples (or lists) where the lists are grouped with the same elements regardless of whether they are ordered. For example:
dict_1 = {
"pv_0": [1, 2, 3, 4, 5],
"pv_1": [2, 4, 6, 8, 10],
"pv_2": [1, 3, 5, 7, 9],
"pv_3": [3, 4, 1, 2, 5],
"pv_4": [2, 3, 4, 5, 6],
"pv_5": [3, 4, 5, 6, 2],
"pv_6": [1, 2, 3, 5, 4],
"pv_7": [5, 9, 7, 3, 1],
"pv_8": [2, 4, 6, 8, 10],
"pv_9": [1, 3, 5, 6, 7],
}
I wish to obtain the following result:
Result = [
("pv_0", "pv_3", "pv_6"),
("pv_2", "pv_7"),
("pv_1", "pv_8"),
("pv_4", "pv_5"),
("pv_9"),
]
How do I solve this problem?
from operator import itemgetter
from itertools import groupby
# create a new dictionary where the value is a hashed immutable set
d = {k: hash(frozenset(v)) for k, v in dict_.items()}
{'pv_0': -3779889356588604112,
'pv_1': 2564111202014126800,
'pv_2': 777379418226018803,
'pv_3': -3779889356588604112,
'pv_4': 8713515799959436501,
'pv_5': 8713515799959436501,
'pv_6': -3779889356588604112,
'pv_7': 777379418226018803,
'pv_8': 2564111202014126800,
'pv_9': -6160949303479789752}
first = itemgetter(0) # operator to grab first item of iterable
second = itemgetter(1) # operator to grab second item of iterable
[list(map(first, v)) for _, v in groupby(sorted(d.items(), key=second), key=second)]
[['pv_9'],
['pv_0', 'pv_3', 'pv_6'],
['pv_2', 'pv_7'],
['pv_1', 'pv_8'],
['pv_4', 'pv_5']]
The final list comprehension grabs all the key/value pairs from the dictionary and sorts them by the value. It then passes that in to the groupby function from itertools and tells it to group by the value of the dictionary. The output of this is then fed in to a map function which grabs the first item from each pair in the group which would be the corresponding key.
From what I can tell, you want a tuple of keys where each value is the same.
def get_matching_keys(data: dict) -> list:
# first, make everything a set
for key in data:
data [key] = set (data [key]) # makes order irrelevant
results = []
duplicates = []
for key, value in data.items():
if key in duplicates: continue # we already did this
result = [key]
duplicates.append (key)
for key2, value2 in data.items():
if key == key2: continue # skip the same key
else:
if value == value2:
result.append (key2)
duplicates.append (key2) # make sure we don't do it again
results.append (result)
return results

Categories

Resources