Need help understanding the behavior of a for loop - python

I am working through a tutorial on sets in Python 2.7, and I have run into a behavior using a for loop that I do not understand, and I am trying to find out what the reason for the difference in outputs might be.
The object of the exercise is to produce a set, cities, from a dictionary that contains keys made up of city pairs of frozen sets using a for loop.
The data comes from the following dictionary:
flight_distances = {
frozenset(['Atlanta', 'Chicago']): 590.0,
frozenset(['Atlanta', 'Dallas']): 720.0,
frozenset(['Atlanta', 'Houston']): 700.0,
frozenset(['Atlanta', 'New York']): 750.0,
frozenset(['Austin', 'Dallas']): 180.0,
frozenset(['Austin', 'Houston']): 150.0,
frozenset(['Boston', 'Chicago']): 850.0,
frozenset(['Boston', 'Miami']): 1260.0,
frozenset(['Boston', 'New York']): 190.0,
frozenset(['Chicago', 'Denver']): 920.0,
frozenset(['Chicago', 'Houston']): 940.0,
frozenset(['Chicago', 'Los Angeles']): 1740.0,
frozenset(['Chicago', 'New York']): 710.0,
frozenset(['Chicago', 'Seattle']): 1730.0,
frozenset(['Dallas', 'Denver']): 660.0,
frozenset(['Dallas', 'Los Angeles']): 1240.0,
frozenset(['Dallas', 'New York']): 1370.0,
frozenset(['Denver', 'Los Angeles']): 830.0,
frozenset(['Denver', 'New York']): 1630.0,
frozenset(['Denver', 'Seattle']): 1020.0,
frozenset(['Houston', 'Los Angeles']): 1370.0,
frozenset(['Houston', 'Miami']): 970.0,
frozenset(['Houston', 'San Francisco']): 1640.0,
frozenset(['Los Angeles', 'New York']): 2450.0,
frozenset(['Los Angeles', 'San Francisco']): 350.0,
frozenset(['Los Angeles', 'Seattle']): 960.0,
frozenset(['Miami', 'New York']): 1090.0,
frozenset(['New York', 'San Francisco']): 2570.0,
frozenset(['San Francisco', 'Seattle']): 680.0,
}
There is also a test list that will create the intended set as a check:
flying_circus_cities = [
'Houston', 'Chicago', 'Miami', 'Boston', 'Dallas', 'Denver',
'New York', 'Los Angeles', 'San Francisco', 'Atlanta',
'Seattle', 'Austin'
]
When the code is written in the following form, the loop produces the intended result.
cities = set()
for pair in flight_distances:
cities = cities.union(pair)
print cities
print "Check:", cities == set(flying_circus_cities)
Output:
set(['Houston', 'Chicago', 'Miami', 'Boston', 'Dallas', 'Denver', 'New York', 'Los Angeles', 'San Francisco', 'Atlanta', 'Seattle', 'Austin'])
Check: True
However, if I attempt as a comprehension, with either of the following, I get a different result.
cities = set()
cities = {pair for pair in flight_distances}
print cities
print "Check:", cites == set(flying_circus_cities)
or
cities = set()
cities = cities.union(pair for pair in flight_distances)
print cities
print "Check:", cities == set(flying_circus_cities)
Output for both:
set([frozenset(['Atlanta', 'Dallas']), frozenset(['San Francisco', 'New York']), frozenset(['Denver', 'Chicago']), frozenset(['Houston', 'San Francisco']), frozenset(['San Francisco', 'Austin']), frozenset(['Seattle', 'Los Angeles']), frozenset(['Boston', 'New York']), frozenset(['Houston', 'Atlanta']), frozenset(['New York', 'Chicago']), frozenset(['San Francisco', 'Seattle']), frozenset(['Austin', 'Dallas']), frozenset(['New York', 'Dallas']), frozenset(['Houston', 'Chicago']), frozenset(['Seattle', 'Denver']), frozenset(['Seattle', 'Chicago']), frozenset(['Miami', 'New York']), frozenset(['Los Angeles', 'Denver']), frozenset(['Miami', 'Houston']), frozenset(['San Francisco', 'Los Angeles']), frozenset(['New York', 'Denver']), frozenset(['Atlanta', 'Chicago']), frozenset(['Boston', 'Chicago']), frozenset(['Houston', 'Austin']), frozenset(['Houston', 'Los Angeles']), frozenset(['New York', 'Los Angeles']), frozenset(['Atlanta', 'New York']), frozenset(['Denver', 'Dallas']), frozenset(['Los Angeles', 'Dallas']), frozenset(['Los Angeles', 'Chicago'])])
Check: False
I cannot figure out why the for loop in the first example unpacks the pairs as intended so that it produces a set with one instance of each city, while trying to write the loop as a comprehension pulls out the frozenset([city1, city2]) pairs and places them in the set instead.
I do not understand why pair would give the city strings in the first instance but passes the frozenset in the second instance.
Can someone explain the different behavior?
Note: As explained by Holt and donkopotamus, the issue of why this was behaving differently was that using the comprehension evaluated the entire dictionary completely before making a single assignment to the cities variable, thus creating a set of frozensets, where as the standard for loop unpacked the pairs one at a time and evaluated each individual one separately, assigning them to cities one at a time with each pass of the for loop and allowing the union function to evaluate each instance of the pairs being passed to it.
They further explained that using the *-operator unpacks the dictionary in the comprehension to produce the desired behavior.
cities = cities.union(*(set(pair) for pair in flight_distances))

The expression:
cities = set()
cities = cities.union(pair for pair in flight_distances)
will take the union of the empty set {} with another set
{pair_0, pair_1, pair_2, ..., pair_n}
leaving you with a set of sets.
In contrast, the following will give you all of the cities flown to:
>>> set.union(*(set(pair) for pair in flight_distances))
{'Atlanta',
'Austin',
'Boston',
'Chicago',
'Dallas',
'Denver',
'Houston',
'Los Angeles',
'Miami',
'New York',
'San Francisco',
'Seattle'}
Here we transform each of the frozen set keys into a plain set and find the union.

In the first version, pair is a frozenset at each loop, so you can do a union with it, while in your version, you try do a union with a set of frozenset.
The first case comes down to (union with a frozenset at each iteration):
cities = set()
cities.union(frozenset(['Atlanta', 'Chicago']))
cities.union(frozenset(['Atlanta', 'Dallas']))
...
So you have (mathematically):
cities = {} # Empty set
cities = {} U {'Atlanta', 'Chicago'} = {'Atlanta', 'Chicago'}
cities = {'Atlanta', 'Chicago'} U {'Atlanta', 'Dallas'} = {'Atlanta', 'Chicago', 'Dallas'}
...
In your (last) case, you are doing the following (one union with a sequence of frozenset):
cities = set()
cities.union([frozenset(['Atlanta', 'Chicago']), frozenset(['Atlanta', 'Dallas']), ...])
So you have:
cities = {}
cities = {} U {{'Atlanta', 'Chicago'}, {'Atlanta', 'Dallas'}, ...}
= {{'Atlanta', 'Chicago'}, {'Atlanta', 'Dallas'}, ...} # Nothing disappears
Since no two pairs are identical, you get a set of all the pairs in your initial dictionary, because you are passing a set of set (pair) of cities, not a set of cities to .union().
On a more abstract point of view, you are trying to obtain:
S = {} U S1 U S2 U S3 U ... U Sn = (((({} U S1) U S2) U S3) U ...) U Sn
With:
S = {} U {S1, S2, S3, ..., Sn}

Related

Matching part of a string with a value in two pandas dataframes

Given the following df with street names:
df = pd.DataFrame({'street1': ['36 Angeles', 'New York', 'Rice Street', 'Levitown']})
And df2 which contains that match streets and their following county:
df2 = pd.DataFrame({'street2': ['Angeles', 'Caguana', 'Levitown'], 'county': ["Utuado", "Utuado", "Bayamon"]})
How can I create a column that tells me the state where each street of DF is, through a pairing of df(street) df2(street2). The matching does not have to be perfect, it must match at least one word?
The following dataframe is an example of what I want to obtain:
desiredoutput = pd.DataFrame({'street1': ['36 Angeles', 'New York', 'Rice Street', 'Levitown'], 'state': ["Utuado", "NA", "NA", "Bayamon"]})
Maybe a Naive approach, but works well.
df = pd.DataFrame({'street1': ['36 Angeles', 'New York', 'Rice Street', 'Levitown']})
df2 = pd.DataFrame({'street2': ['Angeles', 'Caguana', 'Levitown'], 'county': ["Utuado", "Utuado", "Bayamon"]})
output = {'street1':[],'county':[]}
streets1 = df['street1']
streets2 = df2['street2']
county = df2['county']
for street in streets1:
for index,street2 in enumerate(streets2):
if street2 in street:
output['street1'].append(street)
output['county'].append(county[index])
count = 1
if count == 0:
output['street1'].append(street)
output['county'].append('NA')
count = 0
print(output)

How to convert dataframe to nested dictionary?

I am running Python 3.7 and Pandas 1.1.3 and have a DataFrame which looks like this:
location = {'city_id': [22000,25000,27000,35000],
'population': [3971883,2720546,667137,1323],
'region_name': ['California','Illinois','Massachusetts','Georgia'],
'city_name': ['Los Angeles','Chicago','Boston','Boston'],
}
df = pd.DataFrame(location, columns = ['city_id', 'population','region_name', 'city_name'])
I want to transform this dataframe into a dict that looks like:
{
'Boston': {'Massachusetts': 27000, 'Georgia': 35000},
'Chicago': {'Illinois': 25000},
'Los Angeles': {'California': 22000}
}
And if the same cities in different regions, nested JSON should be sorted by population (for example Boston is in Massachusetts and Georgia. The city in Massachusetts is bigger, we output it first.
My code is:
result = df = df.groupby(['city_name'])[['region_name','city_id']].apply(lambda x: x.set_index('region_name').to_dict()).to_dict()
Output:
{'Boston': {'city_id': {'Massachusetts': 27000, 'Georgia': 35000}},
'Chicago': {'city_id': {'Illinois': 25000}},
'Los Angeles': {'city_id': {'California': 22000}}}
how can you see to dictionary add key - "city_id"
Tell me, please, how I should change my code that gets the expected result?
just method chain apply() method to your current solution:
result=df.groupby(['city_name'])[['region_name','city_id']].apply(lambda x: x.set_index('region_name').to_dict()).apply(lambda x:list(x.values())[0]).to_dict()
Now if you print result you will get your expected output:
{'Boston': {'Massachusetts': 27000, 'Georgia': 35000},
'Chicago': {'Illinois': 25000},
'Los Angeles': {'California': 22000}}

Python Print Function City name

I'm trying to print 'Phoenix' from this list dictionary but can't extract the specific name.
test = [{'Arizona': 'Phoenix', 'California': 'Sacramento', 'Hawaii': 'Honolulu'}, 1000, 2000, 3000, ['hat', 't-shirt', 'jeans', {'socks1': 'red', 'socks2': 'blue'}]]
print(test[0]) gives me all the city names... How do I display just 'Phoenix'?
With test[0] you're just accessing the first element of the list test, which in this case is the dictionary.
You need to use the key - in this case Arizona:
print(test[0]['Arizona'])
Output is Phoenix.
You should read up a bit on the dictionary data structure. But here you go:
test[0]['Arizona']
Try to run the below code snippet and you will get your answer by your own most probably.
test = [{'Arizona': 'Phoenix', 'California': 'Sacramento', 'Hawaii': 'Honolulu'}, 1000, 2000, 3000,
['hat', 't-shirt', 'jeans', {'socks1': 'red', 'socks2': 'blue'}]]
print(test[0]['Arizona'])
for index in range(len(test)):
print("test[" + str(index) + "]:")
print(str(test[index]))
print()
The output will be like the below:
Phoenix
test[0]:
{'Arizona': 'Phoenix', 'California': 'Sacramento', 'Hawaii': 'Honolulu'}
test[1]:
1000
test[2]:
2000
test[3]:
3000
test[4]:
['hat', 't-shirt', 'jeans', {'socks1': 'red', 'socks2': 'blue'}]
The output explains that test is an array of 5 elements and the 0th element is an associative array. Thus, to output Phoenix, you have to use test[0]['Arizona']
If you use just test[0], you can just access the first element in the test list:
{'Arizona': 'Phoenix', 'California': 'Sacramento', 'Hawaii': 'Honolulu'}.
The test[0] is a dictionary. And you need to get 'Phoenix' which is the value of the key 'Arizona', you can just use the key to get the value:
test[0]['Arizona']

Efficient looping through dictionary with keys as tuple

I have a very large dictionary with 200 million keys. The keys are tuple with integer as individual elements of the tuple. I want to search for the key where the "query integer" lies within the two integers of the tuple in dictionary keys.
Currently, I am looping through all dictionary keys and comparing the integer with each element of tuple if it lies within that range. It works but the time to look up each query is around 1-2 minutes and I need to perform around 1 Million such queries. The example of the dictionary and the code which I have written are following:
Sample dictionary:
[{ (3547237440, 3547237503) : {'state': 'seoul teukbyeolsi', 'country': 'korea (south)', 'country_code': 'kr', 'city': 'seoul'} },
{ (403044176, 403044235) : {'state': 'california', 'country': 'united states', 'country_code': 'us', 'city': 'pleasanton'} },
{ (3423161600, 3423161615) : {'state': 'kansas', 'country': 'united states', 'country_code': 'us', 'city': 'lenexa'} },
{ (3640467200, 3640467455) : {'state': 'california', 'country': 'united states', 'country_code': 'us', 'city': 'san jose'} },
{ (853650485, 853650485) : {'state': 'colorado', 'country': 'united states', 'country_code': 'us', 'city': 'arvada'} },
{ (2054872064, 2054872319) : {'state': 'tainan', 'country': 'taiwan', 'country_code': 'tw', 'city': 'tainan'} },
{ (1760399104, 1760399193) : {'state': 'texas', 'country': 'united states', 'country_code': 'us', 'city': 'dallas'} },
{ (2904302140, 2904302143) : {'state': 'iowa', 'country': 'united states', 'country_code': 'us', 'city': 'hampton'} },
{ (816078080, 816078335) : {'state': 'district of columbia', 'country': 'united states', 'country_code': 'us', 'city': 'washington'} },
{ (2061589204, 2061589207) : {'state': 'zhejiang', 'country': 'china', 'country_code': 'cn', 'city': 'hangzhou'} }]
The code I have written:
ipint=int(ipaddress.IPv4Address(ip))
for k in ip_dict.keys():
if ipint >= k[0] and ipint <= k[1]:
print(ip_dict[k]['country'], ip_dict[k]['country_code'], ip_dict[k]['state'])
where ip is just ipaddress like '192.168.0.1'.
If anyone could provide a hint regarding more efficient way to perform this task, it would be much appreciated.
Thanks
I suggest you to use another structure with a good query complexity like a tree.
Maybe you can try this library I just found https://pypi.org/project/rangetree/
As they say, it is optimized for lookups but not for insertions so if you need to insert once and lopk many it should be OK.
Another solution is to not used a dict but a list, to order it and to build an index over it. Do a dichotomy on this index when there is a query (can be less optimal if ranges are not regular so I prefer the first solution)
Create a index for each of the 2 integers: a sorted list like this:
[(left_int, [list_of_row_ids_that have_this_left_int]),
(another_greater_left_int, [...])]
You can then search for all rows that have a left int greater than the searched one in log(n).
A binary search will do here.
Do the same for the right int.
Keep the rest of the data in a list of tuples.
More in detail:
data = [( (3547237440, 3547237503), {'state': 'seoul'} ), ...]
left_idx = [(3547237440, [0,43]), (9547237440, [3])]
# 0, 43, 3 are indices in the data list
# search
min_left_idx = binary_search(left_idx, 3444444)
# now all rows referred to by left_idx[min_left_idx] ... left_idx[-1] will satisfy your criteria
min_right_idx = ...
# between these 2 all referred rows satisfy the range check
# intersect the sets

Find value of dictionary inside a dictionary

I have some data like this:
{'cities': [{'abbrev': 'NY', 'name': 'New York'}, {'abbrev': 'BO', 'name': 'Boston'}]}
From my scarce knowledge of Python this looks like a dictionary within a dictionary.
But either way how can I use "NY" as a key to fetch the value "New York"?
It's a dictionary with one key-value pair. The value is a list of dictionaries.
d = {'cities': [{'abbrev': 'NY', 'name': 'New York'}, {'abbrev': 'BO', 'name': 'Boston'}]}
To find the name for an abbreviation you should iterate over the dictionaries in the list and then compare the abbrev-value for a match:
for city in d['cities']: # iterate over the inner list
if city['abbrev'] == 'NY': # check for a match
print(city['name']) # print the matching "name"
Instead of the print you can also save the dictionary containing the abbreviation, or return it.
When you've got a dataset not adapted to your need, instead of using it "as-is", you can build another dictionary from that one, using a dictionary comprehension with key/values as values of your sub-dictionaries, using the fixed keys.
d = {'cities': [{'abbrev': 'NY', 'name': 'New York'}, {'abbrev': 'BO', 'name': 'Boston'}]}
newd = {sd["abbrev"]:sd["name"] for sd in d['cities']}
print(newd)
results in:
{'NY': 'New York', 'BO': 'Boston'}
and of course: print(newd['NY']) yields New York
Once the dictionary is built, you can reuse it as many times as you need with great lookup speed. Build other specialized dictionaries from the original dataset whenever needed.
Use next and filter the sub dictionaries based upon the 'abbrev' key:
d = {'cities': [{'abbrev': 'NY', 'name': 'New York'},
{'abbrev': 'BO', 'name': 'Boston'}]}
city_name = next(city['name'] for city in d['cities']
if city['abbrev'] == 'NY')
print city_name
Output:
New York
I think that I understand your problem.
'NY' is a value, not a key.
Maybe you need something like {'cities':{'NY':'New York','BO':'Boston'}, so you could type: myvar['cities']['NY'] and it will return 'New York'.
If you have to use x = {'cities': [{'abbrev': 'NY', 'name': 'New York'}, {'abbrev': 'BO', 'name': 'Boston'}]} you could create a function:
def search(abbrev):
for cities in x['cities']:
if cities['abbrev'] == abbrev:
return cities['name']
Output:
>>> search('NY')
'New York'
>>> search('BO')
'Boston'
PD: I use python 3.6
Also with this code you could also find abbrev:
def search(s, abbrev):
for cities in x['cities']:
if cities['abbrev'] == abbrev: return cities['name'], cities['abbrev']
if cities['name'] == abbrev: return cities['name'], cities['abbrev']

Categories

Resources