I would like to get the URL of a video with maximum resolution.
If I had the following dictionary, would it be easiest to get the URL of the video with the maximum size?
Is it best to split the size by partition, format the number into an int type, and then get the value?
videos = {
'size_476_306': 'https://www.....',
'size_560_360': 'https://www.....',
'size_644_414': 'https://www.....',
'size_720_480': 'https://www.....',
}
Solved
I couldn't figure out lambda, so I implemented it in a different way.
for size, v_url in videos.items():
max_size_info = size.split('_')
split_size = int(max_size_info[1])
size_array.append(split_size)
video_array.append(v_url)
max_size = size_array.index(max(size_array))
if str(size_array[max_size]) in v_url:
print(size,v_url)
Is it best to split the size by partition, format the number into an int type, and then get the value?
Yes, that's the way I'd do it:
>>> max(videos.items(), key=lambda i: int.__mul__(*map(int, i[0].split("_")[1:])))
('size_720_480', 'https://www.....')
Here's a slightly more verbose version with a named function:
>>> def get_resolution(key: str) -> int:
... """
... Gets the resolution (as pixel area) from 'size_width_height'.
... e.g. get_resolution("size_720_480") -> 345600
... """
... _, width, height = key.split("_")
... return int(width) * int(height)
...
>>> max(videos, key=get_resolution)
'size_720_480'
Given that expression that gives us the largest key, we can easily get the corresponding value:
>>> videos[max(videos, key=get_resolution)]
'https://www.....'
or we could get the key, value pair by taking the max of items(), here using a much simpler lambda that just translates key, value into get_resolution(key):
>>> max(videos.items(), key=lambda i: get_resolution(i[0]))
('size_720_480', 'https://www.....')
Related
How I can get the keys from the ftp_json dictionary with the largest date by mask from the daily_updated list?
daily_updated = ('kgrd', 'cvhd', 'metd')
ftp_json = {'kgrd0118.arj': 'Jan-18-2007',
'kgrd0623.arj': 'Jun-23-2005',
'kgrd0624.arj': 'Jun-24-2005',
'cvhd0629.ARJ': 'Jan-29-2021',
'cvhd1026.arj': 'Oct-26-2015',
'cvhd1125.ARJ': 'Nov-25-2019',
'cvhd0222.ARJ': 'Feb-22-2022',
'metd0228.ARJ': 'Feb-28-2022',
'metd0321.ARJ': 'Mar-26-2021',
}
result = ['kgrd0118.arj', 'cvhd0222.arj', 'metd0228.ARJ']
You can make advantage of the key parameter of the max (and min) built-in function to impose a ordering criterium. Before that you need to turn the string containing the dates into datetime objects which come along with their own ordering, __lt__ etc, implementation. Here the doc for the date formatting.
Notice that a minimum date object is needed, it will be used as a "fake" value to avoid interference from all other masks in the max-term search. I naturally fixed it as the minimum among of all dates.
import datetime
daily_updated = ('kgrd', 'cvhd', 'metd')
ftp_json = {'kgrd0118.arj': 'Jan-18-2007',
'kgrd0623.arj': 'Jun-23-2005',
'kgrd0624.arj': 'Jun-24-2005',
'cvhd0629.ARJ': 'Jan-29-2021',
'cvhd1026.arj': 'Oct-26-2015',
'cvhd1125.ARJ': 'Nov-25-2019',
'cvhd0222.ARJ': 'Feb-22-2022',
'metd0228.ARJ': 'Feb-28-2022',
'metd0321.ARJ': 'Mar-26-2021',
}
def date_formatter(mydate):
return datetime.datetime.strptime(mydate, '%b-%d-%Y').date()
# smallest date
day_zero = datetime.datetime.strptime(min(ftp_json.values(), key=lambda d: date_formatter(d)), '%b-%d-%Y').date()
# get the maximum for each mask
m = [max(ftp_json.items(), key=lambda pair: date_formatter(pair[1]) if pair[0].startswith(pattern) else day_zero) for pattern in daily_updated]
print([i for i, _ in m])
Output
['kgrd0118.arj', 'cvhd0222.ARJ', 'metd0228.ARJ']
EDIT
To keep it more readable (and not single-line-like), a decorator can be introduced that will be passed to the key parameter of max (min).
# ...
def date_formatter(mydate):
return datetime.datetime.strptime(mydate, '%b-%d-%Y').date()
# smallest date
day_zero = datetime.datetime.strptime(min(ftp_json.values(), key=lambda d: date_formatter(d)), '%b-%d-%Y').date()
# decorator containing the logic of the comparison criteria
def ordering(pattern):
def _wrapper(pair):
if pair[0].startswith(pattern):
# cast to date-object if the "mask"/pattern is correct
return date_formatter(pair[1])
else:
# return default smallest date-object -> will not influence the max-function
return day_zero
return _wrapper
# get the maximum for each mask
m = [max(ftp_json.items(), key=ordering(pattern)) for pattern in daily_updated]
This can no doubt be done more simply, but I think this example is a descriptive way to do this with the standard library.
from datetime import datetime
ftp_json = {
"kgrd0118.arj": "Jan-18-2007",
"kgrd0623.arj": "Jun-23-2005",
"kgrd0624.arj": "Jun-24-2005",
"cvhd0629.ARJ": "Jan-29-2021",
"cvhd1026.arj": "Oct-26-2015",
"cvhd1125.ARJ": "Nov-25-2019",
"cvhd0222.ARJ": "Feb-22-2022",
"metd0228.ARJ": "Feb-28-2022",
"metd0321.ARJ": "Mar-26-2021",
}
max_dates = {} # New dict for storing running maximums.
for k, v in ftp_json.items():
d = datetime.strptime(v, "%b-%d-%Y") # Use datetime for comparison.
# Here we return the previous tuple values if set for comparison.
# If they weren't set, do so now.
maxk, maxv, maxd = max_dates.setdefault(k[:4], (k, v, d))
if d > maxd: # Update the values is the current date is more recent.
max_dates[k[:4]] = (k, v, d)
# Validate we stored the correct values.
assert [v[0] for v in max_dates.values()] == [
"kgrd0118.arj",
"cvhd0222.ARJ",
"metd0228.ARJ",
]
How can I do a very explicit sort on a list in Python? What I mean is, items are supposed to be sorted a very specific way and not just alphabetically or numerically. The input I would be receiving looks something list this:
h43948fh4349f84 ./.file.html
dsfj940j90f94jf ./abcd.ppt
f9j3049fj349f0j ./abcd_FF_000000001.jpg
f0f9049jf043930 ./abcd_FF_000000002.jpg
j909jdsa094jf49 ./abcd_FF_000000003.jpg
jf4398fj9348fjj ./abcd_FFinit.jpg
9834jf9483fj43f ./abcd_MM_000000001.jpg
fj09jw93fj930fj ./abcd_MM_000000002.jpg
fjdsjfd89s8hs9h ./abcd_MM_000000003.jpg
vyr89r8y898r839 ./abcd_MMinit.jpg
The list should be sorted:
html file first
ppt file second
FFinit file third
MMinit file fourth
The rest of the numbered files in the order of FF/MM
Example output for this would look like:
h43948fh4349f84 ./.file.html
dsfj940j90f94jf ./abcd.ppt
jf4398fj9348fjj ./abcd_FFinit.jpg
vyr89r8y898r839 ./abcd_MMinit.jpg
f9j3049fj349f0j ./abcd_FF_000000001.jpg
9834jf9483fj43f ./abcd_MM_000000001.jpg
f0f9049jf043930 ./abcd_FF_000000002.jpg
fj09jw93fj930fj ./abcd_MM_000000002.jpg
j909jdsa094jf49 ./abcd_FF_000000003.jpg
fjdsjfd89s8hs9h ./abcd_MM_000000003.jpg
You need to define a key function, to guide the sorting. When comparing values to see what goes where, the result of the key function is then used instead of the values directly.
The key function can return anything, but here a tuple would be helpful. Tuples are compared lexicographically, meaning that only their first elements are compared unless they are equal, after which the second elements are used. If those are equal too, further elements are compared, until there are no more elements or an order has been determined.
For your case, you could produce a number in the first location, to order the 'special' entries, then for the remainder return the number in the second position and the FF or MM string in the last:
def key(filename):
if filename.endswith('.html'):
return (0,) # html first
if filename.endswith('.ppt'):
return (1,) # ppt second
if filename.endswith('FFinit.jpg'):
return (2,) # FFinit third
if filename.endswith('MMinit.jpg'):
return (3,) # MMinit forth
# take last two parts between _ characters, ignoring the extension
_, FFMM, number = filename.rpartition('.')[0].rsplit('_', 2)
# rest is sorted on the number (compared here lexicographically) and FF/MM
return (4, number, FFMM)
Note that the tuples don't need to be of equal length even.
This produces the expected output:
>>> from pprint import pprint
>>> lines = '''\
... h43948fh4349f84 ./.file.html
... dsfj940j90f94jf ./abcd.ppt
... f9j3049fj349f0j ./abcd_FF_000000001.jpg
... f0f9049jf043930 ./abcd_FF_000000002.jpg
... j909jdsa094jf49 ./abcd_FF_000000003.jpg
... jf4398fj9348fjj ./abcd_FFinit.jpg
... 9834jf9483fj43f ./abcd_MM_000000001.jpg
... fj09jw93fj930fj ./abcd_MM_000000002.jpg
... fjdsjfd89s8hs9h ./abcd_MM_000000003.jpg
... vyr89r8y898r839 ./abcd_MMinit.jpg
... '''.splitlines()
>>> pprint(sorted(lines, key=key))
['h43948fh4349f84 ./.file.html',
'dsfj940j90f94jf ./abcd.ppt',
'jf4398fj9348fjj ./abcd_FFinit.jpg',
'vyr89r8y898r839 ./abcd_MMinit.jpg',
'f9j3049fj349f0j ./abcd_FF_000000001.jpg',
'9834jf9483fj43f ./abcd_MM_000000001.jpg',
'f0f9049jf043930 ./abcd_FF_000000002.jpg',
'fj09jw93fj930fj ./abcd_MM_000000002.jpg',
'j909jdsa094jf49 ./abcd_FF_000000003.jpg',
'fjdsjfd89s8hs9h ./abcd_MM_000000003.jpg']
You can use the key argument to sort(). This method of the list class accepts an element of the list and returns a value that can be compared to other return values to determine sorting order. One possibility is to assign a number to each criteria exactly as you describe in your question.
Use sorted and a custom key function.
strings = ['h43948fh4349f84 ./.file.html',
'dsfj940j90f94jf ./abcd.ppt',
'f9j3049fj349f0j ./abcd_FF_000000001.jpg',
'f0f9049jf043930 ./abcd_FF_000000002.jpg',
'j909jdsa094jf49 ./abcd_FF_000000003.jpg',
'jf4398fj9348fjj ./abcd_FFinit.jpg',
'9834jf9483fj43f ./abcd_MM_000000001.jpg',
'fj09jw93fj930fj ./abcd_MM_000000002.jpg',
'fjdsjfd89s8hs9h ./abcd_MM_000000003.jpg',
'vyr89r8y898r839 ./abcd_MMinit.jpg']
def key(string):
if string.endswith('html'):
return 0,
elif string.endswith('ppt'):
return 1,
elif string.endswith('FFinit.jpg'):
return 2,
elif string.endswith('MMinit.jpg'):
return 3,
elif string[-16:-14] == 'FF':
return 4, int(string[-13:-4]), 0
elif string[-16:-14] == 'MM':
return 4, int(string[-13:-4]), 1
result = sorted(strings, key=key)
for string in result:
print(string)
Out:
h43948fh4349f84 ./.file.html
dsfj940j90f94jf ./abcd.ppt
jf4398fj9348fjj ./abcd_FFinit.jpg
vyr89r8y898r839 ./abcd_MMinit.jpg
f9j3049fj349f0j ./abcd_FF_000000001.jpg
9834jf9483fj43f ./abcd_MM_000000001.jpg
f0f9049jf043930 ./abcd_FF_000000002.jpg
fj09jw93fj930fj ./abcd_MM_000000002.jpg
j909jdsa094jf49 ./abcd_FF_000000003.jpg
fjdsjfd89s8hs9h ./abcd_MM_000000003.jpg
I assumed the last ordering point just looked at the number before the file extension (e.g. 000001)
def custom_key(x):
substring_order = ['.html','.ppt','FFinit','MMinit']
other_order = lambda x: int(x.split('_')[-1].split('.')[0])+len(substring_order)
for i,o in enumerate(substring_order):
if o in x:
return i
return other_order(x)
sorted_list = sorted(data,key=custom_key)
import pprint
pprint.pprint(sorted_list)
Out:
['h43948fh4349f84 ./.file.html',
'dsfj940j90f94jf ./abcd.ppt',
'jf4398fj9348fjj ./abcd_FFinit.jpg',
'vyr89r8y898r839 ./abcd_MMinit.jpg',
'f9j3049fj349f0j ./abcd_FF_000000001.jpg',
'9834jf9483fj43f ./abcd_MM_000000001.jpg',
'f0f9049jf043930 ./abcd_FF_000000002.jpg',
'fj09jw93fj930fj ./abcd_MM_000000002.jpg',
'j909jdsa094jf49 ./abcd_FF_000000003.jpg',
'fjdsjfd89s8hs9h ./abcd_MM_000000003.jpg']
I have a script to pull random numbers from a set of values. However, it broke today because min() and max() sort values by lexicographic order (so 200 is considered greater than 10000). How can I avoid lexicographic order here? Len key is on the right track but not quite right. I couldn't find any other key(s) that would help.
data_set = 1600.csv, 2405.csv, 6800.csv, 10000.csv, 21005.csv
First try:
highest_value = os.path.splitext(max(data_set))[0]
lowest_value = os.path.splitext(min(data_set))[0]
returns: lowest_value = 10000 highest_value = 6800
Second try:
highest_value = os.path.splitext(max(data_set,key=len))[0]
lowest_value = os.path.splitext(min(data_set,key=len))[0]
returns: lowest_value = 1600 highest_value = 10000
Thanks.
You can use key to order by the numeric part of the file:
data_set = ['1600.csv', '2405.csv', '6800.csv', '10000.csv', '21005.csv']
highest = max(data_set, key=lambda x: int(x.split('.')[0]))
lowest = min(data_set, key=lambda x: int(x.split('.')[0]))
print(highest) # >> 21005.csv
print(lowest) # >> 1600.csv
You were close. Rather than using the result of splittext with the len function, use the int function instead:
>>> from os.path import splitext
>>> data_set = ['1600.csv', '2405.csv', '6800.csv', '10000.csv', '21005.csv']
>>> def convert_to_int(file_name):
return int(splitext(file_name)[0])
>>> min(data_set, key=convert_to_int)
'1600.csv'
>>> max(data_set, key=convert_to_int)
'21005.csv'
Of course, this solution assumes that your file name will consist solely of numerical values.
My 100m in size, quantized data:
(1424411938', [3885, 7898])
(3333333333', [3885, 7898])
Desired result:
(3885, [3333333333, 1424411938])
(7898, [3333333333, 1424411938])
So what I want, is to transform the data so that I group 3885 (for example) with all the data[0] that have it). Here is what I did in python:
def prepare(data):
result = []
for point_id, cluster in data:
for index, c in enumerate(cluster):
found = 0
for res in result:
if c == res[0]:
found = 1
if(found == 0):
result.append((c, []))
for res in result:
if c == res[0]:
res[1].append(point_id)
return result
but when I mapPartitions()'ed data RDD with prepare(), it seem to do what I want only in the current partition, thus return a bigger result than the desired.
For example, if the 1st record in the start was in the 1st partition and the 2nd in the 2nd, then I would get as a result:
(3885, [3333333333])
(7898, [3333333333])
(3885, [1424411938])
(7898, [1424411938])
How to modify my prepare() to get the desired effect? Alternatively, how to process the result that prepare() produces, so that I can get the desired result?
As you may already have noticed from the code, I do not care about speed at all.
Here is a way to create the data:
data = []
from random import randint
for i in xrange(0, 10):
data.append((randint(0, 100000000), (randint(0, 16000), randint(0, 16000))))
data = sc.parallelize(data)
You can use a bunch of basic pyspark transformations to achieve this.
>>> rdd = sc.parallelize([(1424411938, [3885, 7898]),(3333333333, [3885, 7898])])
>>> r = rdd.flatMap(lambda x: ((a,x[0]) for a in x[1]))
We used flatMap to have a key, value pair for every item in x[1] and we changed the data line format to (a, x[0]), the a here is every item in x[1]. To understand flatMap better you can look to the documentation.
>>> r2 = r.groupByKey().map(lambda x: (x[0],tuple(x[1])))
We just grouped all key, value pairs by their keys and used tuple function to convert iterable to tuple.
>>> r2.collect()
[(3885, (1424411938, 3333333333)), (7898, (1424411938, 3333333333))]
As you said you can use [:150] to have first 150 elements, I guess this would be proper usage:
r2 = r.groupByKey().map(lambda x: (x[0],tuple(x[1])[:150]))
I tried to be as explanatory as possible. I hope this helps.
I have a dictionary where the key is a string and the values of the key are a set of strings that also contain the key (word chaining). I'm having trouble finding the max depth of a graph, which would be the set with the most elements in the dictionary, and I'm try print out that max graph as well.
Right now my code prints:
{'DOG': [],
'HIPPOPOTIMUS': [],
'POT': ['SUPERPOT', 'HIPPOPOTIMUS'],
'SUPERPOT': []}
1
Where 1 is my maximum dictionary depth. I was expecting the depth to be two, but there appears to be only 1 layer to the graph of 'POT'
How can I find the maximum value set from the set of keys in a dictionary?
import pprint
def dict_depth(d, depth=0):
if not isinstance(d, dict) or not d:
return depth
print max(dict_depth(v, depth+1) for k, v in d.iteritems())
def main():
for keyCheck in wordDict:
for keyCompare in wordDict:
if keyCheck in keyCompare:
if keyCheck != keyCompare:
wordDict[keyCheck].append(keyCompare)
if __name__ == "__main__":
#load the words into a dictionary
wordDict = dict((x.strip(), []) for x in open("testwordlist.txt"))
main()
pprint.pprint (wordDict)
dict_depth(wordDict)
testwordlist.txt:
POT
SUPERPOT
HIPPOPOTIMUS
DOG
The "depth" of a dictionary will naturally be 1 plus the maximum depth of its entries. You've defined the depth of a non-dictionary to be zero. Since your top-level dictionary doesn't contain any dictionaries of its own, the depth of your dictionary is clearly 1. Your function reports that value correctly.
However, your function isn't written expecting the data format you're providing it. We can easily come up with inputs where the depth of substring chains is more than just one. For example:
DOG
DOGMA
DOGMATIC
DOGHOUSE
POT
Output of your current script:
{'DOG': ['DOGMATIC', 'DOGMA', 'DOGHOUSE'],
'DOGHOUSE': [],
'DOGMA': ['DOGMATIC'],
'DOGMATIC': [],
'POT': []}
1
I think you want to get 2 for that input because the longest substring chain is DOG → DOGMA → DOGMATIC, which contains two hops.
To get the depth of a dictionary as you've structured it, you want to calculate the chain length for each word. That's 1 plus the maximum chain length of each of its substrings, which gives us the following two functions:
def word_chain_length(d, w):
if len(d[w]) == 0:
return 0
return 1 + max(word_chain_length(d, ww) for ww in d[w])
def dict_depth(d):
print(max(word_chain_length(d, w) for w in d))
The word_chain_length function given here isn't particularly efficient. It may end up calculating the lengths of the same chain multiple times if a string is a substring of many words. Dynamic programming is a simple way to mitigate that, which I'll leave as an exercise.
Sorry my examples wont be in python because my python is rusty but you should get the idea.
Lets say this is a binary tree:
(written in c++)
int depth(TreeNode* root){
if(!root) return 0;
return 1+max(depth(root->left), depth(root->right));
}
Simple. Now lets expand this too more then just a left and right.
(golang code)
func depthfunc(Dic dic) (int){
if dic == nil {
return 0
}
level := make([]int,0)
for key, anotherDic := range dic{
depth := 1
if ok := anotherDic.(Dic); ok { // check if it does down further
depth = 1 + depthfunc(anotherDic)
}
level = append(level, depth)
}
//find max
max := 0
for _, value := range level{
if value > max {
max = value
}
}
return max
}
The idea is that you just go down each dictionary until there is no more dictionaries to go down adding 1 to each level you traverse.