How to create value pairs with lambda in pyspark?

How to create value pairs with lambda in pyspark? - python

I am trying to convert a pyspark rdd like this:
before:
[
[('169', '5'), ('2471', '6'), ('48516', '10')],
[('58', '7'), ('163', '7')],
[('172', '5'), ('186', '4'), ('236', '6')]
]
after:
[
[('169', '5'), ('2471', '6')],
[('169', '5'),('48516', '10')],
[('2471', '6'), ('48516', '10')],
[('58', '7'), ('163', '7')],
[('172', '5'), ('186', '4')],
[('172', '5'), ('236', '6')],
[('186', '4'), ('236', '6')]
]
The idea is to go through each line and create new line pairwise. I tried to find out a solution myself with lambda tutorials but with no good. May I ask for some help? If this is repeating other questions, I apologize. Thanks!

I'd use flatMap with itertools.combinations:
from itertools import combinations
rdd.flatMap(lambda xs: combinations(xs, 2))

Related

How to sort a list of tuples but with one specific tuple being the first?

I'm doing an application to find the best path for a delivery.
The delivery send me his path:
[
('0', '1'),
('1', '2'),
('0', '2'),
('2', '0')
]
... where every pair of numbers is a location and smallest numbers are closer. They also send me their starting point. For example: 2.
I did a function to sort from lower to higher:
def lowToHigh(trajet):
trajet_opti = trajet
print(sorted(trajet_opti))
lowToHigh([
('0', '1'),
('1', '2'),
('0', '2'),
('2', '0')
])
The output is like this:
[('0', '1'), ('0', '2'), ('1', '2'), ('2', '0')]
I need a function who puts the tuple with the starting number first:
def starting_tuple():
starting_number = 2
.
.
.
Which returns something like this:
[('2', '0'), ('0', '1'), ('0', '2'), ('1', '2')]

Sort with a key that adds another tuple element representing whether the list item equals the starting point.
>>> path = [
... ('0', '1'),
... ('1', '2'),
... ('0', '2'),
... ('2', '0')
... ]
>>> sorted(path, key=lambda c: (c[0] != '2', c))
[('2', '0'), ('0', '1'), ('0', '2'), ('1', '2')]
The expression c[0] != '2' will be False (0) for the starting point and True (1) for all others, which will force the starting point to come at the start of the list. If there are multiple starting points, they will be sorted normally relative to each other.

Round off some values of a tuple

I have tuples like this ( I not sure will it call a list of tuple or not ! )
ratings = [('5', 45.58139534883721), ('4', 27.44186046511628), ('3', 20.0), ('2', 5.116279069767442), ('1', 1.8604651162790697)]
I want to make second value round off ( or truncate, don't matter to me )up to 2 decimal place, like this:
[('5', 45.58), ('4', 27.44), ('3', 20.0), ('2', 5.11), ('1', 1.86)]
I tried something like this:
l = tuple([round(x,2) if isinstance(x, float) else x for x in ratings])
But this seems to be not working. What can I try?

Round the 2nd element of your tuples only:
ratings = [('5', 45.58139534883721), ('4', 27.44186046511628), ('3', 20.0), ('2', 5.116279069767442), ('1', 1.8604651162790697)]
l = [(item[0],round(item[1],2)) for item in ratings]
# [('5', 45.58), ('4', 27.44), ('3', 20.0), ('2', 5.12), ('1', 1.86)]

Unpacking a list of tuples [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
How can I unpack the following list
[('1', 'GENERAL', '1'), ('1.1', 'RELATED DOCUMENTS', '1'), ('1.2', 'SUMMARY', '1'), ('1.3', 'DEFINITIONS', '1'), ('1.4', 'INFORMATIONAL SUBMITTALS', '2'), ('1.5', 'GENERAL COORDINATION PROCEDURES', '2'), ('1.6', 'COORDINATION DRAWINGS', '3'), ('1.7', 'REQUESTS FOR INFORMATION (RFIs)', '4'), ('1.8', 'PROJECT MEETINGS', '6')]
[[('2', 'PRODUCTS – NOT APPLICABLE', '10')]]
From solution on other post I tried.
Part, Title, Page = zip(*text_good[0])
But got the error
too many values to unpack (expected 3)
And I also tried
Part1[a].append(Part for Part, Title, Page in text_good[0])
Part2[a].append(Part for Part, Title, Page in text_good[1])
Part3[a].append(Part for Part, Title, Page in text_good[2])
But this seemed to return a spot in memory and I could not open the array because I received an error stating it is not pickable.
Thanks
Update:
Assignment of text_good
for i in range(0, len(text_between_parts)):
text_good[i].append(re.findall(r'\s*(\b\d+(?:[.]\d+)?)\W+\s*(.*?)\s*(\b\d+\b)', text_between_parts[i]))
Update 2: When I do text_good[0] I get
[[('1', 'GENERAL', '1'), ('1.1', 'RELATED DOCUMENTS', '1'), ('1.2', 'SUMMARY', '1'), ('1.3', 'DEFINITIONS', '1'), ('1.4', 'INFORMATIONAL SUBMITTALS', '2'), ('1.5', 'GENERAL COORDINATION PROCEDURES', '2'), ('1.6', 'COORDINATION DRAWINGS', '3'), ('1.7', 'REQUESTS FOR INFORMATION (RFIs)', '4'), ('1.8', 'PROJECT MEETINGS', '6')]]
and when I do text_good[0][0] I get
[('1', 'GENERAL', '1'), ('1.1', 'RELATED DOCUMENTS', '1'), ('1.2', 'SUMMARY', '1'), ('1.3', 'DEFINITIONS', '1'), ('1.4', 'INFORMATIONAL SUBMITTALS', '2'), ('1.5', 'GENERAL COORDINATION PROCEDURES', '2'), ('1.6', 'COORDINATION DRAWINGS', '3'), ('1.7', 'REQUESTS FOR INFORMATION (RFIs)', '4'), ('1.8', 'PROJECT MEETINGS', '6')]
Notice the extra bracket when I do text_good[0].

Ok, I think we need to do a little clarification here first. I'm a little confused on what exactly the list is so I will make the following assumption (if any of these assumptions are wrong please let me know so I can fix them):
text_good = [[('1', 'GENERAL', '1'), ('1.1', 'RELATED DOCUMENTS', '1'), ('1.2', 'SUMMARY', '1'), ('1.3', 'DEFINITIONS', '1'), ('1.4', 'INFORMATIONAL SUBMITTALS', '2'), ('1.5', 'GENERAL COORDINATION PROCEDURES', '2'), ('1.6', 'COORDINATION DRAWINGS', '3'), ('1.7', 'REQUESTS FOR INFORMATION (RFIs)', '4'), ('1.8', 'PROJECT MEETINGS', '6')], [('2', 'PRODUCTS - NOT APPLICABLE', '10')]]
Where now if I do text_good[0] I get:
[('1', 'GENERAL', '1'),
('1.1', 'RELATED DOCUMENTS', '1'),
('1.2', 'SUMMARY', '1'),
('1.3', 'DEFINITIONS', '1'),
('1.4', 'INFORMATIONAL SUBMITTALS', '2'),
('1.5', 'GENERAL COORDINATION PROCEDURES', '2'),
('1.6', 'COORDINATION DRAWINGS', '3'),
('1.7', 'REQUESTS FOR INFORMATION (RFIs)', '4'),
('1.8', 'PROJECT MEETINGS', '6')]
and text_good[1] would be:
[('2', 'PRODUCTS - NOT APPLICABLE', '10')]
And to me this seems like you have a list of tuples where ('1', 'GENERAL', '1') would correspond to Part, Title, Page, in that order.
Then if this is the case you need can do something like this:
Parts, Title, Page = zip(*[t for l in text_good for t in l])
Where in this case you get:
print Parts # ('1', '1.1', '1.2', '1.3', '1.4', '1.5', '1.6', '1.7', '1.8', '2')
print Title # ('GENERAL',
# 'RELATED DOCUMENTS',
# 'SUMMARY',
# 'DEFINITIONS',
# 'INFORMATIONAL SUBMITTALS',
# 'GENERAL COORDINATION PROCEDURES',
# 'COORDINATION DRAWINGS',
# 'REQUESTS FOR INFORMATION (RFIs)',
# 'PROJECT MEETINGS',
# 'PRODUCTS - NOT APPLICABLE')
print Page # ('1', '1', '1', '1', '2', '2', '3', '4', '6', '10')
Final Edit:
Because #JStuff has a list of lists of lists of tuples, we technically need 3 for loops to be able to extract the definitions he wants.
Parts, Title, Page = [t for l in text_good for ll in l for t in ll] # Yay for list comprehension?

Python Look-And-Say Regex

I'm trying to write a regex for the Look-and-Say sequence in python. The idea is to split a given string into same-digit substrings. With trial and error, I have '((\d)\\2*)'.
For the pattern 11244455221116 this gives [('11', '1'), ('2', '2'), ('444', '4'), ('55', '5'), ('22', '2'), ('111', '1'), ('6', '6')] as expected. This works, but looks clumsy. Is there a cleaner way to do this, with or without regex?

You could use itertools.groupby:
import itertools as IT
text = '11244455221116'
print([(''.join(group), key) for key, group in IT.groupby(text)])
yields
[('11', '1'), ('2', '2'), ('444', '4'), ('55', '5'), ('22', '2'), ('111', '1'), ('6', '6')]
But re.findall is faster:
In [67]: %timeit [(''.join(group), key)for key, group in IT.groupby(text*100)]
1000 loops, best of 3: 528 us per loop
In [68]: %timeit re.findall(r'((\d)\2*)', text*100)
1000 loops, best of 3: 219 us per loop

instead of splitting your string, you can do a replace with a lambda function:
re.sub(r'(\d)\1*', lambda x: str(len(x.group(0)))+x.group(1), '112224355')
result: 2132141325

Why am I not getting the result of sorted function in expected order?

print activities
activities = sorted(activities,key = lambda item:item[1])
print activities
Activities in this case is a list of tuples like (start_number,finish_number) the output of the above code according to me should be the list of values sorted according the the increasing order of finish_number. When I tried the above code in shell I got the following output. I am not sure why the second list is not sorted according the the increasing order of the finish_number. Please help me in understanding this.
[('1', '4'), ('3', '5'), ('0', '6'), ('5', '7'), ('3', '9'), ('5', '9'), ('6', '10'), ('8', '11'), ('8', '12'), ('2', '14'), ('12', '16')]
[('6', '10'), ('8', '11'), ('8', '12'), ('2', '14'), ('12', '16'), ('1', '4'), ('3', '5'), ('0', '6'), ('5', '7'), ('3', '9'), ('5', '9')]

You are sorting strings instead of integers: in that case, 10 is "smaller" than 4. To sort on integers, convert it to this:
activites = sorted(activities,key = lambda item:int(item[1]))
print activities
Results in:
[('1', '4'), ('3', '5'), ('0', '6'), ('5', '7'), ('3', '9'), ('5', '9'), ('6', '10'), ('8', '11'), ('8', '12'), ('2', '14'), ('12', '16')]

Your items are being compared as strings, not as numbers. Thus, since the 1 character comes before 4 lexicographically, it makes sense that 10 comes before 4.
You need to cast the value to an int first:
activities = sorted(activities,key = lambda item:int(item[1]))

You are sorting strings, not numbers. Strings get sorted character by character.
So, for example '40' is greater than '100' because character 4 is larger than 1.
You can fix this on the fly by simply casting the item as an integer.
activities = sorted(activities,key = lambda item: int(item[1]))

It's because you're not storing the number as a number, but as a string. The string '10' comes before the string '2'. Try:
activities = sorted(activities, key=lambda i: int(i[1]))

Look for a BROADER solution to your problem: Convert your data from str to int immediately on input, work with it as int (otherwise you'll be continually be bumping into little problems like this), and format your data as str for output.
This principle applies generally, e.g. when working with non-ASCII string data, do UTF-8 -> unicode -> UTF-8; don't try to manipulate undecoded text.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to create value pairs with lambda in pyspark? - python

I'd use flatMap with itertools.combinations: from itertools import combinations rdd.flatMap(lambda xs: combinations(xs, 2))

Related

How to sort a list of tuples but with one specific tuple being the first?

Round off some values of a tuple

Unpacking a list of tuples [closed]

Python Look-And-Say Regex

Why am I not getting the result of sorted function in expected order?

Categories

Resources