Working with a pair in lists - python

I extract two values with a statement from a dataframe via:
date = data_audit.loc[data_audit.Audit == audit) & data_audit.Meilenstein == phase1), 'Planned_Date']
division = data_audit.loc[(data_audit.Audit == audit) & (data_audit.Meilenstein == phase1), 'Ber']
After that extract, I transform these values...
x = date.tolist()
y = division.tolist()
.. and append it to a list
time.extend((x, y))
My result in pycharm is (after looping the .extend through some values):
[[100], [A], [200], [A], [100], [B]]
My first question: Why is the result not like:
[([100], [A]), ([200], [A]), ([100], [B])] ?
My second question: I want to calculate the average of all first items (the integers) and of all first items (the integers) per exec (exec=A, B)
Result would be: All: 133, 33 | A: 150 | B: 100
How can I access all values of the "first value" of the pair in my list [(firstvalue,secondvalue),(,)...]
For example:
time= np.round(np.mean(timeCleaned[ACCESS_ALL_"FIRST"_VALUES_IN_MY_LIST]), 2)
Thank you!
edit: Variable names.

extend unpacks and appends each item of an iterable to your list. Use append instead:
time.append((x, y))

Related

Drop duplicate lists within a nested list value in a column

I have a pandas dataframe with nested lists as values in a column as follows:
sample_df = pd.DataFrame({'single_proj_name': [['jsfk'],['fhjk'],['ERRW'],['SJBAK']],
'single_item_list': [['ABC_123'],['DEF123'],['FAS324'],['HSJD123']],
'single_id':[[1234],[5678],[91011],[121314]],
'multi_proj_name':[['AAA','VVVV','SASD'],['QEWWQ','SFA','JKKK','fhjk'],['ERRW','TTTT'],['SJBAK','YYYY']],
'multi_item_list':[[['XYZAV','ADS23','ABC_123'],['XYZAV','ADS23','ABC_123']],['XYZAV','DEF123','ABC_123','SAJKF'],['QWER12','FAS324'],[['JFAJKA','HSJD123'],['JFAJKA','HSJD123']]],
'multi_id':[[[2167,2147,29481],[2167,2147,29481]],[2313,57567,2321,7898],[1123,8775],[[5237,43512],[5237,43512]]]})
As you can see above, in some columns, the same list is repeated twice or more.
So, I would like to remove the duplicated list and only retain one copy of the list.
I was trying something like the below:
for i, (single, multi_item, multi_id) in enumerate(zip(sample_df['single_item_list'],sample_df['multi_item_list'],sample_df['multi_id'])):
if (any(isinstance(i, list) for i in multi_item)) == False:
for j, item_list in enumerate(multi_item):
if single[0] in item_list:
pos = item_list.index(single[0])
sample_df.at[i,'multi_item_list'] = [item_list]
sample_df.at[i,'multi_id'] = [multi_id[j]]
else:
print("under nested list")
for j, item_list in enumerate(zip(multi_item,multi_id)):
if single[0] in multi_item[j]:
pos = multi_item[j].index(single[0])
sample_df.at[i,'multi_item_list'][j] = single[0]
sample_df.at[i,'multi_id'][j] = multi_id[j][pos]
else:
sample_df.at[i,'multi_item_list'][j] = np.nan
sample_df.at[i,'multi_id'][j] = np.nan
But this assigns NA to the whole column value. I expect to remove that specific list (within a nested list).
I expect my output to be like as below:
In the data it looks like removing duplicates is equivalent to keeping the first element in any list of lists while any standard lists are kept as they are. If this is true, then you can solve it as follows:
def get_first_list(x):
if isinstance(x[0], list):
return [x[0]]
return x
for c in ['multi_item_list', 'multi_id']:
sample_df[c] = sample_df[c].apply(get_first_list)
Result:
single_proj_name single_item_list single_id multi_proj_name multi_item_list multi_id
0 [jsfk] [ABC_123] [1234] [AAA, VVVV, SASD] [[XYZAV, ADS23, ABC_123]] [[2167, 2147, 29481]]
1 [fhjk] [DEF123] [5678] [QEWWQ, SFA, JKKK, fhjk] [XYZAV, DEF123, ABC_123, SAJKF] [2313, 57567, 2321, 7898]
2 [ERRW] [FAS324] [91011] [ERRW, TTTT] [QWER12, FAS324] [1123, 8775]
3 [SJBAK] [HSJD123] [121314] [SJBAK, YYYY] [[JFAJKA, HSJD123]] [[5237, 43512]]
To handle the case where there can be more than a single unique list the get_first_list method can be adjusted to:
def get_first_list(x):
if isinstance(x[0], list):
new_x = []
for i in x:
if i not in new_x:
new_x.append(i)
return new_x
return x
This will keep the order of the sublists while removing any sublist duplicates.
Shortly with np.unique function:
cols = ['multi_item_list', 'multi_id']
sample_df[cols] = sample_df[cols].apply(lambda x: [np.unique(a, axis=0) if type(a[0]) == list else a for a in x.values])
In [382]: sample_df
Out[382]:
single_proj_name single_item_list single_id multi_proj_name \
0 [jsfk] [ABC_123] [1234] [AAA, VVVV, SASD]
1 [fhjk] [DEF123] [5678] [QEWWQ, SFA, JKKK, fhjk]
2 [ERRW] [FAS324] [91011] [ERRW, TTTT]
3 [SJBAK] [HSJD123] [121314] [SJBAK, YYYY]
multi_item_list multi_id
0 [[XYZAV, ADS23, ABC_123]] [[2167, 2147, 29481]]
1 [XYZAV, DEF123, ABC_123, SAJKF] [2313, 57567, 2321, 7898]
2 [QWER12, FAS324] [1123, 8775]
3 [[JFAJKA, HSJD123]] [[5237, 43512]]

Compare tuple values record wise

I have an ordered tuple(its 2dimensional, column 0 are my endings, which I want to compare & column1 there are the complete urls), at "column"[0] I have to compare the first value with the second one, if they are the same, save the first value to other list and repeat. I want to compare every item with the following one, if they are euqal or not.
tuple:
[('https://www.topart-online.com/de/Rose%2C-Micle%2C-kupfer%2C-52cm%2C-Oe-9cm/c-KAT240/a-XH0124KP', '/a-XH0124KP'), ('https://www.topart-online.com/de/Rose%2C-Micle%2C-kupfer%2C-52cm%2C-Oe-9cm/c-KAT183/a-XH0124KP', '/a-XH0124KP'), ('https://www.topart-online.com/de/Rose%2C-Micle%2C-kupfer%2C-52cm%2C-Oe-9cm/c-KAT173/a-XH0124KP', '/a-XH0124KP'), ('https://www.topart-online.com/de/Liguster-Zweig-50cm-mit-Glitter/c-KAT184/a-XM0721', '/a-XM0721'), ('https://www.topart-online.com/de/3D-Stern-schwarz-mit-Glitter%2C-7%2C5-cm---SUPER-DEAL/c-KAT14/a-XM1633ZW', '/a-XM1633ZW'), ('https://www.topart-online.com/de/Christbaumschmuck%2C-Zweige%2C-gold-30-cm----SUPER-DEAL/c-KAT14/a-XP0091', '/a-XP0091')]
I want to compare the productnumber extracted of the url, because every product could be possibly found in multiple urls
my sorting try:
sized = len(complete_links2) - 1
for index, tuple in enumerate(complete_links2):
index = k
k = index + 1
if k < sized:
while complete_links2[index][1] == complete_links2[k][1]:
k += 1
if complete_links2[index][1] == complete_links2[k][1]:
k -= 1
not_rep_links.append(complete_links2[index])
complete_links3 = [a_tuple[0] for a_tuple in not_rep_links]
My problem is, that there are some unique links, that get also filter off, because my logic is not really good.
I also tried with set, with unpacking the tuple but idk how to continue
I am still a bit confused but is this what You want?
list_ = [
('https://www.topart-online.com/de/Rose%2C-Micle%2C-kupfer%2C-52cm%2C-Oe-9cm/c-KAT240/a-XH0124KP', '/a-XH0124KP'),
('https://www.topart-online.com/de/Rose%2C-Micle%2C-kupfer%2C-52cm%2C-Oe-9cm/c-KAT183/a-XH0124KP', '/a-XH0124KP'),
('https://www.topart-online.com/de/Rose%2C-Micle%2C-kupfer%2C-52cm%2C-Oe-9cm/c-KAT173/a-XH0124KP', '/a-XH0124KP'),
('https://www.topart-online.com/de/Liguster-Zweig-50cm-mit-Glitter/c-KAT184/a-XM0721', '/a-XM0721'),
('https://www.topart-online.com/de/3D-Stern-schwarz-mit-Glitter%2C-7%2C5-cm---SUPER-DEAL/c-KAT14/a-XM1633ZW', '/a-XM1633ZW'),
('https://www.topart-online.com/de/Christbaumschmuck%2C-Zweige%2C-gold-30-cm----SUPER-DEAL/c-KAT14/a-XP0091', '/a-XP0091')
]
products = []
links = []
for item in list_:
if item[1] not in products:
products.append(item[1])
links.append(item[0])
print(links)

Compare a list with a date in pandas in Python

Hello I have this list:
b = [[2018-12-14, 2019-01-11, 2019-01-25, 2019-02-08, 2019-02-22, 2019-07-26],
[2018-06-14, 2018-07-11, 2018-07-25, 2018-08-08, 2018-08-22, 2019-01-26],
[2017-12-14, 2018-01-11, 2018-01-25, 2018-02-08, 2018-02-22, 2018-07-26]]
dtype: datetime64[ns]]
and I want to know if it's possible to compare this list of dates with another date. I am doing it like this:
r = df.loc[(b[1] > vdate)]
with:
vdate = dt.date(2018, 9, 19)
the output is correct because it select the values that satisfy the condition. But the problem is that I want to do that for all the list values. Something like:
r = df.loc[(b > vdate)] # Without [1]
but this get as an output an error as I expected.
I try some for loop and it seems like it works but I am not sure:
g = []
for i in range(len(b)):
r = df.loc[(b[i] > vdate)]
g.append(r)
Thank you so much for your time and any help would be perfect.
One may use the apply function as stated by #Joseph Developer, but a simple list comprehension would not require you to write the function. The following will give you a list of boolean telling you whether or not each date is greater than vdate :
is_after_b = [x > vdate for x in b]
And if you want to include this directly in your DataFrame you may write :
df['is_after_b'] = [ x > vdate for x in df.b]
Assuming that b is a column of df, which btw would make sure that the length of b and your DataFrame's columns match.
EDIT
I did not consider that b was a list of list, you would need to flatten b by using :
flat_b = [item for sublist in b for item in sublist]
And you can now use :
is_after_b = [x > vdate for x in flat_b]
if you want to go through the entire list just use the following method:
ds['new_list'] = ds['list_dates'].apply(function)
use the .apply () method to process your list through a function

How to sort a list in a very specific way in Python?

How can I do a very explicit sort on a list in Python? What I mean is, items are supposed to be sorted a very specific way and not just alphabetically or numerically. The input I would be receiving looks something list this:
h43948fh4349f84 ./.file.html
dsfj940j90f94jf ./abcd.ppt
f9j3049fj349f0j ./abcd_FF_000000001.jpg
f0f9049jf043930 ./abcd_FF_000000002.jpg
j909jdsa094jf49 ./abcd_FF_000000003.jpg
jf4398fj9348fjj ./abcd_FFinit.jpg
9834jf9483fj43f ./abcd_MM_000000001.jpg
fj09jw93fj930fj ./abcd_MM_000000002.jpg
fjdsjfd89s8hs9h ./abcd_MM_000000003.jpg
vyr89r8y898r839 ./abcd_MMinit.jpg
The list should be sorted:
html file first
ppt file second
FFinit file third
MMinit file fourth
The rest of the numbered files in the order of FF/MM
Example output for this would look like:
h43948fh4349f84 ./.file.html
dsfj940j90f94jf ./abcd.ppt
jf4398fj9348fjj ./abcd_FFinit.jpg
vyr89r8y898r839 ./abcd_MMinit.jpg
f9j3049fj349f0j ./abcd_FF_000000001.jpg
9834jf9483fj43f ./abcd_MM_000000001.jpg
f0f9049jf043930 ./abcd_FF_000000002.jpg
fj09jw93fj930fj ./abcd_MM_000000002.jpg
j909jdsa094jf49 ./abcd_FF_000000003.jpg
fjdsjfd89s8hs9h ./abcd_MM_000000003.jpg
You need to define a key function, to guide the sorting. When comparing values to see what goes where, the result of the key function is then used instead of the values directly.
The key function can return anything, but here a tuple would be helpful. Tuples are compared lexicographically, meaning that only their first elements are compared unless they are equal, after which the second elements are used. If those are equal too, further elements are compared, until there are no more elements or an order has been determined.
For your case, you could produce a number in the first location, to order the 'special' entries, then for the remainder return the number in the second position and the FF or MM string in the last:
def key(filename):
if filename.endswith('.html'):
return (0,) # html first
if filename.endswith('.ppt'):
return (1,) # ppt second
if filename.endswith('FFinit.jpg'):
return (2,) # FFinit third
if filename.endswith('MMinit.jpg'):
return (3,) # MMinit forth
# take last two parts between _ characters, ignoring the extension
_, FFMM, number = filename.rpartition('.')[0].rsplit('_', 2)
# rest is sorted on the number (compared here lexicographically) and FF/MM
return (4, number, FFMM)
Note that the tuples don't need to be of equal length even.
This produces the expected output:
>>> from pprint import pprint
>>> lines = '''\
... h43948fh4349f84 ./.file.html
... dsfj940j90f94jf ./abcd.ppt
... f9j3049fj349f0j ./abcd_FF_000000001.jpg
... f0f9049jf043930 ./abcd_FF_000000002.jpg
... j909jdsa094jf49 ./abcd_FF_000000003.jpg
... jf4398fj9348fjj ./abcd_FFinit.jpg
... 9834jf9483fj43f ./abcd_MM_000000001.jpg
... fj09jw93fj930fj ./abcd_MM_000000002.jpg
... fjdsjfd89s8hs9h ./abcd_MM_000000003.jpg
... vyr89r8y898r839 ./abcd_MMinit.jpg
... '''.splitlines()
>>> pprint(sorted(lines, key=key))
['h43948fh4349f84 ./.file.html',
'dsfj940j90f94jf ./abcd.ppt',
'jf4398fj9348fjj ./abcd_FFinit.jpg',
'vyr89r8y898r839 ./abcd_MMinit.jpg',
'f9j3049fj349f0j ./abcd_FF_000000001.jpg',
'9834jf9483fj43f ./abcd_MM_000000001.jpg',
'f0f9049jf043930 ./abcd_FF_000000002.jpg',
'fj09jw93fj930fj ./abcd_MM_000000002.jpg',
'j909jdsa094jf49 ./abcd_FF_000000003.jpg',
'fjdsjfd89s8hs9h ./abcd_MM_000000003.jpg']
You can use the key argument to sort(). This method of the list class accepts an element of the list and returns a value that can be compared to other return values to determine sorting order. One possibility is to assign a number to each criteria exactly as you describe in your question.
Use sorted and a custom key function.
strings = ['h43948fh4349f84 ./.file.html',
'dsfj940j90f94jf ./abcd.ppt',
'f9j3049fj349f0j ./abcd_FF_000000001.jpg',
'f0f9049jf043930 ./abcd_FF_000000002.jpg',
'j909jdsa094jf49 ./abcd_FF_000000003.jpg',
'jf4398fj9348fjj ./abcd_FFinit.jpg',
'9834jf9483fj43f ./abcd_MM_000000001.jpg',
'fj09jw93fj930fj ./abcd_MM_000000002.jpg',
'fjdsjfd89s8hs9h ./abcd_MM_000000003.jpg',
'vyr89r8y898r839 ./abcd_MMinit.jpg']
def key(string):
if string.endswith('html'):
return 0,
elif string.endswith('ppt'):
return 1,
elif string.endswith('FFinit.jpg'):
return 2,
elif string.endswith('MMinit.jpg'):
return 3,
elif string[-16:-14] == 'FF':
return 4, int(string[-13:-4]), 0
elif string[-16:-14] == 'MM':
return 4, int(string[-13:-4]), 1
result = sorted(strings, key=key)
for string in result:
print(string)
Out:
h43948fh4349f84 ./.file.html
dsfj940j90f94jf ./abcd.ppt
jf4398fj9348fjj ./abcd_FFinit.jpg
vyr89r8y898r839 ./abcd_MMinit.jpg
f9j3049fj349f0j ./abcd_FF_000000001.jpg
9834jf9483fj43f ./abcd_MM_000000001.jpg
f0f9049jf043930 ./abcd_FF_000000002.jpg
fj09jw93fj930fj ./abcd_MM_000000002.jpg
j909jdsa094jf49 ./abcd_FF_000000003.jpg
fjdsjfd89s8hs9h ./abcd_MM_000000003.jpg
I assumed the last ordering point just looked at the number before the file extension (e.g. 000001)
def custom_key(x):
substring_order = ['.html','.ppt','FFinit','MMinit']
other_order = lambda x: int(x.split('_')[-1].split('.')[0])+len(substring_order)
for i,o in enumerate(substring_order):
if o in x:
return i
return other_order(x)
sorted_list = sorted(data,key=custom_key)
import pprint
pprint.pprint(sorted_list)
Out:
['h43948fh4349f84 ./.file.html',
'dsfj940j90f94jf ./abcd.ppt',
'jf4398fj9348fjj ./abcd_FFinit.jpg',
'vyr89r8y898r839 ./abcd_MMinit.jpg',
'f9j3049fj349f0j ./abcd_FF_000000001.jpg',
'9834jf9483fj43f ./abcd_MM_000000001.jpg',
'f0f9049jf043930 ./abcd_FF_000000002.jpg',
'fj09jw93fj930fj ./abcd_MM_000000002.jpg',
'j909jdsa094jf49 ./abcd_FF_000000003.jpg',
'fjdsjfd89s8hs9h ./abcd_MM_000000003.jpg']

Prepare my bigdata with Spark via Python

My 100m in size, quantized data:
(1424411938', [3885, 7898])
(3333333333', [3885, 7898])
Desired result:
(3885, [3333333333, 1424411938])
(7898, [3333333333, 1424411938])
So what I want, is to transform the data so that I group 3885 (for example) with all the data[0] that have it). Here is what I did in python:
def prepare(data):
result = []
for point_id, cluster in data:
for index, c in enumerate(cluster):
found = 0
for res in result:
if c == res[0]:
found = 1
if(found == 0):
result.append((c, []))
for res in result:
if c == res[0]:
res[1].append(point_id)
return result
but when I mapPartitions()'ed data RDD with prepare(), it seem to do what I want only in the current partition, thus return a bigger result than the desired.
For example, if the 1st record in the start was in the 1st partition and the 2nd in the 2nd, then I would get as a result:
(3885, [3333333333])
(7898, [3333333333])
(3885, [1424411938])
(7898, [1424411938])
How to modify my prepare() to get the desired effect? Alternatively, how to process the result that prepare() produces, so that I can get the desired result?
As you may already have noticed from the code, I do not care about speed at all.
Here is a way to create the data:
data = []
from random import randint
for i in xrange(0, 10):
data.append((randint(0, 100000000), (randint(0, 16000), randint(0, 16000))))
data = sc.parallelize(data)
You can use a bunch of basic pyspark transformations to achieve this.
>>> rdd = sc.parallelize([(1424411938, [3885, 7898]),(3333333333, [3885, 7898])])
>>> r = rdd.flatMap(lambda x: ((a,x[0]) for a in x[1]))
We used flatMap to have a key, value pair for every item in x[1] and we changed the data line format to (a, x[0]), the a here is every item in x[1]. To understand flatMap better you can look to the documentation.
>>> r2 = r.groupByKey().map(lambda x: (x[0],tuple(x[1])))
We just grouped all key, value pairs by their keys and used tuple function to convert iterable to tuple.
>>> r2.collect()
[(3885, (1424411938, 3333333333)), (7898, (1424411938, 3333333333))]
As you said you can use [:150] to have first 150 elements, I guess this would be proper usage:
r2 = r.groupByKey().map(lambda x: (x[0],tuple(x[1])[:150]))
I tried to be as explanatory as possible. I hope this helps.

Categories

Resources