I have two arrays and I am wanting to loop through a second array to only return arrays whose first element is equal to an element from another array.
a = [10, 11, 12, 13, 14]
b = [[9, 23, 45, 67, 56, 23, 54], [10, 8, 52, 30, 15, 47, 109], [11, 81,
152, 54, 112, 78, 167], [13, 82, 84, 63, 24, 26, 78], [18, 182, 25, 63, 96,
104, 74]]
I have two different arrays, a and b. I would like to find a way to look through each of the sub-arrays(?) within b in which
the first value is equal to the values in array a to create a new array, c.
The result I am looking for is:
c = [[10, 8, 52, 30, 15, 47, 109],[11, 81, 152, 54, 112, 78, 167],[13, 82, 84, 63, 24, 26, 78]]
Does Python have a tool to do this in a way Excel has MATCH()?
I tried looping in a manner such as:
for i in a:
if i in b:
print (b)
But because there are other elements within the array, this way is not working. Any help would be greatly appreciated.
Further explanation of the problem:
a = [5, 6, 7, 9, 12]
I read in a excel file using XLRD (b_csv_data):
Start Count Error Constant Result1 Result2 Result3 Result4
5 41 0 45 23 54 66 19
5.4 44 1 21 52 35 6 50
6 16 1 42 95 39 1 13
6.9 50 1 22 71 86 59 97
7 38 1 43 50 47 83 67
8 26 1 29 100 63 15 40
9 46 0 28 85 9 27 81
12 43 0 21 74 78 20 85
Next, I created a look to read in a select number of rows. For simplicity, this file above only has a few rows. My current file has about 100 rows.
for r in range (1, 7): #skipping headers and only wanting first few rows to start
b_raw = b_csv_data.row_values(r)
b = np.array(b_raw) # I created this b numpy array from the line of code above
Use np.isin -
In [8]: b[np.isin(b[:,0],a)]
Out[8]:
array([[ 10, 8, 52, 30, 15],
[ 11, 81, 152, 54, 112],
[ 13, 82, 84, 63, 24]])
With sorted a, we can also use np.searchsorted -
idx = np.searchsorted(a,b[:,0])
idx[idx==len(a)] = 0
out = b[a[idx] == b[:,0]]
If you have an array with different number of elements per row, which is essentially array of lists, you need to modify the slicing part. So, in that case, get the first off elements -
b0 = [bi[0] for bi in b]
Then, use b0 to replace all instances of b[:,0] in earlier posted methods.
Use list comprehension:
c = [l for l in b if l[0] in a]
Output:
[[10, 8, 52, 30, 15], [11, 81, 152, 54, 112], [13, 82, 84, 63, 24]]
If your list or arrays are considerably large, using numpy.isin can be significantly faster:
b[np.isin(b[:, 0], a), :]
Benchmark:
a = [10, 11, 12, 13, 14]
b = [[9, 23, 45, 67, 56], [10, 8, 52, 30, 15], [11, 81, 152, 54, 112],
[13, 82, 84, 63, 24], [18, 182, 25, 63, 96]]
list_comp, np_isin = [], []
for i in range(1,100):
a_test = a * i
b_test = b * i
list_comp.append(timeit.timeit('[l for l in b_test if l[0] in a_test]', number=10, globals=globals()))
a_arr = np.array(a_test)
b_arr = np.array(b_test)
np_isin.append(timeit.timeit('b_arr[np.isin(b_arr[:, 0], a_arr), :]', number=10, globals=globals()))
While it is not clear and concise, I would recommend using list comprehension if the b is shorter than 100. Otherwise, numpy is your way to go.
You are doing it reverse. It is better to loop through the elements of b array and check if it is present in a. If yes then print that element of b. See the answer below.
a = [10, 11, 12, 13, 14]
b = [[9, 23, 45, 67, 56, 23, 54], [10, 8, 52, 30, 15, 47, 109], [11, 81, 152, 54, 112, 78, 167], [13, 82, 84, 63, 24, 26, 78], [18, 182, 25, 63, 96, 104, 74]]
for bb in b: # if you want to check only the first element of b is in a
if bb[0] in a:
print(bb)
for bb in b: # if you want to check if any element of b is in a
for bbb in bb:
if bbb in a:
print(bb)
Output:
[10, 8, 52, 30, 15, 47, 109]
[11, 81, 152, 54, 112, 78, 167]
[13, 82, 84, 63, 24, 26, 78]
Related
This is similar to previous questions about how to expand a list-based column across several columns, but the solutions I'm seeing don't seem to work for Dask. Note, that the true DFs I'm working with are too large to hold in memory, so converting to pandas first is not an option.
I have a df with column that contains lists:
df = pd.DataFrame({'a': [np.random.randint(100, size=4) for _ in range(20)]})
dask_df = dd.from_pandas(df, chunksize=10)
dask_df['a'].compute()
0 [52, 38, 59, 78]
1 [79, 71, 13, 63]
2 [15, 81, 79, 76]
3 [53, 4, 94, 62]
4 [91, 34, 26, 92]
5 [96, 1, 69, 27]
6 [84, 91, 96, 68]
7 [93, 56, 45, 40]
8 [54, 1, 96, 76]
9 [27, 11, 79, 7]
10 [27, 60, 78, 23]
11 [56, 61, 88, 68]
12 [81, 10, 79, 65]
13 [34, 49, 30, 3]
14 [32, 46, 53, 62]
15 [20, 46, 87, 31]
16 [89, 9, 11, 4]
17 [26, 46, 19, 27]
18 [79, 44, 45, 56]
19 [22, 18, 31, 90]
Name: a, dtype: object
According to this solution, if this were a pd.DataFrame I could do something like this:
new_dask_df = dask_df['a'].apply(pd.Series)
ValueError: The columns in the computed data do not match the columns in the provided metadata
Extra: [1, 2, 3]
Missing: []
There's another solution listed here:
import dask.array as da
import dask.dataframe as dd
x = da.ones((4, 2), chunks=(2, 2))
df = dd.io.from_dask_array(x, columns=['a', 'b'])
df.compute()
So for dask I tried:
df = dd.io.from_dask_array(dask_df.values)
but that just spits out the same DF I have from before:
[1]: https://i.stack.imgur.com/T099A.png
Not really sure why as the types between the example 'x' and the values in my df are the same:
print(type(dask_df.values), type(x))
<class 'dask.array.core.Array'> <class 'dask.array.core.Array'>
print(type(dask_df.values.compute()[0]), type(x.compute()[0]))
<class 'numpy.ndarray'> <class 'numpy.ndarray'>
Edit: I kind of having a working solution but it involves iterating through each groupby object. It feels like there should be a better way:
dask_groups = dask_df.explode('a').reset_index().groupby('index')
final_df = []
for idx in dask_df.index.values.compute():
group = dask_groups.get_group(idx).drop(columns='index').compute()
group_size = list(range(len(group)))
row = group.transpose()
row.columns = group_size
row['index'] = idx
final_df.append(dd.from_pandas(row, chunksize=10))
final_df = dd.concat(final_df).set_index('index')
In this case dask doesn't know what to expect from the outcome, so it's best to specify meta explicitly:
# this is a short-cut to use the existing pandas df
# in actual code it is sufficient to provide an
# empty series with the expected dtype
meta = df['a'].apply(pd.Series)
new_dask_df = dask_df['a'].apply(pd.Series, meta=meta)
new_dask_df.compute()
I got a working solution. My original function created a list which resulted in the column of lists, as above. Changing the applied function to return a dask bag seems to do the trick:
def create_df_row(x):
vals = np.random.randint(2, size=4)
return db.from_sequence([vals], partition_size=2).to_dataframe()
test_df = dd.from_pandas(pd.DataFrame({'a':[random.choice(['a', 'b', 'c']) for _ in range(20)]}), chunksize=10)
test_df.head()
mini_dfs = [*test_df.groupby('a')['a'].apply(lambda x: create_df_row(x))]
result = dd.concat(mini_dfs)
result.compute().head()
But not sure if this solves the in-memory issue as now i'm holding a list of groupby results.
Here's how to expand a list-like column across multiple columns manually:
dask_df["a0"] = dask_df["a"].str[0]
dask_df["a1"] = dask_df["a"].str[1]
dask_df["a2"] = dask_df["a"].str[2]
dask_df["a3"] = dask_df["a"].str[3]
print(dask_df.head())
a a0 a1 a2 a3
0 [71, 16, 0, 10] 71 16 0 10
1 [59, 65, 99, 74] 59 65 99 74
2 [83, 26, 33, 38] 83 26 33 38
3 [70, 5, 19, 37] 70 5 19 37
4 [0, 59, 4, 80] 0 59 4 80
SultanOrazbayev's answer seems more elegant.
I'm having troubles writing this piece of code.
I need to create a list to only have 3 values every 3 values :
The expected output must be something like :
output1 = [1,2,3,7,8,9,13,14,15,....67,68,69]
output2 = [4,5,6,10,11,12...70,71,72]
Any ideas how can I reach that ?
Use two loops -- one for each group of three, and one for each item within that group. For example:
>>> [i*6 + j for i in range(12) for j in range(1, 4)]
[1, 2, 3, 7, 8, 9, 13, 14, 15, 19, 20, 21, 25, 26, 27, 31, 32, 33, 37, 38, 39, 43, 44, 45, 49, 50, 51, 55, 56, 57, 61, 62, 63, 67, 68, 69]
>>> [i*6 + j for i in range(12) for j in range(4, 7)]
[4, 5, 6, 10, 11, 12, 16, 17, 18, 22, 23, 24, 28, 29, 30, 34, 35, 36, 40, 41, 42, 46, 47, 48, 52, 53, 54, 58, 59, 60, 64, 65, 66, 70, 71, 72]
Suppose you want n values every n values of total sets starting with start. Just change the start and number of sets you need. In below example list start with 1, so first set [1,2,3] and we need 12 sets each containing 3 consecutive element
Method 1
n = 3
start = 1
total = 12
# 2*n*i + start is first element of every set of n tuples (Arithmetic progression)
print([j for i in range(total) for j in range(2*n*i + start, 2*n*i + start+n)])
# Or
print(sum([list(range(2*n*i + start, 2*n*i + start+n)) for i in range(total)], []))
Method 2 (Numpy does operation in C, so fast)
import numpy as np
n = 3
start = 1
total = 12
# One liner
print(
(np.arange(start, start + n, step=1)[:, np.newaxis] + np.arange(0, total, 1) * 2*n).transpose().reshape(-1)
)
##############EXPLAINATION OF ABOVE ONE LINEAR########################
# np.arange start, start+1, ... start + n - 1
first_set = np.arange(start, start + n, step=1)
# [1 2 3]
# np.arange 0, 2*n, 4*n, 6*n, ....
multiple_to_add = np.arange(0, total, 1) * 2*n
print(multiple_to_add)
# broadcast first set using np.newaxis and repeatively add to each element in multiple_to_add
each_set_as_col = first_set[:, np.newaxis] + multiple_to_add
# [[ 1 7 13 19 25 31 37 43 49 55 61 67]
# [ 2 8 14 20 26 32 38 44 50 56 62 68]
# [ 3 9 15 21 27 33 39 45 51 57 63 69]]
# invert rows and columns
each_set_as_row = each_set_as_col.transpose()
# [[ 1 2 3]
# [ 7 8 9]
# [13 14 15]
# [19 20 21]
# [25 26 27]
# [31 32 33]
# [37 38 39]
# [43 44 45]
# [49 50 51]
# [55 56 57]
# [61 62 63]
# [67 68 69]]
merge_all_set_in_single_row = each_set_as_row.reshape(-1)
# array([ 1, 2, 3, 7, 8, 9, 13, 14, 15, 19, 20, 21, 25, 26, 27, 31, 32,
# 33, 37, 38, 39, 43, 44, 45, 49, 50, 51, 55, 56, 57, 61, 62, 63, 67,
# 68, 69])
To make the logic understandable, because sometimes the Pythonic methods look 'magic'
Here's a naive algorithm to do that:
output1 = []
output2 = []
for i in range(1, 100): # change as you like:
if (i-1) % 6 < 3:
output1.append(i)
else:
output2.append(i)
What's going on here:
Initializing two empty lists.
Iterate through integers in a range.
How to tell if i should go to output1 or output2:
I can see that 3 consecutive numbers go to output1, then 3 consecutive to output2.
This tells me I can use the modulo % operator, (doing % 6)
The rest is simple logic to get the exact result wanted.
I have a list of numbers and from this list, I want to create 3 more lists that contain the maximum, average, and 5th largest number from it. My original list overdraw is the block of lists, which means it has sub-blocks in it and each block has 6 numbers in it and there are a total of 3 blocks or 6x3 matrix or array.
overdraw:
[[16,13,23,14,33,45],[23,11,54,34,23,76],[22,54,34,43,41,11]]
I know how to calculate max, average and 5 largest in this list. But I want a answer in a specific way like I know the max, average, and 5th largest values of each block but I want them to get printed 4 times. I know all the values:
Max = [45, 76, 54]
Average = [24, 37, 34]
Largest(5th) = [14, 23, 22]
my approach:
overdraw = [[16,13,23,14,33,45],[23,11,54,34,23,76],[22,54,34,43,41,11]]
x = [sorted(block, reverse=True) for block in overdraw] # first sort the whole list
max = [x[i][0] for i in range(0, len(x))] # for max
largest = [x[i][4] for i in range(0, len(x))] #5th largest
average = [sum(x[i])/len(x[i]) for i in range(0, len(x))] #average
print("max: ", max)
print("5th largest: ", largest)
print("average: ", average)
You will get the same output after running this code but I want output in this format:
Average = [24, 24, 24, 24, 37, 37, 37, 37, 34, 34, 34, 34]
Max = [45, 45, 45, 45, 76, 76, 76, 76, 54, 54, 54, 54]
Largest(5th) = [14, 14, 14, 14, 23, 23, 23, 23, 22, 22, 22, 22]
As you can see each average, max, and the largest number is printed 4 times in their respective list. So can anyone help with this answer?
What about using pandas.DataFrame.explode
import pandas as pd
df = pd.DataFrame({
'OvIdx' : 3 * [range(4)],
'Average' : average,
'Max' : max, # should be renamed/assigned as max_ instead
'Largest(5th)': largest
}).explode('OvIdx').set_index('OvIdx').astype(int)
print(df)
which shows
Average Max Largest(5th)
OvIdx
0 24 45 14
1 24 45 14
2 24 45 14
3 24 45 14
0 36 76 23
1 36 76 23
2 36 76 23
3 36 76 23
0 34 54 22
1 34 54 22
2 34 54 22
3 34 54 22
from here, you can still do all the calculations you want and/or getting a NumPy array, doing df.values.
Following your comment, you can also get your column(s) as individual entities, doing, e.g.
>>> df.Average.tolist()
[24, 24, 24, 24, 36, 36, 36, 36, 34, 34, 34, 34]
>>> df.Max.tolist()
[45, 45, 45, 45, 76, 76, 76, 76, 54, 54, 54, 54]
>>> df['Largest(5th)'].tolist() # as string key since the name is a little bit exotic
[14, 14, 14, 14, 23, 23, 23, 23, 22, 22, 22, 22]
which approach starts to be a little bit overkilled, readable though.
A solution that returns lists like you specified
import itertools
import numpy as np
n_times = 4
overdraw = [[16,13,23,14,33,45],[23,11,54,34,23,76],[22,54,34,43,41,11]]
y = [sorted(block, reverse=True) for block in overdraw]
maximum = list(itertools.chain(*[[max(x)]*n_times for x in y]))
average = list(itertools.chain(*[[int(round(sum(x)/len(x)))]*n_times for x in y]))
fifth_largest = list(itertools.chain(*[[x[4]]*n_times for x in y]))
print(f"Average = {average}")
print(f"Max = {maximum}")
print(f"Largest(5th): {fifth_largest}")
Outputs:
Average = [24, 24, 24, 24, 37, 37, 37, 37, 34, 34, 34, 34]
Max = [45, 45, 45, 45, 76, 76, 76, 76, 54, 54, 54, 54]
Largest(5th): [14, 14, 14, 14, 23, 23, 23, 23, 22, 22, 22, 22]
The list comprehension:
def getBiggerNumber(input_number, generated_number):
return [x for x in generated_number if x > input_number]
The results from the list comprehension:
Generated Numbers : [7, 9, 14, 18, 27, 41, 44, 46, 54, 55, 57, 57, 57, 64, 65, 81, 82, 82, 83, 95]
Enter a number 1-100: 44
Your number: 44
Numbers greater than 44 : [46, 54, 55, 57, 57, 57, 64, 65, 81, 82, 82, 83, 95]
This code is what I tried to get the same result as the above.
for x in generated_number:
if x > input_number:
print(x)
The results I get from this is:
Random Numbers : [6, 12, 17, 24, 25, 26, 40, 43, 44, 45, 50, 51, 62, 65, 72, 75, 77, 91, 93, 98]
Please enter a number 1 through 100: 66
Your number is : 66
72
75
77
91
93
98
72
75
77
91
93
98
Numbers greater than 66 : None
As you can see, step by step
def doThing(input_number,generated_number):
return [x for x in generated_number if x > input_number]
print(doThing(10,[100,10,20,40]))
def doSameThing(input_number,generated_number):
res = []
for x in generated_number:
if x > 10:
res.append(x)
return res
print(doSameThing(10,[100,10,20,40]))
You are making a filter by >10, list comprehension is just syntactic sugar of it
Equivalent of list comprehensions
def getBiggerNumber(input_number, generated_number):
return [x for x in generated_number if x > input_number]
print(getBiggerNumber(44, [20, 66, 100]))
def same(xnum, ylist, lst=[]):
for x in ylist:
if x > xnum:
lst.append(x)
return lst
print(same(44, [20, 66, 100]))
OUTPUT:
[66, 100]
[66, 100]
>>>
Consider a numpy array of the form:
> a = np.random.uniform(0., 100., (10, 1000))
and a list of indexes to elements in that array that I want to keep track of:
> idx_s = [0, 5, 7, 9, 12, 17, 19, 32, 33, 35, 36, 39, 40, 41, 42, 45, 47, 51, 53, 57, 59, 60, 61, 62, 63, 65, 66, 70, 71, 73, 75, 81, 83, 85, 87, 88, 89, 90, 91, 93, 94, 96, 98, 100, 106, 107, 108, 118, 119, 121, 124, 126, 127, 128, 129, 133, 135, 138, 142, 143, 144, 146, 147, 150]
I also have a list of indexes of elements I need to remove from a:
> idx_d = [4, 12, 18, 20, 21, 22, 26, 28, 29, 31, 37, 43, 48, 54, 58, 74, 80, 86, 99, 109, 110, 113, 117, 134, 139, 140, 141, 148, 154, 156, 160, 166, 169, 175, 183, 194, 198, 199, 219, 220, 237, 239, 241, 250]
which I delete with:
> a_d = np.delete(arr, idx_d, axis=1)
But this process alters the indexes of elements in a_d. The indexes in idx_s no longer point in a_d to the same elements in a, since np.delete() moved them. For example: if I delete the element of index 4 from a, then all indexes after 4 in idx_s are now displaced by 1 to the right in a_d.
v Index 5 points to 'f' in a
0 1 2 3 4 5 6
a -> a b c d e f g ... # Remove 4th element 'e' from a
a_d -> a b c d f g h ... # Now index 5 no longer points to 'f' in a_d, but to 'g'
0 1 2 3 4 5 6
How do I update the idx_s list of indexes, so that the same elements that were pointed in a are pointed in a_d?
In the case of an element that is present in idx_s that is also present in idx_d (and thus removed from a and not present in a_d) its index should also be discarded.
You could use np.searchsorted to get the shifts for each element in idx_s and then simply subtract those from idx_s for the new shifted-down values, like so -
idx_s - np.searchsorted(idx_d, idx_s)
If idx_d is not already sorted, we need to feed in a sorted version. Thus, for simplicity assuming these as arrays, we would have -
idx_s = idx_s[~np.in1d(idx_s, idx_d)]
out = idx_s - np.searchsorted(np.sort(idx_d), idx_s)
A sample run to help out getting a better picture -
In [530]: idx_s
Out[530]: array([19, 5, 17, 9, 12, 7, 0])
In [531]: idx_d
Out[531]: array([12, 4, 18])
In [532]: idx_s = idx_s[~np.in1d(idx_s, idx_d)] # Remove matching ones
In [533]: idx_s
Out[533]: array([19, 5, 17, 9, 7, 0])
In [534]: idx_s - np.searchsorted(np.sort(idx_d), idx_s) # Updated idx_s
Out[534]: array([16, 4, 15, 8, 6, 0])
idx_s = [0, 5, 7, 9, 12, 17, 19]
idx_d = [4, 12, 18]
def worker(a, v, i=0):
if not a:
return []
elif not v:
return []
elif a[0] == v[0]:
return worker(a[1:], v[1:], i+1)
elif a[0] < v[0]:
return [a[0]-i] + worker(a[1:], v, i)
else:
return [a[0]-i-1] + worker(a[1:], v[1:], i+1)
worker(idx_s, idx_d)
# [0, 5, 6, 8, 15, 16]