Let's say I have two very large lists (e.g. 10 million rows) with some values or strings. I would like to figure out how many items from list1 are in list2.
As such this can be done by:
true_count = 0
false_count = 0
for i, x in enumerate(list1):
print(i)
if x in list2:
true_count += 1
else:
false_count += 1
print(true_count)
print(false_count)
This will do the trick, however, if you have 10 million rows, this could take quite some time. Is there some sweet function I don't know about that can do this much faster, or something entirely different?
Using Pandas
Here's how you will do it using Pandas dataframe.
import pandas as pd
import random
list1 = [random.randint(1,10) for i in range(10)]
list2 = [random.randint(1,10) for i in range(10)]
df1 = pd.DataFrame({'list1':list1})
df2 = pd.DataFrame({'list2':list2})
print (df1)
print (df2)
print (all(df2.list2.isin(df1.list1).astype(int)))
I am just picking 10 rows and generating 10 random numbers:
List 1:
list1
0 3
1 5
2 4
3 1
4 5
5 2
6 1
7 4
8 2
9 5
List 2:
list2
0 2
1 3
2 2
3 4
4 3
5 5
6 5
7 1
8 4
9 1
The output of the if statement will be:
True
The random lists I checked against are:
list1 = [random.randint(1,100000) for i in range(10000000)]
list2 = [random.randint(1,100000) for i in range(5000000)]
Ran a test with 10 mil. random numbers in list1, 5 mil. random numbers in list2, result on my mac came back in 2.207757880999999 seconds
Using Set
Alternate, you can also convert the list into a set and check if one set is a subset of the other.
set1 = set(list1)
set2 = set(list2)
print (set2.issubset(set1))
Comparing the results of the run, set is also fast. It came back in 1.6564296570000003 seconds
You can convert the lists to sets and compute the length of the intersection between them.
len(set(list1) & set(list2))
You will have to use Numpy array to translate the lists into a np.array()
After that, both lists will be considered as np.array objects, and because they have only one dimension you can use np.intersect() and count the common items with .size
import numpy as np
lst = [1, 7, 0, 6, 2, 5, 6]
lst2 = [1, 8, 0, 6, 2, 4, 6]
a_list=np.array(lst)
b_list=np.array(lst2)
c = np.intersect1d(a_list, b_list)
print (c.size)
Related
I have list of 5 elements which could be 50000, now I want to sum all the combinations from the same list and create a dataframe from the results, so I am writing following code,
x =list(range(1,5))
t=[]
for i in x:
for j in x:
t.append((i,j,i+j))
df=pd.Dataframe(t)
The above code is generating the correct results but taking so long to execute when I have more elements in the list. Looking for the fastest way to do the same thing
Combinations can be obtained through the pandas.merge() method without using explicit loops
x = np.arange(1, 5+1)
df = pd.DataFrame(x, columns=['x']).merge(pd.Series(x, name='y'), how='cross')
df['sum'] = df.x.add(df.y)
print(df)
x y sum
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 2 1 3
6 2 2 4
...
Option 2: with itertools.product()
import itertools
num = 5
df = pd.DataFrame(itertools.product(range(1,num+1),range(1,num+1)))
df['sum'] = df[0].add(df[1])
print(df)
List Comprehension can make it faster. So, you can use t=[(i,j,i+j) for i in x for j in x] instead of for loop, as the traditional for loop is slower than list comprehensions, and nested loop is even slower. Here's the updated code in replacement of nested loops.
x =list(range(1,5))
t=[(i,j,i+j) for i in x for j in x]
df=pd.Dataframe(t)
I have defined a function to create a dataframe, but I get two lists in each column, how could I get each element of the list as a separate row in the dataframe as shown below.
a = [1, 2, 3, 4]
def function():
result = []
for i in range(0, len(a)):
number = [i for i in a]
operation = [8*i for i in a]
result.append({'number': number, 'operation': operation})
df = pd.DataFrame(result, columns=['number','operation'])
return df
function()
Result:
number operation
0 [1, 2, 3, 4] [8, 16, 24, 32]
What I really want to:
number operation
0 1 8
1 2 16
2 3 24
3 4 34
Can anyone help me please? :)
Your problems are twofold, firstly you are pushing the entire list of values (instead of the "current" value) into the result array on each pass through your for loop, and secondly you are overwriting the dataframe each time as well. It would be simpler to use a list comprehension to generate the values for the dataframe:
import pandas as pd
a = [1, 2, 3, 4]
def function():
result = [{'number' : i, 'operation' : 8*i} for i in a]
df = pd.DataFrame(result)
return df
print(function())
Output:
number operation
0 1 8
1 2 16
2 3 24
3 4 32
import numpy as np
a = [1, 2, 3, 4]
def function():
for i in range(0, len(a)):
number = [i for i in a]
operation = [8*i for i in a]
v=np.rot90(np.array((number,operation)))
result=np.flipud(v)
df = pd.DataFrame(result, columns=['number','operation'])
return df
print (function())
number operation
0 1 8
1 2 16
2 3 24
3 4 32
You are almost there. Just replace number = [i for i in a] with number = a[i] and operation = [8*i for i in a] with operation = 8 * a[i]
(FYI: No need to create pandas dataframe inside loop. You can get same output with pandas dataframe creation outside loop)
Refer to the below code:
a = [1, 2, 3, 4]
def function():
result = []
for i in range(0, len(a)):
number = a[i]
operation = 8*a[i]
result.append({'number': number, 'operation': operation})
df = pd.DataFrame(res, columns=['number','operation'])
return df
function()
number operation
0 1 8
1 2 16
2 3 24
3 4 32
I have some lists such as
list1 = ['hi',2,3,4]
list2 = ['hello', 7,1,8]
list3 = ['morning',7,2,1]
Where 'hi', 'hello' and 'morning' are strings, while the rest are numbers.
However then I try to stack them up as:
matrix = np.vstack((list1,list2,list3))
However the types of the numbers become string. In particular they become numpy_str.
How do I solve this? I tried replacing the items, I tried changing their type, nothing works
edit
I made a mistake above! In my original problem, the first list is actually a list of headings, so for example
list1 = ['hi', 'number of hours', 'number of days', 'ideas']
So the first column (in the vertically stacked array) is a column of strings. The other columns have a string as their first element and then numbers.
You could use Pandas DataFrames, they allow for heterogeneous data:
>>> pandas.DataFrame([list1, list2, list3])
0 1 2 3
0 hi 2 3 4
1 hello 7 1 8
2 morning 7 2 1
If you want to name the columns, you can do that too:
pandas.DataFrame([list1, list2, list3], columns=list0)
hi nb_hours nb_days ideas
0 hi 2 3 4
1 hello 7 1 8
2 morning 7 2 1
Since number can be written as strings, but strings can not be written as number, your matrix will have all its elements of type string.
If you want to have a matrix of integers, you can:
1- Extract a submatrix corresponding to your numbers and then map it to be integers 2- Or you can directly extract only the numbers from your lists and stack them.
import numpy as np
list1 = ['hi',2,3,4]
list2 = ['hello', 7,1,8]
list3 = ['morning',7,2,1]
matrix = np.vstack((list1,list2,list3))
# First
m = map(np.int32,matrix[:,1:])
# [array([2, 3, 4], dtype=int32), array([7, 1, 8], dtype=int32), array([7, 2, 1], dtype=int32)]
# Second
m = np.vstack((list1[1:],list2[1:],list3[1:]))
# [[2 3 4] [7 1 8] [7 2 1]]
edit (Answer to comment)
I'll call the title list list0:
list0 = ['hi', 'nb_hours', 'nb_days', 'ideas']
It's basically the same ideas:
1- Stack all then extract submatrix (Here we don't take neither first row neither first column: [1:,1:])
matrix = np.vstack((list0,list1,list2,list3))
matrix_nb = map(np.int32,matrix[1:,1:])
2- Directly don't stack the list0 and stack all the other lists (except their first element [1:]):
m = np.vstack((list1[1:],list2[1:],list3[1:]))
So I just started programming in Python a few days ago. And now, im trying to make a program that generates a random list, and then, choose the duplicates elements. The problem is, I dont have duplicate numbers in my list.
This is my code:
import random
def generar_listas (numeros, rango):
lista = [random.sample(range(numeros), rango)]
print("\n", lista, sep="")
return
def texto_1 ():
texto = "Debes de establecer unos parámetros para generar dos listas aleatorias"
print(texto)
return
texto_1()
generar_listas(int(input("\nNumero maximo: ")), int(input("Longitud: ")))
And for example, I choose 20 and 20 for random.sample, it generates me a list from 0 to 20 but in random position. I want a list with random numbers and duplicated.
What you want is fairly simple. You want to generate a random list of numbers that contain some duplicates. The way to do that is easy if you use something like numpy.
Generate a list (range) of 0 to 10.
Sample randomly (with replacement) from that list.
Like this:
import numpy as np
print np.random.choice(10, 10, replace=True)
Result:
[5 4 8 7 0 8 7 3 0 0]
If you want the list to be ordered just use the builtin function "sorted(list)"
sorted([5 4 8 7 0 8 7 3 0 0])
[0 0 0 3 4 5 7 7 8 8]
If you don't want to use numpy you can use the following:
print [random.choice(range(10)) for i in range(10)]
[7, 3, 7, 4, 8, 0, 4, 0, 3, 7]
random.randrange is what you want.
>>> [random.randrange(10) for i in range(5)]
[3, 2, 2, 5, 7]
I know that the order of the keys is not guaranteed and that's OK, but what exactly does it mean that the order of the values is not guaranteed as well*?
For example, I am representing a matrix as a dictionary, like this:
signatures_dict = {}
M = 3
for i in range(1, M):
row = []
for j in range(1, 5):
row.append(j)
signatures_dict[i] = row
print signatures_dict
Are the columns of my matrix correctly constructed? Let's say I have 3 rows and at this signatures_dict[i] = row line, row will always have 1, 2, 3, 4, 5. What will signatures_dict be?
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
or something like
1 2 3 4 5
1 4 3 2 5
5 1 3 4 2
? I am worried about cross-platform support.
In my application, the rows are words and the columns documents, so can I say that the first column is the first document?
*Are order of keys() and values() in python dictionary guaranteed to be the same?
You will guaranteed have 1 2 3 4 5 in each row. It will not reorder them. The lack of ordering of values() refers to the fact that if you call signatures_dict.values() the values could come out in any order. But the values are the rows, not the elements of each row. Each row is a list, and lists maintain their order.
If you want a dict which maintains order, Python has that too: https://docs.python.org/2/library/collections.html#collections.OrderedDict
Why not use a list of lists as your matrix? It would have whatever order you gave it;
In [1]: matrix = [[i for i in range(4)] for _ in range(4)]
In [2]: matrix
Out[2]: [[0, 1, 2, 3], [0, 1, 2, 3], [0, 1, 2, 3], [0, 1, 2, 3]]
In [3]: matrix[0][0]
Out[3]: 0
In [4]: matrix[3][2]
Out[4]: 2