I have list vector_list of length 800,000, where the elements are lists of size 768. I'm trying to add 768 columns to a pandas dataframe where each column is 800,000 long and represents an element from each list. Here's my code:
active = pd.DataFrame()
for i in range(len(vector_list[0])):
element_list = []
for j in range(len(vector_list)):
element_list.append(vector_list[j][i])
active['Element {}'.format(i)] = element_list
Just to reiterate,
len(vector_list) = 800,000
len(vector_list[0]) = 768
Is there a more clever, faster way to do this?
Directly pass the list to DataFrame constructor.
import pandas as pd
_list = [[1, 2], [3, 4], [5, 6], [7, 8]]
df = pd.DataFrame(_list)
print(df.head())
Output
0 1
0 1 2
1 3 4
2 5 6
3 7 8
Related
I'm trying to randomize all the rows of my DataFrame but with no success.
What I want to do is from this matrix
A= [ 1 2 3
4 5 6
7 8 9 ]
to this
A_random=[ 4 5 6
7 8 9
1 2 3 ]
I've tried with np. random.shuffle but it doesn't work.
I'm working in Google Colaboratory environment.
If you want to make this work with np.random.shuffle, then one way would be to extract the rows into an ArrayLike structure, shuffle them in place and then recreate the DataFrame:
A = pandas.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
extracted_rows = A.values.tolist() # Each row is an array element, so rows will remain fixed but their order shuffled
np.random.shuffle(extracted_rows)
A_random = pandas.DataFrame(extracted_rows)
Let's say I have two very large lists (e.g. 10 million rows) with some values or strings. I would like to figure out how many items from list1 are in list2.
As such this can be done by:
true_count = 0
false_count = 0
for i, x in enumerate(list1):
print(i)
if x in list2:
true_count += 1
else:
false_count += 1
print(true_count)
print(false_count)
This will do the trick, however, if you have 10 million rows, this could take quite some time. Is there some sweet function I don't know about that can do this much faster, or something entirely different?
Using Pandas
Here's how you will do it using Pandas dataframe.
import pandas as pd
import random
list1 = [random.randint(1,10) for i in range(10)]
list2 = [random.randint(1,10) for i in range(10)]
df1 = pd.DataFrame({'list1':list1})
df2 = pd.DataFrame({'list2':list2})
print (df1)
print (df2)
print (all(df2.list2.isin(df1.list1).astype(int)))
I am just picking 10 rows and generating 10 random numbers:
List 1:
list1
0 3
1 5
2 4
3 1
4 5
5 2
6 1
7 4
8 2
9 5
List 2:
list2
0 2
1 3
2 2
3 4
4 3
5 5
6 5
7 1
8 4
9 1
The output of the if statement will be:
True
The random lists I checked against are:
list1 = [random.randint(1,100000) for i in range(10000000)]
list2 = [random.randint(1,100000) for i in range(5000000)]
Ran a test with 10 mil. random numbers in list1, 5 mil. random numbers in list2, result on my mac came back in 2.207757880999999 seconds
Using Set
Alternate, you can also convert the list into a set and check if one set is a subset of the other.
set1 = set(list1)
set2 = set(list2)
print (set2.issubset(set1))
Comparing the results of the run, set is also fast. It came back in 1.6564296570000003 seconds
You can convert the lists to sets and compute the length of the intersection between them.
len(set(list1) & set(list2))
You will have to use Numpy array to translate the lists into a np.array()
After that, both lists will be considered as np.array objects, and because they have only one dimension you can use np.intersect() and count the common items with .size
import numpy as np
lst = [1, 7, 0, 6, 2, 5, 6]
lst2 = [1, 8, 0, 6, 2, 4, 6]
a_list=np.array(lst)
b_list=np.array(lst2)
c = np.intersect1d(a_list, b_list)
print (c.size)
Normally when you want to create a turn a set of data into a Data Frame, you make a list for each column, create a dictionary from those lists, then create a data frame from the dictionary.
The data frame I want to create has 75 columns, all with the same number of rows. Defining lists one-by-one isn't going work. Instead I decided to make a single list and iteratively put a certain chunk of each row onto a Data Frame.
Here I will make an example where I turn a list into a data frame:
lst = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# Example list
df =
a b c d e
0 0 2 4 6 8
1 1 3 5 7 9
# Result I want from the example list
Here is my test code:
import pandas as pd
import numpy as np
dict = {'a':[], 'b':[], 'c':[], 'd':[], 'e':[]}
df = pd.DataFrame(dict)
# Here is my test data frame, it contains 5 columns and no rows.
lst = np.arange(10).tolist()
# This is my test list, it looks like this lst = [0, 2, …, 9]
for i in range(len(lst)):
df.iloc[:, i] = df.iloc[:, i]\
.append(pd.Series(lst[2 * i:2 * i + 2]))
# This code is supposed to put two entries per column for the whole data frame.
# For the first column, i = 0, so [2 * (0):2 * (0) + 2] = [0:2]
# df.iloc[:, 0] = lst[0:2], so df.iloc[:, 0] = [0, 1]
# Second column i = 1, so [2 * (1):2 * (1) + 2] = [2:4]
# df.iloc[:, 1] = lst[2:4], so df.iloc[:, 1] = [2, 3]
# This is how the code was supposed to allocate lst to df.
# However it outputs an error.
When I run this code I get this error:
ValueError: cannot reindex from a duplicate axis
When I add ignore_index = True such that I have
for i in range(len(lst)):
df.iloc[:, i] = df.iloc[:, i]\
.append(pd.Series(lst[2 * i:2 * i + 2]), ignore_index = True)
I get this error:
IndexError: single positional indexer is out-of-bounds
After running the code, I check the results of df. The output is the same whether I ignore index or not.
In: df
Out:
a b c d e
0 0 NaN NaN NaN NaN
1 1 NaN NaN NaN NaN
It seems that the first loop runs fine, but the error occurs when trying to fill the second column.
Does anybody know how to get this to work? Thank you.
IIUC:
lst = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
alst = np.array(lst)
df = pd.DataFrame(alst.reshape(2,-1, order='F'), columns = [*'abcde'])
print(df)
Output:
a b c d e
0 0 2 4 6 8
1 1 3 5 7 9
I have defined a function to create a dataframe, but I get two lists in each column, how could I get each element of the list as a separate row in the dataframe as shown below.
a = [1, 2, 3, 4]
def function():
result = []
for i in range(0, len(a)):
number = [i for i in a]
operation = [8*i for i in a]
result.append({'number': number, 'operation': operation})
df = pd.DataFrame(result, columns=['number','operation'])
return df
function()
Result:
number operation
0 [1, 2, 3, 4] [8, 16, 24, 32]
What I really want to:
number operation
0 1 8
1 2 16
2 3 24
3 4 34
Can anyone help me please? :)
Your problems are twofold, firstly you are pushing the entire list of values (instead of the "current" value) into the result array on each pass through your for loop, and secondly you are overwriting the dataframe each time as well. It would be simpler to use a list comprehension to generate the values for the dataframe:
import pandas as pd
a = [1, 2, 3, 4]
def function():
result = [{'number' : i, 'operation' : 8*i} for i in a]
df = pd.DataFrame(result)
return df
print(function())
Output:
number operation
0 1 8
1 2 16
2 3 24
3 4 32
import numpy as np
a = [1, 2, 3, 4]
def function():
for i in range(0, len(a)):
number = [i for i in a]
operation = [8*i for i in a]
v=np.rot90(np.array((number,operation)))
result=np.flipud(v)
df = pd.DataFrame(result, columns=['number','operation'])
return df
print (function())
number operation
0 1 8
1 2 16
2 3 24
3 4 32
You are almost there. Just replace number = [i for i in a] with number = a[i] and operation = [8*i for i in a] with operation = 8 * a[i]
(FYI: No need to create pandas dataframe inside loop. You can get same output with pandas dataframe creation outside loop)
Refer to the below code:
a = [1, 2, 3, 4]
def function():
result = []
for i in range(0, len(a)):
number = a[i]
operation = 8*a[i]
result.append({'number': number, 'operation': operation})
df = pd.DataFrame(res, columns=['number','operation'])
return df
function()
number operation
0 1 8
1 2 16
2 3 24
3 4 32
I have this DataFrame. In Column ArraysDate contains many elements. I want to be able to number and run the for loop in the array of java. I have not found any solution, please tell me some ideas?.
Ex with CustomerNumber = 4 , then ArraysDate have 3 elements ,and understood i1,i2,i3,i4 to use calculations in ArraysDate.
Thanks you
CustomerNumber ArraysDate
1 [ 1 13 ]
2 [ 3 ]
3 [ 0 ]
4 [ 2 60 30 40]
If I understand correctly, you want to get an array of data from 'ArraysDate' based on column 'CustomerNumber'.
Basically, you can use loc
import pandas as pd
data = {'c': [1, 2, 3, 4], 'date': [[1,2],[3],[0],[2,60,30,40]]}
df = pd.DataFrame(data)
df.loc[df['c']==4, 'date']
df.loc[df['c']==4, 'date'] = df.loc[df['c']==4, 'date'].apply(lambda i: sum(i))
Result:
[2, 60, 30, 40]
c date
0 1 [1, 2]
1 2 [3]
2 3 [0]
3 4 132
You can use the lambda to sum all items in the array per row.
Step 1: Create a dataframe
import pandas as pd
import numpy as np
d = {'ID': [[1,2,3],[1,2,43]]}
df = pd.DataFrame(data=d)
Step 2: Sum the items in the array
df['ID2']=df['ID'].apply(lambda x: sum(x))
df