I have defined a function to create a dataframe, but I get two lists in each column, how could I get each element of the list as a separate row in the dataframe as shown below.
a = [1, 2, 3, 4]
def function():
result = []
for i in range(0, len(a)):
number = [i for i in a]
operation = [8*i for i in a]
result.append({'number': number, 'operation': operation})
df = pd.DataFrame(result, columns=['number','operation'])
return df
function()
Result:
number operation
0 [1, 2, 3, 4] [8, 16, 24, 32]
What I really want to:
number operation
0 1 8
1 2 16
2 3 24
3 4 34
Can anyone help me please? :)
Your problems are twofold, firstly you are pushing the entire list of values (instead of the "current" value) into the result array on each pass through your for loop, and secondly you are overwriting the dataframe each time as well. It would be simpler to use a list comprehension to generate the values for the dataframe:
import pandas as pd
a = [1, 2, 3, 4]
def function():
result = [{'number' : i, 'operation' : 8*i} for i in a]
df = pd.DataFrame(result)
return df
print(function())
Output:
number operation
0 1 8
1 2 16
2 3 24
3 4 32
import numpy as np
a = [1, 2, 3, 4]
def function():
for i in range(0, len(a)):
number = [i for i in a]
operation = [8*i for i in a]
v=np.rot90(np.array((number,operation)))
result=np.flipud(v)
df = pd.DataFrame(result, columns=['number','operation'])
return df
print (function())
number operation
0 1 8
1 2 16
2 3 24
3 4 32
You are almost there. Just replace number = [i for i in a] with number = a[i] and operation = [8*i for i in a] with operation = 8 * a[i]
(FYI: No need to create pandas dataframe inside loop. You can get same output with pandas dataframe creation outside loop)
Refer to the below code:
a = [1, 2, 3, 4]
def function():
result = []
for i in range(0, len(a)):
number = a[i]
operation = 8*a[i]
result.append({'number': number, 'operation': operation})
df = pd.DataFrame(res, columns=['number','operation'])
return df
function()
number operation
0 1 8
1 2 16
2 3 24
3 4 32
Related
I am learning Pandas and I am moving my python code to Pandas. I want to compare every value with the next values using a sub. So the first with the second etc.. The second with the third but not with the first because I already did that. In python I use two nested loops over a list:
sub match_values (a, b):
#do some stuff...
l = ['a', 'b', 'c']
length = len(l)
for i in range (1, length):
for j in range (i, length): # starts from i, not from the start!
if match_values(l[i], l[j]):
#do some stuff...
How do I do a similar technique in Pandas when my list is a column in a dataframe? Do I simply reference every value like before or is there a clever "vector-style" way to do this fast and efficient?
Thanks in advance,
Jo
Can you please check this ? It provides an output in the form of a list for each row after comparing the values.
>>> import pandas as pd
>>> import numpy as np
>>> val = [16,19,15,19,15]
>>> df = pd.DataFrame({'val': val})
>>> df
val
0 16
1 19
2 15
3 19
4 15
>>>
>>>
>>> df['match'] = df.apply(lambda x: [ (1 if (x['val'] == df.loc[idx, 'val']) else 0) for idx in range(x.name+1, len(df)) ], axis=1)
>>> df
val match
0 16 [0, 0, 0, 0]
1 19 [0, 1, 0]
2 15 [0, 1]
3 19 [0]
4 15 []
Yes, vector comparison as pandas is built on Numpy:
df['columnname'] > 5
This will result in a Boolean array. If you also want to return the actually part of the dataframe:
df[df['columnname'] > 5]
Normally when you want to create a turn a set of data into a Data Frame, you make a list for each column, create a dictionary from those lists, then create a data frame from the dictionary.
The data frame I want to create has 75 columns, all with the same number of rows. Defining lists one-by-one isn't going work. Instead I decided to make a single list and iteratively put a certain chunk of each row onto a Data Frame.
Here I will make an example where I turn a list into a data frame:
lst = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# Example list
df =
a b c d e
0 0 2 4 6 8
1 1 3 5 7 9
# Result I want from the example list
Here is my test code:
import pandas as pd
import numpy as np
dict = {'a':[], 'b':[], 'c':[], 'd':[], 'e':[]}
df = pd.DataFrame(dict)
# Here is my test data frame, it contains 5 columns and no rows.
lst = np.arange(10).tolist()
# This is my test list, it looks like this lst = [0, 2, …, 9]
for i in range(len(lst)):
df.iloc[:, i] = df.iloc[:, i]\
.append(pd.Series(lst[2 * i:2 * i + 2]))
# This code is supposed to put two entries per column for the whole data frame.
# For the first column, i = 0, so [2 * (0):2 * (0) + 2] = [0:2]
# df.iloc[:, 0] = lst[0:2], so df.iloc[:, 0] = [0, 1]
# Second column i = 1, so [2 * (1):2 * (1) + 2] = [2:4]
# df.iloc[:, 1] = lst[2:4], so df.iloc[:, 1] = [2, 3]
# This is how the code was supposed to allocate lst to df.
# However it outputs an error.
When I run this code I get this error:
ValueError: cannot reindex from a duplicate axis
When I add ignore_index = True such that I have
for i in range(len(lst)):
df.iloc[:, i] = df.iloc[:, i]\
.append(pd.Series(lst[2 * i:2 * i + 2]), ignore_index = True)
I get this error:
IndexError: single positional indexer is out-of-bounds
After running the code, I check the results of df. The output is the same whether I ignore index or not.
In: df
Out:
a b c d e
0 0 NaN NaN NaN NaN
1 1 NaN NaN NaN NaN
It seems that the first loop runs fine, but the error occurs when trying to fill the second column.
Does anybody know how to get this to work? Thank you.
IIUC:
lst = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
alst = np.array(lst)
df = pd.DataFrame(alst.reshape(2,-1, order='F'), columns = [*'abcde'])
print(df)
Output:
a b c d e
0 0 2 4 6 8
1 1 3 5 7 9
I have list vector_list of length 800,000, where the elements are lists of size 768. I'm trying to add 768 columns to a pandas dataframe where each column is 800,000 long and represents an element from each list. Here's my code:
active = pd.DataFrame()
for i in range(len(vector_list[0])):
element_list = []
for j in range(len(vector_list)):
element_list.append(vector_list[j][i])
active['Element {}'.format(i)] = element_list
Just to reiterate,
len(vector_list) = 800,000
len(vector_list[0]) = 768
Is there a more clever, faster way to do this?
Directly pass the list to DataFrame constructor.
import pandas as pd
_list = [[1, 2], [3, 4], [5, 6], [7, 8]]
df = pd.DataFrame(_list)
print(df.head())
Output
0 1
0 1 2
1 3 4
2 5 6
3 7 8
I have a dataframe with index and multiple columns. Secondly, I have few lists containing index values sampled on certain criterias. Now I want to create columns with labes based on fact whether or not the index of certain row is present in a specified list.
Now there are two situations where I am using it:
1) To create a column and give labels based on one list:
df['1_name'] = df.index.map(lambda ix: 'A' if ix in idx_1_model else 'B')
2) To create a column and give labels based on multiple lists:
def assignLabelsToSplit(ix_, random_m, random_y, model_m, model_y):
if (ix_ in random_m) or (ix_ in model_m):
return 'A'
if (ix_ in random_y) or (ix_ in model_y):
return 'B'
else:
return 'not_assigned'
df['2_name'] = df.index.map(lambda ix: assignLabelsToSplit(ix, idx_2_random_m, idx_2_random_y, idx_2_model_m, idx_2_model_y))
This is working, but it is quite slow. Each call takes about 3 minutes and considering I have to execute the funtions multiple times, it needs to be faster.
Thank you for any suggestions.
I think you need double numpy.where with Index.isin :
df['2_name'] = np.where(df.index.isin(random_m + model_m), 'A',
np.where(df.index.isin(random_y + model_y), 'B', 'not_assigned'))
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(10,1)), columns=['A'])
#print (df)
random_m = [0,1]
random_y = [2,3]
model_m = [7,4]
model_y = [5,6]
print (type(random_m))
<class 'list'>
print (random_m + model_m)
[0, 1, 7, 4]
print (random_y + model_y)
[2, 3, 5, 6]
df['2_name'] = np.where(df.index.isin(random_m + model_m), 'A',
np.where(df.index.isin(random_y + model_y), 'B', 'not_assigned'))
print (df)
A 2_name
0 8 A
1 8 A
2 3 B
3 7 B
4 7 A
5 0 B
6 4 B
7 2 A
8 5 not_assigned
9 2 not_assigned
Basically if a column of my pandas dataframe looks like this:
[1 1 1 2 2 2 3 3 3 1 1]
I'd like it to be turned into the following:
[1 2 3 1]
You can write a simple function that loops through the elements of your series only storing the first element in a run.
As far as I know, there is no tool built in to pandas to do this. But it is not a lot of code to do it yourself.
import pandas
example_series = pandas.Series([1, 1, 1, 2, 2, 3])
def collapse(series):
last = ""
seen = []
for element in series:
if element != last:
last = element
seen.append(element)
return seen
collapse(example_series)
In the code above, you will iterate through each element of a series and check if it is the same as the last element seen. If it is not, store it. If it is, ignore the value.
If you need to handle the return value as a series you can change the last line of the function to:
return pandas.Series(seen)
You could write a function that does the following:
x = pandas.Series([1 1 1 2 2 2 3 3 3 1 1])
y = x-x.shift(1)
y[0] = 1
result = x[y!=0]
You can use DataFrame's diff and indexing:
>>> df = pd.DataFrame([1,1,2,2,2,2,3,3,3,3,1])
>>> df[df[0].diff()!=0]
0
0 1
2 2
6 3
10 1
>>> df[df[0].diff()!=0].values.ravel() # If you need an array
array([1, 2, 3, 1])
Same works for Series:
>>> df = pd.Series([1,1,2,2,2,2,3,3,3,3,1])
>>> df[df.diff()!=0].values
array([1, 2, 3, 1])
You can use shift to create a boolean mask to compare the row against the previous row:
In [67]:
s = pd.Series([1,1,2,2,2,2,3,3,3,3,4,4,5])
s[s!=s.shift()]
Out[67]:
0 1
2 2
6 3
10 4
12 5
dtype: int64