remove elements from list based on index in pandas Dataframe - python

How to remove elements from list based on index range in pandas Dataframe.
suppose DataFrame is like
df:
values size
0 [1,2,3,4,5,6,7] 2 #delete first 2 elements from list
1 [1,2,3,4] 3 #delete first 3 elements from list
2 [9,8,7,6,5,4,3] 5 #delete first 5 elements from list
Expected Output is
df:
values size
0 [3,4,5,6,7] 2
1 [4] 3
2 [4,3] 5

Use list comprehension with indexing:
df['values'] = [i[j:] for i, j in zip(df['values'], df['size'])]
print (df)
values size
0 [3, 4, 5, 6, 7] 2
1 [4] 3
2 [4, 3] 5

Using df.apply
import pandas as pd
df = pd.DataFrame({"values": [[1,2,3,4,5,6,7], [1,2,3,4], [9,8,7,6,5,4,3]], "size": [2, 3, 5]})
df["values"] = df.apply(lambda x: x["values"][x['size']:], axis=1)
print(df)
Output:
size values
0 2 [3, 4, 5, 6, 7]
1 3 [4]
2 5 [4, 3]

Using map in base Python, you could do
dat['values'] = pd.Series(map(lambda x, y : x[y:], dat['values'], dat['size']))
which returns
dat
Out[34]:
values size
0 [3, 4, 5, 6, 7] 2
1 [4] 3
2 [4, 3] 5

Related

count number of elements in a list inside a dataframe

Assume that we have a dataframe and inside the dataframe in a column we have lists. How can I count the number per list? For example
A B
(1,2,3) (1,2,3,4)
(1) (1,2,3)
I would like to create 2 new columns with the count of each column. something like the following
A B C D
(1,2,3) (1,2,3,4) 3 4
(1) (1,2,3) 1 3
where C corresponds to the number of the elements in the column A for that row, and D for the number of elements in the list in column B for that row
I cannot just do
df['A'] = len(df['A'])
Because that returns the len of my dataframe
You can use the .apply method on the Series for the column df['A'].
>>> import pandas
>>> import pandas as pd
>>> pd.DataFrame({"column": [[1, 2], [1], [1, 2, 3]]})
column
0 [1, 2]
1 [1]
2 [1, 2, 3]
>>> df = pd.DataFrame({"column": [[1, 2], [1], [1, 2, 3]]})
>>> df["column"].apply
<bound method Series.apply of 0 [1, 2]
1 [1]
2 [1, 2, 3]
Name: column, dtype: object>
>>> df["column"].apply(len)
0 2
1 1
2 3
Name: column, dtype: int64
>>> df["column"] = df["column"].apply(len)
>>>
See Python Pandas, apply function for a more general discussion of apply.
You can pandas' apply with the len function to each column like bellow to obtain what you are looking for
# package importation
import pandas as pd
# creating a sample dataframce
df = pd.DataFrame(
{
'A':[[1,2,3],[32,4],[45,67,23,54,3],[],[0]],
'B':[[2],[3],[2,3],[5,6,1],[98,44]]
},
index=['z','y','m','n','o']
)
# computing lengths of lists in the column
df['items_in_A'] = df['A'].apply(len)
df['items_in_B'] = df['B'].apply(len)
# check the putput
print(df)
output
A B items_in_A items_in_B
z [1, 2, 3] [2] 3 1
y [32, 4] [3] 2 1
m [45, 67, 23, 54, 3] [2, 3] 5 2
n [] [5, 6, 1] 0 3
o [0] [98, 44] 1 2

Assign a list to a pandas dataframe element

I want to add a column to a data frame, and also set a list to each element of it, after the execution of below code, nothing changed,
df = pd.DataFrame({'A':[1,2,3],'B':[4,5,6]})
df['C'] = 0
for i in range(len(df)):
lst = [6,7,8]
data.iloc[i]['C'] = []
data.iloc[i]['C'] = lst
Also, based on Assigning a list value to pandas dataframe, I tried df.at[i,'C'] on the above code, and the following error appeared: 'setting an array element with a sequence.'
You can use np.tile with np.ndarray.tolist
l = len(df)
df['C'] = np.tile([6,7,8],(l,1)).tolist()
df
A B C
0 1 4 [6, 7, 8]
1 2 5 [6, 7, 8]
2 3 6 [6, 7, 8]
One idea is use list comprehension:
lst = [6,7,8]
df['C'] = [lst for _ in df.index]
print (df)
A B C
0 1 4 [6, 7, 8]
1 2 5 [6, 7, 8]
2 3 6 [6, 7, 8]
In your solution for me working:
df['C'] = ''
for i in range(len(df)):
lst = [6,7,8]
df.iloc[i, df.columns.get_loc('C')] = lst

Python - pick a value from a list basing on another list

I've got a dataframe. In column A there is a list of integers, in column B - an integer. I want to pick n-th value of the column A list, where n is a number from column B. So if in columns A there is [1,5,6,3,4] and in column B: 2, I want to get '6'.
I tried this:
result = [y[x] for y in df['A'] for x in df['B']
But it doesn't work. Please help.
Use zip with list comprehension:
df['new'] = [y[x] for x, y in zip(df['B'], df['A'])]
print (df)
A B new
0 [1, 2, 3, 4, 5] 1 2
1 [1, 2, 3, 4] 2 3
You can go for apply i.e
df = pd.DataFrame({'A':[[1,2,3,4,5],[1,2,3,4]],'B':[1,2]})
A B
0 [1, 2, 3, 4, 5] 1
1 [1, 2, 3, 4] 2
# df.apply(lambda x : np.array(x['A'])[x['B']],1)
# You dont need np.array here, use it when the column B is also a list.
df.apply(lambda x : x['A'][x['B']],1) # Thanks #Zero
0 2
1 3
dtype: int64

Find all duplicate rows in a pandas dataframe

I would like to be able to get the indices of all the instances of a duplicated row in a dataset without knowing the name and number of columns beforehand. So assume I have this:
col
1 | 1
2 | 2
3 | 1
4 | 1
5 | 2
I'd like to be able to get [1, 3, 4] and [2, 5]. Is there any way to achieve this? It sounds really simple, but since I don't know the columns beforehand I can't do something like df[col == x...].
First filter all duplicated rows and then groupby with apply or convert index to_series:
df = df[df.col.duplicated(keep=False)]
a = df.groupby('col').apply(lambda x: list(x.index))
print (a)
col
1 [1, 3, 4]
2 [2, 5]
dtype: object
a = df.index.to_series().groupby(df.col).apply(list)
print (a)
col
1 [1, 3, 4]
2 [2, 5]
dtype: object
And if need nested lists:
L = df.groupby('col').apply(lambda x: list(x.index)).tolist()
print (L)
[[1, 3, 4], [2, 5]]
If need use only first column is possible selected by position with iloc:
a = df[df.iloc[:,0].duplicated(keep=False)]
.groupby(df.iloc[:,0]).apply(lambda x: list(x.index))
print (a)
col
1 [1, 3, 4]
2 [2, 5]
dtype: object

How to make a multi-dimensional column into a single valued vector for training data in sklearn pandas

I have a data set in which certain column is a combination of couple of independent values, as in the example below:
id age marks
1 5 3,6,7
2 7 1,2
3 4 34,78,2
Thus the column by itself is composed of multiple values, and I need to pass the vector into a machine learning algorithm , I cannot really combine the values to assign a single value like :
3,6,7 => 1
1,2 => 2
34,78,2 => 3
making my new vector as
id age marks
1 5 1
2 7 2
3 4 3
and then subsequently pass it to the algorithm , as the number of such combination would be infinite and also that might not really capture the real meaning of the data.
how to handle such situation where individual feature is a combination of multiple features.
Note :
the values in column marks are just examples, it could be anything a list of values. it could be list of integer or list of string , string composed of multiple stings separated by commas
You can pd.factorize tuples
Assuming marks is a list
df
id age marks
0 1 5 [3, 6, 7]
1 2 7 [1, 2]
2 3 4 [34, 78, 2]
3 4 5 [3, 6, 7]
Apply tuple and factorize
df.assign(new=pd.factorize(df.marks.apply(tuple))[0] + 1)
id age marks new
0 1 5 [3, 6, 7] 1
1 2 7 [1, 2] 2
2 3 4 [34, 78, 2] 3
3 4 5 [3, 6, 7] 1
setup df
df = pd.DataFrame([
[1, 5, ['3', '6', '7']],
[2, 7, ['1', '2']],
[3, 4, ['34', '78', '2']],
[4, 5, ['3', '6', '7']]
], [0, 1, 2, 3], ['id', 'age', 'marks']
)
UPDATE: I think we can use CountVectorizer in this case:
assuming we have the following DF:
In [33]: df
Out[33]:
id age marks
0 1 5 [3, 6, 7]
1 2 7 [1, 2]
2 3 4 [34, 78, 2]
3 4 11 [3, 6, 7]
In [34]: %paste
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import TreebankWordTokenizer
vect = CountVectorizer(ngram_range=(1,1), stop_words=None, tokenizer=TreebankWordTokenizer().tokenize)
X = vect.fit_transform(df.marks.apply(' '.join))
r = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
## -- End pasted text --
Result:
In [35]: r
Out[35]:
1 2 3 34 6 7 78
0 0 0 1 0 1 1 0
1 1 1 0 0 0 0 0
2 0 1 0 1 0 0 1
3 0 0 1 0 1 1 0
OLD answer:
you can first convert your list to string and then categorize it:
In [119]: df
Out[119]:
id age marks
0 1 5 [3, 6, 7]
1 2 7 [1, 2]
2 3 4 [34, 78, 2]
3 4 11 [3, 6, 7]
In [120]: df['new'] = pd.Categorical(pd.factorize(df.marks.str.join('|'))[0])
In [121]: df
Out[121]:
id age marks new
0 1 5 [3, 6, 7] 0
1 2 7 [1, 2] 1
2 3 4 [34, 78, 2] 2
3 4 11 [3, 6, 7] 0
In [122]: df.dtypes
Out[122]:
id int64
age int64
marks object
new category
dtype: object
this will also work if marks is a column of strings:
In [124]: df
Out[124]:
id age marks
0 1 5 3,6,7
1 2 7 1,2
2 3 4 34,78,2
3 4 11 3,6,7
In [125]: df['new'] = pd.Categorical(pd.factorize(df.marks.str.join('|'))[0])
In [126]: df
Out[126]:
id age marks new
0 1 5 3,6,7 0
1 2 7 1,2 1
2 3 4 34,78,2 2
3 4 11 3,6,7 0
Tp access them as either [[x, y, z], [x, y, z]] or [[x, x], [y, y], [z, z]] (whatever is most appropriate for the function you need to call) then use:
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(a=[1, 2, 3, 4], b=[3, 4, 3, 4], c=[[1,2,3], [1,2], [], [2]]))
df.values
zip(*df.values)
where
>>> df
a b c
0 1 3 [1, 2, 3]
1 2 4 [1, 2]
2 3 3 []
3 4 4 [2]
>>> df.values
array([[1, 3, [1, 2, 3]],
[2, 4, [1, 2]],
[3, 3, []],
[4, 4, [2]]], dtype=object)
>>> zip(*df.values)
[(1, 2, 3, 4), (3, 4, 3, 4), ([1, 2, 3], [1, 2], [], [2])]
To convert a column try this:
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(a=[1, 2], b=[3, 4], c=[[1,2,3], [1,2]]))
df['c'].apply(lambda x: np.mean(x))
before:
>>> df
a b c
0 1 3 [1, 2, 3]
1 2 4 [1, 2]
after:
>>> df
a b c
0 1 3 2.0
1 2 4 1.5

Categories

Resources