Filter list-valued columns - python

I have this kind of dataset:
id value cond1 cond2
a 1 ['a','b'] [1,2]
b 1 ['a'] [1]
a 2 ['b'] [2]
a 3 ['a','b'] [1,2]
b 3 ['a','b'] [1,2]
I would like to extract all the rows using the conditions, something like
df.loc[(df['cond1']==['a','b']) & (df['cond2']==[1,2])
this syntax produces however
ValueError: ('Lengths must match to compare', (100,), (1,))
or this if I use isin:
SystemError: <built-in method view of numpy.ndarray object at 0x7f1e4da064e0> returned a result with an error set
How to do it right?
Thanks!

Since it tries to interpret the lists as an array-like, it attempts a column-wise comparison and fails as seen. A way is to tuplify:
df.loc[(df["cond1"].map(tuple) == ("a", "b")) & (df["cond2"].map(tuple) == (1, 2))]
id value cond1 cond2
0 a 1 [a, b] [1, 2]
3 a 3 [a, b] [1, 2]
4 b 3 [a, b] [1, 2]

Related

Get only the users that contain a certain list column

I have the following dataframe
df = pd.DataFrame({'Id':['1','2','3'],'List_Origin':[['A','B'],['B','C'],['A','B']]})
How could i only get the ids, that contain only a certain List_Origin, for example 'A','B'. Would appreciate if the solution avoided loops
Wanted end result
end_df = pd.DataFrame({'Id':['1','3'],'List_Origin':[['A','B'],['A','B']]})
You can use apply and check like below:
>>> df[df['List_Origin'].apply(lambda x: x==['A', 'B'] or x==['A,B'])]
Id List_Origin
0 1 [A,B]
2 3 [A, B]
Unfortunately, when using lists, you cannot vectorize. You must use a loop.
I am assuming first that you have ['A', 'B'] and not ['A,B'] in the first row:
end_df = df[[x==['A', 'B'] for x in df['List_Origin']]]
output:
Id List_Origin
0 1 [A, B]
2 3 [A, B]
If, really, you have a mix of ['A', 'B'] and ['A,B'], then use:
end_df = df[[','.join(x)=='A,B' for x in df['List_Origin']]]
output:
Id List_Origin
0 1 [A,B]
2 3 [A, B]

Replace elements of a pandas series with a list containing a single string

I am trying to replace the empty list in a pandas serie with a list containing a single string. Here is what I have:
a = pd.Series([
[],
[],
['a'],
['a'],
["a","b"]
])
The desired output is as following :
b = pd.Series([
['INCOMPLETE'],
['INCOMPLETE'],
['1'],
['1'],
["1","2"]
])
Where I try to replace the empty lists using boolean indexing, I get an automatic coercion of my list of a unique string to just string string:
a[a.map(len) == 0 ] = ['INCOMPLETE']
0 INCOMPLETE
1 INCOMPLETE
2 [a]
3 [a]
4 [a, b]
In contrast the manual replacement works a[0] = ['INCOMPLETE']
Does anyone have a workaround?
Use lambda function with if-else for replace empty string, because if comapre are processing like False:
a = a.apply(lambda x: x if x else ['INCOMPLETE'])
print (a)
0 [INCOMPLETE]
1 [INCOMPLETE]
2 [a]
3 [a]
4 [a, b]
dtype: object
You can't easily assign a list in pandas (pandas is not made to work with lists as items), you need to loop here:
b = pd.Series([x if x else ['INCOMPLETE'] for x in a], index=a.index)
output:
0 [INCOMPLETE]
1 [INCOMPLETE]
2 [a]
3 [a]
4 [a, b]
dtype: object
import pandas as pd
a = pd.Series([
[],
[],
['a'],
['a'],
["a","b"]
])
convert_num = lambda x:list(map(lambda y:ord(y)-ord('a')+1,x))
map_data = lambda x:'Incomplete' if x==[] else convert_num(x)
a = a.apply(map_data)
0 Incomplete
1 Incomplete
2 [1]
3 [1]
4 [1, 2]
dtype: object

Pandas dataframe selecting with index and condition on a column

I am trying for a while to solve this problem:
I have a daraframe like this:
import pandas as pd
df=pd.DataFrame(np.array([['A', 2, 3], ['B', 5, 6], ['C', 8, 9]]),columns=['a', 'b', 'c'])
j=[0,2]
But then when i try to select just a part of it filtering by a list of index and a condition on a column I get error...
df[df.loc[j]['a']=='A']
There is somenting wrong, but i don't get what is the problem here. Can you help me?
This is the error message:
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
There is filtered DataFrame compared by original, so indices are different, so error is raised.
You need compare filtered DataFrame:
df1 = df.loc[j]
print (df1)
a b c
0 A 2 3
2 C 8 9
out = df1[df1['a']=='A']
print(out)
a b c
0 A 2 3
Your solution is possible use with convert ndices of filtered mask by original indices by Series.reindex:
out = df[(df.loc[j, 'a']=='A').reindex(df.index, fill_value=False)]
print(out)
a b c
0 A 2 3
Or nicer solution:
out = df[(df['a'] == 'A') & (df.index.isin(j))]
print(out)
a b c
0 A 2 3
A boolean array and the dataframe should be the same length. here your df length is 3 but the boolean array df.loc[j]['a']=='A' length is 2
You should do:
>>> df.loc[j][df.loc[j]['a']=='A']
a b c
0 A 2 3

Pandas Set element of a new column as a list (iterable) raise ValueError: setting an array element with a sequence

I want to, at the same time, create a new column in a pandas dataframe and set its first value to a list.
I want to transform this dataframe
df = pd.DataFrame.from_dict({'a':[1,2],'b':[3,4]})
a b
0 1 3
1 2 4
into this one
a b c
0 1 3 [2,3]
1 2 4 NaN
I tried :
df.loc[0, 'c'] = [2,3]
df.loc[0, 'c'] = np.array([2,3])
df.loc[0, 'c'] = [[2,3]]
df.at[0,'c'] = [2,3]
df.at[0,'d'] = [[2,3]]
It does not work.
How should I proceed?
If the first element of a series is a list, then the series must be of type object (not the most efficient for numerical computations). This should work, however.
df = df.assign(c=None)
df.loc[0, 'c'] = [2, 3]
>>> df
a b c
0 1 3 [2, 3]
1 2 4 None
If you really need the remaining values of column c to be NaNs instead of None, use this:
df.loc[1:, 'c'] = np.nan
The problem seems to have something to do with the type of the c column. If you convert it to type 'object', you can use iat, loc or set_value to set a cell as a list.
df2 = (
df.assign(c=np.nan)
.assign(c=lambda x: x.c.astype(object))
)
df2.set_value(0,'c',[2,3])
Out[86]:
a b c
0 1 3 [2, 3]
1 2 4 NaN

python remove element from list without changing index of other elements

I have a list L = [a,a,b,b,c,c] now I want to remove first 'b' so that the L becomes [a,a,b,c,c]. In the new L the index of first 'c' is 3. Is there any way I can remove first 'b' from L and still get the index of first 'c' as 4.
Thanks in advance.
It isn't possible to completely remove an element while retaining the indices of the other elements, as in your example. The indices of the elements represent their positions in the list. If you have a list [a, a, b, b, c, c] and you remove the first b to get [a, a, b, c, c] then the indices adjust because they represent the positions of the elements in the list, which have changed with the removal of an element.
However, depending on what your use case is, there are ways you can get the behavior you want by using a different data structure. For example, you could use a dictionary of integers to objects (whatever objects you need in the list). For example, the original list could be represented instead as {0: a, 1: a, 2: b, 3: b, 4: c, 5: c}. If you remove the b at 'index' (rather, with a key of) 2, you will get {0: a, 1: a, 3: b, 4: c, 5: c}, which seems to be the behavior you are looking for.
Perhaps you can get your desired effect with pandas:
>>> import pandas as pd
>>> L = ['a','a','b','b','c','c']
>>> df = pd.DataFrame(L)
>>> df
0
0 a
1 a
2 b
3 b
4 c
5 c
[6 rows x 1 columns]
>>> df = df.drop(3)
>>> df
0
0 a
1 a
2 b
4 c
5 c
[5 rows x 1 columns]
>>> df.loc[4]
0 c
Name: 4, dtype: object
>>> df.loc[5]
0 c
Name: 5, dtype: object

Categories

Resources