Get only the users that contain a certain list column - python

I have the following dataframe
df = pd.DataFrame({'Id':['1','2','3'],'List_Origin':[['A','B'],['B','C'],['A','B']]})
How could i only get the ids, that contain only a certain List_Origin, for example 'A','B'. Would appreciate if the solution avoided loops
Wanted end result
end_df = pd.DataFrame({'Id':['1','3'],'List_Origin':[['A','B'],['A','B']]})

You can use apply and check like below:
>>> df[df['List_Origin'].apply(lambda x: x==['A', 'B'] or x==['A,B'])]
Id List_Origin
0 1 [A,B]
2 3 [A, B]

Unfortunately, when using lists, you cannot vectorize. You must use a loop.
I am assuming first that you have ['A', 'B'] and not ['A,B'] in the first row:
end_df = df[[x==['A', 'B'] for x in df['List_Origin']]]
output:
Id List_Origin
0 1 [A, B]
2 3 [A, B]
If, really, you have a mix of ['A', 'B'] and ['A,B'], then use:
end_df = df[[','.join(x)=='A,B' for x in df['List_Origin']]]
output:
Id List_Origin
0 1 [A,B]
2 3 [A, B]

Related

Replace elements of a pandas series with a list containing a single string

I am trying to replace the empty list in a pandas serie with a list containing a single string. Here is what I have:
a = pd.Series([
[],
[],
['a'],
['a'],
["a","b"]
])
The desired output is as following :
b = pd.Series([
['INCOMPLETE'],
['INCOMPLETE'],
['1'],
['1'],
["1","2"]
])
Where I try to replace the empty lists using boolean indexing, I get an automatic coercion of my list of a unique string to just string string:
a[a.map(len) == 0 ] = ['INCOMPLETE']
0 INCOMPLETE
1 INCOMPLETE
2 [a]
3 [a]
4 [a, b]
In contrast the manual replacement works a[0] = ['INCOMPLETE']
Does anyone have a workaround?
Use lambda function with if-else for replace empty string, because if comapre are processing like False:
a = a.apply(lambda x: x if x else ['INCOMPLETE'])
print (a)
0 [INCOMPLETE]
1 [INCOMPLETE]
2 [a]
3 [a]
4 [a, b]
dtype: object
You can't easily assign a list in pandas (pandas is not made to work with lists as items), you need to loop here:
b = pd.Series([x if x else ['INCOMPLETE'] for x in a], index=a.index)
output:
0 [INCOMPLETE]
1 [INCOMPLETE]
2 [a]
3 [a]
4 [a, b]
dtype: object
import pandas as pd
a = pd.Series([
[],
[],
['a'],
['a'],
["a","b"]
])
convert_num = lambda x:list(map(lambda y:ord(y)-ord('a')+1,x))
map_data = lambda x:'Incomplete' if x==[] else convert_num(x)
a = a.apply(map_data)
0 Incomplete
1 Incomplete
2 [1]
3 [1]
4 [1, 2]
dtype: object

Compare consecutive rows and delete based on condition

I would like to compare consecutive rows from the column one and delete based on this condition:
if 2 or more consecutive rows are the same, keep them
If one row it's different from the previous and the next delete it
Example df:
a = [['A', 'B', 'C'], ['A', 'B', 'C'], ['B', 'B', 'C'],['C', 'B', 'C'],['C', 'B', 'C'],['C', 'B', 'C']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])
print output would be:
one two three
0 A B C
1 A B C
2 B B C
3 D B C
4 C B C
5 C B C
Expected output would be:
one two three
0 A B C
1 A B C
3 c B C
4 C B C
5 C B C
So the line from index 2 will be deleted.
I've tried using shift but I am a stucked, because like I am doing now, it deletes also the first and last column. Can someone please tell me a better way of doing this? Or maybe how to apply shift but ignore the first and last row ?
#First I take only the one column
df = df['one']
#Then apply shift
df.loc[df.shift(-1) == df]
With the above code I get this. Which is not correct because it delets also the first and last row
0 A
3 C
4 C
Try shifting up and down:
mask = (df.one == df.one.shift(-1)) | (df.one == df.one.shift(1))
adj_df = df[mask]
You could use shift in both directions (and you need an all condition to check that all the columns are the same):
df[(df.shift(1) == df).all(axis=1) | (df.shift(-1) == df).all(axis=1)]

Function Value with Combination(or Permutation) of Variables and Assign to Dataframe

I have n variables. Suppose n equals 3 in this case. I want to apply one function to all of the combinations(or permutations, depending on how you want to solve this) of variables and store the result in the same row and column in dataframe.
a = 1
b = 2
c = 3
indexes = ['a', 'b', 'c']
df = pd.DataFrame({x:np.nan for x in indexes}, index=indexes)
If I apply sum(the function can be anything), then the result that I want to get is like this:
a b c
a 2 3 4
b 3 4 5
c 4 5 6
I can only think of iterating all the variables, apply the function one by one, and use the index of the iterators to set the value in the dataframe. Is there any better solution?
You can use apply and return a pd.Series for that effect. In such cases, pandas uses the series indices as columns in the resulting dataframe.
s = pd.Series({"a": 1, "b": 2, "c": 3})
s.apply(lambda x: x+s)
Just note that the operation you do is between an element and a series.
I believe you need broadcast sum of array created from variables if performance is important:
a = 1
b = 2
c = 3
indexes = ['a', 'b', 'c']
arr = np.array([a,b,c])
df = pd.DataFrame(arr + arr[:, None], index=indexes, columns=indexes)
print (df)
a b c
a 2 3 4
b 3 4 5
c 4 5 6

How to add multiple columns to dataframe by function

If I have a df such as this:
a b
0 1 3
1 2 4
I can use df['c'] = '' and df['d'] = -1 to add 2 columns and become this:
a b c d
0 1 3 -1
1 2 4 -1
How can I make the code within a function, so I can apply that function to df and add all the columns at once, instead of adding them one by one seperately as above? Thanks
Create a dictionary:
dictionary= { 'c':'', 'd':-1 }
def new_columns(df, dictionary):
return df.assign(**dictionary)
then call it with your df:
df = new_columns(df, dictionary)
or just ( if you don't need a function call, not sure what your use case is) :
df.assign(**dictionary)
def update_df(a_df, new_cols_names, new_cols_vals):
for n, v in zip(new_cols_names, new_cols_vals):
a_df[n] = v
update_df(df, ['c', 'd', 'e'], ['', 5, 6])

How to Access Element of Pandas Series that is a List

I have a Dataframe series that contains is a list of strings for each row. I'd like to create another series that is the last string in the list for that row.
So one row may have a list e.g
['a', 'b', 'c', 'd']
I'd like to create another pandas series made up of the last element of the row, normally access as a -1 reference, in this 'd'. The lists for each observation (i.e. row) are of varying length. How can this be done?
I believe need indexing with str, it working with all iterables:
df = pd.DataFrame({'col':[['a', 'b', 'c', 'd'],['a', 'b'],['a'], []]})
df['last'] = df['col'].str[-1]
print (df)
col last
0 [a, b, c, d] d
1 [a, b] b
2 [a] a
3 [] NaN
strings are iterables too:
df = pd.DataFrame({'col':['abcd','ab','a', '']})
df['last'] = df['col'].str[-1]
print (df)
col last
0 abcd d
1 ab b
2 a a
3 NaN
Why not making the list column to a info dataframe, and you can using the index for join
Infodf=pd.DataFrame(df.col.values.tolist(),index=df.index)
Infodf
Out[494]:
0 1 2 3
0 a b c d
1 a b None None
2 a None None None
3 None None None None
I think I over looked the question , and both PiR and Jez provided their valuable suggestion to help me achieve the final result .
Infodf.ffill(1).iloc[:,-1]

Categories

Resources