Creating a new column from two columns with apply() - python

I want to creat a column s['C'] using apply() with a Pandas DataFrame.
My dataset is similiar to this:
[In]:
s=pd.DataFrame({'A':['hello', 'good', 'my', 'pandas','wrong'],
'B':[['all', 'say', 'hello'],
['good', 'for', 'you'],
['so','hard'],
['pandas'],
[]]})
[Out]:
A B
0 hello [all, say, hello]
1 good [good, for, you]
2 my [so, hard]
3 pandas [pandas]
4 wrong []
I need to creat a s['C'] column where the value of each row is a list with ones and zeros dependending if the word of column A is in the list of column B and the position of the element in the list of column B. My output should be like this:
[Out]:
A B C
0 hello [all, say, hello] [0, 0, 1]
1 good [good, for, you] [1, 0, 0]
2 my [so, hard] [0, 0]
3 pandas [pandas] [1]
4 wrong [] [0]
I've been trying with a funciĆ³n and apply, but I still have not realized where is the error.
[In]:
def func(valueA,listB):
new_list=[]
for i in listB:
if listB[i] == valueA:
new_list.append(1)
else:
new_list.append(0)
return new_list
s['C']=s.apply( lambda x: func(x.loc[:,'A'], x.loc[:,'B']))
The error is: Too many indexers
And I also tried with:
[In]:
list=[]
listC=[]
for i in s['A']:
for j in s['B'][i]:
if s['A'][i] == s['B'][i][j]:
list.append(1)
else:
list.append(0)
listC.append(list)
s['C']=listC
The error is: KeyError: 'hello'
Any suggests?

If you are working with pandas 0.25+, explode is an option:
(s.explode('B')
.assign(C=lambda x: x['A'].eq(x['B']).astype(int))
.groupby(level=0).agg({'A':'first','B':list,'C':list})
)
Output:
A B C
0 hello [all, say, hello] [0, 0, 1]
1 good [good, for, you] [1, 0, 0]
2 my [so, hard] [0, 0]
3 pandas [pandas] [1]
4 wrong [nan] [0]
Option 2: Based on your logic, you can do a list comprehension. This should work with any version of pandas:
s['C'] = [[x==a for x in b] if b else [0] for a,b in zip(s['A'],s['B'])]
Output:
A B C
0 hello [all, say, hello] [False, False, True]
1 good [good, for, you] [True, False, False]
2 my [so, hard] [False, False]
3 pandas [pandas] [True]
4 wrong [] [0]

With apply would be
s['c'] = s.apply(lambda x: [int(x.A == i) for i in x.B], axis=1)
s
A B c
0 hello [all, say, hello] [0, 0, 1]
1 good [good, for, you] [1, 0, 0]
2 my [so, hard] [0, 0]
3 pandas [pandas] [1]
4 wrong [] []

I could get your function to work with some minor changes:
def func(valueA, listB):
new_list = []
for i in range(len(listB)): #I changed your in listB with len(listB)
if listB[i] == valueA:
new_list.append(1)
else:
new_list.append(0)
return new_list
and adding the parameter axis = 1 to the apply function
s['C'] = s.apply(lambda x: func(x.A, x.B), axis=1)

Another approach that requires numpy for easy indexing:
import numpy as np
def create_vector(word, vector):
out = np.zeros(len(vector))
indices = [i for i, x in enumerate(vector) if x == word]
out[indices] = 1
return out.astype(int)
s['C'] = s.apply(lambda x: create_vector(x.A, x.B), axis=1)
# Output
# A B C
# 0 hello [all, say, hello] [0, 0, 1]
# 1 good [good, for, you] [1, 0, 0]
# 2 my [so, hard] [0, 0]
# 3 pandas [pandas] [1]
# 4 wrong [] []

Related

How to find index of a list in a new dataframe by mapping?

I have a dataframe which contains list,
df = pd.DataFrame({'Item': [['hi', 'hello', 'bye'], ['school', 'pen'], ['hate', 'love', 'feelings', 'sad']]})
print(df)
Item
0 [hi, hello, bye]
1 [school, pen]
2 [hate, love, feelings, sad]
Expected output:
mapped_value
0 [0, 1, 2]
1 [0, 1]
2 [0, 1, 2, 3]
I tried using map(). I also used
df['mapped value'] = [i for i, x in enumerate(df['Item'][0])]
df
which gives me the wrong output. I need the index for the whole list, but nothing works, can someone please guide?
You can use nested list comprehension:
df['mapped value'] = [[i for i, x in enumerate(x)] for x in df['Item']]
Or lambda function:
df['mapped value'] = df['Item'].apply(lambda x: [i for i, x in enumerate(x)])

Pandas Multi-index set value based on three different condition

The objective is to create a new multiindex column based on 3 conditions of the column (B)
Condition for B
if B<0
CONDITION_B='l`
elif B<-1
CONDITION_B='L`
else
CONDITION_B='g`
Naively, I thought, we can simply create two different mask and replace the value as suggested
# Handle CONDITION_B='l` and CONDITION_B='g`
mask_2 = df.loc[:,idx[:,'B']]<0
appenddf_2=mask_2.replace({True:'g',False:'l'}).rename(columns={'A':'iv'},level=1)
and then
# CONDITION_B='L`
mask_33 = df.loc[:,idx[:,'B']]<-0.1
appenddf_2=mask_33.replace({True:'G'}).rename(columns={'A':'iv'},level=1)
As expected, this will throw an error
TypeError: sequence item 1: expected str instance, bool found
May I know how to handle the 3 different condition
Expected output
ONE TWO
B B
g L
l l
l g
g l
L L
The code to produce the error is
import pandas as pd
import numpy as np
np.random.seed(3)
arrays = [np.hstack([['One']*2, ['Two']*2]) , ['A', 'B', 'A', 'B']]
columns = pd.MultiIndex.from_arrays(arrays)
df= pd.DataFrame(np.random.randn(5, 4), columns=list('ABAB'))
df.columns = columns
idx = pd.IndexSlice
mask_2 = df.loc[:,idx[:,'B']]<0
appenddf_2=mask_2.replace({True:'g',False:'l'}).rename(columns={'A':'iv'},level=1)
mask_33 = df.loc[:,idx[:,'B']]<-0.1
appenddf_2=mask_33.replace({True:'G'}).rename(columns={'A':'iv'},level=1)
IIUC:
np.select() is ideal in this case:
conditions=[
df.loc[:,idx[:,'B']].lt(0) & df.loc[:,idx[:,'B']].gt(-1),
df.loc[:,idx[:,'B']].lt(-1),
df.loc[:,idx[:,'B']].ge(0)
]
labels=['l','L','g']
out=pd.DataFrame(np.select(conditions,labels),columns=df.loc[:,idx[:,'B']].columns)
OR
via np.where():
s=np.where(df.loc[:,idx[:,'B']].lt(0) & df.loc[:,idx[:,'B']].gt(-1),'l',np.where(df.loc[:,idx[:,'B']].lt(-1),'L','g'))
out=pd.DataFrame(s,columns=df.loc[:,idx[:,'B']].columns)
output of out:
One Two
B B
0 g L
1 l l
2 l g
3 g l
4 L L
I don't fully understand what you want to do but try something like this:
df = pd.DataFrame({'B': [ 0, -1, -2, -2, -1, 0, 0, -1, -1, -2]})
df['ONE'] = np.where(df['B'] < 0, 'l', 'g')
df['TWO'] = np.where(df['B'] < -1, 'L', df['ONE'])
df = df.set_index(['ONE', 'TWO'])
Output result:
>>> df
B
ONE TWO
g g 0
l l -1
L -2
L -2
l -1
g g 0
g 0
l l -1
l -1
L -2

Pandas - add a row at the end of a for loop iteration

So I have a for loop that gets a series of values and makes some tests:
list = [1, 2, 3, 4, 5, 6]
df = pd.DataFrame(columns=['columnX','columnY', 'columnZ'])
for value in list:
if value > 3:
df['columnX']="A"
else:
df['columnX']="B"
df['columnZ']="Another value only to be filled in this condition"
df['columnY']=value-1
How can I do this and keep all the values in a single row for each loop iteration no matter what's the if outcome? Can I keep some columns empty?
I mean something like the following process:
[create empty row] -> [process] -> [fill column X] -> [process] -> [fill column Y if true] ...
Like:
[index columnX columnY columnZ]
[0 A 0 NULL ]
[1 A 1 NULL ]
[2 B 2 "..." ]
[3 B 3 "..." ]
[4 B 4 "..." ]
I am not sure to understand exactly but I think this may be a solution:
list = [1, 2, 3, 4, 5, 6]
d = {'columnX':[],'columnY':[]}
for value in list:
if value > 3:
d['columnX'].append("A")
else:
d['columnX'].append("B")
d['columnY'].append(value-1)
df = pd.DataFrame(d)
for the second question just add another condition
list = [1, 2, 3, 4, 5, 6]
d = {'columnX':[],'columnY':[], 'columnZ':[]}
for value in list:
if value > 3:
d['columnX'].append("A")
else:
d['columnX'].append("B")
if condition:
d['columnZ'].append(xxx)
else:
d['columnZ'].append(None)
df = pd.DataFrame(d)
According to the example you have given I have changed your code a bit to achieve the result you shared:
list = [1, 2, 3, 4, 5, 6]
df = pd.DataFrame(columns=['columnX','columnY', 'columnZ'])
for index, value in enumerate(list):
temp = []
if value > 3:
#df['columnX']="A"
temp.append("A")
temp.append(None)
else:
#df['columnX']="B"
temp.append("B")
temp.append("Another value") # or you can add any conditions
#df['columnY']=value-1
temp.append(value-1)
df.loc[index] = temp
print(df)
this produce the result:
columnX columnY columnZ
0 B Another value 0.0
1 B Another value 1.0
2 B Another value 2.0
3 A None 3.0
4 A None 4.0
5 A None 5.0
df.index is printed as : Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')
You may just prepare/initialize your Dataframe with an index depending on input list size, then getting power from np.where routine:
In [111]: lst = [1, 2, 3, 4, 5, 6]
...: df = pd.DataFrame(columns=['columnX','columnY', 'columnZ'], index=range(len(lst)))
In [112]: int_arr = np.array(lst)
In [113]: df['columnX'] = np.where(int_arr > 3, 'A', 'B')
In [114]: df['columnZ'] = np.where(int_arr > 3, df['columnZ'], '...')
In [115]: df['columnY'] = int_arr - 1
In [116]: df
Out[116]:
columnX columnY columnZ
0 B 0 ...
1 B 1 ...
2 B 2 ...
3 A 3 NaN
4 A 4 NaN
5 A 5 NaN

What is happening here? Looping by line and looping by indice

I am attempting to take an array of strings and transform it into an array of separate words (with the same number of columns). But the two loops are giving me very different results, and this means I can't access any of the values in the array, really.
array1 = [
["yes is a good thing","no is a bad thing"],
["maybe is a good","certainly is a bad"]
]
w2, h2 = 2,15;
array2 = [[0 for x in range(w2)] for y in range(h2)]
for column in range(len(array1[0])):
for row in range(len(array1)):
array2[1:][column] += str(array1[row][column]).split()
for line in array2: #LOOP 1
print(line)
for column in range(len(array2[0])): #LOOP 2
for row in range(len(array2)):
print(array2[row][column])
The results:
Loop 1 (This is what I'd like to be represented in the second loop)
[0, 0]
[0, 0, 'yes', 'is', 'a', 'good', 'thing', 'maybe', 'is', 'a', 'good']
[0, 0, 'no', 'is', 'a', 'bad', 'thing', 'certainly', 'is', 'a', 'bad']
[0, 0]
[0, 0]
[0, 0]
[0, 0]
[0, 0]
[0, 0]
[0, 0]
[0, 0]
[0, 0]
[0, 0]
[0, 0]
[0, 0]
Loop 2:
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Basically I want an array with two columns, and then the relevant separate words going down each column. Expected output:
yes no
is is
a a
good bad
thing thing
maybe certainly
is is
a a
good bad
You can produce your output directly from the columns:
array1 = [
["yes is a good thing", "no is a bad thing"],
["maybe is a good", "certainly is a bad"]
]
words = [[word for line in col for word in line.split()] for col in zip(*array1)]
transposed = list(zip(*words))
zip(*iterable) transposes a matrix, moving columns to rows and vice versa.
Demo:
>>> array1 = [
... ["yes is a good thing", "no is a bad thing"],
... ["maybe is a good", "certainly is a bad"]
... ]
>>> words = [[word for line in col for word in line.split()] for col in zip(*array1)]
>>> transposed = list(zip(*words))
>>> for row in transposed:
... print('{:8} {:8}'.format(*row))
...
yes no
is is
a a
good bad
thing thing
maybe certainly
is is
a a
good bad

Insert list of lists into single column of pandas df

I am trying to place multiple lists into a single column of a Pandas df. My list of lists is very long, so I cannot do so manually.
The desired out put would look like this:
list_of_lists = [[1,2,3],[3,4,5],[5,6,7],...]
df = pd.DataFrame(list_of_lists)
>>> df
0
0 [1,2,3]
1 [3,4,5]
2 [5,6,7]
3 ...
Thank you for the assistance.
You can assign it by wrapping it in a Series vector if you're trying to add to an existing df:
In [7]:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
df
Out[7]:
a b c
0 -1.675422 -0.696623 -1.025674
1 0.032192 0.582190 0.214029
2 -0.134230 0.991172 -0.177654
3 -1.688784 1.275275 0.029581
4 -0.528649 0.858710 -0.244512
In [9]:
df['new_col'] = pd.Series([[1,2,3],[3,4,5],[5,6,7]])
df
Out[9]:
a b c new_col
0 -1.675422 -0.696623 -1.025674 [1, 2, 3]
1 0.032192 0.582190 0.214029 [3, 4, 5]
2 -0.134230 0.991172 -0.177654 [5, 6, 7]
3 -1.688784 1.275275 0.029581 NaN
4 -0.528649 0.858710 -0.244512 NaN
What about
df = pd.DataFrame({0: [[1,2,3],[3,4,5],[5,6,7]]})
The above solutions were helpful but wanted to add a little bit in case they didn't quite do the trick for someone...
pd.Series will not accept a np.ndarray that looks like a list-of-lists, e.g. one-hot labels array([[1, 0, 0], [0, 1, 0], ..., [0, 0, 1]]).
So in this case one can wrap the variable with list():
df['new_col'] = pd.Series(list(one-hot-labels))

Categories

Resources