When I run the following code I get an KeyError: ('a', 'occurred at index a'). How can I apply this function, or something similar, over the Dataframe without encountering this issue?
Running python3.6, pandas v0.22.0
import numpy as np
import pandas as pd
def add(a, b):
return a + b
df = pd.DataFrame(np.random.randn(3, 3),
columns = ['a', 'b', 'c'])
df.apply(lambda x: add(x['a'], x['c']))
I think need parameter axis=1 for processes by rows in apply:
axis: {0 or 'index', 1 or 'columns'}, default 0
0 or index: apply function to each column
1 or columns: apply function to each row
df = df.apply(lambda x: add(x['a'], x['c']), axis=1)
print (df)
0 -0.802652
1 0.145142
2 -1.160743
dtype: float64
You don't even need apply, you can directly add the columns. The output will be a series either way:
df = df['a'] + df['c']
for example:
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6]})
df = df['a'] + df['c']
print(df)
# 0 6
# 1 8
# dtype: int64
you can try this
import numpy as np
import pandas as pd
def add(df):
return df.a + df.b
df = pd.DataFrame(np.random.randn(3, 3),
columns = ['a', 'b', 'c'])
df.apply(add, axis =1)
where of course you can substitute any function that takes as inputs the columns of df.
Related
When I use the DataFrame.assign() method in my own function foobar, it has no effect to the global DataFrame.
#!/usr/bin/env python3
import pandas as pd
def foobar(df):
# has no affect to the "global" df
df.assign(Z = lambda x: x.A + x.B)
return df
data = {'A': range(3),
'B': range(3)}
df = pd.DataFrame(data)
df = foobar(df)
# There is no 'Z' column in this df
print(df)
The result output
A B
0 0 0
1 1 1
2 2 2
I assume this has something to do with the difference of views and copy's in Pandas. But I am not sure how to handle this the right and elegant Pandas-way.
Pandas assign returns a DataFrame so you need to assign the result to the same df. Try this:
def foobar(df):
df = df.assign(Z = lambda x: x.A + x.B)
return df
i want to concatenate item that is in list format in dataframe
i have a data frame below, when i print the DataFrame.head(), it shows below
A B
1 [1,2,3,4]
2 [5,6,7,8]
Expect Result (convert it from list to string separate by comma)
A B
1 1,2,3,4
2 5,6,7,8
You could do:
import pandas as pd
data = [[1, [1,2,3,4]],
[2, [5,6,7,8]]]
df = pd.DataFrame(data=data, columns=['A', 'B'])
df['B'] = [','.join(map(str, lst)) for lst in df.B]
print(df.head(2))
Output
A B
0 1 1,2,3,4
1 2 5,6,7,8
You can use the map or apply methods for this:
import pandas as pd
data = [[1, [1,2,3,4]],
[2, [5,6,7,8]]]
df = pd.DataFrame(data=data, columns=['A', 'B'])
df['B'] = df['B'].map(lambda x: ",".join(map(str,x)))
# or
# df['B'] = df['B'].apply(lambda x: ",".join(map(str,x)))
print(df.head(2))
df = pd.DataFrame([['1',[1,2,3,4]],['2',[5,6,7,8]]], columns=list('AB'))
generic way to convert lists to strings. in your example, your list is of type int but it could be any type that can be represented as a string to join the elements in the list by using ','.join(map(str, a_list)) Then just iterate through the rows in the specific column that cotains the lists you want to join
for i, row in df.iterrows():
df.loc[i,'B'] = ','.join(map(str, row['B']))
I have a dataframe with a lot of columns using the suffix '_o'. Is there a way to drop all the columns that has '_o' in the end of its label?
In this post I've seen a way to drop the columns that start with something using the filter function. But how to drop the ones that end with something?
Pandonic
df = df.loc[:, ~df.columns.str.endswith('_o')]
df = df[df.columns[~df.columns.str.endswith('_o')]]
List comprehensions
df = df[[x for x in df if not x.endswith('_o')]]
df = df.drop([x for x in df if x.endswith('_o')], 1)
To use df.filter() properly here you could use it with a lookbehind:
>>> df = pd.DataFrame({'a': [1, 2], 'a_o': [2, 3], 'o_b': [4, 5]})
>>> df.filter(regex=r'.*(?<!_o)$')
a o_b
0 1 4
1 2 5
This can be done by re-assigning the dataframe with only the needed columns
df = df.iloc[:, [not o.endswith('_o') for o in df.columns]]
I am facing issue where on passing numpy array to dataframe without column names initializes it properly. Whereas, if I pass column names, it is empty.
x = np.array([(1, '1'), (2, '2')], dtype = 'i4,S1')
df = pd.DataFrame(x)
In []: df
Out[]:
f0 f1
0 1 1
1 2 2
df2 = pd.DataFrame(x, columns=['a', 'b'])
In []: df2
Out[]:
Empty DataFrame
Columns: [a, b]
Index: []
I think you need specify column names in parameter dtype, see DataFrame from structured or record array:
x = np.array([(1, '1'), (2, '2')], dtype=[('a', 'i4'),('b', 'S1')])
df2 = pd.DataFrame(x)
print (df2)
a b
0 1 b'1'
1 2 b'2'
Another solution without parameter dtype:
x = np.array([(1, '1'), (2, '2')])
df2 = pd.DataFrame(x, columns=['a', 'b'])
print (df2)
a b
0 1 1
1 2 2
It's the dtype param, without specifiying it, it works as expected.
See the example at documentation DataFrame
import numpy as np
import pandas as pd
x = np.array([(1, "11"), (2, "22")])
df = pd.DataFrame(x)
print df
df2 = pd.DataFrame(x, columns=['a', 'b'])
print df2
I have a problem with replace text in df. I tried to use df.replace() function but in my case it failed. So here is my example:
df = pd.DataFrame({'col_a':['A', 'B', 'C'], 'col_b':['_world1_', '-world1_', '*world1_']})
df = df.replace(to_replace='world1', value='world2')
Unfortunately this code doesn't change anything, I still have world1 in my df
Someone have any suggestions ?
Use vectorised str.replace to replace string matches in your text:
In [245]:
df = pd.DataFrame({'col_a':['A', 'B', 'C'], 'col_b':['_world1_', '-world1_', '*world1_']})
df['col_b'] = df['col_b'].str.replace('world1', 'world2')
df
Out[245]:
col_a col_b
0 A _world2_
1 B -world2_
2 C *world2_
The value you want to replace does not exist.
That one works:
import pandas as pd
df = pd.DataFrame({'col_a':['A', 'B', 'C'], 'col_b':['_world1_', '-world1_', '*world1_']})
print df
df = df.replace(to_replace='*world1_', value='world2')
print df
Here you go:
df.col_b = df.apply(lambda x: x.col_b.replace('world1','world2'), axis = 1)
In [13]: df
Out[13]:
col_a col_b
0 A _world2_
1 B -world2_
2 C *world2_
There could be many more options, however with the function replace that you are referring to, it can be used with regex as well
In [21]: df.replace('(world1)','world2',regex=True)
Out[21]:
col_a col_b
0 A _world2_
1 B -world2_
2 C *world2_