Given a DataFrame like this:
>>> df
0 1 2
0 2 3 5
1 3 4 7
and a function that returns multiple results, like this:
def sumprod(x, y, z):
return x+y+z, x*y*z
I want to add new columns, so the result would be:
>>> df
0 1 2 sum prod
0 2 3 5 10 30
1 3 4 7 14 84
I have been successful with functions that returns one result:
df["sum"] = p.apply(sum, axis=1)
but not if it returns more than one result.
One way to do this is to pass the columns of the DataFrame to the function by unpacking the transpose of the array:
>>> df['sum'], df['prod'] = sumprod(*df.values.T)
>>> df
0 1 2 sum prod
0 2 3 5 10 30
1 3 4 7 14 84
sumprod returns a tuple of columns and, since Python supports multiple assignment, you can assign them to new column labels as above.
You could write df['sum'], df['prod'] = sumprod(df[0], df[1], df[2]) to get the same result. This is clearer and is preferable if you need to pass the columns to the function in a particular order. On the other hand, it's a lot more verbose if you have a lot of columns to pass to the function.
Related
I'd like to concatenate two dataframes A, B to a new one without duplicate rows (if rows in B already exist in A, don't add):
Dataframe A:
I II
0 1 2
1 3 1
Dataframe B:
I II
0 5 6
1 3 1
New Dataframe:
I II
0 1 2
1 3 1
2 5 6
How can I do this?
The simplest way is to just do the concatenation, and then drop duplicates.
>>> df1
A B
0 1 2
1 3 1
>>> df2
A B
0 5 6
1 3 1
>>> pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
A B
0 1 2
1 3 1
2 5 6
The reset_index(drop=True) is to fix up the index after the concat() and drop_duplicates(). Without it you will have an index of [0,1,0] instead of [0,1,2]. This could cause problems for further operations on this dataframe down the road if it isn't reset right away.
In case you have a duplicate row already in DataFrame A, then concatenating and then dropping duplicate rows, will remove rows from DataFrame A that you might want to keep.
In this case, you will need to create a new column with a cumulative count, and then drop duplicates, it all depends on your use case, but this is common in time-series data
Here is an example:
df_1 = pd.DataFrame([
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':6, 'value':34},])
df_2 = pd.DataFrame([
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':6, 'value':14},
])
df_1['count'] = df_1.groupby(['date','id','value']).cumcount()
df_2['count'] = df_2.groupby(['date','id','value']).cumcount()
df_tot = pd.concat([df_1,df_2], ignore_index=False)
df_tot = df_tot.drop_duplicates()
df_tot = df_tot.drop(['count'], axis=1)
>>> df_tot
date id value
0 11/20/2015 4 24
1 11/20/2015 4 24
2 11/20/2015 6 34
1 11/20/2015 6 14
I'm surprised that pandas doesn't offer a native solution for this task.
I don't think that it's efficient to just drop the duplicates if you work with large datasets (as Rian G suggested).
It is probably most efficient to use sets to find the non-overlapping indices. Then use list comprehension to translate from index to 'row location' (boolean), which you need to access rows using iloc[,]. Below you find a function that performs the task. If you don't choose a specific column (col) to check for duplicates, then indexes will be used, as you requested. If you chose a specific column, be aware that existing duplicate entries in 'a' will remain in the result.
import pandas as pd
def append_non_duplicates(a, b, col=None):
if ((a is not None and type(a) is not pd.core.frame.DataFrame) or (b is not None and type(b) is not pd.core.frame.DataFrame)):
raise ValueError('a and b must be of type pandas.core.frame.DataFrame.')
if (a is None):
return(b)
if (b is None):
return(a)
if(col is not None):
aind = a.iloc[:,col].values
bind = b.iloc[:,col].values
else:
aind = a.index.values
bind = b.index.values
take_rows = list(set(bind)-set(aind))
take_rows = [i in take_rows for i in bind]
return(a.append( b.iloc[take_rows,:] ))
# Usage
a = pd.DataFrame([[1,2,3],[1,5,6],[1,12,13]], index=[1000,2000,5000])
b = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], index=[1000,2000,3000])
append_non_duplicates(a,b)
# 0 1 2
# 1000 1 2 3 <- from a
# 2000 1 5 6 <- from a
# 5000 1 12 13 <- from a
# 3000 7 8 9 <- from b
append_non_duplicates(a,b,0)
# 0 1 2
# 1000 1 2 3 <- from a
# 2000 1 5 6 <- from a
# 5000 1 12 13 <- from a
# 2000 4 5 6 <- from b
# 3000 7 8 9 <- from b
Another option:
concatenation = pd.concat([
dfA,
dfB[dfB['I'].isin(dfA['I']) == False], # <-- get all the data in dfB that doesn't show up in dfB (based on values in column 'I')
])
The object concatenation will be:
I II
0 1 2
1 3 1
2 5 6
How can I create a function that squares the specific column value from a dataframe in pandas?
it should be like as you see below
def func(dataframe,column,value)
Suppose you have dataframe named df
Just create a function:-
def pow(data,column,val):
return data[column]**val
Now just call the function and pass the series,and val as parameter:-
func(df,'col3',2)
Between you can do this without creating a function just by:-
df['column name']**2
I suppose that you wanted to square only those values in the column column which are equal to value parameter:
def func(dataframe, column, value):
s = dataframe[column]
dataframe[column] = s.mask(s == value, lambda x: x**2)
Note:
This function changes the dataframe in place, so in concordance with Python conventions it returns the None value. (Why? Because there is no return statement.)
The explanation:
In the first command (in the body of the function definition) we assign the appropriate column (i.e. a series) to the variable s.
In the second command we apply the method .mask() with 2 arguments:
The first argument is a condition for using the second argument,
the second argument is a function (which is used only for elements satisfying the condition given in the first argument).
A test:
>>> df
A B C D
0 4 4 3 4
1 3 3 4 4
2 4 4 2 2
3 3 2 3 4
4 4 2 4 3
>>> func(df, "D", 4)
>>> df
A B C D
0 4 4 3 16
1 3 3 4 16
2 4 4 2 2
3 3 2 3 16
4 4 2 4 3
I would like to know whether I can get some help in "translating" a multi dim list in a single column of a frame in pandas.
I found help here to translate a multi dim list in a column with multiple columns, but I need to translate the data in one
Suppose I have the following list of list
x=[[1,2,3],[4,5,6]]
If I create a frame I get
frame=pd.Dataframe(x)
0 1 2
1 2 3
4 5 6
But my desire outcome shall be
0
1
2
3
4
5
6
with the zero as column header.
I can of course get the result with a for loop, which from my point of view takes much time. Is there any pythonic/pandas way to get it?
Thanks for helping men
You can use np.concatenate
x=[[1,2,3],[4,5,6]]
frame=pd.DataFrame(np.concatenate(x))
print(frame)
Output:
0
0 1
1 2
2 3
3 4
4 5
5 6
First is necessary flatten values of lists and pass to DataFrame constructor:
df = pd.DataFrame([z for y in x for z in y])
Or:
from itertools import chain
df = pd.DataFrame(list(chain.from_iterable(x)))
print (df)
0
0 1
1 2
2 3
3 4
4 5
5 6
If you use numpy you can utilize the method ravel():
pd.DataFrame(np.array(x).ravel())
So I have an extremely simple dataframe:
values
1
1
1
2
2
I want to add a new column and for each row assign the sum of it's unique occurences, so the table would look like:
values unique_sum
1 3
1 3
1 3
2 2
2 2
I have seen some examples in R, but for python and pandas I have not come across anything and am stuck. I can list the value counts using .value_counts() and I have tried groupbyroutines but cannot fathom it.
Just use map to map your column onto its value_counts:
>>> x
A
0 1
1 1
2 1
3 2
4 2
>>> x['unique'] = x.A.map(x.A.value_counts())
>>> x
A unique
0 1 3
1 1 3
2 1 3
3 2 2
4 2 2
(I named the column A instead of values. values is not a great choice for a column name, because DataFrames have a special attribute called values, which prevents you from getting the column with x.values --- you'd have to use x['values'] instead.)
I need to create a new column at the end of a data frame, where the values in that new column are the result of applying some function who's parameters are based on other columns. Specifically, from another column, but a different row. So for example, if my data frame had two columns, containing values x_i, y_i respectively, my third column would be f(x_(i-1), y_(i-1))
I know that to create create a new column, the easiest way would be to do something like
df['new_row'] = ...
But I'm not sure what I can set to that.
How do I do this?
Something like this? Or is your function more complicated?
print(df)
0 1 2 3
0 1 2 3 4
df[4]= df[2]*df[3]/.3
print(df)
0 1 2 3 4
0 1 2 3 4 40
Here's an example:
df['new_col'] = df['old_col'] * df['old_col']
Or if you wrote a custom function that took in two arrays, such as:
def f(arr1, arr2):
new_arr = # put logic here
return newer
You could try:
df['new_col'] = f(df['old_col'], df['old_col2'])
1 3
3 3
43 4
2 3
with open("file", 'r') as f:
for line in f:
n, r = line.split()
formula = pow(int(n),int(r))
print("{:4}{:4}{:9}".format(n,r, formula))
output
1 3 1
3 3 27
43 4 3418801
2 3 8