I want to use Pandas apply to create a new column, and I want this functionality to be fail-save even if that DataFrame is empty. Here is a minimal example that works as expected:
df = pd.DataFrame(np.array([[1,2],[3,4]]), columns=['a','b']) # two columns
add = lambda x: x['a'] + x['b'] # add column a and b # add two values
df['c'] = df.apply( add, axis=1 ) # creates new column c, as anticipated
However, it gets problematic when df happens to be empty. Consider the following example where now the DataFrame is empty, but otherwise equal:
df = pd.DataFrame( columns=['a','b']) # two columns, but no values
df['c'] = df.apply( add, axis=1 ) # raises an error!
How can I execute this last column safely, such that it just appends a column 'c' to the DataFrame, even if df is empty?
Interestingly enough, this works
df.apply( add, axis=1 )
but cannot be appended as column 'c'.
If you want to create a new column c based on the sum of columns a and b, then you can just do the following:
df['c'] = df['a'] + df['b'] # creates new column c, as anticipated :)
That way, you don't need to assign a lambda expression to a function add, (it is not recommended to assign lambda expressions a function).
import numpy as np
import pandas as pd
df = pd.DataFrame(np.array([[1, 2], [3, 4]]), columns=['a', 'b']) # two columns
print(df)
a b
0 1 2
1 3 4
df['c'] = df['a'] + df['b'] # creates new column c, as anticipated
print(df)
a b c
0 1 2 3
1 3 4 7
df = pd.DataFrame(columns=['a', 'b']) # two columns, but no values
df['c'] = df['a'] + df['b'] # creates new column c, as anticipated
print(df)
Empty DataFrame
Columns: [a, b, c]
Index: []
Even if the dataframe is empty, above method will work.
If one axis (rows or columns) is empty then the apply function returns an empty result.
Your defined lambda function returns a pandas.Series. For handling empty pandas.DataFrame it is necessary to be more explicit on the result type of the apply method and use the reduce mode.
‘reduce’ : returns a Series if possible rather than expanding list-like results. This is the opposite of ‘expand’.
This will work:
df = pd.DataFrame(columns=['a','b'])
df['c'] = df.apply(add, axis=1, result_type='reduce')
Related
The goal of following code is to go through each row in df_label, extract app1 and app2 names, filter df_all using those two names, concatenate the result and return it as a dataframe. Here is the code:
def create_dataset(se):
# extracting the names of applications
app1 = se.app1
app2 = se.app2
# extracting each application from df_all
df1 = df_all[df_all.workload == app1]
df1.columns = df1.columns + '_0'
df2 = df_all[df_all.workload == app2]
df2.columns = df2.columns + '_1'
# combining workloads to create the pairs dataframe
df3 = pd.concat([df1, df2], axis=1)
display(df3)
return df3
df_pairs = pd.DataFrame()
df_label.apply(create_dataset, axis=1)
#df_pairs = df_pairs.append(df_label.apply(create_dataset, axis=1))
I would like to append all dataframes returned from apply. However, while display(df3) shows the correct dataframe, when returned from function, it's not a dataframe anymore and it's a series. A series with one element and that element seems to be the whole dataframe. Any ideas what I am doing wrong?
When you select a single column, you'll get a Series instead of a DataFrame so df1 and df2 will both be series.
However, concatenating them on axis=1 should produce a DataFrame (whereas combining them on axis=0 would produce a series). For example:
df = pd.DataFrame({'a':[1,2],'b':[3,4]})
df1 = df['a']
df2 = df['b']
>>> pd.concat([df1,df2],axis=1)
a b
0 1 3
1 2 4
>>> pd.concat([df1,df2],axis=0)
0 1
1 2
0 3
1 4
dtype: int64
On a single dataframe, I can drop columns using the conventional df = df.drop('column name'). But when I try to loop over multiple dataframes and apply drop() to each one, the changes are not persistent. I know there is an inplace='True' argument that I can use but I am confused by what is fundamentally going on inside the for loop.
Example:
df_1 = pd.DataFrame({'A':[1,2,3], 'B':[4,5,6]})
df_1
A B
0 1 4
1 2 5
2 3 6
df_2 = pd.DataFrame({'A':[10,20,30], 'C':[40,50,60]})
df_2
A C
0 10 40
1 20 50
2 30 60
# this is the behavior I am looking for.
df_1 = df_1.drop('A', axis=1)
df_1
B
0 4
1 5
2 6
# when I put 2 dataframes in a for loop, I do not get the same output.
df_1 = pd.DataFrame({'A':[1,2,3], 'B':[4,5,6]})
df_2 = pd.DataFrame({'A':[10,20,30], 'C':[40,50,60]})
full_data = [df_1, df_2]
# I expect this code to apply the "drop" for each of these dataframes in the same way
# as above without the need for the "inplace" argument.
for dataset in full_data:
dataset = dataset.drop('A', axis=1)
# the column 'A' should have been dropped for each dataframe while inside the loop
# but it wasnt. why?
df_1
A B
0 1 4
1 2 5
2 3 6
The issue doesn't have to do with loop scoping specifically, but is a basic python assignment rules issue. See the following:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
In [3]: another = df
In [4]: another is df
Out[4]: True
In [5]: another = another.drop('A', axis=1)
In [6]: another is df
Out[6]: False
In this example, you can see that assigning the result of the drop operation to another assignes a new object to the identifier another. It does not modify the df object in-place. On the other hand, using the inplace=True keyword does:
In [7]: another = df
In [8]: another.drop('A', axis=1, inplace=True)
In [9]: another is df
Out[9]: True
Essentially, there is no way to do what you are trying to do, which is to loop over a list of objects and then modify the object contents in place by re-assigning to the variables using the loop identifier. The reason the inplace=True argument works is because it is referencing a method on the dataframe itself, giving pandas control over the assignment of the result.
Check out this article on variables and object references or more info.
By doing dataset = dataset.drop('A', axis=1) in your loop, you are just assigning the variable in the loop. If you add print(dataset) you will see Column A dropped.
Try dataset.drop(columns=['A'], axis=1, inplace=True) in your loop instead.
for dataset in [df_1, df_2]:
dataset.drop('A', axis=1, inplace=True)
The object you are changing has to be defined outside of loop or else its scope is local to that for loop. This example uses a predefined list to solve the issue:
df_1 = pd.DataFrame({'A':[1,2,3], 'B':[4,5,6]})
df_2 = pd.DataFrame({'A':[10,20,30], 'C':[40,50,60]})
l=[]
for dataset in [df_1, df_2]:
l.append(dataset.drop("A", axis=1))
df_1=l[0]
df_2=l[1]
When selecting a single column from a pandas DataFrame(say df.iloc[:, 0], df['A'], or df.A, etc), the resulting vector is automatically converted to a Series instead of a single-column DataFrame. However, I am writing some functions that takes a DataFrame as an input argument. Therefore, I prefer to deal with single-column DataFrame instead of Series so that the function can assume say df.columns is accessible. Right now I have to explicitly convert the Series into a DataFrame by using something like pd.DataFrame(df.iloc[:, 0]). This doesn't seem like the most clean method. Is there a more elegant way to index from a DataFrame directly so that the result is a single-column DataFrame instead of Series?
As #Jeff mentions there are a few ways to do this, but I recommend using loc/iloc to be more explicit (and raise errors early if you're trying something ambiguous):
In [10]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
In [11]: df
Out[11]:
A B
0 1 2
1 3 4
In [12]: df[['A']]
In [13]: df[[0]]
In [14]: df.loc[:, ['A']]
In [15]: df.iloc[:, [0]]
Out[12-15]: # they all return the same thing:
A
0 1
1 3
The latter two choices remove ambiguity in the case of integer column names (precisely why loc/iloc were created). For example:
In [16]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 0])
In [17]: df
Out[17]:
A 0
0 1 2
1 3 4
In [18]: df[[0]] # ambiguous
Out[18]:
A
0 1
1 3
As Andy Hayden recommends, utilizing .iloc/.loc to index out (single-columned) dataframe is the way to go; another point to note is how to express the index positions.
Use a listed Index labels/positions whilst specifying the argument values to index out as Dataframe; failure to do so will return a 'pandas.core.series.Series'
Input:
A_1 = train_data.loc[:,'Fraudster']
print('A_1 is of type', type(A_1))
A_2 = train_data.loc[:, ['Fraudster']]
print('A_2 is of type', type(A_2))
A_3 = train_data.iloc[:,12]
print('A_3 is of type', type(A_3))
A_4 = train_data.iloc[:,[12]]
print('A_4 is of type', type(A_4))
Output:
A_1 is of type <class 'pandas.core.series.Series'>
A_2 is of type <class 'pandas.core.frame.DataFrame'>
A_3 is of type <class 'pandas.core.series.Series'>
A_4 is of type <class 'pandas.core.frame.DataFrame'>
These three approaches have been mentioned:
pd.DataFrame(df.loc[:, 'A']) # Approach of the original post
df.loc[:,[['A']] # Approach 2 (note: use iloc for positional indexing)
df[['A']] # Approach 3
pd.Series.to_frame() is another approach.
Because it is a method, it can be used in situations where the second and third approaches above do not apply. In particular, it is useful when applying some method to a column in your dataframe and you want to convert the output into a dataframe instead of a series. For instance, in a Jupyter Notebook a series will not have pretty output, but a dataframe will.
# Basic use case:
df['A'].to_frame()
# Use case 2 (this will give you pretty output in a Jupyter Notebook):
df['A'].describe().to_frame()
# Use case 3:
df['A'].str.strip().to_frame()
# Use case 4:
def some_function(num):
...
df['A'].apply(some_function).to_frame()
You can use df.iloc[:, 0:1], in this case the resulting vector will be a DataFrame and not series.
As you can see:
(Talking about pandas 1.3.4)
I'd like to add a little more context to the answers involving .to_frame(). If you select a single row of a data frame and execute .to_frame() on that, then the index will be made up of the original column names and you'll get numeric column names. You can just tack on a .T to the end to transpose that back into the original data frame's format (see below).
import pandas as pd
print(pd.__version__) #1.3.4
df = pd.DataFrame({
"col1": ["a", "b", "c"],
"col2": [1, 2, 3]
})
# series
df.loc[0, ["col1", "col2"]]
# dataframe (column names are along the index; not what I wanted)
df.loc[0, ["col1", "col2"]].to_frame()
# 0
# col1 a
# col2 1
# looks like an actual single-row dataframe.
# To me, this is the true answer to the question
# because the output matches the format of the
# original dataframe.
df.loc[0, ["col1", "col2"]].to_frame().T
# col1 col2
# 0 a 1
# this works really well with .to_dict(orient="records") which is
# what I'm ultimately after by selecting a single row
df.loc[0, ["col1", "col2"]].to_frame().T.to_dict(orient="records")
# [{'col1': 'a', 'col2': 1}]
I have a pandas dataframe with two column of data. Now i want to make a label for two column, like the picture bellow:
Because two column donot have the same value so cant use groupby. I just only want add the label AAA like that. So, how to do it? Thank you
reassign to the columns attribute with an newly constructed pd.MultiIndex
df.columns = pd.MultiIndex.from_product([['AAA'], df.columns.tolist()])
Consider the dataframe df
df = pd.DataFrame(1, ['hostname', 'tmserver'], ['value', 'time'])
print(df)
value time
hostname 1 1
tmserver 1 1
Then
df.columns = pd.MultiIndex.from_product([['AAA'], df.columns.tolist()])
print(df)
AAA
value time
hostname 1 1
tmserver 1 1
If need create MultiIndex in columns, simpliest is:
df.columns = [['AAA'] * len(df.columns), df.columns]
It is similar as MultiIndex.from_arrays, also is possible add names parameter:
n = ['a','b']
df.columns = pd.MultiIndex.from_arrays([['AAA'] * len(df.columns), df.columns], names=n)
When selecting a single column from a pandas DataFrame(say df.iloc[:, 0], df['A'], or df.A, etc), the resulting vector is automatically converted to a Series instead of a single-column DataFrame. However, I am writing some functions that takes a DataFrame as an input argument. Therefore, I prefer to deal with single-column DataFrame instead of Series so that the function can assume say df.columns is accessible. Right now I have to explicitly convert the Series into a DataFrame by using something like pd.DataFrame(df.iloc[:, 0]). This doesn't seem like the most clean method. Is there a more elegant way to index from a DataFrame directly so that the result is a single-column DataFrame instead of Series?
As #Jeff mentions there are a few ways to do this, but I recommend using loc/iloc to be more explicit (and raise errors early if you're trying something ambiguous):
In [10]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
In [11]: df
Out[11]:
A B
0 1 2
1 3 4
In [12]: df[['A']]
In [13]: df[[0]]
In [14]: df.loc[:, ['A']]
In [15]: df.iloc[:, [0]]
Out[12-15]: # they all return the same thing:
A
0 1
1 3
The latter two choices remove ambiguity in the case of integer column names (precisely why loc/iloc were created). For example:
In [16]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 0])
In [17]: df
Out[17]:
A 0
0 1 2
1 3 4
In [18]: df[[0]] # ambiguous
Out[18]:
A
0 1
1 3
As Andy Hayden recommends, utilizing .iloc/.loc to index out (single-columned) dataframe is the way to go; another point to note is how to express the index positions.
Use a listed Index labels/positions whilst specifying the argument values to index out as Dataframe; failure to do so will return a 'pandas.core.series.Series'
Input:
A_1 = train_data.loc[:,'Fraudster']
print('A_1 is of type', type(A_1))
A_2 = train_data.loc[:, ['Fraudster']]
print('A_2 is of type', type(A_2))
A_3 = train_data.iloc[:,12]
print('A_3 is of type', type(A_3))
A_4 = train_data.iloc[:,[12]]
print('A_4 is of type', type(A_4))
Output:
A_1 is of type <class 'pandas.core.series.Series'>
A_2 is of type <class 'pandas.core.frame.DataFrame'>
A_3 is of type <class 'pandas.core.series.Series'>
A_4 is of type <class 'pandas.core.frame.DataFrame'>
These three approaches have been mentioned:
pd.DataFrame(df.loc[:, 'A']) # Approach of the original post
df.loc[:,[['A']] # Approach 2 (note: use iloc for positional indexing)
df[['A']] # Approach 3
pd.Series.to_frame() is another approach.
Because it is a method, it can be used in situations where the second and third approaches above do not apply. In particular, it is useful when applying some method to a column in your dataframe and you want to convert the output into a dataframe instead of a series. For instance, in a Jupyter Notebook a series will not have pretty output, but a dataframe will.
# Basic use case:
df['A'].to_frame()
# Use case 2 (this will give you pretty output in a Jupyter Notebook):
df['A'].describe().to_frame()
# Use case 3:
df['A'].str.strip().to_frame()
# Use case 4:
def some_function(num):
...
df['A'].apply(some_function).to_frame()
You can use df.iloc[:, 0:1], in this case the resulting vector will be a DataFrame and not series.
As you can see:
(Talking about pandas 1.3.4)
I'd like to add a little more context to the answers involving .to_frame(). If you select a single row of a data frame and execute .to_frame() on that, then the index will be made up of the original column names and you'll get numeric column names. You can just tack on a .T to the end to transpose that back into the original data frame's format (see below).
import pandas as pd
print(pd.__version__) #1.3.4
df = pd.DataFrame({
"col1": ["a", "b", "c"],
"col2": [1, 2, 3]
})
# series
df.loc[0, ["col1", "col2"]]
# dataframe (column names are along the index; not what I wanted)
df.loc[0, ["col1", "col2"]].to_frame()
# 0
# col1 a
# col2 1
# looks like an actual single-row dataframe.
# To me, this is the true answer to the question
# because the output matches the format of the
# original dataframe.
df.loc[0, ["col1", "col2"]].to_frame().T
# col1 col2
# 0 a 1
# this works really well with .to_dict(orient="records") which is
# what I'm ultimately after by selecting a single row
df.loc[0, ["col1", "col2"]].to_frame().T.to_dict(orient="records")
# [{'col1': 'a', 'col2': 1}]