How to filter DataFrame using isna? - python

This seems super basic and yet I am failing to filter this dataframe. As you can see from the screenshot I load a very basic set of data. I check if any values in column 'Col3' is na. And finally I try to filter the dataframe using that. I am hoping to get returned just the second column (with index 1). But as you can see I get all 5 rows but the values for Col3 are now all NaN.
I am using Python 3.7.3 and Pandas 1.1.4
Trying wwnde's suggestion to use brackets instead of .loc did not seem to work:

Try
data( now that you didnt give me sample data)
df = pd.DataFrame({'group1': [0, 0, 1, 1], 'group2': [2, 2, 3, 4],
'base': [0, 1, 2, 3], 'x1': [3, 4, 5, 6], 'x2': [np.nan, 6, np.nan, 8]})
df[df['x2'].isna()]
group1 group2 base x1 x2
0 0 2 0 3 NaN
2 1 3 2 5 NaN
Use loc accessor if you need to call particular columns
df.loc[df['x2'].isna(),:'base']#base and preceding columns
group1 group2 base
0 0 2 0
2 1 3 2
or
df.loc[df['x2'].isna(),['base','x1']]
base x1
0 0 3
2 2 5

Related

Find value smaller but closest to current value

I have a very large pandas dataframe that contains two columns, column A and column B. For each value in column A, I would like to find the largest value in column B that is less than the corresponding value in column A. Note that each value in column B can be mapped to many values in column A.
Here's an example with a smaller dataset. Let's say I have the following dataframe:
df = pd.DataFrame({'a' : [1, 5, 7, 2, 3, 4], 'b' : [5, 2, 7, 5, 1, 9]})
I would like to find some third column -- say c -- so that
c = [nil, 2, 5, 1, 2, 2].
Note that each entry in c is strictly less than the corresponding value in c.
Upon researching, I think that I want something similar to pandas.merge_asof, but I can't quite get the query correct. In particular, I'm struggling because I only have one dataframe and not two. Perhaps I can form a second dataframe from my current one to get what I need, but I can't quite get it right. Any help is appreciated.
Yes, it is doable using pandas.merge_asof. Explanation as comments in the code -
import pandas as pd
df = pd.DataFrame({'a' : [1, 5, 7, 2, 3, 4], 'b' : [5, 2, 7, 5, 1, 9]})
# merge_asof requires the keys to be sorted
adf = df[['a']].sort_values(by='a')
bdf = df[['b']].sort_values(by='b')
# your example wants 'strictly less' so we also add 'allow_exact_matches=False'
cdf_ordered = pd.merge_asof(adf, bdf, left_on='a', right_on='b', allow_exact_matches=False, direction='backward')
# rename the dataframe |a|b| -> |a|c|
cdf_ordered = cdf_ordered.rename(columns={'b': 'c'})
# since c is based on sorted a, we merge with original dataframe column a
new_df = pd.merge(df, cdf_ordered, on='a')
print(new_df)
"""
a b c
0 1 5 NaN
1 5 2 2.0
2 7 7 5.0
3 2 5 1.0
4 3 1 2.0
5 4 9 2.0
"""

Pandas group by function to do different methods if index in list

I'm wondering if its possible to create your own groupby function that runs a different method for the output in a single column depending on if the index is in some list or not. For example:
df = pd.DataFrame({'ID' : [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
'Data' : [5, 7, 6, 13, 14, 11, 10, 2, 4, 3]})
some_list = [2, 3]
I want to group by ID column, and return an average of the Data column (df.groupby('ID').mean() for most values) However, if ID is in some_list then I would like the average to be calculated as the sum of Data divided by 4 (df.groupby('ID').sum()/4). The output for the above would look as below:
ID Data
0 1 6.00
1 2 12.00
2 3 2.25
I know I could do both methods separately and join into one column after doing the groupby, but I was wondering if its possible to do this in one step? Maybe with df.groupby('ID').apply(function)?
I've looked at this question, but it didn't help me.
Try groupby with apply and a condition:
df.groupby('ID', as_index=False)['Data'].apply(lambda x: x.sum() / 4 if x.name in some_list else x.mean())
Output:
ID Data
0 1 6.00
1 2 12.00
2 3 2.25
If performance is important dont use groupby.apply, you can filter rows for aggregate sum with division and for aggregate mean:
s = df[df['ID'].isin(some_list)].groupby('ID')['Data'].sum().div(4)
df = s.combine_first(df.groupby('ID')['Data'].mean()).reset_index()
print (df)
ID Data
0 1 6.00
1 2 12.00
2 3 2.25
Here is alternative solution:
df = df.groupby('ID')['Data'].agg(['sum','mean']).reset_index()
df['Value'] = np.where(df['ID'].isin(some_list), df.pop('sum').div(4), df.pop('mean'))
print (df)
ID Value
0 1 6.00
1 2 12.00
2 3 2.25

Pandas - Recode variable based on value_counts() results

This is my dataframe:
df = pd.DataFrame({'col1': [1, 1, 1, 2, 2, 3, 4],
'col2': [1, 3, 2, 4, 6, 5, 7]})
I try to recode values based on how often they appear in my dataset, here I want to relabel every value which occurs only once to "other". This is the desired output:
#desired
"col1": [1,1,1,2,2,"other", "other"]
I tried this but it did not work:
df["recoded"] = np.where(df["col1"].value_counts() > 1, df["col1"], "other")
My idea is to save the value counts and filter them and then loop over the result array, but this seems overly complicated. Is there an easy "pythonic/pandas" way to archieve this?
You are close - need Series.map for same length of Series like original DataFrame:
df["recoded"] = np.where(df["col1"].map(df["col1"].value_counts()) > 1, df["col1"], "other")
Or use GroupBy.transform with count values by GroupBy.size:
df["recoded"] = np.where(df.groupby('col1')["col1"].transform('size') > 1,
df["col1"],
"other")
If need check duplicates use Series.duplicated with keep=False for return mask by all duplicates:
df["recoded"] = np.where(df["col1"].duplicated(keep=False), df["col1"], "other")
print (df)
0 1 1 1
1 1 3 1
2 1 2 1
3 2 4 2
4 2 6 2
5 3 5 other
6 4 7 other

Best way to add multiple list to existing dataframe [duplicate]

I'm trying to figure out how to add multiple columns to pandas simultaneously with Pandas. I would like to do this in one step rather than multiple repeated steps.
import pandas as pd
df = {'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]}
df = pd.DataFrame(df)
df[[ 'column_new_1', 'column_new_2','column_new_3']] = [np.nan, 'dogs',3] # I thought this would work here...
I would have expected your syntax to work too. The problem arises because when you create new columns with the column-list syntax (df[[new1, new2]] = ...), pandas requires that the right hand side be a DataFrame (note that it doesn't actually matter if the columns of the DataFrame have the same names as the columns you are creating).
Your syntax works fine for assigning scalar values to existing columns, and pandas is also happy to assign scalar values to a new column using the single-column syntax (df[new1] = ...). So the solution is either to convert this into several single-column assignments, or create a suitable DataFrame for the right-hand side.
Here are several approaches that will work:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]
})
Then one of the following:
1) Three assignments in one, using list unpacking:
df['column_new_1'], df['column_new_2'], df['column_new_3'] = [np.nan, 'dogs', 3]
2) DataFrame conveniently expands a single row to match the index, so you can do this:
df[['column_new_1', 'column_new_2', 'column_new_3']] = pd.DataFrame([[np.nan, 'dogs', 3]], index=df.index)
3) Make a temporary data frame with new columns, then combine with the original data frame later:
df = pd.concat(
[
df,
pd.DataFrame(
[[np.nan, 'dogs', 3]],
index=df.index,
columns=['column_new_1', 'column_new_2', 'column_new_3']
)
], axis=1
)
4) Similar to the previous, but using join instead of concat (may be less efficient):
df = df.join(pd.DataFrame(
[[np.nan, 'dogs', 3]],
index=df.index,
columns=['column_new_1', 'column_new_2', 'column_new_3']
))
5) Using a dict is a more "natural" way to create the new data frame than the previous two, but the new columns will be sorted alphabetically (at least before Python 3.6 or 3.7):
df = df.join(pd.DataFrame(
{
'column_new_1': np.nan,
'column_new_2': 'dogs',
'column_new_3': 3
}, index=df.index
))
6) Use .assign() with multiple column arguments.
I like this variant on #zero's answer a lot, but like the previous one, the new columns will always be sorted alphabetically, at least with early versions of Python:
df = df.assign(column_new_1=np.nan, column_new_2='dogs', column_new_3=3)
7) This is interesting (based on https://stackoverflow.com/a/44951376/3830997), but I don't know when it would be worth the trouble:
new_cols = ['column_new_1', 'column_new_2', 'column_new_3']
new_vals = [np.nan, 'dogs', 3]
df = df.reindex(columns=df.columns.tolist() + new_cols) # add empty cols
df[new_cols] = new_vals # multi-column assignment works for existing cols
8) In the end it's hard to beat three separate assignments:
df['column_new_1'] = np.nan
df['column_new_2'] = 'dogs'
df['column_new_3'] = 3
Note: many of these options have already been covered in other answers: Add multiple columns to DataFrame and set them equal to an existing column, Is it possible to add several columns at once to a pandas DataFrame?, Add multiple empty columns to pandas DataFrame
You could use assign with a dict of column names and values.
In [1069]: df.assign(**{'col_new_1': np.nan, 'col2_new_2': 'dogs', 'col3_new_3': 3})
Out[1069]:
col_1 col_2 col2_new_2 col3_new_3 col_new_1
0 0 4 dogs 3 NaN
1 1 5 dogs 3 NaN
2 2 6 dogs 3 NaN
3 3 7 dogs 3 NaN
My goal when writing Pandas is to write efficient readable code that I can chain. I won't go into why I like chaining so much here, I expound on that in my book, Effective Pandas.
I often want to add new columns in a succinct manner that also allows me to chain. My general rule is that I update or create columns using the .assign method.
To answer your question, I would use the following code:
(df
.assign(column_new_1=np.nan,
column_new_2='dogs',
column_new_3=3
)
)
To go a little further. I often have a dataframe that has new columns that I want to add to my dataframe. Let's assume it looks like say... a dataframe with the three columns you want:
df2 = pd.DataFrame({'column_new_1': np.nan,
'column_new_2': 'dogs',
'column_new_3': 3},
index=df.index
)
In this case I would write the following code:
(df
.assign(**df2)
)
With the use of concat:
In [128]: df
Out[128]:
col_1 col_2
0 0 4
1 1 5
2 2 6
3 3 7
In [129]: pd.concat([df, pd.DataFrame(columns = [ 'column_new_1', 'column_new_2','column_new_3'])])
Out[129]:
col_1 col_2 column_new_1 column_new_2 column_new_3
0 0.0 4.0 NaN NaN NaN
1 1.0 5.0 NaN NaN NaN
2 2.0 6.0 NaN NaN NaN
3 3.0 7.0 NaN NaN NaN
Not very sure of what you wanted to do with [np.nan, 'dogs',3]. Maybe now set them as default values?
In [142]: df1 = pd.concat([df, pd.DataFrame(columns = [ 'column_new_1', 'column_new_2','column_new_3'])])
In [143]: df1[[ 'column_new_1', 'column_new_2','column_new_3']] = [np.nan, 'dogs', 3]
In [144]: df1
Out[144]:
col_1 col_2 column_new_1 column_new_2 column_new_3
0 0.0 4.0 NaN dogs 3
1 1.0 5.0 NaN dogs 3
2 2.0 6.0 NaN dogs 3
3 3.0 7.0 NaN dogs 3
Dictionary mapping with .assign():
This is the most readable and dynamic way to assign new column(s) with value(s) when working with many of them.
import pandas as pd
import numpy as np
new_cols = ["column_new_1", "column_new_2", "column_new_3"]
new_vals = [np.nan, "dogs", 3]
# Map new columns as keys and new values as values
col_val_mapping = dict(zip(new_cols, new_vals))
# Unpack new column/new value pairs and assign them to the data frame
df = df.assign(**col_val_mapping)
If you're just trying to initialize the new column values to be empty as you either don't know what the values are going to be or you have many new columns.
import pandas as pd
import numpy as np
new_cols = ["column_new_1", "column_new_2", "column_new_3"]
new_vals = [None for item in new_cols]
# Map new columns as keys and new values as values
col_val_mapping = dict(zip(new_cols, new_vals))
# Unpack new column/new value pairs and assign them to the data frame
df = df.assign(**col_val_mapping)
use of list comprehension, pd.DataFrame and pd.concat
pd.concat(
[
df,
pd.DataFrame(
[[np.nan, 'dogs', 3] for _ in range(df.shape[0])],
df.index, ['column_new_1', 'column_new_2','column_new_3']
)
], axis=1)
if adding a lot of missing columns (a, b, c ,....) with the same value, here 0, i did this:
new_cols = ["a", "b", "c" ]
df[new_cols] = pd.DataFrame([[0] * len(new_cols)], index=df.index)
It's based on the second variant of the accepted answer.
Just want to point out that option2 in #Matthias Fripp's answer
(2) I wouldn't necessarily expect DataFrame to work this way, but it does
df[['column_new_1', 'column_new_2', 'column_new_3']] = pd.DataFrame([[np.nan, 'dogs', 3]], index=df.index)
is already documented in pandas' own documentation
http://pandas.pydata.org/pandas-docs/stable/indexing.html#basics
You can pass a list of columns to [] to select columns in that order.
If a column is not contained in the DataFrame, an exception will be raised.
Multiple columns can also be set in this manner.
You may find this useful for applying a transform (in-place) to a subset of the columns.
You can use tuple unpacking:
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
df['col3'], df['col4'] = 'a', 10
Result:
col1 col2 col3 col4
0 1 3 a 10
1 2 4 a 10
If you just want to add empty new columns, reindex will do the job
df
col_1 col_2
0 0 4
1 1 5
2 2 6
3 3 7
df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)
col_1 col_2 column_new_1 column_new_2 column_new_3
0 0 4 NaN NaN NaN
1 1 5 NaN NaN NaN
2 2 6 NaN NaN NaN
3 3 7 NaN NaN NaN
full code example
import numpy as np
import pandas as pd
df = {'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]}
df = pd.DataFrame(df)
print('df',df, sep='\n')
print()
df=df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)
print('''df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)''',df, sep='\n')
otherwise go for zeros answer with assign
I am not comfortable using "Index" and so on...could come up as below
df.columns
Index(['A123', 'B123'], dtype='object')
df=pd.concat([df,pd.DataFrame(columns=list('CDE'))])
df.rename(columns={
'C':'C123',
'D':'D123',
'E':'E123'
},inplace=True)
df.columns
Index(['A123', 'B123', 'C123', 'D123', 'E123'], dtype='object')
You could instantiate the values from a dictionary if you wanted different values for each column & you don't mind making a dictionary on the line before.
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({
'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]
})
>>> df
col_1 col_2
0 0 4
1 1 5
2 2 6
3 3 7
>>> cols = {
'column_new_1':np.nan,
'column_new_2':'dogs',
'column_new_3': 3
}
>>> df[list(cols)] = pd.DataFrame(data={k:[v]*len(df) for k,v in cols.items()})
>>> df
col_1 col_2 column_new_1 column_new_2 column_new_3
0 0 4 NaN dogs 3
1 1 5 NaN dogs 3
2 2 6 NaN dogs 3
3 3 7 NaN dogs 3
Not necessarily better than the accepted answer, but it's another approach not yet listed.
import pandas as pd
df = pd.DataFrame({
'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]
})
df['col_3'], df['col_4'] = [df.col_1]*2
>> df
col_1 col_2 col_3 col_4
0 4 0 0
1 5 1 1
2 6 2 2
3 7 3 3

how to add columns label on a Pandas DataFrame

I can't understand how can I add column names on a pandas dataframe, an easy example will clarify my issue:
dic = {'a': [4, 1, 3, 1], 'b': [4, 2, 1, 4], 'c': [5, 7, 9, 1]}
df = pd.DataFrame(dic)
now if I type df than I get
a b c
0 4 4 5
1 1 2 7
2 3 1 9
3 1 4 1
say now that I generate another dataframe just by summing up the columns on the previous one
a = df.sum()
if I type 'a' than I get
a 9
b 11
c 22
That looks like a dataframe without with index and without names on the only column. So I wrote
a.columns = ['column']
or
a.columns = ['index', 'column']
and in both cases Python was happy because he didn't provide me any message of errors. But still if I type 'a' I can't see the columns name anywhere. What's wrong here?
The method DataFrame.sum() does an aggregation and therefore returns a Series, not a DataFrame. And a Series has no columns, only an index. If you want to create a DataFrame out of your sum you can change a = df.sum() by:
a = pandas.DataFrame(df.sum(), columns = ['whatever_name_you_want'])

Categories

Resources