No exception raised when accessing wrong column labels in Pandas? - python

Accessing Pandas dataframe in some cases does not raise exception even when the columns labels are not existed.
How should I check for these cases, to avoid reading wrong results?
a = pd.DataFrame(np.zeros((5,2)), columns=['la', 'lb'])
a
Out[349]:
la lb
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 0.0 0.0
a.loc[:, 'lc'] # Raised exception as expected.
a.loc[:, ['la', 'lb', 'lc']] # Not expected.
Out[353]:
la lb lc
0 0.0 0.0 NaN
1 0.0 0.0 NaN
2 0.0 0.0 NaN
3 0.0 0.0 NaN
4 0.0 0.0 NaN
a.loc[:, ['la', 'wrong_lb', 'lc']] # Not expected.
Out[354]:
la wrong_lb lc
0 0.0 NaN NaN
1 0.0 NaN NaN
2 0.0 NaN NaN
3 0.0 NaN NaN
4 0.0 NaN NaN
Update: There is a suggested duplicate question (Safe label-based selection in DataFrame), but it's about row selection, my question is about column selection.

it looks like because at least one of the columns exists, it returns an enlarged df as a reindex operation.
You could define a user func that validates the columns which will handle whether the column exists or not. Here I construct a pandas Index object from the passed in iterable and call intersection to return the common values from the existing df and passed in iterable:
In [80]:
def val_cols(cols):
return pd.Index(cols).intersection(a.columns)
​
a.loc[:, val_cols(['la', 'lb', 'lc'])]
Out[80]:
la lb
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 0.0 0.0
This also handles completely missing columns:
In [81]:
a.loc[:, val_cols(['x', 'y'])]
Out[81]:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]
This also handles your latter case:
In [83]:
a.loc[:, val_cols(['la', 'wrong_lb', 'lc'])]
Out[83]:
la
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
update
in the case where you want to just test if all are valid you can just iterate over each column in the list and append the duff columns:
In [93]:
def val_cols(cols):
duff=[]
for col in cols:
try:
a[col]
except KeyError:
duff.append(col)
return duff
invalid = val_cols(['la','x', 'y'])
print(invalid)
['x', 'y']

Related

Conditionally Set Values Greater Than 0 To 1

I have a dataframe that looks like this, with many more date columns
AUTHOR 2022-07-01 2022-10-14 2022-10-15 .....
0 Kathrine 0.0 7.0 0.0
1 Catherine 0.0 13.0 17.0
2 Amanda Jane 0.0 0.0 0.0
3 Jaqueline 0.0 3.0 0.0
4 Christine 0.0 0.0 0.0
I would like to set values in each column after the AUTHOR to 1 when the value is greater than 0, so the resulting table would look like this:
AUTHOR 2022-07-01 2022-10-14 2022-10-15 .....
0 Kathrine 0.0 1.0 0.0
1 Catherine 0.0 1.0 1.0
2 Amanda Jane 0.0 0.0 0.0
3 Jaqueline 0.0 1.0 0.0
4 Christine 0.0 0.0 0.0
I tried the following line of code but got an error, which makes sense. As I need to figure out how to apply this code just to the date columns while also keeping the AUTHOR column in my table.
Counts[Counts != 0] = 1
TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value
You can select the date column first then mask on these columns
cols = df.drop(columns='AUTHOR').columns
# or
cols = df.filter(regex='\d{4}-\d{2}-\d{2}').columns
# or
cols = df.select_dtypes(include='number').columns
df[cols] = df[cols].mask(df[cols] != 0, 1)
print(df)
AUTHOR 2022-07-01 2022-10-14 2022-10-15
0 Kathrine 0.0 1.0 0.0
1 Catherine 0.0 1.0 1.0
2 Amanda Jane 0.0 0.0 0.0
3 Jaqueline 0.0 1.0 0.0
4 Christine 0.0 0.0 0.0
Since you would like to only exclude the first column you could first set it as index and then create your booleans. In the end you will reset the index.
df.set_index('AUTHOR').pipe(lambda g: g.mask(g > 0, 1)).reset_index()
df
AUTHOR 2022-10-14 2022-10-15
0 Kathrine 0.0 1.0
1 Cathrine 1.0 1.0

Replace dataframe rows with identical rows from another dataframe on a column value

I have a dataframe data in which I took a subset of it g2_data to perform some operations on. How would I go about replacing values in the original dataframe with values from the subset, using values from one of the columns as the reference?
The column structure from data is retained in the subset g2_data shown below.
data:
idx group x1 y1
0 27 1 0.0 0.0
1 28 1 0.0 0.0
2 29 1 0.0 0.0
3 73 1 0.0 0.0
4 74 1 0.0 0.0
... ... ... ...
14612 14674 8 0.0 0.0
14613 14697 8 0.0 0.0
14614 14698 8 0.0 0.0
14615 14721 8 0.0 0.0
14616 14722 8 0.0 0.0
[14617 rows x 4 columns]
g2_data:
idx group x1 y1
1125 1227 2 115.0 0.0
1126 1228 2 0.0 220.0
1127 1260 2 0.0 0.0
1128 1294 2 0.0 0.0
1129 1295 2 0.0 0.0
... ... ... ...
3269 3277 2 0.0 0.0
3270 3308 2 0.0 0.0
3271 3309 2 0.0 0.0
3272 3342 2 0.0 0.0
3273 3343 2 0.0 0.0
[2149 rows x 4 columns]
Replace rows in Dataframe using index from another Dataframe has an answer to do this using the index values of the rows, but I would like to do it using the values from the idx column incase I need to reset the index in the subset later on (i.e. starting from 0 instead of using the index values from the original dataframe). It is important to note that the values in the idx column are all unique as they pertain to info about each observation.
This probably isn't optimal, but you can convert g2_data to a dictionary and then map the other columns based on idx, filtering the update to just those ids in the g2_data subset.
g2_data_dict = g2_data.set_index('idx').to_dict()
g2_data_ids = g2_data['idx'].to_list()
for k in g2_data_dict.keys():
data.loc[df['idx'].isin(g2_data_ids), k] = data['idx'].map(g2_data_dict[k])
Use combine_first:
out = g2_data.set_index('idx').combine_first(data.set_index('idx')).reset_index()

How to assign pandas dataframe to slice of other dataframe

I have Excel spreadsheets with data, one for each year. Alas the columns change slightly over the year. What I want is to have one dataframe with all the data and fill the lacking columns with predefined data. I wrote a small example program to test that.
import numpy as np
import pandas as pd
# Initialize three dataframes
df1 = pd.DataFrame([[1,2], [11,22],[111,222]], columns=['een', 'twee'])
df2 = pd.DataFrame([[3,4], [33,44],[333,444]], columns=['een', 'drie'])
df3 = pd.DataFrame([[5,6], [55,66],[555,666]], columns=['twee', 'vier'])
# Store these in a dictionary and print for verification
d = {'df1': df1, 'df2': df2, 'df3': df3}
for key in d:
print(d[key])
print()
# Create a list of all columns, as order is relevant a Set is not used
cols = []
# Count total number of rows
nrows = 0
# Loop thru each dataframe to determine total number of rows and columns
for key in d:
df = d[key]
nrows += len(df)
for col in df.columns:
if col not in cols:
cols += [col]
# Create total dataframe, fill with default (zeros)
data = pd.DataFrame(np.zeros((nrows, len(cols))), columns=cols)
# Assign dataframe to each slice
c = 0
for key in d:
data.loc[c:c+len(d[key])-1, d[key].columns] = d[key]
c += len(d[key])
print(data)
The dataframes are initialized all right but there is something weird with the assignment to the slice of the data dataframe. What I wanted (and expected) is:
een twee drie vier
0 1.0 2.0 0.0 0.0
1 11.0 22.0 0.0 0.0
2 111.0 222.0 0.0 0.0
3 3.0 0.0 4.0 0.0
4 33.0 0.0 44.0 0.0
5 333.0 0.0 444.0 0.0
6 0.0 5.0 0.0 6.0
7 0.0 55.0 0.0 66.0
8 0.0 555.0 0.0 666.0
But this is what I got:
een twee drie vier
0 1.0 2.0 0.0 0.0
1 11.0 22.0 0.0 0.0
2 111.0 222.0 0.0 0.0
3 NaN 0.0 NaN 0.0
4 NaN 0.0 NaN 0.0
5 NaN 0.0 NaN 0.0
6 0.0 NaN 0.0 NaN
7 0.0 NaN 0.0 NaN
8 0.0 NaN 0.0 NaN
The location AND the data of the first dataframe are correctly assigned. However, the second dataframe is assigned to the correct location, but not its contents: NaN is assigned instead. This also happens for the third dataframe: correct location but missing data. I have tried to assign d[key].loc[0:2, d[key].columns and some more fanciful solutions to the data slice, but all return NaN. How can I get the contents of the dataframe as well assigned to data?
Per the comments, you can use:
pd.concat([df1, df2, df3])
OR
pd.concat([df1, df2, df3]).fillna(0)

Pandas: cell-wise fillna(method = 'pad') of a list of DataFrame

Basically, I'm trying to do something like this but for a fillna instead of a sum.
I have a list of df's, each with same colunms/indexes, ordered over time:
import numpy as np
import pandas as pd
np.random.seed(0)
df_list = []
for index in range(3):
a = pd.DataFrame(np.random.randint(3, size=(5,3)), columns=list('abc'))
mask = np.random.choice([True, False], size=a.shape)
df_list.append(a.mask(mask))
now, I want to do a replace the numpy.nan cells of the ith
DataFrame in df_list by the value of the same cell in the i-1 th
DataFrame in df_list.
so if the first DataFrame is:
a b c
0 NaN 1.0 0.0
1 1.0 1.0 NaN
2 0.0 NaN 0.0
3 NaN 0.0 2.0
4 NaN 2.0 2.0
and the 2nd is:
a b c
0 0.0 NaN NaN
1 NaN NaN NaN
2 0.0 1.0 NaN
3 NaN NaN 2.0
4 0.0 NaN 2.0
Then the output output_list should be a list of the same length as df_list and having also DataFrames as elements.
The first entry of output_list is the same as the first entry of df_list.
The second entry of output_list is:
a b c
0 0.0 1.0 0.0
1 1.0 1.0 NaN
2 0.0 1.0 0.0
3 NaN 0.0 2.0
4 0.0 2.0 2.0
I believe the update functionality is very good for this, see the docs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.update.html
It is a method that specifically allows you to update a DataFrame, in your case only the NaN-elements of it.
In particular, you could use it like this:
new_df_list = df_list[:1]
for df_new, df_old in zip(df_list[1:], df_list[:-1]):
df_new.update(df_old, overwrite=False)
new_df_list.append(df_new)
Which will give you the desired output

Sort rows of a dataframe in descending order of NaN counts

I'm trying to sort the following Pandas DataFrame:
RHS age height shoe_size weight
0 weight NaN 0.0 0.0 1.0
1 shoe_size NaN 0.0 1.0 NaN
2 shoe_size 3.0 0.0 0.0 NaN
3 weight 3.0 0.0 0.0 1.0
4 age 3.0 0.0 0.0 1.0
in such a way that the rows with a greater number of NaNs columns are positioned first.
More precisely, in the above df, the row with index 1 (2 Nans) should come before ther row with index 0 (1 NaN).
What I do now is:
df.sort_values(by=['age', 'height', 'shoe_size', 'weight'], na_position="first")
Using df.sort_values and loc based accessing.
df = df.iloc[df.isnull().sum(1).sort_values(ascending=0).index]
print(df)
RHS age height shoe_size weight
1 shoe_size NaN 0.0 1.0 NaN
2 shoe_size 3.0 0.0 0.0 NaN
0 weight NaN 0.0 0.0 1.0
4 age 3.0 0.0 0.0 1.0
3 weight 3.0 0.0 0.0 1.0
df.isnull().sum(1) counts the NaNs and the rows are accessed based on this sorted count.
#ayhan offered a nice little improvement to the solution above, involving pd.Series.argsort:
df = df.iloc[df.isnull().sum(axis=1).mul(-1).argsort()]
print(df)
RHS age height shoe_size weight
1 shoe_size NaN 0.0 1.0 NaN
0 weight NaN 0.0 0.0 1.0
2 shoe_size 3.0 0.0 0.0 NaN
3 weight 3.0 0.0 0.0 1.0
4 age 3.0 0.0 0.0 1.0
df.isnull().sum().sort_values(ascending=False)
Here's a one-liner that will do it:
df.assign(Count_NA = lambda x: x.isnull().sum(axis=1)).sort_values('Count_NA', ascending=False).drop('Count_NA', axis=1)
# RHS age height shoe_size weight
# 1 shoe_size NaN 0.0 1.0 NaN
# 0 weight NaN 0.0 0.0 1.0
# 2 shoe_size 3.0 0.0 0.0 NaN
# 3 weight 3.0 0.0 0.0 1.0
# 4 age 3.0 0.0 0.0 1.0
This works by assigning a temporary column ("Count_NA") to count the NAs in each row, sorting on that column, and then dropping it, all in the same expression.
You can add a column of the number of null values, sort by that column, then drop the column. It's up to you if you want to use .reset_index(drop=True) to reset the row count.
df['null_count'] = df.isnull().sum(axis=1)
df.sort_values('null_count', ascending=False).drop('null_count', axis=1)
# returns
RHS age height shoe_size weight
1 shoe_size NaN 0.0 1.0 NaN
0 weight NaN 0.0 0.0 1.0
2 shoe_size 3.0 0.0 0.0 NaN
3 weight 3.0 0.0 0.0 1.0
4 age 3.0 0.0 0.0 1.0

Categories

Resources