Creating a new column based on multiple columns - python

I'm trying to create a new column based on other columns existing in my df.
My new column, col, should be 1 if there is at least one 1 in columns A ~ E.
If all values in columns A ~ E is 0, then value of col should be 0.
I've attached image for a better understanding.
What is the most efficient way to do this with python, not using loop? Thanks.
enter image description here

If need test all columns use DataFrame.max or DataFrame.any with cast to integers for True/False to 1/0 mapping:
df['col'] = df.max(axis=1)
df['col'] = df.any(axis=1).astype(int)
Or if need test columns between A:E add DataFrame.loc:
df['col'] = df.loc[:, 'A':'E'].max(axis=1)
df['col'] = df.loc[:, 'A':'E'].any(axis=1).astype(int)
If need specify columns by list use subset:
cols = ['A','B','C','D','E']
df['col'] = df[cols].max(axis=1)
df['col'] = df[cols].any(axis=1).astype(int)

Related

pandas concatenate multiple columns together with pipe while skip the empty values

hi I want to concatenate multiple columns together using pipes as connector in pandas python and if the columns is blank values then skip this columns.
I tried the following code, it does not skip the values when its empty, it will still have a '|' to connect with other fields, what I want is the completely pass the empty field ..
for example: currently it gives me 'N|911|WALLACE|AVE||||MT|031|000600'
while I want 'N|911|WALLACE|AVE|MT|031|000600'
df['key'] = df[['fl_predir','fl_prim_range','fl_prim_name','fl_addr_suffix','fl_postdir','fl_unit_desig','fl_sec_range','fl_st','fl_fips_county','blk']].agg('|'.join, axis=1)
can anybody help me on this?
cols = ['fl_predir','fl_prim_range','fl_prim_name','fl_addr_suffix','fl_postdir','fl_unit_desig','fl_sec_range','fl_st','fl_fips_county','blk']
df['key'] = df[cols].apply(lambda row: '|'.join(x for x in row if x), axis=1, raw=True)
You can use melt to flat your dataframe, drop null values then group by index and finally concatenate values:
cols = ['fl_predir', 'fl_prim_range', 'fl_prim_name', 'fl_addr_suffix' ,
'fl_postdir', 'fl_unit_desig', 'fl_sec_range', 'fl_st',
'fl_fips_county', 'blk']
df['key'] = (df[cols].melt(ignore_index=False)['value'].dropna()
.astype(str).groupby(level=0).agg('|'.join))
Output:
>>> df['key']
0 N|911|WALLACE|AVE|MT|31|600
Name: key, dtype: object
Alternative (Pandas < 1.1.0)
df['keys'] = (df[cols].unstack().dropna().astype(str)
.groupby(level=1).agg('|'.join))

How to change the column type of all columns except the first in Pandas?

I have a 6,000 column table that is loaded into a pandas DataFrame. The first column is an ID, the rest are numeric variables. All the columns are currently strings and I need to convert all but the first column to integer.
Many of the functions I've found don't allow passing a list of column names or drop the first column entirely.
You can do:
df.astype({col: int for col in df.columns[1:]})
An easy trick when you want to perform an operation on all columns but a few is to set the columns to ignore as index:
ignore = ['col1']
df = (df.set_index(ignore, append=True)
.astype(float)
.reset_index(ignore)
)
This should work with any operation even if it doesn't support specifying on which columns to work.
Example input:
df = pd.DataFrame({'col1': list('ABC'),
'col2': list('123'),
'col3': list('456'),
})
output:
>>> df.dtypes
col1 object
col2 float64
col3 float64
dtype: object
Try something like:
df.loc[:, df.columns != 'ID'].astype(int)
Some code that could be used for general cases where you want to convert dtypes
# select columns that need to be converted
cols = df.select_dtypes(include=['float64']).columns.to_list()
cols = ... # here exclude certain columns in cols e.g. the first col
df = df.astype({col:int for col in cols})
You can select str columns and exclude the first column in your case. The idea is basically the same.

pandas df masking specific row by list

I have pandas df which has 7000 rows * 7 columns. And I have list (row_list) that consists with the value that I want to filter out from df.
What I want to do is to filter out the rows if the rows from df contain the corresponding value in the list.
This is what I got when I tried,
"Empty DataFrame
Columns: [A,B,C,D,E,F,G]
Index: []"
df = pd.read_csv('filename.csv')
df1 = pd.read_csv('filename1.csv', names = 'A')
row_list = []
for index, rows in df1.iterrows():
my_list = [rows.A]
row_list.append(my_list)
boolean_series = df.D.isin(row_list)
filtered_df = df[boolean_series]
print(filtered_df)
replace
boolean_series = df.RightInsoleImage.isin(row_list)
with
boolean_series = df.RightInsoleImage.isin(df1.A)
And let us know the result. If it doesn't work show a sample of df and df1.A
(1) generating separate dfs for each condition, concat, then dedup (slow)
(2) a custom function to annotate with bool column (default as False, then annotated True if condition is fulfilled), then filter based on that column
(3) keep a list of indices of all rows with your row_list values, then filter using iloc based on your indices list
Without an MRE, sample data, or a reason why your method didn't work, it's difficult to provide a more specific answer.

Creating a flag for rows with missing data

I have a pandas dataframe and want to create a new column.
This new column would return 1 if all columns in the row have a value (are not Nan)
If there was a Nan in any one of the columns in the row it would return 0
Does anyone have guidance on how to go about this?
I have used the below to sum the instances of 'Not Nans' in the row, which could possibly be used in an if statement? or is there a more simple way
code_count.apply(lambda x: x.count(), axis=1)
code_count['count_languages'] = code_count.apply(lambda x: x.count(), axis=1)
Use DataFrame.notna for test non missing values with DataFrame.all for test if all values per rows are True, then convert mask to 1,0 by Series.view:
code_count['count_languages'] = code_count.notna().all(axis=1).view('i1')
Or Series.astype:
code_count['count_languages'] = code_count.notna().all(axis=1).astype('int')
Or numpy.where:
code_count['count_languages'] = np.where(code_count.notna().all(axis=1), 1, 0)

Select columns based on != condition

I have a dataframe and I have a list of some column names that correspond to the dataframe. How do I filter the dataframe so that it != the list of column names, i.e. I want the dataframe columns that are outside the specified list.
I tried the following:
quant_vair = X != true_binary_cols
but get the output error of: Unable to coerce to Series, length must be 545: given 155
Been battling for hours, any help will be appreciated.
It will help:
df.drop(columns = ["col1", "col2"])
You can either drop the columns from the dataframe, or create a list that does not contain all these columns:
df_filtered = df.drop(columns=true_binary_cols)
Or:
filtered_col = [col for col in df if col not in true_binary_cols]
df_filtered = df[filtered_col]

Categories

Resources