How to delete columns based on condition - python

I want to delete columns that start the particular "TYPE" word and do not contain _1?
df =
TYPE_1 TYPE_2 TYPE_3 COL1
aaa asb bbb 123
The result should be:
df =
TYPE_1 COL1
aaa 123
Currently I am deleting these columns manually, however this approach is not very efficient if the number of columns is big:
df = df.drop(["TYPE_2","TYPE_3"], axis=1)

A list comprehension can be used. Note: axis=1 denotes that we are referring to the column and inplace=True can also be used as per pandas.DataFrame.drop docs.
droplist = [i for i in df.columns if i.startswith('TYPE') and '_1' not in i]
df1.drop(droplist,axis=1,inplace=True)

This is the fifth answer but I wanted to showcase the power of the filter dataframe method which filters by column names with regex. This searches for columns that don't start with TYPE or have _1 somewhere in them.
df.filter(regex='^(?!TYPE)|_1')

Easy:
unwanted = [column for column in df.columns
if column.startswith("TYPE") and "_1" not in column]
df = df.drop(unwanted)

t_cols = [c for c in df.columns.values if c.startswith('TYPE_') and not c == 'TYPE_1']
df.drop(t_cols)
Should do the job

Related

Unnamed columns - rename - pandas

I am trying to rename unnamed columns in my data frame.
Value of the 1st row in this column is expected to become a name of that column. If a column doesn't contain Unnamed, its name should remain unchanged.
I try to achieve it this way:
for col in columns:
if 'Unnamed' in col:
df = df.rename(columns=df.iloc[0])
break
In this case each column is renamed. Any ideas what am I doing wrong?
Use Index.where with str.contains, it replace if False, so inverted mask by ~:
df = pd.DataFrame({'Unnamed 1':['a', 2], 'col':['b',8]})
df.columns = df.columns.where(~df.columns.str.contains('Unnamed'), df.iloc[0])
print (df)
a col
0 a b
1 2 8
Your solution is possible change by loop Series with first row:
new = []
for col, v in df.iloc[0].items():
if 'Unnamed' in col:
new.append(v)
else:
new.append(col)
df.columns = new
Same in list comprehension:
df.columns = [v if 'Unnamed' in col else col for col, v in df.iloc[0].items()]
You can rename unamed columns as shown below
df = df.rename({'Unnamed: 0':'NewName1'})
If multiple unnamed columns then use index based on occurance of unnamed column
df = df.rename({'Unnamed: 0':'NewName1','Unnamed: 1':'NewName2'})

How to change the column type of all columns except the first in Pandas?

I have a 6,000 column table that is loaded into a pandas DataFrame. The first column is an ID, the rest are numeric variables. All the columns are currently strings and I need to convert all but the first column to integer.
Many of the functions I've found don't allow passing a list of column names or drop the first column entirely.
You can do:
df.astype({col: int for col in df.columns[1:]})
An easy trick when you want to perform an operation on all columns but a few is to set the columns to ignore as index:
ignore = ['col1']
df = (df.set_index(ignore, append=True)
.astype(float)
.reset_index(ignore)
)
This should work with any operation even if it doesn't support specifying on which columns to work.
Example input:
df = pd.DataFrame({'col1': list('ABC'),
'col2': list('123'),
'col3': list('456'),
})
output:
>>> df.dtypes
col1 object
col2 float64
col3 float64
dtype: object
Try something like:
df.loc[:, df.columns != 'ID'].astype(int)
Some code that could be used for general cases where you want to convert dtypes
# select columns that need to be converted
cols = df.select_dtypes(include=['float64']).columns.to_list()
cols = ... # here exclude certain columns in cols e.g. the first col
df = df.astype({col:int for col in cols})
You can select str columns and exclude the first column in your case. The idea is basically the same.

Alternative way to merge two dataframes in python

Let's take a simple example. I have this first dataframe :
df = pd.DataFrame(dict(Name=['abc','def','ghi'],NoMatter=['X','X','X']))
df
Name NoMatter
0 abc X
1 def X
2 ghi X
For some reasons, I would like to use a for loop which add a column Value to df and do some treatments, from another dataframe changing at each iteration :
# strucutre of for loop I would like to use :
for i in range(something) :
add the column Value to df from df_value
other treatment not usefull here
# appearance of df_value (which change at each iteration of the for loop) :
Name Value
0 abc 1
1 def 2
2 ghi 3
However, I would prefer not to use merging, because that would require to delete the column Value added in the previous iteration before adding the one of the current iteration. Is there please a way to add the Value column to df by just an assignment starting like that :
df['Value'] = XXX
Expected output :
Name NoMatter Value
0 abc X 1
1 def X 2
2 ghi X 3
[EDIT]
I don't want to use merging because at the fourth iteration of the for loop, df would have the columns :
Name NoMatter Value1 Value2 Value3 Value4
Whereas I just want to have :
Name NoMatter Value4
I could delete the previous column each time but it seems not to be very efficient. This is why I'm just looking for a way to assign values to the Value column, not adding the column. Like an equivalent of the vlookup function in Excel applied to df from df_value data.
3 ways to join dataframes
df1.append(df2) # Adds the rows in df1 to the end of df2 (columns should be identical)
pd.concat([df1, df2],axis=1) # Adds the columns in df1 to the end of df2 (rows should be identical)
df1.join(df2,on=col1,how='inner') # SQL-style joins the columns in df1 with the columns on
df2 where the rows for col have identical values. how can be one of 'left', 'right',
Here's the solution for your problem.
import pandas as pd
df = pd.DataFrame(dict(Name=['abc','def','ghi'],NoMatter=['X','X','X']))
df1 = pd.DataFrame(dict(Name=['abc','def','ghi'],Value=[1,2,3]))
new_df=pd.merge(df, df1, on='Name')
new_df
The correct way is #UmerRana's answer, because iterating over a dataframe has terrible performances. If you really have to do it, it is possible to address an individual cell, but never pretend I advise you to do so:
df = pd.DataFrame(dict(Name=['abc','def','ghi'],NoMatter=['X','X','X']))
df1 = pd.DataFrame(dict(Name=['abc','def','ghi'],Value=[1,2,3]))
df['Value'] = 0 # initialize a new column of integers (hence the 0)
ix = df.columns.get_loc('Value')
for i in range(len(df)): # perf is terrible!
df.iloc[i, ix] = df1['Value'][i]
After seeing your example code, and if you cannot avoid the loop, I thing that this would be the less bad way:
newcol = np.zeros(something, dtype='int') # set the correct type
for i in range(something):
#compute a value
newcol[i] = value_for_i_iteration
df['Value'] = newcol # assign the array to the new column
Maybe not the best way, but this solution works and replaces at each iteration the Value column (no need to delete the Value column before each new iteration) :
# similar to Excel vlookup function
def vlookup(df,ref,col_ref,col_goal):
return pd.DataFrame(df[df.apply(lambda x: ref == x[col_ref],axis=1)][col_goal]).iloc[0,0]
df['Value'] = df['Name'].apply(lambda x : vlookup(df_value,x,'Name','Value'))
#Output :
Name NoMatter Value
0 abc X 1
1 def X 2
2 ghi X 3

Change dataframe row names

I have a df which looks like:
BBG.LON.123.S_CAR_ADJ_DPS 343.94325
BBG.LON.436.S_CAR_ADJ_DPS 236.51530
I am trying to rename the row names (removing the '_CAR_ADJ_DPS' element of each row name and rename the column 'id' so my resulting df looks like:
id
BBG.LON.123.S 343.94325
BBG.LON.436.S 236.51530
I have tried using the following line without success:
pd.DataFrame(pd.Series(np.unique([row.split('_')[0] for row in df.rows]), name='id'))
What can I try next?
I think you can use str.split with rename_axis (new in pandas 0.18.0):
print (df)
a
BBG.LON.123.S_CAR_ADJ_DPS 343.94325
BBG.LON.436.S_CAR_ADJ_DPS 236.51530
df.index = df.index.str.split('_').str[0]
df = df.rename_axis('id')
#if use pandas bellow 0.18.0
#df.index.name = 'id'
print (df)
a
id
BBG.LON.123.S 343.94325
BBG.LON.436.S 236.51530
You may also be interested in str.extract to pull out the entries as columns:
In [11]: df[0].str.extract('(?P<A>.*)\.(?P<B>.*)\.(?P<C>\d+)\.(?P<D>.)_.*', expand=True)
Out[11]:
A B C D
0 BBG LON 123 S
1 BBG LON 436 S

Data selection using pandas

I have a file where the separator(delimiter) is ';' . I read that file into a pandas dataframe df. Now, I want to select some rows from df using a criteria from column c in df. The format of data in column c is as follows:
[0]science|time|boot
[1]history|abc|red
and so on...
I have another list of words L, which has values such as
[history, geography,....]
Now, if I split the text in column c on '|', then I want to select those rows from df, where the first word does not belong to L.
Therefore, in this example, I will select df[0] but will not chose df[1], since history is present in L and science is not.
I know, I can write a for loop and iter over each object in the dataframe but I was wondering if I could do something in a more compact and efficient way.
For example, we can do:
df.loc[df['column_name'].isin(some_values)]
I have this:
df = pd.read_csv(path, sep=';', header=None, error_bad_lines=False, warn_bad_lines=False)
dat=df.ix[:,c].str.split('|')
But, I do not know how to index 'dat'. 'dat' is a Pandas Series, as follows:
0 [science, time, boot]
1 [history, abc, red]
....
I tried indexing dat as follows:
dat.iloc[:][0]
But, it gives the entire series instead of just the first element.
Any help would be appreciated.
Thank You in advance
Here is an approach:
Data
df = pd.DataFrame({'c':['history|science','science|chemistry','geography|science','biology|IT'],'col2':range(4)})
Out[433]:
c col2
0 history|science 0
1 science|chemistry 1
2 geography|science 2
3 biology|IT 3
lst = ['geography', 'biology','IT']
Resolution
You can use list comprehension:
df.loc[pd.Series([not x.split('|')[0] in lst for x in df.c.tolist()])]
Out[444]:
c col2
0 history|science 0
1 science|chemistry 1

Categories

Resources