Delete pandas column if column name begins with a number

Delete pandas column if column name begins with a number - python

I have a pandas DataFrame with about 200 columns. Roughly, I want to do this
for col in df.columns:
if col begins with a number:
df.drop(col)
I'm not sure what are the best practices when it comes to handling pandas DataFrames, how should I handle this? Will my pseudocode work, or is it not recommended to modify a pandas dataframe in a for loop?

I think simpliest is select all columns which not starts with number by filter with regex - ^ is for start of string and \D is for not number:
df1 = df.filter(regex='^\D')
Similar alternative:
df1 = df.loc[:, df.columns.str.contains('^\D')]
Or inverse condition and select numbers:
df1 = df.loc[:, ~df.columns.str.contains('^\d')]
df1 = df.loc[:, ~df.columns.str[0].str.isnumeric()]
If want use your pseudocode:
for col in df.columns:
if col[0].isnumeric():
df = df.drop(col, axis=1)
Sample:
df = pd.DataFrame({'2A':list('abcdef'),
'1B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D3':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
1B 2A C D3 E F
0 4 a 7 1 5 a
1 5 b 8 3 3 a
2 4 c 9 5 6 a
3 5 d 4 7 9 b
4 5 e 2 1 2 b
5 4 f 3 0 4 b
df1 = df.filter(regex='^\D')
print (df1)
C D3 E F
0 7 1 5 a
1 8 3 3 a
2 9 5 6 a
3 4 7 9 b
4 2 1 2 b
5 3 0 4 b

An alternative can be this:
columns = [x for x in df.columns if not x[0].isdigit()]
df = df[columns]

Related

Pandas, duplicate a row based on a condition

I have a dataframe like this -
What I want to do is, whenever there is 'X' in Col3, that row should get duplicated and 'X' should be changed to 'Z'. The result must look like this -
I did try a few approaches, but nothing worked!
Can somebody please guide on how to do this.

You can filter first by boolean indexing and set Z to Col3 by DataFrame.assign, join with original with concat, sorting index by DataFrame.sort_index with stabble algo mergesort and last create default RangeIndex by DataFrame.reset_index with drop=True:
df = pd.DataFrame({
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'Col3':list('aXcdXf'),
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
df = (pd.concat([df, df[df['Col3'].eq('X')].assign(Col3 = 'Z')])
.sort_index(kind='mergesort')
.reset_index(drop=True))
print (df)
B C Col3 D E F
0 4 7 a 1 5 a
1 5 8 X 3 3 a
2 5 8 Z 3 3 a
3 4 9 c 5 6 a
4 5 4 d 7 9 b
5 5 2 X 1 2 b
6 5 2 Z 1 2 b
7 4 3 f 0 4 b

Best way to move a column in pandas dataframe to last column in large dataframe

I have a pandas dataframe with more than 100 columns.
For example in the following df:
df['A','B','C','D','E','date','G','H','F','I']
How can I move date to be the last column? assuming the dataframe is large and i cant write all the column names manually.

You can try this:
new_cols = [col for col in df.columns if col != 'date'] + ['date']
df = df[new_cols]
Test data:
cols = ['A','B','C','D','E','date','G','H','F','I']
df = pd.DataFrame([np.arange(len(cols))],
columns=cols)
print(df)
# A B C D E date G H F I
# 0 0 1 2 3 4 5 6 7 8 9
Output of the code:
A B C D E G H F I date
0 0 1 2 3 4 6 7 8 9 5

Use pandas.DataFrame.pop and pandas.concat:
print(df)
col1 col2 col3
0 1 11 111
1 2 22 222
2 3 33 333
s = df.pop('col1')
new_df = pd.concat([df, s], 1)
print(new_df)
Output:
col2 col3 col1
0 11 111 1
1 22 222 2
2 33 333 3

This way :
df_new=df.loc[:,df.columns!='date']
df_new['date']=df['date']

Simple reindexing should do the job:
original = df.columns
new_cols = original.delete(original.get_loc('date'))
df.reindex(columns=new_cols)

You can use reindex and union:
df.reindex(df.columns[df.columns != 'date'].union(['date']), axis=1)
Let's only work with the index headers and not the complete dataframe.
Then, use reindex to reorder the columns.
Output using #QuangHoang setup:
A B C D E F G H I date
0 0 1 2 3 4 8 6 7 9 5

You can use movecolumn package in Python to move columns:
pip install movecolumn
Then you can write your code as:
import movecolumn as mc
mc.MoveToLast(df,'date')
Hope that helps.
P.S : The package can be found here. https://pypi.org/project/movecolumn/

Using isin() to determine what should be printed

Right now I have two dataframes (data1 and data2)
I would like to print a column of string values in the dataframe called data1, based on whether the ID exists in both data2 and data1.
What I am doing now gives me a boolean list (True or False if the ID exists in the both dataframes but not the column of strings).
print(data2['id'].isin(data1.id).to_string())
yields
0 True
1 True
2 True
3 True
4 True
5 True
Any ideas would be appreciated.
Here is a sample of data1
'user_id', 'id', 'rating', 'unix_timestamp'
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
And data2 contains something like this
'id', 'title', 'release_date',
'video_release_date', 'imdb_url'
37|Nadja (1994)|01-Jan-1994||http://us.imdb.com/M/title-exact?Nadja%20(1994)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0
38|Net, The (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Net,%20The%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|1|0|0
39|Strange Days (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Strange%20Days%20(1995)|0|1|0|0|0|0|1|0|0|0|0|0|0|0|0|1|0|0|0

If all values of ids are unique:
I think you need merge with inner join. For data2 select only id column, on parameter should be omit, because joining on all columns - here only id:
df = pd.merge(data1, data2[['id']])
Sample:
data1 = pd.DataFrame({'id':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3]})
print (data1)
B C id
0 4 7 a
1 5 8 b
2 4 9 c
3 5 4 d
4 5 2 e
5 4 3 f
data2 = pd.DataFrame({'id':list('frcdeg'),
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],})
print (data2)
D E id
0 1 5 f
1 3 3 r
2 5 6 c
3 7 9 d
4 1 2 e
5 0 4 g
df = pd.merge(data1, data2[['id']])
print (df)
B C id
0 4 9 c
1 5 4 d
2 5 2 e
3 4 3 f
If id are duplicated in one or another Dataframe use another answer, also added similar solutions:
df = data1[data1['id'].isin(set(data1['id']) & set(data2['id']))]
ids = set(data1['id']) & set(data2['id'])
df = data2.query('id in #ids')
df = data1[np.in1d(data1['id'], np.intersect1d(data1['id'], data2['id']))]
Sample:
data1 = pd.DataFrame({'id':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3]})
print (data1)
B C id
0 4 7 a
1 5 8 b
2 4 9 c
3 5 4 d
4 5 2 e
5 4 3 f
data2 = pd.DataFrame({'id':list('fecdef'),
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],})
print (data2)
D E id
0 1 5 f
1 3 3 e
2 5 6 c
3 7 9 d
4 1 2 e
5 0 4 f
df = data1[data1['id'].isin(set(data1['id']) & set(data2['id']))]
print (df)
B C id
2 4 9 c
3 5 4 d
4 5 2 e
5 4 3 f
EDIT:
You can use:
df = data2.loc[data1['id'].isin(set(data1['id']) & set(data2['id'])), ['title']]
ids = set(data1['id']) & set(data2['id'])
df = data2.query('id in #ids')[['title']]
df = data2.loc[np.in1d(data1['id'], np.intersect1d(data1['id'], data2['id'])), ['title']]

You can compute the set intersection of the two columns -
ids = set(data1['id']).intersection(data2['id'])
Or,
ids = np.intersect1d(data1['id'], data2['id'])
Next, query/filter out relevant rows.
data1.loc[data1['id'].isin(ids), 'id']

How to join column values in pandas MultiIndex DataFrame?

How can I join values in columns with the same name in MultiIndex pandas DataFrame?
data = [['1','1','2','3','4'],['2','5','6','7','8']]
df = pd.DataFrame(data, columns=['id','A','B','A','B'])
df = df.set_index('id')
df.columns = pd.MultiIndex.from_tuples([('result','A'),('result','B'),('student','A'),('student','B')])
df
result student
A B A B
id
1 1 2 3 4
2 5 6 7 8
Desired results:
A B
id
1 "1 3" "2 4"
2 "5 7" "6 8"

I am not completely sure what you are asking. If you have two separate dataframes then you should be able to just use pd.concat.
pd.concat([df1, df2], axis=1)
If you have one dataframe then just drop the top level of the index.
df.columns = df.columns.droplevel(0)

New answer:
For join values by second level of MultiIndex in columns use groupby with agg:
#select columns define in list
df = df[['result','student']]
df1 = df.astype(str).groupby(level=1, axis=1).agg(' '.join)
print (df1)
A B
id
1 1 3 2 4
2 5 7 6 8
Old answer:
You can use sort_index for sorting columns and then droplevel for remove first level of MultiIndex.
But get duplicate columns names.
print (df)
result student col
A B A B A B
id
1 1 2 3 4 6 7
2 5 6 7 8 2 1
#select columns define in list
df = df[['result','student']]
print (df)
result student
A B A B
id
1 1 2 3 4
2 5 6 7 8
df = df.sort_index(axis=1, level=1)
df.columns = df.columns.droplevel(0)
print (df)
A A B B
id
1 1 3 2 4
2 5 7 6 8
So better, unique columns names can be created by map with join:
df = df.sort_index(axis=1, level=1)
df.columns = df.columns.map('_'.join)
print (df)
result_A student_A result_B student_B
id
1 1 3 2 4
2 5 7 6 8
df = pd.concat([df['result'],df['student']], axis=1).sort_index(axis=1)
print (df)
A A B B
id
1 1 3 2 4
2 5 7 6 8

How to extract rows in a pandas dataframe NOT in a subset dataframe

I have two dataframes. DF and SubDF. SubDF is a subset of DF. I want to extract the rows in DF that are NOT in SubDF.
I tried the following:
DF2 = DF[~DF.isin(SubDF)]
The number of rows are correct and most rows are correct,
ie number of rows in subDF + number of rows in DF2 = number of rows in DF
but I get rows with NaN values that do not exist in the original DF
Not sure what I'm doing wrong.
Note: the original DF does not have any NaN values, and to double check I did DF.dropna() before and the result still produced NaN

You need merge with outer join and boolean indexing, because DataFrame.isin need values and index match:
DF = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (DF)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
SubDF = pd.DataFrame({'A':[3],
'B':[6],
'C':[9],
'D':[5],
'E':[6],
'F':[3]})
print (SubDF)
A B C D E F
0 3 6 9 5 6 3
#return no match
DF2 = DF[~DF.isin(SubDF)]
print (DF2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
DF2 = pd.merge(DF, SubDF, how='outer', indicator=True)
DF2 = DF2[DF2._merge == 'left_only'].drop('_merge', axis=1)
print (DF2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4

Another way, borrowing the setup from #jezrael:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
sub = pd.DataFrame({'A':[3],
'B':[6],
'C':[9],
'D':[5],
'E':[6],
'F':[3]})
extract_idx = list(set(df.index) - set(sub.index))
df_extract = df.loc[extract_idx]
The rows may not be sorted in the original df order. If matching order is required:
extract_idx = list(set(df.index) - set(sub.index))
idx_dict = dict(enumerate(df.index))
order_dict = dict(zip(idx_dict.values(), idx_dict.keys()))
df_extract = df.loc[sorted(extract_idx, key=order_dict.get)]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Delete pandas column if column name begins with a number - python

An alternative can be this: columns = [x for x in df.columns if not x[0].isdigit()] df = df[columns]

Related

Pandas, duplicate a row based on a condition

Best way to move a column in pandas dataframe to last column in large dataframe

Using isin() to determine what should be printed

How to join column values in pandas MultiIndex DataFrame?

How to extract rows in a pandas dataframe NOT in a subset dataframe

Categories

Resources