rename columns based on specific rule pandas

rename columns based on specific rule pandas - python

Initial df is:
df =
a a a a
1 2 3 4
5 6 7 8
9 1 2 3
Desired output:
df =
b_1 c_1 b_2 c_2
1 2 3 4
5 6 7 8
9 1 2 3
I can do in a long way, like choose odd then even columns, rename them and concat. But looking for a quick solution

Try this:
df =pd.DataFrame({'a':[],'b':[],'c':[],'d':[],'e':[],'f':[],'g':[],'h':[]})
df.columns = ['b_'+str(i//2 +1) if i%2==0 else 'c_'+str((i//2 +1)) for i in range(df.shape[1]) ]
print(df.columns)
output:
Index(['b_1', 'c_1', 'b_2', 'c_2', 'b_3', 'c_3', 'b_4', 'c_4'], dtype='object')
[Finished in 2.6s]

Related

How to concat columns of similar dataframes with new names

I have two dataframes with similar columns:
df1 = (a, b, c, d)
df2 = (a, b, c, d)
I want concat or merge some columns of them like below in df3
df3 = (a_1, a_2, b_1, b_2)
How can I put them beside as they are (without any change), and how can I merge them on a similar key like d? I tried to add them to a list and concat them but don't know how to give them a new name. I don't want multi-level column names.
for ii, tdf in enumerate(mydfs):
tdf = tdf.sort_values(by="fid", ascending=False)
for _col in ["fid", "pred_text1"]:
new_col = _col + str(ii)
dfs.append(tdf[_col])
ii += 1
df = pd.concat(dfs, axis=1)

Without having a look at your dataframe, it would not be easy, but I am generating a dataframe to give you samples and insight into how the code works:
import pandas as pd
import re
df1 = pd.DataFrame({"a":[1,2,4], "b":[2,4,5], "c":[5,6,7], "d":[1,2,3]})
df2 = pd.DataFrame({"a":[6,7,5], "b":[3,4,8], "c":[6,3,9], "d":[1,2,3]})
mergedDf = df1.merge(df2, how="left", on="d").rename(columns=lambda x: re.sub("(.+)\_x", r"\1_1", x)).rename(columns=lambda x: re.sub("(.+)\_y", r"\1_2", x))
mergedDf
which results in:
a_1
b_1
c_1
d
a_2
b_2
c_2
0
1
2
5
1
6
3
6
1
2
4
6
2
7
4
3
2
4
5
7
3
5
8
9
If you are interested in dropping other columns you can use code below:
mergedDf.iloc[:, ~mergedDf.columns.str.startswith("c")]
which results in:
a_1
b_1
d
a_2
b_2
0
1
2
1
6
3
1
2
4
2
7
4
2
4
5
3
5
8

Pandas, duplicate a row based on a condition

I have a dataframe like this -
What I want to do is, whenever there is 'X' in Col3, that row should get duplicated and 'X' should be changed to 'Z'. The result must look like this -
I did try a few approaches, but nothing worked!
Can somebody please guide on how to do this.

You can filter first by boolean indexing and set Z to Col3 by DataFrame.assign, join with original with concat, sorting index by DataFrame.sort_index with stabble algo mergesort and last create default RangeIndex by DataFrame.reset_index with drop=True:
df = pd.DataFrame({
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'Col3':list('aXcdXf'),
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
df = (pd.concat([df, df[df['Col3'].eq('X')].assign(Col3 = 'Z')])
.sort_index(kind='mergesort')
.reset_index(drop=True))
print (df)
B C Col3 D E F
0 4 7 a 1 5 a
1 5 8 X 3 3 a
2 5 8 Z 3 3 a
3 4 9 c 5 6 a
4 5 4 d 7 9 b
5 5 2 X 1 2 b
6 5 2 Z 1 2 b
7 4 3 f 0 4 b

Best way to move a column in pandas dataframe to last column in large dataframe

I have a pandas dataframe with more than 100 columns.
For example in the following df:
df['A','B','C','D','E','date','G','H','F','I']
How can I move date to be the last column? assuming the dataframe is large and i cant write all the column names manually.

You can try this:
new_cols = [col for col in df.columns if col != 'date'] + ['date']
df = df[new_cols]
Test data:
cols = ['A','B','C','D','E','date','G','H','F','I']
df = pd.DataFrame([np.arange(len(cols))],
columns=cols)
print(df)
# A B C D E date G H F I
# 0 0 1 2 3 4 5 6 7 8 9
Output of the code:
A B C D E G H F I date
0 0 1 2 3 4 6 7 8 9 5

Use pandas.DataFrame.pop and pandas.concat:
print(df)
col1 col2 col3
0 1 11 111
1 2 22 222
2 3 33 333
s = df.pop('col1')
new_df = pd.concat([df, s], 1)
print(new_df)
Output:
col2 col3 col1
0 11 111 1
1 22 222 2
2 33 333 3

This way :
df_new=df.loc[:,df.columns!='date']
df_new['date']=df['date']

Simple reindexing should do the job:
original = df.columns
new_cols = original.delete(original.get_loc('date'))
df.reindex(columns=new_cols)

You can use reindex and union:
df.reindex(df.columns[df.columns != 'date'].union(['date']), axis=1)
Let's only work with the index headers and not the complete dataframe.
Then, use reindex to reorder the columns.
Output using #QuangHoang setup:
A B C D E F G H I date
0 0 1 2 3 4 8 6 7 9 5

You can use movecolumn package in Python to move columns:
pip install movecolumn
Then you can write your code as:
import movecolumn as mc
mc.MoveToLast(df,'date')
Hope that helps.
P.S : The package can be found here. https://pypi.org/project/movecolumn/

Delete pandas column if column name begins with a number

I have a pandas DataFrame with about 200 columns. Roughly, I want to do this
for col in df.columns:
if col begins with a number:
df.drop(col)
I'm not sure what are the best practices when it comes to handling pandas DataFrames, how should I handle this? Will my pseudocode work, or is it not recommended to modify a pandas dataframe in a for loop?

I think simpliest is select all columns which not starts with number by filter with regex - ^ is for start of string and \D is for not number:
df1 = df.filter(regex='^\D')
Similar alternative:
df1 = df.loc[:, df.columns.str.contains('^\D')]
Or inverse condition and select numbers:
df1 = df.loc[:, ~df.columns.str.contains('^\d')]
df1 = df.loc[:, ~df.columns.str[0].str.isnumeric()]
If want use your pseudocode:
for col in df.columns:
if col[0].isnumeric():
df = df.drop(col, axis=1)
Sample:
df = pd.DataFrame({'2A':list('abcdef'),
'1B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D3':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
1B 2A C D3 E F
0 4 a 7 1 5 a
1 5 b 8 3 3 a
2 4 c 9 5 6 a
3 5 d 4 7 9 b
4 5 e 2 1 2 b
5 4 f 3 0 4 b
df1 = df.filter(regex='^\D')
print (df1)
C D3 E F
0 7 1 5 a
1 8 3 3 a
2 9 5 6 a
3 4 7 9 b
4 2 1 2 b
5 3 0 4 b

An alternative can be this:
columns = [x for x in df.columns if not x[0].isdigit()]
df = df[columns]

Pandas: Concatenate files but skip the headers except the first file

I have 3 files representing the same dataset split in 3 and I need to concatenate:
import pandas
df1 = pandas.read_csv('path1')
df2 = pandas.read_csv('path2')
df3 = pandas.read_csv('path3')
df = pandas.concat([df1,df2,df3])
But this will keep the headers in the middle of the dataset, I need to remove the headers (column names) from the 2nd and 3rd file. How do I do that?

I think you need numpy.concatenate with DataFrame constructor:
df = pd.DataFrame(np.concatenate([df1.values, df2.values, df3.values]), columns=df1.columns)
Another solution is replace columns names in df2 and df3:
df2.columns = df1.columns
df3.columns = df1.columns
df = pd.concat([df1,df2,df3], ignore_index=True)
Samples:
np.random.seed(100)
df1 = pd.DataFrame(np.random.randint(10, size=(2,3)), columns=list('ABF'))
print (df1)
A B F
0 8 8 3
1 7 7 0
df2 = pd.DataFrame(np.random.randint(10, size=(1,3)), columns=list('ERT'))
print (df2)
E R T
0 4 2 5
df3 = pd.DataFrame(np.random.randint(10, size=(3,3)), columns=list('HTR'))
print (df3)
H T R
0 2 2 2
1 1 0 8
2 4 0 9
print (np.concatenate([df1.values, df2.values, df3.values]))
[[8 8 3]
[7 7 0]
[4 2 5]
[2 2 2]
[1 0 8]
[4 0 9]]
df = pd.DataFrame(np.concatenate([df1.values, df2.values, df3.values]), columns=df1.columns)
print (df)
A B F
0 8 8 3
1 7 7 0
2 4 2 5
3 2 2 2
4 1 0 8
5 4 0 9
df = pd.concat([df1,df2,df3], ignore_index=True)
print (df)
A B F
0 8 8 3
1 7 7 0
2 4 2 5
3 2 2 2
4 1 0 8
5 4 0 9

You have to use argument skip_rows of read_csv for second and third lines like here:
import pandas
df1 = pandas.read_csv('path1')
df2 = pandas.read_csv('path2', skiprows=1)
df3 = pandas.read_csv('path3', skiprows=1)
df = pandas.concat([df1,df2,df3])

Been working on this recently myself, here's the most compact/elegant thing I came up with:
import pandas as pd
frame_list=[df1, df2, df3]
frame_mod=[frame_list[i].iloc[0:] for i in range(0,len(frame_list))]
frame_frame=pd.concat(frame_mod)

Use:
df = pd.merge(df1, df2, how='outer')
Merge rows that appear in either or both df1 and df2 (union).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

rename columns based on specific rule pandas - python

Initial df is: df = a a a a 1 2 3 4 5 6 7 8 9 1 2 3 Desired output: df = b_1 c_1 b_2 c_2 1 2 3 4 5 6 7 8 9 1 2 3 I can do in a long way, like choose odd then even columns, rename them and concat. But looking for a quick solution

Related

How to concat columns of similar dataframes with new names

Pandas, duplicate a row based on a condition

Best way to move a column in pandas dataframe to last column in large dataframe

Delete pandas column if column name begins with a number

Pandas: Concatenate files but skip the headers except the first file

Categories

Resources