How to change dataframe column names without changing the values? [duplicate] - python

This question already has answers here:
Renaming column names in Pandas
(35 answers)
Closed 2 years ago.
I have a bunch of CSV files which are read as dataframes. For each dataframe, I want to change some column names, if a specific column exists in a dataframe:
column_name_update_map = {'aa': 'xx'; 'bb': 'yy'}
In such a map, if 'aa' or 'bb' exists in a dataframe, I want to change the aa to xx, and 'bb' to 'yy'. No values should be changed.
for file in files:
print('Current file: ', file)
df = pd.read_csv(file, sep='\t')
df = df.replace(np.nan, '', regex=True)
for index, row in df.iterrows():
pass
I don't think I should use the inner loop, but if I have to do, what's the right way to change the column name only?

You can use rename in dataframes
column_name_update_map = {'aa': 'xx', 'bb': 'yy'}
df = df.rename(columns=column_name_update_map)

To rename specific columns then follow this code.
Code:
import pandas as pd
import numpy as np
#creating sample dataframe
df=pd.DataFrame({'aa':[1, 2], 'bb':[3, 4], 'c':[5, 6], '':[7, 8]})
#replace columns 'aa' to 'xx', 'bb' to 'yy' and '' to 'NaN'
df.rename(columns={'aa':'xx', 'bb':'yy', '':np.nan}, inplace=True)
#display resulting dataframe
print(df)
I hope it would be helpful.

Related

How to extract a value after colon in all the rows from a pandas dataframe column? [duplicate]

This question already has answers here:
access value from dict stored in python df
(3 answers)
Closed 3 months ago.
Edit: the dummy dataframe is edited
I have a pandas data frame with the below kind of column with 200 rows.
Let's say the name of df is data.
-----------------------------------|
B
-----------------------------------|
{'animal':'cat', 'bird':'peacock'...}
I want to extract the value of animal to a separate column C for all the rows.
I tried the below code but it doesn't work.
data['C'] = data["B"].apply(lambda x: x.split(':')[-2] if ':' in x else x)
Please help.
The dictionary is unpacked with pd.json_normalize
import pandas as pd
data = pd.DataFrame({'B': [{0: {'animal': 'cat', 'bird': 'peacock'}}]})
data['C'] = pd.json_normalize(data['B'])['0.animal']
I'm not totally sure of the structure of your data. Does this look right?
import pandas as pd
import re
df = pd.DataFrame({
"B": ["'animal':'cat'", "'bird':'peacock'"]
})
df["C"] = df.B.apply(lambda x: re.sub(r".*?\:(.*$)", r"\1", x))

Add column containing list value if column contains string in list

I'm trying to scan a particular column in a dataframe, eg df['x'] for values which I have in a separate list list = ['y', 'z', 'a', 'b']. How do I make pandas load a new column with the list value if df['x'] contains any, or more than one of the values from the list?
Thanks!
Use this:
In [720]: import pandas as pd
In [719]: if df['x'].str.contains('|'.join(list)).any():
...: df = pd.concat([df, pd.DataFrame(list)], axis=1))
...:

How to change specific row value in dataframe using pandas? [duplicate]

This question already has answers here:
Set value for particular cell in pandas DataFrame using index
(23 answers)
Closed 2 years ago.
Here I attached my data frame.I am trying to change specific value of row.but I am not getting succeed.Any leads would be appreciated.
df.replace(to_replace ="Agriculture, forestry and fishing ",
value ="Agriculture")
Image of My data frame
Try this:
df['Name'] = df['Name'].str.replace('Agriculture, forestry and fishing', 'Agriculture')
This should work for any data type:
df.loc[df.loc[:, 'Name']=='Agriculture, forestry and fishing', 'Name'] = 'Agriculture'
You can easily get all the columns names with calling: df.columns
then you can copy this list and replace the name of any column and reassign the list to df.columns.
For example:
import pandas as pd
df = pd.DataFrame(data=[[1, 2], [10, 20], [100, 200]], columns=['A', 'B'])
df.columns
the output will be in a jupyter notebook:
Index(['C', 'D'], dtype='object')
so you copy that list and then replace what you want to change and reassign it
df.columns = ['C', 'D']
and then you will get a dataframe with the name of columns changed from A and B to C and D, you check this by calling
df.head()

Appending a column to data frame using Pandas in python

I'm trying some operations on Excel file using pandas. I want to extract some columns from a excel file and add another column to those extracted columns. And want to write all the columns to new excel file. To do this I have to append new column to old columns.
Here is my code-
import pandas as pd
#Reading ExcelFIle
#Work.xlsx is input file
ex_file = 'Work.xlsx'
data = pd.read_excel(ex_file,'Data')
#Create subset of columns by extracting columns D,I,J,AU from the file
data_subset_columns = pd.read_excel(ex_file, 'Data', parse_cols="D,I,J,AU")
#Compute new column 'Percentage'
#'Num Labels' and 'Num Tracks' are two different columns in given file
data['Percentage'] = data['Num Labels'] / data['Num Tracks']
data1 = data['Percentage']
print data1
#Here I'm trying to append data['Percentage'] to data_subset_columns
Final_data = data_subset_columns.append(data1)
print Final_data
Final_data.to_excel('111.xlsx')
No error is shown. But Final_data is not giving me expected results. ( Data not getting appended)
There is no need to explicitly append columns in pandas. When you calculate a new column, it is included in the dataframe. When you export it to excel, the new column will be included.
Try this, assuming 'Num Labels' and 'Num Tracks' are in "D,I,J,AU" [otherwise add them]:
import pandas as pd
data_subset = pd.read_excel(ex_file, 'Data', parse_cols="D,I,J,AU")
data_subset['Percentage'] = data_subset['Num Labels'] / data_subset['Num Tracks']
data_subset.to_excel('111.xlsx')
The append function of a dataframe adds rows, not columns to the dataframe. Well, it does add columns if the appended rows have more columns than in the source dataframe.
DataFrame.append(other, ignore_index=False, verify_integrity=False)[source]
Append rows of other to the end of this frame, returning a new object. Columns not in this frame are added as new columns.
I think you are looking for something like concat.
Combine DataFrame objects horizontally along the x axis by passing in axis=1.
>>> df1 = pd.DataFrame([['a', 1], ['b', 2]],
... columns=['letter', 'number'])
>>> df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']],
... columns=['animal', 'name'])
>>> pd.concat([df1, df4], axis=1)
letter number animal name
0 a 1 bird polly
1 b 2 monkey george

Multiple columns with the same name in Pandas

I am creating a dataframe from a CSV file. I have gone through the docs, multiple SO posts, links as I have just started Pandas but didn't get it. The CSV file has multiple columns with same names say a.
So after forming dataframe and when I do df['a'] which value will it return? It does not return all values.
Also only one of the values will have a string rest will be None. How can I get that column?
the relevant parameter is mangle_dupe_cols
from the docs
mangle_dupe_cols : boolean, default True
Duplicate columns will be specified as 'X.0'...'X.N', rather than 'X'...'X'
by default, all of your 'a' columns get named 'a.0'...'a.N' as specified above.
if you used mangle_dupe_cols=False, importing this csv would produce an error.
you can get all of your columns with
df.filter(like='a')
demonstration
from StringIO import StringIO
import pandas as pd
txt = """a, a, a, b, c, d
1, 2, 3, 4, 5, 6
7, 8, 9, 10, 11, 12"""
df = pd.read_csv(StringIO(txt), skipinitialspace=True)
df
df.filter(like='a')
I had a similar issue, not due to reading from csv, but I had multiple df columns with the same name (in my case 'id'). I solved it by taking df.columns and resetting the column names using a list.
In : df.columns
Out:
Index(['success', 'created', 'id', 'errors', 'id'], dtype='object')
In : df.columns = ['success', 'created', 'id1', 'errors', 'id2']
In : df.columns
Out:
Index(['success', 'created', 'id1', 'errors', 'id2'], dtype='object')
From here, I was able to call 'id1' or 'id2' to get just the column I wanted.
That's what I usually do with my genes expression dataset, where the same gene name can occur more than once because of a slightly different genetic sequence of the same gene:
create a list of the duplicated columns in my dataframe (refers to column names which appear more than once):
duplicated_columns_list = []
list_of_all_columns = list(df.columns)
for column in list_of_all_columns:
if list_of_all_columns.count(column) > 1 and not column in duplicated_columns_list:
duplicated_columns_list.append(column)
duplicated_columns_list
Use the function .index() that helps me to find the first element that is duplicated on each iteration and underscore it:
for column in duplicated_columns_list:
list_of_all_columns[list_of_all_columns.index(column)] = column + '_1'
list_of_all_columns[list_of_all_columns.index(column)] = column + '_2'
This for loop helps me to underscore all of the duplicated columns and now every column has a distinct name.
This specific code is relevant for columns that appear exactly 2 times, but it can be modified for columns that appear even more than 2 times in your dataframe.
Finally, rename your columns with the underscored elements:
df.columns = list_of_all_columns
That's it, I hope it helps :)
Similarly to JDenman6 (and related to your question), I had two df columns with the same name (named 'id').
Hence, calling
df['id']
returns 2 columns.
You can use
df.iloc[:,ind]
where ind corresponds to the index of the column according how they are ordered in the df. You can find the indices using:
indices = [i for i,x in enumerate(df.columns) if x == 'id']
where you replace 'id' with the name of the column you are searching for.

Categories

Resources