Replace method in Pandas - python

How do I replace multiple column names with different values? Ideally I want to remove certain characters in column names and replace others.
I have to run my jupyter notebook twice in order to get this code to work. Does anyone know the reason for this? Also, how would I go about simplifying this code (I am aware of just nesting .replace(), however that doesn't solve my problem). The snippet posted below may not be enough to go off of; please view the following link to my notebook if needed: https://datalore.jetbrains.com/notebook/iBhSV0RbfC66p84tZsnC24/w3Z6tCriPC5v5XwqCDQpWf/
for col in df.columns:
df.rename(columns={col: col.replace('Deaths - ', '').replace(' - Sex: Both - Age: All Ages (Number)', '')}, inplace=True)
df.rename(columns={col: col.replace(' (deaths)', '')}, inplace=True)
df.rename(columns={col: col.replace(' ', '_')}, inplace=True)
for col in df.columns:
df.rename(columns={col: col.lower()}, inplace=True)

I can't comment due to 'points' but firstly - avoid inplace - https://github.com/pandas-dev/pandas/issues/16529#issuecomment-443763553
For the replacement of multiple values in multiple columns you can use:
df = df.rename(columns = { ... }).rename(columns = lambda x: x.lower())
Which will rename using the given dictionary, and convert to lower case.

aside from replace, as you had done
here are few other ways
# create a list of columns and assign to the DF
cols = ['col1','col2','col3','col4']
df.columns=df.columns = cols
or
# create a dictionary of current values to the new values
# and update using map
d = {'c1' : 'col1', 'c2': 'col2', 'c3':'col3' , 'c4':'col 4'}
df.columns.map(d)

After reviewing the comments, this is the solution I came up with. I would love to be able to simplify it a bit more. However, I only need to run the program once for it to save the column values.
for col in df.columns:
df = df.rename(columns={col: col.lower()
.replace('deaths - ', '')
.replace(' - sex: both - age: all ages (number)', '')
.replace(' (deaths)', '')
.replace(' ', '_')
.replace('amnesty_international', '')}

Related

How to remove all rows of a datframe column that contain a question mark instead of occupation

This is my attempt:
df['occupation']= df['occupation'].str.replace('?', '')
df.dropna(subset=['occupation'], inplace=True)
but it is not working, How do i remove all of the rows of the occupation column that i read from a csv file that contain a ? rather than an occupation
If you're reading the csv with pd.read_csv(), you can pass na_values.
# to treat '?' as NaN in all columns:
pd.read_csv(fname, na_values='?')
# to treat '?' as NaN in just the occupation column:
pd.read_csv(fname, na_values={'occupation': '?'})
Then, you can dropna or fillna('') on that column as you see fit.
Clean up the white space and use an 'unselect' filter:
import pandas as pd
bugs = ['grasshopper','cricket','ant','spider']
fruit = ['lemon','komquat','watermelon','apple']
squashed = [' ? ','Yes','No','Eww']
df = pd.DataFrame(list(zip(bugs,fruit,squashed)), columns = ['Bugs','Fruit','Squashed'])
print(df.head())
df = df[df['Squashed'].apply(lambda x: x.strip()) != '?']
print('after stripping white space and after unselect')
print(df.head())
Why
The dataframe method .dropna() won't detect blanks (ie '') but will look for Nan or NaT or None.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
However, using .replace() to set the value to missing won't work because .replace() requires the type to match and None doesn't match any type you'll have in a column already.
Better to clean up the white space (which is the simple case) using lambda on each entry to apply the string transformation.
You can try this...
df = df[df.occupation != "?"]

Change columns names from string to float

I have dataframe with many columns, some of them are strings and some are numbers. Right now when I print the columns names as list I get something like this:
df.columns.tolist()
>>>['name','code','date','401.2','405.6','507.3'...]
I would like to get the numerical columns as float numbers and not as string, I haven't found yet any way to do that, is is possible to do something like this?
my goal in the end is to be able to create list only of the numerical columns names, so if you know other way to seperate them now when they are string could work also.
Use custom function with try-except statement:
df = pd.DataFrame(columns=['name','code','date','401.2','405.6','507.3'])
def f(x):
try:
return float(x)
except:
return x
df.columns = df.columns.map(f)
print (df.columns.tolist())
['name', 'code', 'date', 401.2, 405.6, 507.3]
Using list comprehension
df.columns = [float(col) if col.replace('.', '').isnumeric() else col for col in df.columns]
res = df.columns.to_list()
print(res)
Output:
['name', 'code', 'date', 401.2, 405.6, 507.3]

Selecting portion of column name for panda dataframe

If I have more than 200 columns, each with long names and I want to remove the first part of the names, how do I do that using pandas?
You could loop through them and omit the first n characters:
n = 3
li = []
for col in df.columns:
col = col[n:]
li.append(col)
df.columns = li
Or perform any other form of string manipulation, I'm not sure what you mean by "to remove the first part".
I'd just use rename:
n=5
df.rename(columns = lambda x: x[n:])
and here, the lambda can be anything, you could also strip further whitespace, and actually, you can just define a callable and use with here, potentially without even using lambda
Use indexing with str:
N = 5
df.columns = df.columns.str[N:]
If you just want to remove a certain number of characters:
df.rename(columns=lambda col: col[n:])
If you want to selectively remove based on a prefix:
# cols = 'a_A', 'a_B', 'b_A'
df.rename(columns=lambda col: col.split('a_')[1] if 'a_' in col else col)
How complicated your rules are is up to you.

Get rid of columns that do not have headers

I'm trying to read in a CSV file that has columns without headers. Currently, my solution is
df = pd.read_csv("test.csv")
df = df[[col for col in df.columns if 'Unnamed' not in col]]
This seems a little hacky, and would fail if the file contains columns with the word 'Unnamed' in them. Is there a better way to do this?
The usecols argument of the read_csv function accepts a callable function as input. If you provide a function that evaluates to False for your undesired column headers, then these columns are dropped.
func = lambda x: not x.startswith('Unnamed: ')
df = pd.read_csv('test.csv', usecols=func)
I guess that this solution is not really fundamentally different from your original solution though.
Maybe you could rename those columns first?
df = pd.read_csv("test.csv")
df.columns = df.columns.str.replace('^Unnamed:.*', '')
df[[col for col in df.columns if col]]
Still pretty hacky, but at least this replaces only the strings which start with "Unnamed:" with '' before filtering them.

Trying to remove commas and dollars signs with Pandas in Python

Tring to remove the commas and dollars signs from the columns. But when I do, the table prints them out and still has them in there. Is there a different way to remove the commans and dollars signs using a pandas function. I was unuable to find anything in the API Docs or maybe i was looking in the wrong place
import pandas as pd
import pandas_datareader.data as web
players = pd.read_html('http://www.usatoday.com/sports/mlb/salaries/2013/player/p/')
df1 = pd.DataFrame(players[0])
df1.drop(df1.columns[[0,3,4, 5, 6]], axis=1, inplace=True)
df1.columns = ['Player', 'Team', 'Avg_Annual']
df1['Avg_Annual'] = df1['Avg_Annual'].replace(',', '')
print (df1.head(10))
You have to access the str attribute per http://pandas.pydata.org/pandas-docs/stable/text.html
df1['Avg_Annual'] = df1['Avg_Annual'].str.replace(',', '')
df1['Avg_Annual'] = df1['Avg_Annual'].str.replace('$', '')
df1['Avg_Annual'] = df1['Avg_Annual'].astype(int)
alternately;
df1['Avg_Annual'] = df1['Avg_Annual'].str.replace(',', '').str.replace('$', '').astype(int)
if you want to prioritize time spent typing over readability.
Shamelessly stolen from this answer... but, that answer is only about changing one character and doesn't complete the coolness: since it takes a dictionary, you can replace any number of characters at once, as well as in any number of columns.
# if you want to operate on multiple columns, put them in a list like so:
cols = ['col1', 'col2', ..., 'colN']
# pass them to df.replace(), specifying each char and it's replacement:
df[cols] = df[cols].replace({'\$': '', ',': ''}, regex=True)
#shivsn caught that you need to use regex=True; you already knew about replace (but also didn't show trying to use it on multiple columns or both the dollar sign and comma simultaneously).
This answer is simply spelling out the details I found from others in one place for those like me (e.g. noobs to python an pandas). Hope it's helpful.
#bernie's answer is spot on for your problem. Here's my take on the general problem of loading numerical data in pandas.
Often the source of the data is reports generated for direct consumption. Hence the presence of extra formatting like %, thousand's separator, currency symbols etc. All of these are useful for reading but causes problems for the default parser. My solution is to typecast the column to string, replace these symbols one by one then cast it back to appropriate numerical formats. Having a boilerplate function which retains only [0-9.] is tempting but causes problems where the thousand's separator and decimal gets swapped, also in case of scientific notation. Here's my code which I wrap into a function and apply as needed.
df[col] = df[col].astype(str) # cast to string
# all the string surgery goes in here
df[col] = df[col].replace('$', '')
df[col] = df[col].replace(',', '') # assuming ',' is the thousand's separator in your locale
df[col] = df[col].replace('%', '')
df[col] = df[col].astype(float) # cast back to appropriate type
This worked for me. Adding "|" means or :
df['Salary'].str.replace('\$|,','', regex=True)
I used this logic
df.col = df.col.apply(lambda x:x.replace('$','').replace(',',''))
When I got to this problem, this was how I got out of it.
df['Salary'] = df['Salary'].str.replace("$",'').astype(float)

Categories

Resources