Tring to remove the commas and dollars signs from the columns. But when I do, the table prints them out and still has them in there. Is there a different way to remove the commans and dollars signs using a pandas function. I was unuable to find anything in the API Docs or maybe i was looking in the wrong place
import pandas as pd
import pandas_datareader.data as web
players = pd.read_html('http://www.usatoday.com/sports/mlb/salaries/2013/player/p/')
df1 = pd.DataFrame(players[0])
df1.drop(df1.columns[[0,3,4, 5, 6]], axis=1, inplace=True)
df1.columns = ['Player', 'Team', 'Avg_Annual']
df1['Avg_Annual'] = df1['Avg_Annual'].replace(',', '')
print (df1.head(10))
You have to access the str attribute per http://pandas.pydata.org/pandas-docs/stable/text.html
df1['Avg_Annual'] = df1['Avg_Annual'].str.replace(',', '')
df1['Avg_Annual'] = df1['Avg_Annual'].str.replace('$', '')
df1['Avg_Annual'] = df1['Avg_Annual'].astype(int)
alternately;
df1['Avg_Annual'] = df1['Avg_Annual'].str.replace(',', '').str.replace('$', '').astype(int)
if you want to prioritize time spent typing over readability.
Shamelessly stolen from this answer... but, that answer is only about changing one character and doesn't complete the coolness: since it takes a dictionary, you can replace any number of characters at once, as well as in any number of columns.
# if you want to operate on multiple columns, put them in a list like so:
cols = ['col1', 'col2', ..., 'colN']
# pass them to df.replace(), specifying each char and it's replacement:
df[cols] = df[cols].replace({'\$': '', ',': ''}, regex=True)
#shivsn caught that you need to use regex=True; you already knew about replace (but also didn't show trying to use it on multiple columns or both the dollar sign and comma simultaneously).
This answer is simply spelling out the details I found from others in one place for those like me (e.g. noobs to python an pandas). Hope it's helpful.
#bernie's answer is spot on for your problem. Here's my take on the general problem of loading numerical data in pandas.
Often the source of the data is reports generated for direct consumption. Hence the presence of extra formatting like %, thousand's separator, currency symbols etc. All of these are useful for reading but causes problems for the default parser. My solution is to typecast the column to string, replace these symbols one by one then cast it back to appropriate numerical formats. Having a boilerplate function which retains only [0-9.] is tempting but causes problems where the thousand's separator and decimal gets swapped, also in case of scientific notation. Here's my code which I wrap into a function and apply as needed.
df[col] = df[col].astype(str) # cast to string
# all the string surgery goes in here
df[col] = df[col].replace('$', '')
df[col] = df[col].replace(',', '') # assuming ',' is the thousand's separator in your locale
df[col] = df[col].replace('%', '')
df[col] = df[col].astype(float) # cast back to appropriate type
This worked for me. Adding "|" means or :
df['Salary'].str.replace('\$|,','', regex=True)
I used this logic
df.col = df.col.apply(lambda x:x.replace('$','').replace(',',''))
When I got to this problem, this was how I got out of it.
df['Salary'] = df['Salary'].str.replace("$",'').astype(float)
Related
I have a small dataframe with entries regarding motorsport balance of performance.
I try to get rid of the string after "#"
This is working fine with the code:
for col in df_engine.columns[1:]:
df_engine[col] = df_engine[col].str.rstrip(r"[\ \# \d.[0-9]+]")
but is leaving last column unchanged, and I do not understand why.
The Ferrari column also has a NaN entry as last position, just as additional info.
Can anyone provide some help?
Thank you in advance!
rstrip does not work with regex. As per the documentation,
to_strip str or None, default None
Specifying the set of characters to
be removed. All combinations of this set of characters will be
stripped. If None then whitespaces are removed.
>>> "1.76 # 0.88".rstrip("[\ \# \d.[0-9]+]")
'1.76 # 0.88'
>>> "1.76 # 0.88".rstrip("[\ \# \d.[0-8]+]") # It's not treated as regex, instead All combinations of characters(`[\ \# \d.[0-8]+]`) stripped
'1.76'
You could use the replace method instead.
for col in df.columns[1:]:
df[col] = df[col].str.replace(r"\s#\s[\d\.]+$", "", regex=True)
What about str.split() ?
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html#pandas.Series.str.split
The function splits a serie in dataframe columns (when expand=True) using the separator provided.
The following example split the serie df_engine[col] and produce a dataframe. The first column of the new dataframe contains values preceding the first separator char '#' found in the value
df_engine[col].str.split('#', expand=True)[0]
How do I replace multiple column names with different values? Ideally I want to remove certain characters in column names and replace others.
I have to run my jupyter notebook twice in order to get this code to work. Does anyone know the reason for this? Also, how would I go about simplifying this code (I am aware of just nesting .replace(), however that doesn't solve my problem). The snippet posted below may not be enough to go off of; please view the following link to my notebook if needed: https://datalore.jetbrains.com/notebook/iBhSV0RbfC66p84tZsnC24/w3Z6tCriPC5v5XwqCDQpWf/
for col in df.columns:
df.rename(columns={col: col.replace('Deaths - ', '').replace(' - Sex: Both - Age: All Ages (Number)', '')}, inplace=True)
df.rename(columns={col: col.replace(' (deaths)', '')}, inplace=True)
df.rename(columns={col: col.replace(' ', '_')}, inplace=True)
for col in df.columns:
df.rename(columns={col: col.lower()}, inplace=True)
I can't comment due to 'points' but firstly - avoid inplace - https://github.com/pandas-dev/pandas/issues/16529#issuecomment-443763553
For the replacement of multiple values in multiple columns you can use:
df = df.rename(columns = { ... }).rename(columns = lambda x: x.lower())
Which will rename using the given dictionary, and convert to lower case.
aside from replace, as you had done
here are few other ways
# create a list of columns and assign to the DF
cols = ['col1','col2','col3','col4']
df.columns=df.columns = cols
or
# create a dictionary of current values to the new values
# and update using map
d = {'c1' : 'col1', 'c2': 'col2', 'c3':'col3' , 'c4':'col 4'}
df.columns.map(d)
After reviewing the comments, this is the solution I came up with. I would love to be able to simplify it a bit more. However, I only need to run the program once for it to save the column values.
for col in df.columns:
df = df.rename(columns={col: col.lower()
.replace('deaths - ', '')
.replace(' - sex: both - age: all ages (number)', '')
.replace(' (deaths)', '')
.replace(' ', '_')
.replace('amnesty_international', '')}
I convert a CSV file to pandas DataFrame, but found all content is str with the pattern like ="content"
Tried using df.replace to substitute '=' and '"'. The code is like
df.replace("=","", inplace = True)
df.replace('"',"", inplace = True)
However, this code does not work without error messages, and nothing is replaced in the Dataframe.
After df.replace
Strangely, it works when use
df[column] = df[column].str.replace('=','')
df[column] = df[column].str.replace('=','')
Is there any possible way to replace/substitute equal and double quote signs using DataFrame methods? And I am curious with the reason why df.replace method isn't workable.
Sorry I can only provide the pic since the original data and code are in a notebook with locked internet and USB function.
Thanks for the help
Because .replace('=', '') requires the cell value to be exactly '=' which is obviously not true in your case.
You may instead use it with regex:
df = pd.DataFrame({'a': ['="abc"', '="bcd"'], 'b': ['="uef"', '="hdd"'], 'c':[1,3]})
df.replace([r'^="', r'"$'], '', regex=True, inplace=True)
print(df)
a b c
0 abc uef 1
1 bcd hdd 3
Two regular expressions are used here, with the first taking care of the head and the second the tail.
I've done some searching and can't figure out how to filter a dataframe by
df["col"].str.contains(word)
however I'm wondering if there is a way to do the reverse: filter a dataframe by that set's compliment. eg: to the effect of
!(df["col"].str.contains(word))
Can this be done through a DataFrame method?
You can use the invert (~) operator (which acts like a not for boolean data):
new_df = df[~df["col"].str.contains(word)]
where new_df is the copy returned by RHS.
contains also accepts a regular expression...
If the above throws a ValueError or TypeError, the reason is likely because you have mixed datatypes, so use na=False:
new_df = df[~df["col"].str.contains(word, na=False)]
Or,
new_df = df[df["col"].str.contains(word) == False]
I was having trouble with the not (~) symbol as well, so here's another way from another StackOverflow thread:
df[df["col"].str.contains('this|that')==False]
You can use Apply and Lambda :
df[df["col"].apply(lambda x: word not in x)]
Or if you want to define more complex rule, you can use AND:
df[df["col"].apply(lambda x: word_1 not in x and word_2 not in x)]
I hope the answers are already posted
I am adding the framework to find multiple words and negate those from dataFrame.
Here 'word1','word2','word3','word4' = list of patterns to search
df = DataFrame
column_a = A column name from DataFrame df
values_to_remove = ['word1','word2','word3','word4']
pattern = '|'.join(values_to_remove)
result = df.loc[~df['column_a'].str.contains(pattern, case=False)]
I had to get rid of the NULL values before using the command recommended by Andy above. An example:
df = pd.DataFrame(index = [0, 1, 2], columns=['first', 'second', 'third'])
df.ix[:, 'first'] = 'myword'
df.ix[0, 'second'] = 'myword'
df.ix[2, 'second'] = 'myword'
df.ix[1, 'third'] = 'myword'
df
first second third
0 myword myword NaN
1 myword NaN myword
2 myword myword NaN
Now running the command:
~df["second"].str.contains(word)
I get the following error:
TypeError: bad operand type for unary ~: 'float'
I got rid of the NULL values using dropna() or fillna() first and retried the command with no problem.
To negate your query use ~. Using query has the advantage of returning the valid observations of df directly:
df.query('~col.str.contains("word").values')
Additional to nanselm2's answer, you can use 0 instead of False:
df["col"].str.contains(word)==0
somehow '.contains' didn't work for me but when I tried with '.isin' as mentioned by #kenan in the answer (How to drop rows from pandas data frame that contains a particular string in a particular column?) it works. Adding further, if you want to look at the entire dataframe and remove those rows which has the specific word (or set of words) just use the loop below
for col in df.columns:
df = df[~df[col].isin(['string or string list separeted by comma'])]
just remove ~ to get the dataframe that contains the word
To compliment to the above question, if someone wants to remove all the rows with strings, one could do:
df_new=df[~df['col_name'].apply(lambda x: isinstance(x, str))]
I'm looking for a solution to remove/turn off the 2 spaces between columns that df.to_string creates automatically.
Example:
from pandas import DataFrame
df = DataFrame()
df = df.append({'a':'12345', 'b': '12345'})
df.to_string(index=False, header=False)
'12345 1235'
For clarity, the result is: '12345..12345' where the dots represent actual spaces.
I already tried the pandas.set_option and pandas.to_string documentation.
EDIT: The above example is overly simplified. I am working with an existing df that has spaces all over the place and the output text files are consumed by another blackbox program that is based off char-widths for each line. I've already figured out how to reformat the columns with formatters and make sure my columns are not cutoff by pandas default so I am 90% there (minus these auto spaces).
FYI here are some good links on to_string() formatting and data-truncation:
Convert to date using formatters parameter in pandas to_string
https://github.com/pandas-dev/pandas/issues/9784
Appreciate the help!
You can use the pd.Series.str.cat method, which accepts a sep keyword argument. By default sep is set to '' so there is no separation between values. Here are the docs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.cat.html
You can also use pd.Series.str.strip to remove any leading or trailing whitespace from each value. Here are the docs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.strip.html
Here's an example based on what you have:
df = pd.DataFrame({'a': ['12345'], 'b': ['12345']})
df.iloc[0].fillna('').str.strip().str.cat(sep=' ')
Note that fillna('') is required if there are any empty values.
Even if this post is old, just in case that someone else comes nowadays like me:
df.to_string(header=False, index=False).strip().replace(' ', ''))
I also had the same problem. There is a justify option in to_string() which is supposed to help in this case. But I ended up doing it the old way:
[row['a']+ row['b'] for index, row in df.iterrows()]