I have a data frame where all the columns are supposed to be numbers. While reading it, some of them were read with commas. I know a single column can be fixed by
df['x']=df['x'].str.replace(',','')
However, this works only for series objects and not for entire data frame. Is there an elegant way to apply it to entire data frame since every single entry in the data frame should be a number.
P.S: To ensure I can str.replace, I have first converted the data frame to str by using
df.astype('str')
So I understand, I will have to convert them all to numeric once the comma is removed.
Numeric columns have no ,, so converting to strings is not necessary, only use DataFrame.replace with regex=True for substrings replacement:
df = df.replace(',','', regex=True)
Or:
df.replace(',','', regex=True, inplace=True)
And last convert strings columns to numeric, thank you #anki_91:
c = df.select_dtypes(object).columns
df[c] = df[c].apply(pd.to_numeric,errors='coerce')
Well, you can simplely do:
df = df.apply(lambda x: x.str.replace(',', ''))
Hope it helps!
In case you want to manipulate just one column:
df.column_name = df.column_name.apply(lambda x : x.replace(',',''))
Related
I have a pandas dataframe, where one column contains sets of strings (each row is a (single) set of strings). However, when I "save" this dataframe to csv, and read it back into a pandas dataframe later, each set of strings in this particular column seems to be saved as a single string. For example the value in this particular row, should be a single set of strings, but it seems to have been read in as a single string:
I need to access this data as a python set of strings, is there a way to turn this back into a set? Or better yet, have pandas read this back in as a set?
You can wrap the string in the "set()" function to turn it back into a set.
string = "{'+-0-', '0---', '+0+-', '0-0-', '++++', '+++0', '+++-', '+---', '0+++', '0++0', '0+00', '+-+-', '000-', '+00-'}"
new_set = set(string)
I think you could use a different separator while converting dataframe to csv.
import pandas as pd
df = pd.DataFrame(["{'Ramesh','Suresh','Sachin','Venkat'}"],columns=['set'])
print('Old df \n', df)
df.to_csv('mycsv.csv', sep= ';', index=False)
new_df = pd.read_csv('mycsv.csv', sep= ';')
print('New df \n',new_df)
Output:
You can use series.apply I think:
Let's say your column of sets was called column_of_sets. Assuming you've already read the csv, now do this to convert back to sets.
df['column_of_sets'] = df['column_of_sets'].apply(eval)
I'm taking eval from #Cabara's comment. I think it is the best bet.
Can anyone tell me how can I remove all 'A's and other data like this from the data frame? and I also want to remove XXXX rows from the data frame.
Use Series.str.len with Series.ne to performance a boolean indexing
if you want to delete the column where name is A :
df[df['name'].ne('A') & df['year'].ne('XXXX'))]
to detect when lenght of string in column name is greater than one.
df[df['name'].str.len().gt(1) & df['year'].ne('XXXX')]
In order to remove all the lines where in column name you have 1-character long string just do:
df = df.drop(df.index[df["name"].str.len().eq(1)], axis=0)
Similarly for the XXXX rows:
df = df.drop(df.index[df["year"].eq("XXXX")], axis=0)
And combined:
df = df.drop(df.index[df["name"].str.len().eq(1) | df["year"].eq("XXXX")],axis=0)
I have a program that applies pd.groupby().agg('sum') to a bunch of different pandas.DataFrame objects. Those dataframes are all in the same format. The code works on all dataframes except for this dataframe (picture: df1) which produces funny result (picture: result1).
I tried:
df = df.groupby('Mapping')[list(df)].agg('sum')
This code works for df2 but not for df1.
The code works fine for other dataframes (pictures: df2, result2)
Could somebody tell me why it turned out that way for df1?
The problem in the first dataframe is the commas in variables that should be numeric and i think that python is not recognizing the columns as numeric. Did you try to replace the commas?
It seems that in df1, most of the numeric columns are actually str. You can tell by the commas (,) that delimit thousands. Try:
df.iloc[:,1:] = df.iloc[:,1:].apply(lambda x: str(x).replace(",",""))
df.iloc[:,1:] = df.iloc[:,1:].apply(lambda x: pd.to_numeric(x))
The first line removes the commas from all the second, third, etc. columns. The second line turns them to numeric data types. This could actually be a one-liner, but I wrote it in two lines for readability's sake.
Once this is done, you can try your groupby code.
It's good practice to check the data types of your columns as soon as you load them. You can do so with df1.dtypes.
I have a pandas df, where one of my columns have faulty values. I want to clean these values
The Faulty Values are negative and end with <, example '-2.44<'.
How do I fix this without affecting other columns? My index is Date-Time
I have tried to convert the column to numeric data.
df.values = pd.to_numeric(df.values, errors='coerce')
There are no error messages. But, I'd like to replace them with removing '<'.
Use Series.str.rstrip for remove < from right side:
df.values = pd.to_numeric(df.values.str.rstrip('<'), errors='coerce')
Or more general is used Series.str.strip - possible add more values:
df.values = pd.to_numeric(df.values.str.strip('<>'), errors='coerce')
I have three types of columns in my dataframe: numeric, string and datetime.
I need to add the element | to the end of every value as a separator
I have tried:
df['column'] = (df['column']+ '|')
but it does not work for the datetime columns and I have to add .astype(str) to the numeric columns which may result in formatting issues later.
Any other suggestions?
you can use DataFrame.to_csv() with sep="|", if you want to create a csv.
further documentation :
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html
Not too sure why you would want to do this but if you want to make a CSV file with | as the delimiter, you can set that in the df.to_csv('out.csv', sep='|') method. I think a cleaner way of doing this would be to use a lambda function:
df['column'] = df['column'].apply(lambda x: f"{x}|")
You will always have to add .astype(str) though...
This may help you in this case:
df['column'] = df['column'].astype(str) + "|"