Beginner question incoming.
I have a dataframe derived from an excel file with a column that I will call "input".
In this column are floats (e.g. 7.4, 8.1, 2.2,...). However, there are also some wrong values such as strings (which are easy to filter out) and, what I find difficult, single instances of "." or "..".
I would like to clean the column to generate only numeric float values.
I have used this approach for other columns, but cannot do so here because if I get rid of the "." instances, my floats will be messed up:
for col in [col for col in new_df.columns if col.startswith("input")]:
new_df[col] = new_df[col].str.replace(r',| |\-|\^|\+|#|j|0|.', '', regex=True)
new_df[col] = pd.to_numeric(new_df[col], errors='raise')
I have also tried the following, but it then replaces every value in the column with None:
for index, row in new_df.iterrows():
col_input = row['input']
if re.match(r'^-?\d+(?:.\d+)$', str(col_input)) is None:
new_df["input"] = None
How do I get rid of the dots?
Thanks!
You can simply use pandas.to_numeric and pass errors='coerce' without the loop :
from io import StringIO
import pandas as pd
s = """input
7.4
8.1
2.2
foo
foo.bar
baz/foo"""
df = pd.read_csv(StringIO(s))
df['input'] = pd.to_numeric(df['input'], errors='coerce')
# Outputs :
print(df)
input
0 7.4
1 8.1
2 2.2
3 NaN
4 NaN
5 NaN
df.dropna(inplace=True)
print(df)
input
0 7.4
1 8.1
2 2.2
If you need to clean up multiple mixed columns, use :
cols = ['input', ...] # put here the name of the columns concerned
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
df.dropna(subset=cols, inplace=True)
Related
By default to_csv writes a CSV like
,a,b,c
0,0.0,0.0,0.0
1,0.0,0.0,0.0
2,0.0,0.0,0.0
But I want it to write like this:
a,b,c
0,0.0,0.0,0.0
1,0.0,0.0,0.0
2,0.0,0.0,0.0
How do I achieve this? I can't set index=False because I want to preserve the index. I just want to remove the leading comma.
df = pd.DataFrame(np.zeros((3,3)), columns = ['a','b','c'])
df.to_csv("test.csv") # this results in the first example above.
It is possible by write only columns without index first and then data without header in append mode:
df = pd.DataFrame(np.zeros((3,3)), columns = ['a','b','c'], index=list('XYZ'))
pd.DataFrame(columns=df.columns).to_csv("test.csv", index=False)
#alternative for empty df
#df.iloc[:0].to_csv("test.csv", index=False)
df.to_csv("test.csv", header=None, mode='a')
df = pd.read_csv("test.csv")
print (df)
a b c
X 0.0 0.0 0.0
Y 0.0 0.0 0.0
Z 0.0 0.0 0.0
Alternatively, try reseting the index so it becomes a column in data frame, named index. This works with multiple indexes as well.
df = df.reset_index()
df.to_csv('output.csv', index = False)
Simply set a name for your index: df.index.name = 'blah'. This name will appear as the first name in the headers.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.zeros((3,3)), columns = ['a','b','c'])
df.index.name = 'my_index'
print(df.to_csv())
yields
my_index,a,b,c
0,0.0,0.0,0.0
1,0.0,0.0,0.0
2,0.0,0.0,0.0
However if (as per your comment) you wish to have 3 coma-separated names in the headers while there are 4 coma-separated values in the rows of the csv, you'll have to handcraft it. It will NOT be compliant with any csv standard format though.
this is my code:
for col in df:
if col.startswith('event'):
df[col].fillna(0, inplace=True)
df[col] = df[col].map(lambda x: re.sub("\D","",str(x)))
I have 0 to 10 event column "event_0, event_1,..."
When I fill nan with this code it fills all nan cells under all event columns to 0 but it does not change event_0 which is the first column of that selection and it is also filled by nan.
I made these columns from 'events' column with following code:
event_seperator = lambda x: pd.Series([i for i in
str(x).strip().split('\n')]).add_prefix('event_')
df_events = df['events'].apply(event_seperator)
df = pd.concat([df.drop(columns=['events']), df_events], axis=1)
Please tell me what is wrong? you can see dataframe before changing in the picture.
I don't know why that happened since I made all those columns the
same.
Your data suggests this is precisely what has not been done.
You have a few options depending on what you are trying to achieve.
1. Convert all non-numeric values to 0
Use pd.to_numeric with errors='coerce':
df[col] = pd.to_numeric(df[col], errors='coerce').fillna(0)
2. Replace either string ('nan') or null (NaN) values with 0
Use pd.Series.replace followed by the previous method:
df[col] = df[col].replace('nan', np.nan).fillna(0)
So I'm working with Pandas and I have multiple words (i.e. strings) in one cell, and I need to put every word into the new row and keep coordinated data. I've found a method which could help me,but it works with numbers, not strings.
So what method do I need to use?
Simple example of my table:
id name method
1 adenosis mammography, mri
And I need it to be:
id name method
1 adenosis mammography
mri
Thanks!
UPDATE:
That's what I'm trying to do, according to #jezrael's proposal:
import pandas as pd
import numpy as np
xl = pd.ExcelFile("./dev/eyetoai/google_form_pure.xlsx")
xl.sheet_names
df = xl.parse("Form Responses 1")
df.groupby(['Name of condition','Condition description','Relevant Modality','Type of finding Mammography', 'Type of finding MRI', 'Type of finding US']).mean()
splitted = df['Relevant Modality'].str.split(',')
l = splitted.str.len()
df = pd.DataFrame({col: np.repeat(df[col], l) for col in ['Name of condition','Condition description']})
df['Relevant Modality'] = np.concatenate(splitted)
But I have this type of error:
TypeError: repeat() takes exactly 2 arguments (3 given)
You can use read_excel + split + stack + drop + join + reset_index:
#define columns which need split by , and then flatten them
cols = ['Condition description','Relevant Modality']
#read csv to dataframe
df = pd.read_excel('Untitled 1.xlsx')
#print (df)
df1 = pd.DataFrame({col: df[col].str.split(',', expand=True).stack() for col in cols})
print (df1)
Condition description Relevant Modality
0 0 Fibroadenomas are the most common cause of a b... Mammography
1 NaN US
2 NaN MRI
1 0 Papillomas are benign neoplasms Mammography
1 arising in a duct US
2 either centrally or peripherally within the b... MRI
3 leading to a nipple discharge. As they are of... NaN
4 the discharge may be bloodstained. NaN
2 0 OK Mammography
3 0 breast cancer Mammography
1 NaN US
4 0 breast inflammation Mammography
1 NaN US
#remove original columns
df = df.drop(cols, axis=1)
#create Multiindex in original df for align rows
df.index = [df.index, [0]* len(df.index)]
#join original to flattened columns, remove Multiindex
df = df1.join(df).reset_index(drop=True)
#print (df)
The previous answer is correct, I think you should use the id of reference.
an easier way could possibly be to just parse the method string to a list:
method_list = method.split(',')
method_list = np.asarray(method_list)
If you have any trouble with indexing when initializing your Dataframe, just set index to:
pd.Dataframe(data, index=[0,0])
df.set_index('id')
passing the list as a value for your method key will automatically create a copy of both the index - 'id' and 'name'
id method name
1 mammography adenosis
1 mri adenosis
I hope this helps, all the best
I did searched online posts but what I found were all how to only round float columns in a mixed dataframe, but my problem is how to round float values in a string type column.
Say my dataframe like this:
pd.DataFrame({'a':[1.1111,2.2222, 'aaaa'], 'b':['bbbb', 2.2222,3.3333], 'c':[3.3333,'cccc', 4.4444]})
Looking for an output like
pd.DataFrame({'a':[1.1,2.2, 'aaaa'], 'b':['bbbb', 2.2,3.3], 'c':[3.3,'cccc', 4.4]})
----Above is a straight question------
----Reason why I do so is below----
I have 3 csv files, each has string header and float value, with different row and column number.
I need to append the 3 in one dataframe horizontally then expoert as a new csv, separate by a empty row.
My 3 dataframe like this:
One:
Two:
Three:
to
Pls note that the output dataframe contains headers from the 3 sub dataframe
So, when I import them, first csv of course imported by pd.read_csv, no issue.
Then I used .append(pd.Series([np.NaN])) to create an empty row as separator row
Then second csv loaded then I used pd.append(), but if I don't include 'header=None' in 'read_csv()' then the second one will not be mapped horizontally under first one, coz the csv files have uneven rows and columns.
So two options,
Include 'header=None' in 'read_csv()', then I can't simply use round() as
df = df.round()
does not work, need to find a way to round only numeric values in each column
Also note that when include 'header=None',
All column types are 'object', by df.types
Not include 'header=None' in 'read_csv()', then I could round each dataframe, but having trouble to combine them horizontally with their headers.
Any suggestion?
csv example
import pandas as pd
import io
exp = io.StringIO("""
month;abc;cba;fef;sefe;yjy;gtht
100;0.45384534;0.43455;0.56385;0.5353;0.523453;0.53553
200;0.453453;0.453453;0.645396;0.76786;0.36327;0.453659
""")
df = pd.read_csv(exp, sep=";", header=None)
print(df.dtypes)
df = df.applymap(lambda x: round(x, 1)
if isinstance(x, (int, float)) else x)
print(df)
There is a simple way to loop over every single element in a dataframe using applymap. Combined with isinstance, which test for a specific type, you can get the following.
df = pd.DataFrame({'a':[1.1111,2.2222, 'aaaa'], 'b':['bbbb', 2.2222,3.3333], 'c':[3.3333,'cccc', 4.4444]})
df.dtypes
a object
b object
c object
dtype: object
df2 = df.applymap(lambda x: round(x, 1) if isinstance(x, (int, float)) else x)
Obtaining the following dataframe:
a b c
0 1.1 bbbb 3.3
1 2.2 2.2 cccc
2 aaaa 3.3 4.4
With the following dtypes unchanged
df2.dtypes
a object
b object
c object
dtype: object
As for your other example in your question, I noticed that even the numbers are saved as strings. I noticed a method converting strings to floats pd.to_numeric for a Series.
From your exp, I get the following:
df = pd.read_csv(exp, sep=";", header=None)
df2 = df.apply(lambda x: pd.to_numeric(x, errors='ignore'), axis=1)
df3 = df2.applymap(lambda x: round(x, 1) if isinstance(x, (int, float)) else x)
I have a data frame with certain columns and rows and in which I need to add prefix to rows from one of the columns if it meet certain condition,
df = pd.DataFrame({'col':['a',0,2,3,5],'col2':['PFD_1','PFD_2','PFD_3','PFD_4','PFD_5']})
Samples=pd.DataFrame({'Sam':['PFD_1','PFD_5']})
And I need to add a suffix to df.col2 based on values in Samples dataframe, and I tried it with np.where as following,
df['col2'] = np.where(df.col2.isin(samples.Sam),'Yes' + df.col2, 'Non_'+ df.col2)
Whhich throws error as,
TypeError: can only perform ops with scalar values
It doesn't return what I am asking for, and throwing errors
in the end the data frame should look like,
>>>df.head()
col col2
a Yes_PFD_1
0 no_PFD_2
2 no_PFD_3
3 no_PFD_4
5 Yes_PFD_5
Your code worked fine for me once I changed the capitalization of 'samples' ..
import pandas as pd
import numpy as np
df = pd.DataFrame({'col':['a',0,2,3,5],'col2': ['PFD_1','PFD_2','PFD_3','PFD_4','PFD_5']})
Samples=pd.DataFrame({'Sam':['PFD_1','PFD_5']})
df['col2'] = np.where(df.col2.isin(Samples.Sam),'Yes' + df.col2, 'Non_'+ df.col2)
df['col2']
Outputs ..
0 YesPFD_1
1 Non_PFD_2
2 Non_PFD_3
3 Non_PFD_4
4 YesPFD_5
Name: col2, dtype: object