The Data Contains Faulty Readings with certain format

The Data Contains Faulty Readings with certain format - python

I have a pandas df, where one of my columns have faulty values. I want to clean these values
The Faulty Values are negative and end with <, example '-2.44<'.
How do I fix this without affecting other columns? My index is Date-Time
I have tried to convert the column to numeric data.
df.values = pd.to_numeric(df.values, errors='coerce')
There are no error messages. But, I'd like to replace them with removing '<'.

Use Series.str.rstrip for remove < from right side:
df.values = pd.to_numeric(df.values.str.rstrip('<'), errors='coerce')
Or more general is used Series.str.strip - possible add more values:
df.values = pd.to_numeric(df.values.str.strip('<>'), errors='coerce')

Related

Replacing values in multiple columns with Pandas based on conditions

I have a very large dataframe where I only want to change the values in a small continuous subset of columns. Basically, in those columns the values are either integers or null. All I want is to replace the 0's and nulls with 'No' and everything else with 'Yes' only in those columns
In R, this can be done basically with a one liner:
df <- df %>%
mutate_at(vars(MCI:BNP), ~factor(case_when(. > 0 ~ 'Yes',
TRUE ~ 'No')))
But we're working in Python and I can't quite figure out the equivalent using Pandas. I've been messing around with loc and iloc, which work fine when only changing a single column but I must be missing something when it comes to modifying multiple columns. And the answers I've found on other stackoverflow answers have all mostly been just changing the value in a single column based on some set of conditions
col1 = df.columns.get_loc("MCI")
col2 = df.columns.get_loc("BNP")
df.iloc[:,col1:col2]
Will get me the columns I want, but trying to call loc doesn't work with multidimensional keys. I even tried it with the columns as a list instead of by integer index by creating a an extra variable
binary_var = ['MCI','PVD','CVA','DEMENTIA','CPD','RD','PUD','MLD','DWOC','DWC','HoP','RND','MALIGNANCY','SLD','MST','HIV','AKF',
'ARMD','ASPHY','DEP','DWLK','DRUGA','DUOULC','FALL','FECAL','FLDELEX','FRAIL','GASTRICULC','GASTROULC','GLAU','HYPERKAL',
'HYPTEN','HYPOKAL','HYPOTHYR','HYPOXE','IMMUNOS','ISCHRT','LIPIDMETA','LOSWIGT','LOWBAK','MALNUT','OSTEO','PARKIN',
'PNEUM','RF','SEIZ','SD','TUML','UI','VI','MENTAL','FUROSEMIDE','METOPROLOL','ASPIRIN','OMEPRAZOLE','LISINOPRIL','DIGOXIN',
'ALDOSTERONE_ANTAGONIST','ACE_INHIBITOR','ANGIOTENSIN_RECEPTOR_BLOCKERS','BETA_BLOCKERSDIURETICHoP','BUN','CREATININE',
'SODIUM','POTASSIUM','HEMOGLOBIN','WBC_COUNT','CHLORIDE','ALBUMIN','TROPONIN','BNP']
df.loc[df[binary_var] == 0, binary_var]
But then it just can't find the index for those column names at all. I think Pandas also has problems converting columns that were originally integers into No/Yes. I don't need to do this in place, I'm probably just missing something simple that pandas has built in hopefully
In a very psuedo-code description, all I really want is this
if(df.iloc[:,col1:col2] == 0 || df.iloc[:,col1:col2].isnull())
df ONLY in that subset of column = 'No'
else
df ONLY in that subset of column = 'Yes'

Use:
df.loc[:, 'MCI':'BNP'] = np.where(df.loc[:, 'MCI':'BNP'] > 0, 'Yes', 'No')

filter off NaNs in a large DataFrame with headers

I have a large number of time series, with blanks on certain dates for some of them. I read that with xlwings from an XL sheet:
Y0 = xw.Range('SomeRangeinXLsheet').options(pd.DataFrame, index=True , header=3).value
I'm trying to create a filter to run regressions on those series so I have to take out the void dates. If I :
print(Y0.iloc[:,[i]]==Y0.iloc[:,[i]])
I get a proper series of true/false for my column number i, fine.
I'm then stuck, can't find a way to filter the whole df, with the true/false for that column, or even just extract that clean series as a pd.Series.
I need them one by one to adapt my independent variables' dates to those of my each of these separately.
Thank you for your help.

I believe you want to use df.dropna()

I am not sure if I understood your problem, but if you want to for check NULLs in a specific column and drop those rows, you can try this -
import pandas as pd
df = df[pd.notnull(df['column_name'])]
For deleting NaNs, df.dropna() should work, as suggested in the previous answer. If it is not working, you can try replacing NaNs with a placeholder text and try deleting the rows that contain that placeholder text.
df['column_name'] = df['column_name'].replace(np.nan, 'delete-it', regex = True)
df = df[df["column_name"] != 'delete-it']
Hope this helps!

Converting strings to ints in a DataFrame

How to covert a DataFrame column containing strings and "-" values to floats.
I have tried with pd.to_numeric and pd.Series.astype(int) but i haven´t success.
What do you recommend??

If I correctly understand, you want pandas to convert the string 7.629.352 to the float value 7629352.0 and the string (21.808.956) to the value -21808956.0.
For the first part, it is directly possible with the thousands parameter, and it is even possible to process - as a NaN:
m = read_csv(..., thousands='.', na_values='-')
The real problem is the parens for negative values.
You could use a python function to convert the values. A possible alternative would be to post process the dataframe column wise:
m = read_csv(..., thousands='.', na_values='-')
for col in m.columns:
if m[col].dtype == np.dtype('object'):
m[col] = m[col].str.replace(r'\.', '').str.replace(r'\((.*)\)', r'-\1').astype('float64')

How to remove commas from ALL the column in pandas at once

I have a data frame where all the columns are supposed to be numbers. While reading it, some of them were read with commas. I know a single column can be fixed by
df['x']=df['x'].str.replace(',','')
However, this works only for series objects and not for entire data frame. Is there an elegant way to apply it to entire data frame since every single entry in the data frame should be a number.
P.S: To ensure I can str.replace, I have first converted the data frame to str by using
df.astype('str')
So I understand, I will have to convert them all to numeric once the comma is removed.

Numeric columns have no ,, so converting to strings is not necessary, only use DataFrame.replace with regex=True for substrings replacement:
df = df.replace(',','', regex=True)
Or:
df.replace(',','', regex=True, inplace=True)
And last convert strings columns to numeric, thank you #anki_91:
c = df.select_dtypes(object).columns
df[c] = df[c].apply(pd.to_numeric,errors='coerce')

Well, you can simplely do:
df = df.apply(lambda x: x.str.replace(',', ''))
Hope it helps!

In case you want to manipulate just one column:
df.column_name = df.column_name.apply(lambda x : x.replace(',',''))

Converting values with commas in a pandas dataframe to floats.

I have been trying to convert values with commas in a pandas dataframe to floats with little success. I also tried .replace(",","") but it doesn't work? How can I go about changing the Close_y column to float and the Date column to date values so that I can plot them? Any help would be appreciated.

Convert 'Date' using to_datetime for the other use str.replace(',','.') and then cast the type:
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
df['Close_y'] = df['Close_y'].str.replace(',','.').astype(float)
replace looks for exact matches, what you're trying to do is replace any match in the string

pandas.read_clipboard implements the same kwargs as pandas.read_table in which there are options for the thousands and parse_dates kwarg.s
Try loading your data with:
df = pd.read_clipboard(thousands=',', parse_dates=[0])
Assuming that the Dates column is in the 0 index. If you have a large amount of data you may also try using the infer_datetime_format kwarg to speed things up.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

The Data Contains Faulty Readings with certain format - python

Use Series.str.rstrip for remove < from right side: df.values = pd.to_numeric(df.values.str.rstrip('<'), errors='coerce') Or more general is used Series.str.strip - possible add more values: df.values = pd.to_numeric(df.values.str.strip('<>'), errors='coerce')

Related

Replacing values in multiple columns with Pandas based on conditions

filter off NaNs in a large DataFrame with headers

Converting strings to ints in a DataFrame

How to remove commas from ALL the column in pandas at once

Converting values with commas in a pandas dataframe to floats.

Categories

Resources