Pandas -- Replace dirty strings with int - python

I am trying to do some machine learning practice, but the ID column of my dataframe is giving me trouble. I have this:
0 LP001002
1 LP001003
2 LP001005
3 LP001006
4 LP001008
I want this:
0 001002
1 001003
2 001005
3 001006
4 001008
My idea is to use a replace function, ID.replace('[LP]', '', inplace=True), but this doesn't actually change the series. Any one know a good way to convert this column?

You can use replace
df
Out[656]:
Val
0 LP001002
1 LP001003
2 LP001005
3 LP001006
4 LP001008
df.Val.replace({'LP':''},regex=True)
Out[657]:
0 001002
1 001003
2 001005
3 001006
4 001008
Name: Val, dtype: object

Here's something that will work for the example as given:
import pandas as pd
df = pd.DataFrame({'colname': ['LP001002', 'LP001003']})
# Slice off the 0th and 1st character of the string
df['colname'] = [x[2:] for x in df['colname']]
If this is your index, you can access it through df['my_index'] = df.index and then follow the remaining instructions.
In general, you might consider using something like the label encoder from scikit learn to convert nonnumeric elements to numeric ones.

Related

How Can I drop a column if the last row is nan

I have found examples of how to remove a column based on all or a threshold but I have not been able to find a solution to my particular problem which is dropping the column if the last row is nan. The reason for this is im using time series data in which the collection of data doesnt all start at the same time which is fine but if I used one of the previous solutions it would remove 95% of the dataset. I do however not want data whose most recent column is nan as it means its defunct.
A B C
nan t x
1 2 3
x y z
4 nan 6
Returns
A C
nan x
1 3
x z
4 6
You can also do something like this
df.loc[:, ~df.iloc[-1].isna()]
A C
0 NaN x
1 1 3
2 x z
3 4 6
Try with dropna
df = df.dropna(axis=1, subset=[df.index[-1]], how='any')
Out[8]:
A C
0 NaN x
1 1 3
2 x z
3 4 6
You can use .iloc, .loc and .notna() to sort out your problem.
df = pd.DataFrame({"A":[np.nan, 1,"x",4],
"B":["t",2,"y",np.nan],
"C":["x",3,"z",6]})
df = df.loc[:,df.iloc[-1,:].notna()]
You can use a boolean Series to select the column to drop
df.drop(df.loc[:,df.iloc[-1].isna()], axis=1)
Out:
A C
0 NaN x
1 1 3
2 x z
3 4 6
for i in range(temp_df.shape[1]):
if temp_df.iloc[-1,i] == 'nan':
temp_df = temp_df.drop(i,1)
This will work for you.
Basically what I'm doing here is looping over all columns and checking if last entry is 'nan', then dropping that column.
temp_df.shape[1]
this is the numbers of columns.
pandas.df.drop(i,1)
i represents the column index and 1 represents that you want to drop the column.
EDIT:
I read the other answers on this same post and it seems to me that notna would be best (I would use it), but the advantage of this method is that someone can compare anything they wish to.
Another method I found is isnull() which is a function in the pandas library which will work like this:
for i in range(temp_df.shape[1]):
if temp_df.iloc[-1,i].isnull():
temp_df = temp_df.drop(i,1)

How to get some string of dataframe column?

I have dataframe like this.
print(df)
[ ID ... Control
0 PDF-1 ... NaN
1 PDF-3 ... NaN
2 PDF-4 ... NaN
I want to get only number of ID column. So the result will be.
1
3
4
How to get one of the strings of the dataframe column ?
How about just replace a common PDF- prefix?
df['ID'].str.replace('PDF-', '')
Could you please try following.
df['ID'].replace(regex=True,to_replace=r'([^\d])',value=r'')
One could refer documentation for df.replace
Basically using regex to remove everything apart from digits in column named ID where \d denotes digits and when we use [^\d] means apart form digits match everything.
Another possibility using Regex is:
df.ID.str.extract('(\d+)')
This avoids changing the original data just to extract the integers.
So for the following simple example:
import pandas as pd
df = pd.DataFrame({'ID':['PDF-1','PDF-2','PDF-3','PDF-4','PDF-5']})
print(df.ID.str.extract('(\d+)'))
print(df)
we get the following:
0
0 1
1 2
2 3
3 4
4 5
ID
0 PDF-1
1 PDF-2
2 PDF-3
3 PDF-4
4 PDF-5
Find "PDF-" ,and replace it with nothing
df['ID'] = df['ID'].str.replace('PDF-', '')
Then to print how you asked I'd convert the data frame to a string with no index.
print df['cleanID'].to_string(index=False)

Parse Out Last Sequence Of Numbers From Pandas Column to create new column

I have a dataframe with codes like the following and would like to create a new column that has the last sequence of numbers parse out.
array(['K9ADXXL2', 'K9ADXL2', 'K9ADXS2', 'IVERMAXSCM12', 'HPDMUDOGDRYL'])
So the new column would contain the following:
array([2,2,2,12,None])
Sample data
df:
codes
0 K9ADXXL2
1 K9ADXL2
2 K9ADXS2
3 IVERMAXSCM12
4 HPDMUDOGDRYL
Use str.extract gets digits at the end of string and passing to pd.to_numeric
pd.to_numeric(df.codes.str.extract(r'(\d+$)')[0], errors='coerce')
Out[11]:
0 2.0
1 2.0
2 2.0
3 12.0
4 NaN
Name: 0, dtype: float64
If you want get value as string of numbers, you may use str.extract or str.findall as follow
df.codes.str.findall(r'\d+$').str[0]
or
df.codes.str.extract(r'(\d+$)')[0]
Out[20]:
0 2
1 2
2 2
3 12
4 NaN
Name: codes, dtype: object
import re
import pandas as pd
def get_trailing_digits(s):
match = re.search("[0-9]+$",s)
return match.group(0) if match else None
original_column = pd.array(['K9ADXXL2', 'K9ADXL2', 'K9ADXS2', 'IVERMAXSCM12', 'HPDMUDOGDRYL'])
new_column = pd.array([get_trailing_digits(s) for s in original_column])
# ['2', '2', '2', '12', None]
0-9] means any digit
+ means one or more times
$means only at the end of the string
You can use the apply function of a series/data frame with get_trailing_digits as the function.
eg.
my_df["new column"] = my_df["old column"].apply(get_trailing_digits)

Replacing categorical strings with ints in pandas

I have a series containing data like
0 a
1 ab
2 b
3 a
And I want to replace any row containing 'b' to 1, and all others to 0. I've tried
one = labels.str.contains('b')
zero = ~labels.str.contains('b')
labels.ix[one] = 1
labels.ix[zero] = 0
And this does the trick but it gives this pesky warning
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
And I know I've seen this before in the last few times I've used pandas. Could you please give the recommended approach? My method gives the desired result, but what should I do? Also, I think Python is supposed to be an 'if it makes logical sense and you type it it will run' kind of language, but my solution seems perfectly logical in the human-readable sense and it seems very non-pythonic that it throws an error.
Try this:
ds = pd.Series(['a','ab','b','a'])
ds
0 a
1 ab
2 b
3 a
dtype: object
ds.apply(lambda x: 1 if 'b' in x else 0)
0 0
1 1
2 1
3 0
dtype: int64
You can use numpy.where. Output is numpy.ndarray, so you have to use Series constructor:
import pandas as pd
import numpy as np
ser = pd.Series(['a','ab','b','a'])
print ser
0 a
1 ab
2 b
3 a
dtype: object
print np.where(ser.str.contains('b'),1,0)
[0 1 1 0]
print type(np.where(ser.str.contains('b'),1,0))
<type 'numpy.ndarray'>
print pd.Series(np.where(ser.str.contains('b'),1,0), index=ser.index)
0 0
1 1
2 1
3 0
dtype: int32

How can I delete portion of data in python pandas dataframe imported from csv?

I am trying to take the days out of the column Days_To_Maturity. So instead of Days 0, it will just be 0. I have tried a few things but am wondering if there is a easy way to do this built into python. Thanks
In[12]:
from pandas import *
XYZ = read_csv('XYZ')
df_XYZ = DataFrame(XYZ)
df_XYZ.head()
Out[12]:
Dates Days_To_Maturity Yield
0 5/1/2002 Days 0 0.00
1 5/1/2002 Days 1 0.06
2 5/1/2002 Days 2 0.12
3 5/1/2002 Days 3 0.18
4 5/1/2002 Days 4 0.23
5 rows × 3 columns
You can explore the possibility of using .str method, either you can extract the numbers using regex, or take a slice .str.slice, or like in this example, replace days with a empty string:
In [109]:
df.Days_To_Maturity.str.replace('Days ','').astype(int)
Out[109]:
0 0
1 1
2 2
3 3
4 4
Name: Days_To_Maturity, dtype: int32
I think the solution you are looking for is in the "converters" option of the read_csv function of pandas. From help(pandas.read_csv):
converters: dict. optional
Dict of functions for converting values in certain columns. Keys can either be integers or column labels.
So instead of read_csv('XYZ') you would make a custom converter:
myconverter = { 'Days_To_Maturity': lambda x: x.split(' ')[1] }
read_csv('XYZ',converter=myconverter)
This should work. Please let me know if it helps!

Categories

Resources