Left justify pandas string column with pattern - python

I have a large pandas dataset with a messy string column which contains for example:
72.1
61
25.73.20
33.12
I'd like to fill the gaps in order to match a pattern like XX.XX.XX (X are only numbers):
72.10.00
61.00.00
25.73.20
33.12.00
thank you!

How about defining base_string = '00.00.00' then fill other string in each row with base_string:
base_str = '00.00.00'
df = pd.DataFrame({'ms_str':['72.1','61','25.73.20','33.12']})
print(df)
df['ms_str'] = df['ms_str'].apply(lambda x: x+base_str[len(x):])
print(df)
Output:
ms_str
0 72.1
1 61
2 25.73.20
3 33.12
ms_str
0 72.10.00
1 61.00.00
2 25.73.20
3 33.12.00

Here is a vectorized solution, that works for this particular pattern. First fill with zeros on the right, then replace every third character by a dot:
df['col'].str.ljust(8, fillchar='0').str.replace(r'(..).', r'\1.', regex=True)
Output:
0 72.10.00
1 61.00.00
2 25.73.20
3 33.12.00
Name: col, dtype: object

Related

Replacing Special Symbols in Pandas Dataframe

Can anyone help me to remove extra characters from the dataframe column? Should I have to use replace string method?
For example,
`$7.99` -> `7.99`
`$16.99` -> `16.99`
`$0.99` -> `0.99`
You can remove the first character with .str[1:]:
df["Column1"] = df["Column1"].str[1:].astype(float)
print(df)
Prints:
Column1
0 7.99
1 16.99
2 0.99
Dataframe used:
Column1
0 $7.99
1 $16.99
2 $0.99
IIUC: consider the following dataframe:
df = pd.DataFrame(['$7.99', '$3.45', '$56.99'])
you can use replace to do:
df[0].str.replace('$', '', regex=False)
Output:
0 7.99
1 3.45
2 56.99
Name: 0, dtype: object

Parse Out Last Sequence Of Numbers From Pandas Column to create new column

I have a dataframe with codes like the following and would like to create a new column that has the last sequence of numbers parse out.
array(['K9ADXXL2', 'K9ADXL2', 'K9ADXS2', 'IVERMAXSCM12', 'HPDMUDOGDRYL'])
So the new column would contain the following:
array([2,2,2,12,None])
Sample data
df:
codes
0 K9ADXXL2
1 K9ADXL2
2 K9ADXS2
3 IVERMAXSCM12
4 HPDMUDOGDRYL
Use str.extract gets digits at the end of string and passing to pd.to_numeric
pd.to_numeric(df.codes.str.extract(r'(\d+$)')[0], errors='coerce')
Out[11]:
0 2.0
1 2.0
2 2.0
3 12.0
4 NaN
Name: 0, dtype: float64
If you want get value as string of numbers, you may use str.extract or str.findall as follow
df.codes.str.findall(r'\d+$').str[0]
or
df.codes.str.extract(r'(\d+$)')[0]
Out[20]:
0 2
1 2
2 2
3 12
4 NaN
Name: codes, dtype: object
import re
import pandas as pd
def get_trailing_digits(s):
match = re.search("[0-9]+$",s)
return match.group(0) if match else None
original_column = pd.array(['K9ADXXL2', 'K9ADXL2', 'K9ADXS2', 'IVERMAXSCM12', 'HPDMUDOGDRYL'])
new_column = pd.array([get_trailing_digits(s) for s in original_column])
# ['2', '2', '2', '12', None]
0-9] means any digit
+ means one or more times
$means only at the end of the string
You can use the apply function of a series/data frame with get_trailing_digits as the function.
eg.
my_df["new column"] = my_df["old column"].apply(get_trailing_digits)

How to standardize strings between rows in a pandas DataFrame?

I have the following pandas DataFrame in Python3.x:
import pandas as pd
dict1 = {
'ID':['first', 'second', 'third', 'fourth', 'fifth'],
'pattern':['AAABCDEE', 'ABBBBD', 'CCCDE', 'AA', 'ABCDE']
}
df = pd.DataFrame(dict1)
>>> df
ID pattern
0 first AAABCDEE
1 second ABBBBD
2 third CCCDE
3 fourth AA
4 fifth ABCDE
There are two columns, ID and pattern. The string in pattern with the longest length is in the first row, len('AAABCDEE'), which is length 8.
My goal is to standardize the strings such that these are the same length, with the trailing spaces as ?.
Here is what the output should look like:
>>> df
ID pattern
0 first AAABCDEE
1 second ABBBBD??
2 third CCCDE???
3 fourth AA??????
4 fifth ABCDE???
If I was able to make the trailing spaces NaN, then I could try something like:
df = df.applymap(lambda x: int(x) if pd.notnull(x) else str("?"))
But I'm not sure how one would efficiently (1) find the longest string in pattern and (2) then add NaN add the end of the strings up to this length? This may be a convoluted approach...
You can use Series.str.ljust for this, after acquiring the max string length in the column.
df.pattern.str.ljust(df.pattern.str.len().max(), '?')
# 0 AAABCDEE
# 1 ABBBBD??
# 2 CCCDE???
# 3 AA??????
# 4 ABCDE???
# Name: pattern, dtype: object
In the source for Pandas 0.22.0 here it can be seen that ljust is entirely equivalent to pad with side='right', so pick whichever you find more clear.
You can using str.pad
df.pattern.str.pad(width=df.pattern.str.len().max(),side='right',fillchar='?')
Out[1154]:
0 AAABCDEE
1 ABBBBD??
2 CCCDE???
3 AA??????
4 ABCDE???
Name: pattern, dtype: object
Python 3.6 f-string
n = df.pattern.str.len().max()
df.assign(pattern=[f'{i:?<{n}s}' for i in df.pattern])
ID pattern
0 first AAABCDEE
1 second ABBBBD??
2 third CCCDE???
3 fourth AA??????
4 fifth ABCDE???

How to concatenate all (string) values in a given pandas dataframe row to one string?

I have a pandas dataframe that looks like this:
0 1 2 3 4
0 I want to join strings
1 But only in row 1
The desired output should look like this:
0 1 2 3 4 5
1 But only in row 1 I want to join strings
How to concatenate those strings to a joint string?
IIUC, by using apply , join
df.apply(lambda x :' '.join(x.astype(str)),1)
Out[348]:
0 I want to join strings
1 But only in row 1
dtype: object
Then you can assign them
df1=df.iloc[1:]
df1['5']=df.apply(lambda x :' '.join(x.astype(str)),1)[0]
df1
Out[361]:
0 1 2 3 4 5
1 But only in row 1 I want to join strings
For Timing :
%timeit df.apply(lambda x : x.str.cat(),1)
1 loop, best of 3: 759 ms per loop
%timeit df.apply(lambda x : ''.join(x),1)
1 loop, best of 3: 376 ms per loop
df.shape
Out[381]: (3000, 2000)
Use str.cat to join the first row, and assign to the second.
i = df.iloc[1:].copy() # the copy is needed to prevent chained assignment
i[df.shape[1]] = df.iloc[0].str.cat(sep=' ')
i
0 1 2 3 4 5
1 But only in row 1 I want to join strings
One other alternative way can be with add space followed by sum:
df[5] = df.add(' ').sum(axis=1).shift(1)
Result:
0 1 2 3 4 5
0 I want to join strings NaN
1 But only in row 1 I want to join strings
If your dataset is less than perfect and you want to exclude 'nan' values you can use this:
df.apply(lambda x :' '.join(x for x in x.astype(str) if x != "nan"),1)
I found this particularly helpful in joining columns containing parts of addresses together where some parts like SubLocation (e.g. apartment #) aren't relevant for all addresses.

Python, pandas: how to remove greater than sign

Let's say I have the following example DataFrame
from pandas import Series, DataFrame
df = DataFrame({'A':['1', '<2', '3']})
I would like to convert the column A from string to integer. In the case of '<2', I'd like to simply take off '<' sign and put 1 (the closest integer less than 2) in the second row. What's the most efficient way to do that? This is just a example. The actual data that I'm working on has hundreds of thousands of rows.
Thanks for your help in advance.
You could use Series.apply:
import pandas as pd
df = pd.DataFrame({'A':['1', '<2', '3']})
df['A'] = df['A'].apply(lambda x: int(x[1:])-1 if x.startswith('<') else int(x))
print(df.dtypes)
# A int64
# dtype: object
yields
print(df)
A
0 1
1 1
2 3
[3 rows x 1 columns]
You can use applymap on the DataFrame and remove the "<" character if it appears in the string:
df.applymap(lambda x: x.replace('<',''))
Here is the output:
A
0 1
1 2
2 3
Here are two other ways of doing this which may be helpful on the go-forward!
from pandas import Series, DataFrame
df = DataFrame({'A':['1', '<2', '3']})
Outputs
df.A.str.strip('<').astype(int)
Out[1]:
0 1
1 2
2 3
And this way would be helpful if you were trying to remove a character in the middle of your number (e.g. if you had a comma or something).
df = DataFrame({'A':['1', '1,002', '3']})
df.A.str.replace(',', '').astype(int)
Outputs
Out[11]:
0 1
1 1002
2 3
Name: A, dtype: int64
>>> import re
>>> df.applymap(lambda x: int(re.sub(r'[^0-9.]', '', x)))
A
0 1
1 2
2 3

Categories

Resources