changing values in a column of data set using regex in pandas - python

This is a subset of a data frame:
Index duration
1 4 months20mg 1X D
2 1 years10 1X D
3 2 weeks10 mg
4 8 years300 MG 1X D
5 20 days
6 10 months
The output should be like this:
Index duration
1 4 month
2 1 year
3 2 week
4 8 year
5 20 day
6 10 month
This is my code:
df.dosage_duration.replace(r'year[0-9a-zA-z]*' , 'year', regex=True)
df.dosage_duration.replace(r'day[0-9a-zA-z]*' , 'day', regex=True)
df.dosage_duration.replace(r'month[0-9a-zA-z]*' , 'month', regex=True)
df.dosage_duration.replace(r'week[0-9a-zA-z]*' , 'week', regex=True)
But it does not work. Any suggestion ?

There are two problems.
The first is that your regular expression doesn't match all the parts you want it to match. Look at months20mg 1X D - there is a space in the part you want to replace. I think you could probably just use 'year.*' as your matches.
The second is that you are calling replace without storing the results. If you want to do the call the way you have, you should specify inplace=True.
You can also use a single call if you use a slightly extended regular expression. We can use \1 to refer to the first matching group for the regular expression. The groups are indicated by the parentheses:
df.dosage_duration.replace(r'(year|month|week|day).*' , r'\1',
regex=True, inplace=True)

Related

How to get some string of dataframe column?

I have dataframe like this.
print(df)
[ ID ... Control
0 PDF-1 ... NaN
1 PDF-3 ... NaN
2 PDF-4 ... NaN
I want to get only number of ID column. So the result will be.
1
3
4
How to get one of the strings of the dataframe column ?
How about just replace a common PDF- prefix?
df['ID'].str.replace('PDF-', '')
Could you please try following.
df['ID'].replace(regex=True,to_replace=r'([^\d])',value=r'')
One could refer documentation for df.replace
Basically using regex to remove everything apart from digits in column named ID where \d denotes digits and when we use [^\d] means apart form digits match everything.
Another possibility using Regex is:
df.ID.str.extract('(\d+)')
This avoids changing the original data just to extract the integers.
So for the following simple example:
import pandas as pd
df = pd.DataFrame({'ID':['PDF-1','PDF-2','PDF-3','PDF-4','PDF-5']})
print(df.ID.str.extract('(\d+)'))
print(df)
we get the following:
0
0 1
1 2
2 3
3 4
4 5
ID
0 PDF-1
1 PDF-2
2 PDF-3
3 PDF-4
4 PDF-5
Find "PDF-" ,and replace it with nothing
df['ID'] = df['ID'].str.replace('PDF-', '')
Then to print how you asked I'd convert the data frame to a string with no index.
print df['cleanID'].to_string(index=False)

How to drop parentheses within column or data frame

df =
A B
1 5
2 6)
(3 7
4 8
To remove parentheses I did:
df.A = df.A.str.replace(r"\(.*\)","")
But no result. I have checked a lot of replies here, but still same result.
Would appreciate to remove parentheses from the whole data set or at least in coulmn
to remove parentheses from the whole data set
With regex character class [...] :
In [15]: df.apply(lambda s: s.str.replace(r'[()]', ''))
Out[15]:
A B
0 1 5
1 2 6
2 3 7
3 4 8
Or the same with df.replace(r'[()]', '', regex=True) which is a more concise way.
If you want regex, you can use r"[()]" instead of alteration groups, as long as you need to replace only one character at a time.
df.A = df.A.str.replace(r"[()]", "")
I find it easier to read and alter if needed.

Extract part of a string with regex before hyphen followed by digits

I have a dataframe test with a column category containing a complex pattern of words, characters and digits. I need to extract words separated by hyphen before another followed by digits into a new column sub_category.
I'm not a regex expert and spent too much time fighting it. So will appreciate your help!
test = pd.DataFrame({
'id': ['1','2','3','4'],
'category': ['worda-wordb-1234.ds.er89.',
'worda-4567.we.77-ty','wordc-wordd-5698/de/','wordc-2356/rt/']
})
Desired output:
id category sub_category
0 1 worda-wordb-1234.ds.er worda-wordb
1 2 worda-4567.we.ty worda
2 3 wordc-wordd-5698/de/ wordc-wordd
3 4 wordc-2356/rt/ wordc
Use str.extract,
test['sub-category'] = test.category.str.extract('(.*)-\d+')
id category sub-category
0 1 worda-wordb-1234.ds.er89. worda-wordb
1 2 worda-4567.we.77-ty worda
2 3 wordc-wordd-5698/de/ wordc-wordd
3 4 wordc-2356/rt/ wordc
What you want is simply the start of the string and as many non-digits as possible, except for the final hyphen. This should do the trick:
^\D+?(?=-\d)
Demo
Explanation:
^ matches the start of the string
\D+? matches non-digits, but in a non-greedy manner
(?=-\d) matches a hyphen followed by a digit; this forces the previous match to stop.
You can do this with split() also:
>>> df
id category
0 1 worda-wordb-1234.ds.er89.
1 2 worda-4567.we.77-ty
2 3 wordc-wordd-5698/de/
3 4 wordc-2356/rt/
Resulted output:
>>> df['sub_category'] = df.category.str.split('-\d+',expand=True)[0]
>>> df
id category sub_category
0 1 worda-wordb-1234.ds.er89. worda-wordb
1 2 worda-4567.we.77-ty worda
2 3 wordc-wordd-5698/de/ wordc-wordd
3 4 wordc-2356/rt/ wordc
OR , as #jezrael suggested with the split() method with little change specifying the number of split required for the dataset, here its One only ...
df['sub_category'] = df.category.str.split('-\d+',n=1).str[0]

How to split column values of a panda data frame into rows separated by “,”

I am trying to separate the column values separated by "," separator of a panda dataframe.
The original data Original panda dataframe
The desired output Desired output
I have tried several ways.
Explode/stack a Series of strings
newdf['Month'] = newdf['Month'].apply(list)
using the above code I am getting [j,a,n,,f,e,b] and then I have used
pd.Dataframe({'Month':np.concatenate(newdf['Month'].values), 'cust.no':newdf['cust.no'].repeat(newdf['cust no.'].apply(len))})
The output is each letter is coming in separate rows. As a result, the row numbers are not matching with "cust no." and I am getting error.
I know there are several functions available but I couldn't one that can efficiently break down the values.
You can always just use a regex (regular expression) to identify all text before the comma.
Assuming your original dataframe is called data, meaning your months column is data['Months'], you can use the regular expression r'(.+?),' to select everything before the comma.
data['Months'] = data['Months'].str.extract(r'(.+?),', expand=True)
You can always test regex at https://pythex.org/. Try entering your months column in the test string box, and (.+?), as the regular expression.
Setup
df = pd.DataFrame({'id': [1,2,3,4], 'month': ['Jan,Fev', 'Feb,July', 'Jun,Aug', 'July,Mar']})
id month
0 1 Jan,Fev
1 2 Feb,July
2 3 Jun,Aug
3 4 July,Mar
str.split+pd.DataFrame()+stack
df = df.set_index('id')
pd.DataFrame(df.month.str.split(',').to_dict()).T.stack().reset_index(level=0, name='month')
level_0 month
0 1 Jan
1 1 Fev
0 2 Feb
1 2 July
0 3 Jun
1 3 Aug
0 4 July
1 4 Mar

How to standardize strings between rows in a pandas DataFrame?

I have the following pandas DataFrame in Python3.x:
import pandas as pd
dict1 = {
'ID':['first', 'second', 'third', 'fourth', 'fifth'],
'pattern':['AAABCDEE', 'ABBBBD', 'CCCDE', 'AA', 'ABCDE']
}
df = pd.DataFrame(dict1)
>>> df
ID pattern
0 first AAABCDEE
1 second ABBBBD
2 third CCCDE
3 fourth AA
4 fifth ABCDE
There are two columns, ID and pattern. The string in pattern with the longest length is in the first row, len('AAABCDEE'), which is length 8.
My goal is to standardize the strings such that these are the same length, with the trailing spaces as ?.
Here is what the output should look like:
>>> df
ID pattern
0 first AAABCDEE
1 second ABBBBD??
2 third CCCDE???
3 fourth AA??????
4 fifth ABCDE???
If I was able to make the trailing spaces NaN, then I could try something like:
df = df.applymap(lambda x: int(x) if pd.notnull(x) else str("?"))
But I'm not sure how one would efficiently (1) find the longest string in pattern and (2) then add NaN add the end of the strings up to this length? This may be a convoluted approach...
You can use Series.str.ljust for this, after acquiring the max string length in the column.
df.pattern.str.ljust(df.pattern.str.len().max(), '?')
# 0 AAABCDEE
# 1 ABBBBD??
# 2 CCCDE???
# 3 AA??????
# 4 ABCDE???
# Name: pattern, dtype: object
In the source for Pandas 0.22.0 here it can be seen that ljust is entirely equivalent to pad with side='right', so pick whichever you find more clear.
You can using str.pad
df.pattern.str.pad(width=df.pattern.str.len().max(),side='right',fillchar='?')
Out[1154]:
0 AAABCDEE
1 ABBBBD??
2 CCCDE???
3 AA??????
4 ABCDE???
Name: pattern, dtype: object
Python 3.6 f-string
n = df.pattern.str.len().max()
df.assign(pattern=[f'{i:?<{n}s}' for i in df.pattern])
ID pattern
0 first AAABCDEE
1 second ABBBBD??
2 third CCCDE???
3 fourth AA??????
4 fifth ABCDE???

Categories

Resources