Split one column into two by specific characters in Python - python

I use Python3 and need to split price column which mixed price_value and price_unit together in a dataframe, the example data looks like 20dollar/m2/month or 1.8dollar/m2/day, I want split them to this format by word dollar:
price_value price_unit
20 dollar/m2/month
1.8 dollar/m2/day
I have tried with the following code:
Option 1:
df['price_value'] = df['price'].apply(lambda row: row.split('dollar')[0])
df['price_unit'] = df['price'].apply(lambda row: row.split('dollar')[-1])
Option 2:
df['price_value'], df['price_unit'] = df1["price"].str.split('dollar', 1).str
But I get:
price_value price_unit
20 /m2/month
1.8 /m2/day
How can I split them correctly? Thanks.

You may use str.extract with a r'(?P<price_value>.*?)(?P<price_unit>dollar.*)' regex:
>>> import pandas as pd
>>> df = pd.DataFrame(data=['20dollar/m2/month', '1.8dollar/m2/day'], columns=['price'])
>>> df['price'].str.extract(r'(?P<price_value>.*?)(?P<price_unit>dollar.*)')
price_value price_unit
0 20 dollar/m2/month
1 1.8 dollar/m2/day
See the regex demo.
Details
(?P<price_value>.*?) - Group "price_value": any 0+ chars other than line break chars as few as possible
(?P<price_unit>dollar.*) - Group "price_unit": dollar and any 0+ chars other than line break chars as many as possible.
I assume that you do not have any line breaks in the input, but if you happen to have any, prepend the pattern with the inline DOTALL modifier, (?s): r'(?s)(?P<price_value>.*?)(?P<price_unit>dollar.*)'
To add the newly extracted columns to the existing data frame, you may also use
df[['price_value', 'price_unit']] = df['price'].str.extract(r'(.*?)(dollar.*)')
Here, named capturing groups are not necessary since you define the column names beforehand.

You could do:
df = pd.DataFrame(data=['20dollar/m2/month', '1.8dollar/m2/day'], columns=['price_unit'])
# split by capture group
result = df['price_unit'].str.split('(dollar.*$)', expand=True).drop(2, axis=1)
# rename columns
result.columns = ['price_value', 'price_unit']
print(result)
Output
price_value price_unit
0 20 dollar/m2/month
1 1.8 dollar/m2/day

Related

Find if the string (sentence) in a list of string in other columns in Python

I want to check if sentence columns contains any keyword in other columns (without case sensitive).
I also got the problem when import file from csv, the keyword list has ' ' on the string so when I tried to use join str.join('|') it add | into every character
Sentence = ["Clear is very good","Fill- low light, compact","stripping topsoil"]
Keyword =[['Clearing', 'grubbing','clear','grub'],['Borrow,', 'Fill', 'and', 'Compaction'],['Fall']]
df = pd.DataFrame({'Sentence': Sentence, 'Keyword': Keyword})
My expect output will be
df['Match'] = [True,True,False]
You can try DataFrame.apply on rows
import re
df['Match'] = df.apply(lambda row: bool(re.search('|'.join(row['Keyword']), row['Sentence'], re.IGNORECASE)), axis=1)
print(df)
Sentence Keyword Match
0 Clear is very good [Clearing, grubbing, clear, grub] True
1 Fill- low light, compact [Borrow,, Fill, and, Compaction] True
2 stripping topsoil [Fall] False

Pandas split columns on first % sign, on 2nd letter

We have the following dataframe
# raw_df
print(raw_df.to_dict())
{'Edge': {1: '-1.9%-2.2%', 2: '+5.8%-9.4%', 3: '+3.5%-7.2%'}, 'Grade': {1: 'D+D', 2: 'BF', 3: 'B-F'}}
We are trying to split these 2 columns into 4 columns. The Edge column should split after the first %, and the Grade column should split before the 2nd capital letter appears. The output should look like:
output_df
edge_1 edge_2 grade_1 grade_2
-1.9% -2.2% D+ D
+5.8% -9.4% B F
+3.5% -7.2% B- F
We have raw_df[['t1_grade', 't2_grade']] = raw_df['Grade'].str.extractall(r'([A-Z])').unstack() to split the Grade column, however the + and - are dropped here, which is a problem. And we are not sure how to split the Edge column after the first % appears.
We can use str.extract as follows:
df["edge_1"] = df["Edge"].str.extract(r'^([+-]?\d+(?:\.\d+)?%)')
df["edge_2"] = df["Edge"].str.extract(r'([+-]?\d+(?:\.\d+)?%)$')
df["grade_1"] = df["Grade"].str.extract(r'^([A-Z][+-]?)')
df["grade_2"] = df["Grade"].str.extract(r'([A-Z][+-]?)$')
The strategy here is to extract the first/last percentage/grade from the two current columns using regex.
Looks like you already have your solution, but here is another idea for splitting Edge without regex:
strip the trailing '%'
split by '%' with expand=True
add back '%'
df[['edge_1', 'edge_2']] = (
df['Edge'].str.rstrip('%').str.split('%', expand=True).add('%')
)

How to strip/replace "domain\" from Pandas DataFrame Column?

I have a pandas DataFrame that's being read in from a CSV that has hostnames of computers including the domain they belong to along with a bunch of other columns. I'm trying to strip out the Domain information such that I'm left with ONLY the Hostname.
DataFrame ex:
name
domain1\computername1
domain1\computername45
dmain3\servername1
dmain3\computername3
domain1\servername64
....
I've tried using both str.strip() and str.replace() with a regex as well as a string literal, but I can't seem to correctly target the domain information correctly.
Examples of what I've tried thus far:
df['name'].str.strip('.*\\')
df['name'].str.replace('.*\\', '', regex = True)
df['name'].str.replace(r'[.*\\]', '', regex = True)
df['name'].str.replace('domain1\\\\', '', regex = False)
df['name'].str.replace('dmain3\\\\', '', regex = False)
None of these seem to make any changes when I spit the DataFrame out using logging.debug(df)
You are already close to the answer, just use:
df['name'] = df['name'].str.replace(r'.*\\', '', regex = True)
which just adds using r-string from one of your tried code.
Without using r-string here, the string is equivalent to .*\\ which will be interpreted to only one \ in the final regex. However, with r-string, the string will becomes '.*\\\\' and each pair of \\ will be interpreted finally as one \ and final result becomes 2 slashes as you expect.
Output:
0 computername1
1 computername45
2 servername1
3 computername3
4 servername64
Name: name, dtype: object
You can use .str.split:
df["name"] = df["name"].str.split("\\", n=1).str[-1]
print(df)
Prints:
name
0 computername1
1 computername45
2 servername1
3 computername3
4 servername64
No regex approach with ntpath.basename:
import pandas as pd
import ntpath
df = pd.DataFrame({'name':[r'domain1\computername1']})
df["name"] = df["name"].apply(lambda x: ntpath.basename(x))
Results: computername1.
With rsplit:
df["name"] = df["name"].str.rsplit('\\').str[-1]

How to check the pattern of a column in a dataframe

I have a dataframe which has some id's. I want to check the pattern of those column values.
Here is how the column looks like-
id: {ASDH12HK,GHST67KH,AGSH90IL,THKI86LK}
I want to to write a code that can distinguish characters and numerics in the pattern above and display an output like 'SSSS99SS' as the pattern of the column above where 'S' represents a character and '9' represents a numeric.This dataset is a large dataset so I can't predefine the position the characters and numeric will be in.I want the code to calculate the position of the characters and numerics. I am new to python so any leads will be helpful!
You can try something like:
my_string = "ASDH12HK"
def decode_pattern(my_string):
my_string = ''.join(str(9) if s.isdigit() else s for s in my_string)
my_string = ''.join('S' if s.isalpha() else s for s in my_string)
return my_string
decode_pattern(my_string)
Output:
'SSSS99SS'
You can apply this to the column in your dataframe as well as below:
import pandas as pd
df = pd.DataFrame(['ASDH12HK','GHST67KH','AGSH90IL','THKI86LK', 'SOMEPATTERN123'], columns=['id'])
df['pattern'] = df['id'].map(decode_pattern)
df
Output:
id pattern
0 ASDH12HK SSSS99SS
1 GHST67KH SSSS99SS
2 AGSH90IL SSSS99SS
3 THKI86LK SSSS99SS
4 SOMEPATTERN123 SSSSSSSSSSS999
You can use regular experssion:
st = "SSSS99SSSS"
a = re.match("[A-Za-z]{4}[0-9]{2}[A-Za-z]{4}", st)
It will return a match if the string starting with 4 Char followed by 2 numeric and again 4 char
So you can use this in your df to filter the df
You can use the function findall() from the re module:
import re
text = "ASDH12HK,GHST67KH,AGSH90IL,THKI86LK"
result = re.findall("[A-Za-z]{4}[0-9]{2}[A-Za-z]{2}", text)
print(result)

Replace string in pandas dataframe if it contains specific substring

I have a dataframe generated from a .csv (I use Python 3.5). The df['category'] contains only strings. What I want is to check this column and if a string contains a specific substring(not really interested where they are in the string as long as they exist) to be replaced. I am using this script
import pandas as pd
df=pd.read_csv('lastfile.csv')
df.dropna(inplace=True)
g='Drugs'
z='Weapons'
c='Flowers'
df.category = df.category.str.lower().apply(lambda x: g if ('mdma' or 'xanax' or 'kamagra' or 'weed' or 'tabs' or 'lsd' or 'heroin' or 'morphine' or 'hci' or 'cap' or 'mda' or 'hash' or 'kush' or 'wax'or 'klonop'or\
'dextro'or'zepam'or'amphetamine'or'ketamine'or 'speed' or 'xtc' or 'XTC' or 'SPEED' or 'crystal' or 'meth' or 'marijuana' or 'powder' or 'afghan'or'cocaine'or'haze'or'pollen'or\
'sativa'or'indica'or'valium'or'diazepam'or'tablet'or'codeine'or \
'mg' or 'dmt'or'diclazepam'or'zepam'or 'heroin' ) in x else(z if ('weapon'or'milit'or'gun'or'grenades'or'submachine'or'rifle'or'ak47')in x else c) )
print(df['category'])
My problem is that some records though they contain some of the substrings I defined, do not get replaced. Is it a regex related problem?
Thank you in advance.
Create dictionary of list of substrings with key for replace strings, loop it and join all list values by | for regex OR, so possible check column by contains and replace matched rows with loc:
df = pd.DataFrame({'category':['sss mdma df','milit ss aa','aa ss']})
a = ['mdma', 'xanax' , 'kamagra']
b = ['weapon','milit','gun']
g='Drugs'
z='Weapons'
c='Flowers'
d = {g:a, z:b}
df['new_category'] = c
for k, v in d.items():
pat = '|'.join(v)
mask = df.category.str.contains(pat, case=False)
df.loc[mask, 'new_category'] = k
print (df)
category new_category
0 sss mdma df Drugs
1 milit ss aa Weapons
2 aa ss Flowers

Categories

Resources