Split column into multiple columns based on content of column in Pandas - python

I have a column with data like this
Ticket NO: 123456789 ; Location ID:ABC123; Type:Network;
Ticket No. 132123456, Location ID:ABC444; Type:App
Tickt#222256789 ; Location ID:AMC121; Type:Network;
I am trying like this
new = data["Description"].str.split(";", n = 1, expand = True)
data["Ticket"]= new[0]
data["Location"]= new[1]
data["Type"]= new[2]
# Dropping old columns
data.drop(columns =["Description"], inplace = True)
I can separate based on ";" but how to do for both ";" and "," ?

A more general solution, that allows you to perform as much processing as you like comfortably. Let's start by defining an example dataframe for easy debugging:
df = pd.DataFrame({'Description': [
'Ticket NO: 123456789 , Location ID:ABC123; Type:Network;',
'Ticket NO: 123456789 ; Location ID:ABC123; Type:Network;']})
Then, let's define our processing function, where you can do anything you like:
def process(row):
parts = re.split(r'[,;]', row)
return pd.Series({'Ticket': parts[0], 'Location': parts[1], 'Type': parts[2]})
In addition to splitting by ,; and then separating into the 3 sections, you can add code that will strip whitespace characters, remove whatever is on the left of the colons etc. For example, try:
def process(row):
parts = re.split(r'[,;]', row)
data = {}
for part in parts:
for field in ['Ticket', 'Location', 'Type']:
if field.lower() in part.lower():
data[field] = part.split(':')[1].strip()
return pd.Series(data)
Finally, apply to get the result:
df['Description'].apply(process)
This is much more readable and easily maintainable than doing everything in a single regex, especially as you might end up needing additional processing.
The output of this application will look like this:
To add this output to the original dataframe, simply run:
df[['Ticket', 'Location', 'Type']] = df['Description'].apply(process)

One approach using str.extract
Ex:
df[['Ticket', 'Location', 'Type']] = df['Description'].str.extract(r"[Ticket\sNO:.#](\d+).*ID:([A-Z0-9]+).*Type:([A-Za-z]+)", flags=re.I)
print(df[['Ticket', 'Location', 'Type']])
Output:
Ticket Location Type
0 123456789 ABC123 Network
1 132123456 ABC444 App
2 222256789 AMC121 Network

You can use
new = data["Description"].str.split("[;,]", n = 2, expand = True)
new.columns = ['Ticket', 'Location', 'Type']
Output:
>>> new
Ticket Location Type
0 Ticket NO: 123456789 Location ID:ABC123 Type:Network;
1 Ticket No. 132123456 Location ID:ABC444 Type:App
2 Tickt#222256789 Location ID:AMC121 Type:Network;
The [;,] regex matches either a ; or a , char, and n=2 sets max split to two times.
Another regex Series.str.extract solution:
new[['Ticket', 'Location', 'Type']] = data['Description'].str.extract(r"(?i)Ticke?t\D*(\d+)\W*Location ID\W*(\w+)\W*Type:(\w+)")
>>> new
Ticket Location Type
0 123456789 ABC123 Network
1 132123456 ABC444 App
2 222256789 AMC121 Network
>>>
See the regex demo. Details:
(?i) - case insensitive flag
Ticke?t - Ticket with an optional e
\D* - zero or more non-digit chars
(\d+) - Group 1: one or more digits
\W* - zero or more non-word chars
Location ID - a string
\W* - zero or more non-word chars
(\w+)- Group 2: one or more word chars
\W* - zero or more non-word chars
Type: - a string
(\w+)- Group 3: one or more word chars

Related

How to remove numbers from a string column that starts with 4 zeros?

I have a column of names and informations of products, i need to remove the codes from the names and every code starts with four or more zeros, some names have four zeros or more in the weight and some are joined with the name as the example below:
data = {
'Name' : ['ANOA 250g 00004689', 'ANOA 10000g 00000059884', '80%c asjw 150000001568 ', 'Shivangi000000478761'],
}
testdf = pd.DataFrame(data)
The correct output would be:
results = {
'Name' : ['ANOA 250g', 'ANOA 10000g', '80%c asjw 150000001568 ', 'Shivangi'],
}
results = pd.DataFrame(results)
you can split the strings by the start of the code pattern, which is expressed by the regex (?<!\d)0{4,}. this pattern consumes four 0s that are not preceded by any digit. after splitting the string, take the first fragment, and the str.strip gets rid of possible trailing space
testdf.Name.str.split('(?<!\d)0{4,}', regex=True, expand=True)[0].str.strip()[0].str.strip()
# outputs:
0 ANOA 250g
1 ANOA 10000g
2 80%c asjw 150000001568
3 Shivangi
note that this works for the case where the codes are always at the end of your string.
Use a regex with str.replace:
testdf['Name'] = testdf['Name'].str.replace(r'(?:(?<=\D)|\s*\b)0{4}\d*',
'', regex=True)
Or, similar to #HaleemurAli, with a negative match
testdf['Name'] = testdf['Name'].str.replace(r'(?<!\d)0{4,}0{4}\d*',
'', regex=True)
Output:
Name
0 ANOA 250g
1 ANOA 10000g
2 80%c asjw 150000001568
3 Shivangi
regex1 demo
regex2 demo
try splitting it at each space and checking if the each item has 0000 in it like:
answer=[]
for i in results["Name"]:
answer.append("".join([j for j in i.split() if "0000" not in j]))

Replace N digit numbers in a sentence with specific strings for different values of N

I have a bunch of strings in a pandas dataframe that contain numbers in them. I could the riun the below code and replace them all
df.feature_col = df.feature_col.str.replace('\d+', ' NUM ')
But what I need to do is replace any 10 digit number with a string like masked_id, any 16 digit numbers with account_number, or any three-digit numbers with yet another string, and so on.
How do I go about doing this?
PS: since my data size is less, a less optimal way is also good enough for me.
Another way is replace with option regex=True with a dictionary. You can also use somewhat more relaxed match
patterns (in order) than Tim's:
# test data
df = pd.DataFrame({'feature_col':['this has 1234567',
'this has 1234',
'this has 123',
'this has none']})
# pattern in decreasing length order
# these of course would replace '12345' with 'ID45' :-)
df['feature_col'] = df.feature_col.replace({'\d{7}': 'ID7',
'\d{4}': 'ID4',
'\d{3}': 'ID3'},
regex=True)
Output:
feature_col
0 this has ID7
1 this has ID4
2 this has ID3
3 this has none
You could do a series of replacements, one for each length of number:
df.feature_col = df.feature_col.str.replace(r'\b\d{3}\b', ' 3mask ')
df.feature_col = df.feature_col.str.replace(r'\b\d{10}\b', masked_id)
df.feature_col = df.feature_col.str.replace(r'\b\d{16}\b', account_number)

Python - select string based on length from second dataframe

I have a dataframe with website as one of the columns. Trying to create a clean string column that excludes everything after .com/ .net / .org / .edu , etc., my approach is to find the location of them and exclude anything after .com/.net by adding appropriate characters
**string**
https:/amazon.com
google.com/
http:/onlinelearning.edu/home
walmart.net/
https:/target.onlinesales.org/home/goods
https:/target.onlinesales.de/home/goods
**new string**
https:/amazon.com
google.com
http:/onlinelearning.edu
walmart.net
https:/target.onlinesales.org
https:/target.onlinesales.de
for the ones that contains .com
df['length'] = np.where(df['string'].str.contains('.com'), df['string'].str.find('.com') + 4, df['string'].str.len())
df['new_string'] = [y[:x] for (x, y) in zip(df['length'], account_dt['string'])]
This is a job for regex. You can use pd.Series.str.replace with negative lookbehind:
print (df["col"].str.replace("(?<!:)/.*", ""))
Or alternatively list out all your req domain by positive lookbehind:
print (df["col"].str.replace("(?:(?<=com)|(?<=edu)|(?<=org)|(?<=de)|(?<=net))/.*", ""))
0 -https:/amazon.com
1 -google.com
2 -http:/onlinelearning.edu
3 -walmart.net
4 -https:/target.onlinesales.org
5 -https:/target.onlinesales.de
Name: col, dtype: object
You can further refine the pattern to suit more cases.

Filtering values from string with regex

I am new to regex expressions.
I am trying to filter out values from 2 columns.
The first column look like this:
_source.cookie
__cfduid=d118f225fac35345d9e1d87e533b596ec1574680126; gclid=EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE; full_path=https://google.com/free-test/windows/; country_code=OM; clid=06a98eb3-177a-4692-8a15-04cb4c084c1c; ct_t=1574680122; ct_tid=1574680122; _ga=GA1.2.575812751.1574680122; _gid=GA1.2.560773616.1574680122; _gac_UA-138885843-1=1.1574680161.EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE; _gat=1; _gcl_aw=GCL.1574680123.EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE; _gcl_au=1.1.1227740955.1574680123; sessionid=yr0pycyfhjh90vauf0z8yw4kxno5rom0; u_id=22b5d5e0-d2b5-4a4a-ad6f-128008b4b466; _gat_UA-138885843-1=1
...
__cfduid=de7d3a7e772a62b171f445ce489bc5f791574680110; gclid=CjwKCAiAlO7uBRANEiwA_vXQ-4dP3_zZJmNXCm-P2acHITBe1XbZZZmQIGKcrL9EaoP4r9CaYEQbPxoC1uQQAvD_BwE; full_path=https://google.com/au/free-test/; country_code=AU; ct_tid=1574680121; _ga=GA1.2.476582918.1574680125; _gid=GA1.2.1129397609.1574680125; _gat=1; _gcl_au=1.1.356653701.1574680128; _gat_UA-138885843-1=1; clid=3d0b5be5-8b7b-4094-ba47-879252a59a7a; ct_t=1574680159; _gcl_aw=GCL.1574680162.CjwKCAiAlO7uBRANEiwA_vXQ-4dP3_zZJmNXCm-P2acHITBe1XbZZZmQIGKcrL9EaoP4r9CaYEQbPxoC1uQQAvD_BwE; _gac_UA-138885843-1=1.1574680169.CjwKCAiAlO7uBRANEiwA_vXQ-4dP3_zZJmNXCm-P2acHITBe1XbZZZmQIGKcrL9EaoP4r9CaYEQbPxoC1uQQAvD_BwE
__cfduid=d3b31d4cba74d440bf60e238a62bf46a51574680162; gclid=CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE; full_path=https://google.com/au/best-test/; country_code=AU; clid=4e65772c-5da2-471a-86dd-240a34fd36ac; ct_t=1574680164; ct_tid=1574680164; _ga=GA1.2.242059245.1574680165; _gid=GA1.2.1757216414.1574680165; _gac_UA-138885843-1=1.1574680165.CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE; _gat=1; _gcl_aw=GCL.1574680165.CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE; _gcl_au=1.1.1892979809.1574680165
__cfduid=d054c8a93d4874e31aef9f2966829fefc1574680166; gclid=CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE; full_path=https://google.com/au/free-test/; country_code=AU; clid=726ebc25-95b9-4507-b29d-998ab54a9eeb; ct_t=1574680164; ct_tid=1574680164; _ga=GA1.2.1271977185.1574680165; _gid=GA1.2.506750010.1574680165; _gac_UA-138885843-1=1.1574680165.CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE; _gat=1; _gcl_aw=GCL.1574680165.CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE; _gcl_au=1.1.24394228.1574680165
__cfduid=d27ba2095c6b6ac5fb6108343075969f11574679826; full_path=https://google.com/reviews/testtest/; country_code=VN; ct_tid=1574679826; _ga=GA1.2.2008368313.1574679827; _gid=GA1.2.1231813533.1574679827; _gcl_au=1.1.299737663.1574679827; sessionid=dqwf1zmqdjkv9tdqi1cotr6m2judep2p; u_id=a71d0a87-b93d-4626-8f51-bcc0550dbbee; gclid=EAIaIQobChMI-ZOE3ZyF5gIVy2ArCh37VAdGEAEYASAAEgLCaPD_BwE; clid=aeb5b4d0-400b-47ee-b916-69a7b03544aa; ct_t=1574680166; _gac_UA-138885843-1=1.1574680167.EAIaIQobChMI-ZOE3ZyF5gIVy2ArCh37VAdGEAEYASAAEgLCaPD_BwE; _gat=1; _gcl_aw=GCL.1574680167.EAIaIQobChMI-ZOE3ZyF5gIVy2ArCh37VAdGEAEYASAAEgLCaPD_BwE
The second column look like this:
_source.request_url
https://google.com/go/test/?p3
https://google.com/au/test/?gclid=CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE
https://google.com/go/test/
...
https://google.com/api/dto/?click_type=gclid&click_id=CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE&click_src=GET&cid=242059245.1574680165&user_id=&landing_page_uri=https%3A%2F%2Fgoogle.com%2Fau%2Fbest-vpn%2F%3Fgclid%3DCjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE&landing_page_referer=&lpu=https%3A%2F%2Fgoogle.com%2Fau%2Fbest-vpn%2F&lpr=https%3A%2F%2Fwww.google.com%2F&trigger=onLoad&gtmon=true&gaon=true&cookieon=true&ct_t=1574680164&ct_tid=1574680164&v=20191029&_=1574680164139
My goal is to extract glid values from both columns so that I would have 2 new columns Glid from cookie and Gclid from URL
What I have so far:
def get_glid_from_source(pattern, data):
result = re.search(pattern, str(data))
if result is not None:
return result.group(1)
return None
df['Gclid_from_url'] = df.apply(lambda x: get_glid_from_source('[gclid|click_id]=(.+?)&', x['_source.request_url']), axis=1)
df['Gclid_from_cookie'] = df.apply(lambda x: get_glid_from_source('gclid=(.+?);', x['_source.cookie']), axis=1)
I would need to edit the expression so that:
1. Gclid can only start with a letter a-z or A-Z
2. Gclid ends with one of the following - ;, %, & or the end of string
Now after filtering I get values wfrom the second column that have click_id=ZFGe... because the value I need can be either gclid=Value I need or gclid&click_id= Value I need
EDIT
I have a third column which looks like this:
_source.request_url
www.google.com/api/test...
www.google.com/go/test...
www.google.com/fire-start.php/test...
www.google.com/test...
www.google.com/api/test...
I am making new column in the pandas dataframe where I add TRUE or FALSE values depending if these conditions are met in the above column:
If the link has:
.com/fire-start.php or .com/go/ or .com/api/
New column would have value as FALSE if the patter is not found in the string 'TRUE' value is passed.
What I tried:
df['validate'] = df['_source.request_url'].str.extract(r'(www.google)=([a-zA-Z][^.com/fire-start.php|^.com/go/|^.com/api/]*)')
But that does not seem to work.
Thank you for your help, appreciate it.
You may use
df['Gclid_from_url'] = df['_source.request_url'].str.extract(r'(?:gclid|click_id)=([a-zA-Z][^&#]*)')
See the regex demo
The Gclid_from_cookie can be populated using
df['Gclid_from_cookie'] = df['_source.cookie'].str.extract(r'gclid=([a-zA-Z][^&#;%]*)')
See this regex demo
Note that [gclid|click_id] matches any 1 char defined in the character set, a g, c, l, i, d, |, k or _, not a sequence of chars, hence the non-capturing group (?:...) in my pattern.
The value pattern is [a-zA-Z][^&#]* or [a-zA-Z][^&#;%]* that is quite self-explanatory: an ASCII letter and 0 or more chars other than &, #, ;, %.
As far as the updated part of the question is concerned, you need to understand that a negated character class matches a single char, not a sequence of chars, you can't "say" [^not] to match any text but not, [^not] matches any char but n, o and t.
Add
import re
filters = ['.com/fire-start.php', '.com/go/', '.com/api/']
df['validate']=df['_source.request_url'].str.contains("|".join(map(re.escape,filters)))

Group naming with group and nested regex (unit conversion from text file)

Basic Question:
How can you name a python regex group with another group value and nest this within a larger regex group?
Origin of Question:
Given a string such as 'Your favorite song is 1 hour 23 seconds long. My phone only records for 1 h 30 mins and 10 secs.'
What is an elegant solution for extracting the times and converted to a given unit?
Attempted Solution:
My best guess at a solution would be to create a dictionary and then perform operations on the dictionary to convert to the desired unit.
i.e. convert the given string to this:
string[0]:
{'time1': {'day':0, 'hour':1, 'minutes':0, 'seconds':23, 'milliseconds':0}, 'time2': {'day':0, 'hour':1, 'minutes':30, 'seconds':10, 'milliseconds':0}}
string[1]:
{'time1': {'day':4, 'hour':2, 'minutes':3, 'seconds':6, 'milliseconds':30}}
I have a regex solution, but it isn't doing what I would like:
import re
test_string = ['Your favorite song is 1 hour 23 seconds long. My phone only records for 1h 30 mins and 10 secs.',
'This video is 4 days 2h 3min 6sec 30ms']
year_units = ['year', 'years', 'y']
day_units = ['day', 'days', 'd']
hour_units = ['hour', 'hours', 'h']
min_units = ['minute', 'minutes', 'min', 'mins', 'm']
sec_units = ['second', 'seconds', 'sec', 'secs', 's']
millisec_units = ['millisecond', 'milliseconds', 'millisec', 'millisecs', 'ms']
all_units = '|'.join(year_units + day_units + hour_units + min_units + sec_units + millisec_units)
print((all_units))
# pattern = r"""(?P<time> # time group beginning
# (?P<value>[\d]+) # value of time unit
# \s* # may or may not be space between digit and unit
# (?P<unit>%s) # unit measurement of time
# \s* # may or may not be space between digit and unit
# )
# \w+""" % all_units
pattern = r""".*(?P<time> # time group beginning
(?P<value>[\d]+) # value of time unit
\s* # may or may not be space between digit and unit
(?P<unit>%s) # unit measurement of time
\s* # may or may not be space between digit and unit
).* # may be words in between the times
""" % (all_units)
regex = re.compile(pattern)
for val in test_string:
match = regex.search(val)
print(match)
print(match.groupdict())
This fails miserably due to not being able to properly deal with nested groupings and not being able to assign a name with the value of a group.
First of all, you can't just write a multiline regex with comments and expect it to match anything if you don't use the re.VERBOSE flag:
regex = re.compile(pattern, re.VERBOSE)
Like you said, the best solution is probably to use a dict
for val in test_string:
while True: #find all times
match = regex.search(val) #find the first unit
if not match:
break
matches= {} # keep track of all units and their values
while True:
matches[match.group('unit')]= int(match.group('value')) # add the match to the dict
val= val[match.end():] # remove part of the string so subsequent matches must start at index 0
m= regex.search(val)
if not m or m.start()!=0: # if there are no more matches or there's text between this match and the next, abort
break
match= m
print matches # the finished dict
# output will be like {'h': 1, 'secs': 10, 'mins': 30}
However, the code above won't work just yet. We need to make two adjustments:
The pattern cannot allow just any text between matches. To allow only whitespace and the word "and" between two matches, you can use
pattern = r"""(?P<time> # time group beginning
(?P<value>[\d]+) # value of time unit
\s* # may or may not be space between digit and unit
(?P<unit>%s) # unit measurement of time
\s* # may or may not be space between digit and unit
(?:\band\s+)? # allow the word "and" between numbers
) # may be words in between the times
""" % (all_units)
You have to change the order of your units like so:
year_units = ['years', 'year', 'y'] # yearS before year
day_units = ['days', 'day', 'd'] # dayS before day, etc...
Why? Because if you have a text like 3 years and 1 day, then it would match 3 year instead of 3 years and.

Categories

Resources