Extract multiple pattern from string pandas - python

Create new column with the extracted middle and last strings from a column within a dataset.
Data
Status ID
Ok hello_dd
Ok hello_aa_now
No standard_cc
no standard_ee_not
Desired
Status ID type
Ok hello_dd dd
Ok hello_aa_now aa
No standard_cc cc
no standard_ee_not ee
Doing
I am able to extract the last string, however, still researching how to extract the middle string.
df['type'] = df['ID'].str.strip('_').str[-1]
Any suggestion is appreciated.

Assuming you want to extract the string after the first _:
df['type'] = df['ID'].str.extract(r'_([^_]+)')
With split:
df['type'] = df['ID'].str.split('_').str[1]
output:
Status ID type
0 Ok hello_dd dd
1 Ok hello_aa_now aa
2 No standard_cc cc
3 no standard_ee_not ee

Related

How to drop duplicate with priority in pandas

I'm new you to pandas and python, and I want to remove duplicates but give it a priority. It's hard to explain but I will give an example to make it clear
ID Phone Email
0001 0234+ null
0001 null a#.com
0001 0234+ a#.com
how I can remove the duplicates in ID and leave the third one because it has both phone and email
and not removing it randomly, and if the id for example has no complete of both values it will still remain one
First Drop NaNs in rows and then drop duplicates
df2 = df.dropna(subset=['Phone']).dropna(subset=['Email']).drop_duplicates('ID')
You can just drop the NaN values based on Phone and Email.
df.dropna(subset=['Phone', 'Email'], inplace=True)
I solve this by take each case to new data frame for example if both email and phone have value will set it a firstdf, then if email only has value it will be in seconddf, etc.
then I concat them and append it to new data frame as final result
and remove id duplicate (by that I set the most important case at top)
code:
# drop if both is null
ff = ff.dropna(subset=["الجوال", 'البريد الالكتروني'] , how="all")
#hh = ff with both not null
hh = ff.dropna(subset=["الجوال", 'البريد الالكتروني'])
## ss = ff with email false and phone true
ss = ff.dropna(subset=["الجوال"])
## yy = ff with email true and phone false
yy = ff.dropna(subset=["البريد الالكتروني"])
#### solution to give priority which to drop we take the most important one top
df1=pd.concat([hh,ss],axis=0)
len(hh) + len(ss)
df2=pd.concat([df1,yy],axis=0)
len(df1) + len(yy)
final= df2.copy()
final= final.drop_duplicates(subset=["رقم الهوية"])
final.to_excel(r'Result.xlsx',index=False)

Splitting column by multiple custom delimiters in Python

I need to split a column called Creative where each cell contains samples such as:
pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)
Where each two-letter code preceding each bubbled section ( ) is the title of the desired column, and are the same in every row. The only data that changes is what is inside the bubbles. I want the data to look like:
pn
io
ta
pt
cn
cs
2021
302
Yes
Blue
John
Doe
I tried
df[['Creative', 'Creative Size']] = df['Creative'].str.split('cs(',expand=True)
and
df['Creative Size'] = df['Creative Size'].str.replace(')','')
but got an error, error: missing ), unterminated subpattern at position 2, assuming it has something to do with regular expressions.
Is there an easy way to split these ? Thanks.
Use extract with named capturing groups (see here):
import pandas as pd
# toy example
df = pd.DataFrame(data=[["pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)"]], columns=["Creative"])
# extract with a named capturing group
res = df["Creative"].str.extract(
r"pn\((?P<pn>\d+)\)io\((?P<io>\d+)\)ta\((?P<ta>\w+)\)pt\((?P<pt>\w+)\)cn\((?P<cn>\w+)\)cs\((?P<cs>\w+)\)",
expand=True)
print(res)
Output
pn io ta pt cn cs
0 2021 302 Yes Blue John Doe
I'd use regex to generate a list of dictionaries via comprehensions. The idea is to create a list of dictionaries that each represent rows of the desired dataframe, then constructing a dataframe out of it. I can build it in one nested comprehension:
import re
rows = [{r[0]:r[1] for r in re.findall(r'(\w{2})\((.+)\)', c)} for c in df['Creative']]
subtable = pd.DataFrame(rows)
for col in subtable.columns:
df[col] = subtable[col].values
Basically, I regex search for instances of ab(*) and capture the two-letter prefix and the contents of the parenthesis and store them in a list of tuples. Then I create a dictionary out of the list of tuples, each of which is essentially a row like the one you display in your question. Then, I put them into a data frame and insert each of those columns into the original data frame. Let me know if this is confusing in any way!
David
Try with extractall:
names = df["Creative"].str.extractall("(.*?)\(.*?\)").loc[0][0].tolist()
output = df["Creative"].str.extractall("\((.*?)\)").unstack()[0].set_axis(names, axis=1)
>>> output
pn io ta pt cn cs
0 2021 302 Yes Blue John Doe
1 2020 301 No Red Jane Doe
Input df:
df = pd.DataFrame({"Creative": ["pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)",
"pn(2020)io(301)ta(No)pt(Red)cn(Jane)cs(Doe)"]})
We can use str.findall to extract matching column name-value pairs
pd.DataFrame(map(dict, df['Creative'].str.findall(r'(\w+)\((\w+)')))
pn io ta pt cn cs
0 2021 302 Yes Blue John Doe
Using regular expressions, different way of packaging final DataFrame:
import re
import pandas as pd
txt = 'pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)'
data = list(zip(*re.findall('([^\(]+)\(([^\)]+)\)', txt))
df = pd.DataFrame([data[1]], columns=data[0])

Separate several columns of data that contain hyphens , removing elements in Python

I have a dataset, df, where I would like to separate strings within Python.
Data
Type Id
aa - generation aa - generation01
aa_led - generation aa_led - generation01
ss - generation ss- generation01
Desired
Type Id
aa aa01
aa_led aa_led01
ss ss01
Doing
I am trying to incorporate this code into my script, however, this splits by hyphen but my column
names are not reserved.
new = wordstring.strip('-').split('-')
Any suggestion is appreciated
Thank you
If you just want to remove generation from every value in df. You can use applymap:
df = df.applymap(lambda x : x.replace('- generation', '').replace(' ',''))
OUTPUT:
Type Id
0 aa aa01
1 aa_led aa_led01
2 ss ss01

remove unwanted strings from Pandas column

I have a dataframe :
ID Website
1 www.yah.com/?trk
2 www.gle.com
I want to clean unwanted part from the website Url by deleting '?trk' or replacing it by ''
My final Dataframe will be :
ID Website
1 www.yah.com
2 www.gle.com
how can i do it known that i might have other options not only '?trk'
If you want to replace '?trk' only and not the '/' you can:
df['Website'] = df['Website'].replace(['?trk'],'')
Check split
df['Website'] = df['Website'].str.split('/').str[0]
df
Out[169]:
ID Website
0 1 www.yah.com
1 2 www.gle.com

strings to column using python

I have entire table as string like below:
a= "id;date;type;status;description\r\n1;20-Jan-2019;cat1;active;customer is under\xe9e observation\r\n2;18-Feb-2019;cat2;active;customer is genuine\r\n"
inside string we do have some ascii code like \xe9e so we have to convert the string to non-ascii
My expected output is to convert above string to a dataframe
as below:
id date type status description
1 20-Jan-2019 cat1 active customer is under observation
2 18-Feb-2019 cat2 active customer is genuine
My code :
b = a.splitlines()
c = pd.DataFrame([sub.split(";") for sub in b])
I am getting the following output. but I need the fist row as my header and also convert the ascii to utf-8 text.
0 1 2 3 4 5 6
0 id date type status description None None
1 1 20-Jan-2019 cat1 active customer is underée observation None None
2 2 18-Feb-2019 cat2 active customer is genuine None None
Also, please not here it is creating extra columns with value None. Which should not be the case
Here is a bit of a hacky answer, but given that your question isn't really clear, this should hopefully be sufficient.
import pandas as pd
import numpy as np
import re
a="id;date;type;status;description\r\n1;20-Jan-2019;cat1;active;customer is under\xe9e observation\r\n2;18-Feb-2019;cat2;active;customer is genuine\r\n"
b=re.split('; |\r|\n',a) #split at the delimiters.
del b[-1] #also delete the last index, which we dont need
b[1:]=[re.sub(r'\xe9e', '', b[i]) for i in range(1,len(b))] #get rid of that \xe9e issue
df=pd.DataFrame([b[i].split(';') for i in range(1,len(b))]) #make the dataframe
##list comprehension allows to generalize this if you add to string##
df.columns=b[0].split(';') #split the title words for column names
df['id']=[i for i in range(1,len(b))]
df
This output is presumably what you meant by a dataframe:

Categories

Resources