Separate several columns of data that contain hyphens , removing elements in Python - python

I have a dataset, df, where I would like to separate strings within Python.
Data
Type Id
aa - generation aa - generation01
aa_led - generation aa_led - generation01
ss - generation ss- generation01
Desired
Type Id
aa aa01
aa_led aa_led01
ss ss01
Doing
I am trying to incorporate this code into my script, however, this splits by hyphen but my column
names are not reserved.
new = wordstring.strip('-').split('-')
Any suggestion is appreciated
Thank you

If you just want to remove generation from every value in df. You can use applymap:
df = df.applymap(lambda x : x.replace('- generation', '').replace(' ',''))
OUTPUT:
Type Id
0 aa aa01
1 aa_led aa_led01
2 ss ss01

Related

Extract multiple pattern from string pandas

Create new column with the extracted middle and last strings from a column within a dataset.
Data
Status ID
Ok hello_dd
Ok hello_aa_now
No standard_cc
no standard_ee_not
Desired
Status ID type
Ok hello_dd dd
Ok hello_aa_now aa
No standard_cc cc
no standard_ee_not ee
Doing
I am able to extract the last string, however, still researching how to extract the middle string.
df['type'] = df['ID'].str.strip('_').str[-1]
Any suggestion is appreciated.
Assuming you want to extract the string after the first _:
df['type'] = df['ID'].str.extract(r'_([^_]+)')
With split:
df['type'] = df['ID'].str.split('_').str[1]
output:
Status ID type
0 Ok hello_dd dd
1 Ok hello_aa_now aa
2 No standard_cc cc
3 no standard_ee_not ee

Splitting column by multiple custom delimiters in Python

I need to split a column called Creative where each cell contains samples such as:
pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)
Where each two-letter code preceding each bubbled section ( ) is the title of the desired column, and are the same in every row. The only data that changes is what is inside the bubbles. I want the data to look like:
pn
io
ta
pt
cn
cs
2021
302
Yes
Blue
John
Doe
I tried
df[['Creative', 'Creative Size']] = df['Creative'].str.split('cs(',expand=True)
and
df['Creative Size'] = df['Creative Size'].str.replace(')','')
but got an error, error: missing ), unterminated subpattern at position 2, assuming it has something to do with regular expressions.
Is there an easy way to split these ? Thanks.
Use extract with named capturing groups (see here):
import pandas as pd
# toy example
df = pd.DataFrame(data=[["pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)"]], columns=["Creative"])
# extract with a named capturing group
res = df["Creative"].str.extract(
r"pn\((?P<pn>\d+)\)io\((?P<io>\d+)\)ta\((?P<ta>\w+)\)pt\((?P<pt>\w+)\)cn\((?P<cn>\w+)\)cs\((?P<cs>\w+)\)",
expand=True)
print(res)
Output
pn io ta pt cn cs
0 2021 302 Yes Blue John Doe
I'd use regex to generate a list of dictionaries via comprehensions. The idea is to create a list of dictionaries that each represent rows of the desired dataframe, then constructing a dataframe out of it. I can build it in one nested comprehension:
import re
rows = [{r[0]:r[1] for r in re.findall(r'(\w{2})\((.+)\)', c)} for c in df['Creative']]
subtable = pd.DataFrame(rows)
for col in subtable.columns:
df[col] = subtable[col].values
Basically, I regex search for instances of ab(*) and capture the two-letter prefix and the contents of the parenthesis and store them in a list of tuples. Then I create a dictionary out of the list of tuples, each of which is essentially a row like the one you display in your question. Then, I put them into a data frame and insert each of those columns into the original data frame. Let me know if this is confusing in any way!
David
Try with extractall:
names = df["Creative"].str.extractall("(.*?)\(.*?\)").loc[0][0].tolist()
output = df["Creative"].str.extractall("\((.*?)\)").unstack()[0].set_axis(names, axis=1)
>>> output
pn io ta pt cn cs
0 2021 302 Yes Blue John Doe
1 2020 301 No Red Jane Doe
Input df:
df = pd.DataFrame({"Creative": ["pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)",
"pn(2020)io(301)ta(No)pt(Red)cn(Jane)cs(Doe)"]})
We can use str.findall to extract matching column name-value pairs
pd.DataFrame(map(dict, df['Creative'].str.findall(r'(\w+)\((\w+)')))
pn io ta pt cn cs
0 2021 302 Yes Blue John Doe
Using regular expressions, different way of packaging final DataFrame:
import re
import pandas as pd
txt = 'pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)'
data = list(zip(*re.findall('([^\(]+)\(([^\)]+)\)', txt))
df = pd.DataFrame([data[1]], columns=data[0])

Reading in text file with time column which is separated by commas?

I have a txt-file with data that looks like this
A,B,C,Time
xyz,1,MN,14/11/20 17:20:08,296000000
tuv,0,ST,30/12/20 11:11:18,111111111
I read the data in using this code:
df = pd.read_csv('path/to/file',delimiter=',')
Because of my time column it does not work correctly because Time is separated through a comma. How can I solve this and how can I make it work even in the case that I have multiple columns with such a time format?
I would like to get a datframe which looks like this:
A B C Time
xyz 1 MN 14/11/20 17:20:08,296000000
tuv 0 ST 30/12/20 11:11:18,111111111
Thanks a lot!
Use reset_index() method,apply() method and drop() method:
df=df.reset_index()
df['Time']=df[['C','Time']].astype(str).apply(','.join,1)
df=df.drop(columns=['C'])
df.columns=['A','B','C','Time']
Now If you print df you will get desired output:
A B C Time
0 xyz 1 MN 14/11/20 17:20:08,296000000
1 tuv 0 ST 30/12/20 11:11:18,111111111
Now If you wish to convert it back to txt file then use:
df.to_csv('filename.txt',sep='|',index=False)
Note: you can't use ',' and ' ' as sep parameter because it creates the same problem when you try to load your txt/csv file

strings to column using python

I have entire table as string like below:
a= "id;date;type;status;description\r\n1;20-Jan-2019;cat1;active;customer is under\xe9e observation\r\n2;18-Feb-2019;cat2;active;customer is genuine\r\n"
inside string we do have some ascii code like \xe9e so we have to convert the string to non-ascii
My expected output is to convert above string to a dataframe
as below:
id date type status description
1 20-Jan-2019 cat1 active customer is under observation
2 18-Feb-2019 cat2 active customer is genuine
My code :
b = a.splitlines()
c = pd.DataFrame([sub.split(";") for sub in b])
I am getting the following output. but I need the fist row as my header and also convert the ascii to utf-8 text.
0 1 2 3 4 5 6
0 id date type status description None None
1 1 20-Jan-2019 cat1 active customer is underée observation None None
2 2 18-Feb-2019 cat2 active customer is genuine None None
Also, please not here it is creating extra columns with value None. Which should not be the case
Here is a bit of a hacky answer, but given that your question isn't really clear, this should hopefully be sufficient.
import pandas as pd
import numpy as np
import re
a="id;date;type;status;description\r\n1;20-Jan-2019;cat1;active;customer is under\xe9e observation\r\n2;18-Feb-2019;cat2;active;customer is genuine\r\n"
b=re.split('; |\r|\n',a) #split at the delimiters.
del b[-1] #also delete the last index, which we dont need
b[1:]=[re.sub(r'\xe9e', '', b[i]) for i in range(1,len(b))] #get rid of that \xe9e issue
df=pd.DataFrame([b[i].split(';') for i in range(1,len(b))]) #make the dataframe
##list comprehension allows to generalize this if you add to string##
df.columns=b[0].split(';') #split the title words for column names
df['id']=[i for i in range(1,len(b))]
df
This output is presumably what you meant by a dataframe:

Extract prefix from string in dataframe column where exists in a list

Looking for some help.
I have a pandas dataframe column and I want to extract the prefix where such prefix exists in a separate list.
pr_list = ['1 FO-','2 IA-']
Column in df is like
PartNumber
ABC
DEF
1 FO-BLABLA
2 IA-EXAMPLE
What I am looking for is to extract the prefix where present, put in a new column and leave the rest of the string in the original column.
PartNumber Prefix
ABC
DEF
BLABLA 1 FO-
EXAMPLE 2 IA-
Have tried some things like str.startswith but a bit of a python novice and wasn't able to get it to work.
much appreciated
EDIT
Both solutions below work on the test data, however I am getting an error
error: nothing to repeat at position 16
Which suggests something askew in my dataset. Not sure what position 16 refers to but looking at both the prefix list and PartNumber column in position 16 nothing seems out of the ordinary?
EDIT 2
I have traced it to have an * in the pr_list seems to be throwing it. is * some reserved character? is there a way to break it out so it is read as text?
You can try:
df['Prefix']=df.PartNumber.str.extract(r'({})'.format('|'.join(pr_list))).fillna('')
df.PartNumber=df.PartNumber.str.replace('|'.join(pr_list),'')
print(df)
PartNumber Prefix
0 ABC
1 DEF
2 BLABLA 1 FO-
3 EXAMPLE 2 IA-
Maybe it's not what you are looking for, but may it help.
import pandas as pd
pr_list = ['1 FO-','2 IA-']
df = pd.DataFrame({'PartNumber':['ABC','DEF','1 FO-BLABLA','2 IA-EXAMPLE']})
extr = '|'.join(x for x in pr_list)
df['Prefix'] = df['PartNumber'].str.extract('('+ extr + ')', expand=False).fillna('')
df['PartNumber'] = df['PartNumber'].str.replace('|'.join(pr_list),'')
df

Categories

Resources