splitting column value in to three with two delimiter in pandas - python

I have written an excel file with one column with values of:
col1
22125051|2/136|Possible Match
nan|3/4|Not Match
22125051|1/26|Match
these data are initially in different columns but I want to get the value of the said columns and put the data into one, and I did it by using .apply() and .join() then I added a delimiter | to separate the values
now I want to split the column per value then put it in to specific column in an existing .xlsx file.
say df3 = pd.read_excel('type_primary_data.xlsx')
and .columns[37], .columns[39], .columns[40]
Desired ouput
svc_no port Result
22125051 2/136 Possible Match
nan 3/4 Not Match
22125051 1/26 Match
I am not sure what is the best way to do this in pandas.
UPDATE
turns out that I need to match the adsl column to an existing .xlsx file
so, as the adsl matched with the said column I also wanted to get the svc_no and comparison result along with the matched adsl.
my ouput should be
adsl svc_no port Result
3/4 nan 3/4 Not Match
1/26 22125051 1/26 Match
2/136 22125051 2/136 Possible Match

Try using the df.str.split method:
df =df[col1].str.split('|', expand=True)
Then, rename the comlumns since they will be numbers with:
df.rename(columns={'oldname':'newname'}
Try that. I cant comment because of reputation but I think that's what you are looking for.

Option 1
I'm a fan of using extract with naming within regex pattern
pat = '(?P<svc_no>.*)\|(?P<port>.*)\|(?P<Result>.*)'
df.col1.str.extract(pat, expand=True)
svc_no port Result
0 22125051 2/136 Possible Match
1 nan 3/4 Not Match
2 22125051 1/26 Match
Option 2
cols = dict(enumerate('svc_no port Result'.split()))
df.col1.str.extractall('([^|]+)')[0].unstack().rename(columns=cols)
match svc_no port Result
0 22125051 2/136 Possible Match
1 nan 3/4 Not Match
2 22125051 1/26 Match

Related

Pandas: Filter rows by regex condition

I've read several questions and answers to this, but I must be doing something wrong. I'd appreciate if someone points at me what it might be.
In my df dataframe I have my first column that should always contain six digits, I'm loading the dataframe from Excel, and some smart user thought it would be too funny if adding a disclaimer in the first column.
So I have in the first column something like:
['123456', '456789', '147852', 'In compliance with...']
So I need to filter only the valid records I'm tryng:
pat='\d{6}'
filter = df[0].str.contains(pat, regex=True)
This thing returns 'False' for the disclaimer, but NaN for the match, so doing a df[filter] yields nothing
What am I doing wrong?
You should be able to do that with the following.
You need to select the rows based on the regex filter.
Note that the current regex that you are using will match anything above 6 digits as well. I changed this to include 6 digits exactly.
df = df[df[df.columns[0]].str.contains('^[0-9]{6}$', regex=True)]

filtering pandas dataframe when data contains two parts

I have a pandas dataframe and want to filter down to all the rows that contain a certain criteria in the “Title” column.
The rows I want to filter down to are all rows that contain the format “(Axx)” (Where xx are 2 numbers).
The data in the “Title” column doesn’t just consist of “(Axx)” data.
The data in the “Title” column looks like so:
“some_string (Axx)”
What Ive been playing around a bit with different methods but cant seem to get it.
I think the closest ive gotten is:
df.filter(regex=r'(D\d{2})', axis=0))
but its not correct as the entries aren’t being filtered.
Use Series.str.contains with escape () and $ for end of string and filter in boolean indexing:
df = pd.DataFrame({'Title':['(D89)','aaa (D71)','(D5)','(D78) aa','D72']})
print (df)
Title
0 (D89)
1 aaa (D71)
2 (D5)
3 (D78) aa
df1 = df[df['Title'].str.contains(r'\(D\d{2}\)$')]
print (df1)
4 D72
Title
0 (D89)
1 aaa (D71)
If ned match only (Dxx) use Series.str.match:
df2 = df[df['Title'].str.match(r'\(D\d{2}\)$')]
print (df2)
Title
0 (D89)

Get rid of initial spaces at specific cells in Pandas

I am working with a big dataset (more than 2 million rows x 10 columns) that has a column with string values that were filled oddly. Some rows start and end with many space characters, while others don't.
What I have looks like this:
col1
0 (spaces)string(spaces)
1 (spaces)string(spaces)
2 string
3 string
4 (spaces)string(spaces)
I want to get rid of those spaces at the beginning and at the end and get something like this:
col1
0 string
1 string
2 string
3 string
4 string
Normally, for a small dataset I would use a for iteration (I know it's far from optimal) but now it's not an option given the time it would take.
How can I use the power of pandas to avoid a for loop here?
Thanks!
edit: I can't get rid of all the whitespaces since the strings contain spaces.
df['col1'].apply(lambda x: x.strip())
might help

Match all values str column dataframe with other dataframe str column

I have two pandas dataframes:
Dataframe 1:
ITEM ID TEXT
1 some random words
2 another word
3 blah
4 random words
Dataframe 2:
INDEX INFO
1 random
3 blah
I would like to match the values from the INFO column (of dataframe 2) with the TEXT column of dataframe 1. If there is a match, I would like to see a new column with a "1".
Something like this:
ITEM ID TEXT MATCH
1 some random words 1
2 another word
3 blah 1
4 random words 1
I was able to create a match per value of the INFO column that I'm looking for with this line of code:
dataframe1.loc[dataframe1['TEXT'].str.contains('blah'), 'MATCH'] = '1'
However, in reality, my real dataframe 2 has 5000 rows. So I cannot manually copy paste all of this. But basically I'm looking for something like this:
dataframe1.loc[dataframe1['TEXT'].str.contains('Dataframe2[INFO]'), 'MATCH'] = '1'
I hope someone can help, thanks!
Give this a shot:
Code:
dfA['MATCH'] = dfA['TEXT'].apply(lambda x: min(len([ y for y in dfB['INFO'] if y in x]), 1))
Output:
ITEM ID TEXT MATCH
0 1 some random words 1
1 2 another word 0
2 3 blah 1
3 4 random words 1
It's a 0 if it's not a match, but that's easy enough to weed out.
There may be a better / faster native solution, but it gets the job done by iterating over both the 'TEXT' column and the 'INFO'. Depending on your use case, it may be fast enough.
Looks like .map() in lieu of .apply() would work just as well. Could make a difference in timing, again, based on your use case.
Updated to take into account string contains instead of exact match...
You could get the unique values from the column in the first dataframe convert them to list and then use eval method on the second with Column.str.contains on that list.
unique = df1['TEXT'].unique().tolist()
df2.eval("Match=Text.str.contains('|'.join(#unique))")

pandas str.extractall find unknown number of groups / regex

After some searching I seem to be coming up a bit blank. I'm also a total regex simpleton...
I have a csv file with data like this:
header1 header2
row1 "asdf (qwer) asdf"
row2 "asdf (hghg) asdf (lkjh)"
row3 "asdf (poiu) mkij (vbnc) yuwuiw (hjgk)"
I've put double quotes around the rows in header2 for clarity that it is one field.
I want to extract each occurrence of words between brackets (). There will be a least one occurrence per row, but I don't know ahead of time how many occurrences of bracketed words will appear in each line.
Using the wonderful https://www.regextester.com/ i think the regex i need is \(.*?\)
But I keep getting:
ValueError: pattern contains no capture groups
the code i used was:
pattern = r'\(.*?\)'
extracted = df.loc[:, 'header2'].str.extractall(pattern)
Any help appreciated.
thanks
You need to include a capture group inside the parenthesis. Also, when using extractall, I'd use unstack so it matches the structure of your DataFrame:
df.header2.str.extractall(r'\((.*?)\)').unstack()
0
match 0 1 2
0 qwer NaN NaN
1 hghg lkjh NaN
2 poiu vbnc hjgk
If you're concerned about performance, don't use pandas string operations:
pd.DataFrame([re.findall(r'\((.*?)\)', row) for row in df.header2])
0 1 2
0 qwer None None
1 hghg lkjh None
2 poiu vbnc hjgk

Categories

Resources