pandas str.extractall find unknown number of groups / regex - python

After some searching I seem to be coming up a bit blank. I'm also a total regex simpleton...
I have a csv file with data like this:
header1 header2
row1 "asdf (qwer) asdf"
row2 "asdf (hghg) asdf (lkjh)"
row3 "asdf (poiu) mkij (vbnc) yuwuiw (hjgk)"
I've put double quotes around the rows in header2 for clarity that it is one field.
I want to extract each occurrence of words between brackets (). There will be a least one occurrence per row, but I don't know ahead of time how many occurrences of bracketed words will appear in each line.
Using the wonderful https://www.regextester.com/ i think the regex i need is \(.*?\)
But I keep getting:
ValueError: pattern contains no capture groups
the code i used was:
pattern = r'\(.*?\)'
extracted = df.loc[:, 'header2'].str.extractall(pattern)
Any help appreciated.
thanks

You need to include a capture group inside the parenthesis. Also, when using extractall, I'd use unstack so it matches the structure of your DataFrame:
df.header2.str.extractall(r'\((.*?)\)').unstack()
0
match 0 1 2
0 qwer NaN NaN
1 hghg lkjh NaN
2 poiu vbnc hjgk
If you're concerned about performance, don't use pandas string operations:
pd.DataFrame([re.findall(r'\((.*?)\)', row) for row in df.header2])
0 1 2
0 qwer None None
1 hghg lkjh None
2 poiu vbnc hjgk

Related

Get rid of initial spaces at specific cells in Pandas

I am working with a big dataset (more than 2 million rows x 10 columns) that has a column with string values that were filled oddly. Some rows start and end with many space characters, while others don't.
What I have looks like this:
col1
0 (spaces)string(spaces)
1 (spaces)string(spaces)
2 string
3 string
4 (spaces)string(spaces)
I want to get rid of those spaces at the beginning and at the end and get something like this:
col1
0 string
1 string
2 string
3 string
4 string
Normally, for a small dataset I would use a for iteration (I know it's far from optimal) but now it's not an option given the time it would take.
How can I use the power of pandas to avoid a for loop here?
Thanks!
edit: I can't get rid of all the whitespaces since the strings contain spaces.
df['col1'].apply(lambda x: x.strip())
might help

Match all values str column dataframe with other dataframe str column

I have two pandas dataframes:
Dataframe 1:
ITEM ID TEXT
1 some random words
2 another word
3 blah
4 random words
Dataframe 2:
INDEX INFO
1 random
3 blah
I would like to match the values from the INFO column (of dataframe 2) with the TEXT column of dataframe 1. If there is a match, I would like to see a new column with a "1".
Something like this:
ITEM ID TEXT MATCH
1 some random words 1
2 another word
3 blah 1
4 random words 1
I was able to create a match per value of the INFO column that I'm looking for with this line of code:
dataframe1.loc[dataframe1['TEXT'].str.contains('blah'), 'MATCH'] = '1'
However, in reality, my real dataframe 2 has 5000 rows. So I cannot manually copy paste all of this. But basically I'm looking for something like this:
dataframe1.loc[dataframe1['TEXT'].str.contains('Dataframe2[INFO]'), 'MATCH'] = '1'
I hope someone can help, thanks!
Give this a shot:
Code:
dfA['MATCH'] = dfA['TEXT'].apply(lambda x: min(len([ y for y in dfB['INFO'] if y in x]), 1))
Output:
ITEM ID TEXT MATCH
0 1 some random words 1
1 2 another word 0
2 3 blah 1
3 4 random words 1
It's a 0 if it's not a match, but that's easy enough to weed out.
There may be a better / faster native solution, but it gets the job done by iterating over both the 'TEXT' column and the 'INFO'. Depending on your use case, it may be fast enough.
Looks like .map() in lieu of .apply() would work just as well. Could make a difference in timing, again, based on your use case.
Updated to take into account string contains instead of exact match...
You could get the unique values from the column in the first dataframe convert them to list and then use eval method on the second with Column.str.contains on that list.
unique = df1['TEXT'].unique().tolist()
df2.eval("Match=Text.str.contains('|'.join(#unique))")

Remove all rows that meet regex condition

trying to teach myself pandas.. and playing around with different dtypes
I have a df as follows
df = pd.DataFrame({'ID':[0,2,"bike","cake"], 'Course':['Test','Math','Store','History'] })
print(df)
ID Course
0 0 Test
1 2 Math
2 bike Store
3 cake History
the dtype of ID is of course an object. What I want to do is remove any rows in the DF if the ID has a string in it.
I thought this would be as simple as..
df.ID.filter(regex='[\w]*')
but this returns everything, is there a sure fire method for dealing with such things?
You can using to_numeric
df[pd.to_numeric(df.ID,errors='coerce').notnull()]
Out[450]:
Course ID
0 Test 0
1 Math 2
Another option is to convert the column to string and use str.match:
print(df[df['ID'].astype(str).str.match("\d+")])
# Course ID
#0 Test 0
#1 Math 2
Your code does not work, because as stated in the docs for pandas.DataFrame.filter:
Note that this routine does not filter a dataframe on its contents. The filter is applied to the labels of the index.
Wen's answer is the correct (and fastest) way to solve this, but to explain why your regular expression doesn't work, you have to understand what \w means.
\w matches any word character, which includes [a-zA-Z0-9_]. So what you're currently matching includes digits, so everything is matched. A valid regular expression approach would be:
df.loc[df.ID.astype(str).str.match(r'\d+')]
ID Course
0 0 Test
1 2 Math
The second issue is your use of filter. It isn't filtering your ID row, it is filtering your index. A valid solution using filter would be as follows:
df.set_index('ID').filter(regex=r'^\d+$', axis=0)
Course
ID
0 Test
2 Math

splitting column value in to three with two delimiter in pandas

I have written an excel file with one column with values of:
col1
22125051|2/136|Possible Match
nan|3/4|Not Match
22125051|1/26|Match
these data are initially in different columns but I want to get the value of the said columns and put the data into one, and I did it by using .apply() and .join() then I added a delimiter | to separate the values
now I want to split the column per value then put it in to specific column in an existing .xlsx file.
say df3 = pd.read_excel('type_primary_data.xlsx')
and .columns[37], .columns[39], .columns[40]
Desired ouput
svc_no port Result
22125051 2/136 Possible Match
nan 3/4 Not Match
22125051 1/26 Match
I am not sure what is the best way to do this in pandas.
UPDATE
turns out that I need to match the adsl column to an existing .xlsx file
so, as the adsl matched with the said column I also wanted to get the svc_no and comparison result along with the matched adsl.
my ouput should be
adsl svc_no port Result
3/4 nan 3/4 Not Match
1/26 22125051 1/26 Match
2/136 22125051 2/136 Possible Match
Try using the df.str.split method:
df =df[col1].str.split('|', expand=True)
Then, rename the comlumns since they will be numbers with:
df.rename(columns={'oldname':'newname'}
Try that. I cant comment because of reputation but I think that's what you are looking for.
Option 1
I'm a fan of using extract with naming within regex pattern
pat = '(?P<svc_no>.*)\|(?P<port>.*)\|(?P<Result>.*)'
df.col1.str.extract(pat, expand=True)
svc_no port Result
0 22125051 2/136 Possible Match
1 nan 3/4 Not Match
2 22125051 1/26 Match
Option 2
cols = dict(enumerate('svc_no port Result'.split()))
df.col1.str.extractall('([^|]+)')[0].unstack().rename(columns=cols)
match svc_no port Result
0 22125051 2/136 Possible Match
1 nan 3/4 Not Match
2 22125051 1/26 Match

Python Data table

I have a data textfile called “data” that is delimited by comma in each row.
The first three lines of the file look like the following:
“(1,ABC,ABCDE)”,”24”
“(10,ABD,ABC11)”,”12”
“(6,ABE,ABERD)”,”39”
In the file, values in each row are string and integer.
I first read the file:
target=pd.read_csv(’data',sep=',',names=[‘col1’,'col2’])
What I want to see in my Target table eventually is the following 5 column
table:
COLA COLB COLC col1 col2
1 ABC ABCDE (1,ABC,ABCDE) 24
10 ABD ABC11 (10,ABD,ABC11) 12
6 ABE ABERD (6,ABE,ABERD) 39
What I tried was:
for index,row in target.iterrows():
tup=tuple(row[0][1:len(row[0])-1].split(","))
target[’COLA'][index]=tup[0]
target[’COLB'][index]=tup[1]
target[’COLC'][index]=tup[2]
This is done to change the string to tuple so that I can create new columns to the Target datatable. I will delete the col1 eventually but the code above doesn't work for some reason. It crashes..
Your code is peppered with "smart quotes": “” instead of " and ‘’ instead of '. Remove them all and replace them with the corresponding straight ("dumb") quotes, as only straight quotes have the Python meaning you're looking for.

Categories

Resources