two step regex on pandas variable

two step regex on pandas variable - python

I have two data frames containing a common variable, 'citation'. I am trying to check if values of citation in one data frame are also values in the other data frame. The problem is that the variables are of different format. In one data frame the variables appear as:
0154/0924
0022/0320
whereas in the other data frame they appear as:
154/ 924
22/ 320
the differences being: 1) no zeros before the first non-zero integer of the number before the hyphen and 2) zeros that appear after the hyphen but before the first non-zero integer after the hyphen are replaced with spaces, ' ', in the second data frame.
I am trying to use a function and apply it, as shown in the code below, but I am having trouble with regex and I could not find documentation on this exact problem.
def Clean_citation(citation):
# Search for opening bracket in the name followed by
# any characters repeated any number of times
if re.search('\(.*', citation):
# Extract the position of beginning of pattern
pos = re.search('\(.*', citation).start()
# return the cleaned name
return citation[:pos]
else:
# if clean up needed return the same name
return citation
df['citation'] = df['citation'].apply(Clean_citation)

Aside: Maybe something relevant- 01 invalid token
My solution:
def convert_str(strn):
new_strn = [s.lstrip("0") for s in strn.split('/')] #to strip only leading 0's
return ('/ ').join(new_strn)
So,
convert_str('0154/0924') #would return
'154/ 924'
Which is in the same format as 'citation' in the other data frame. Could make use of pandas apply function to 'apply' convert_str function on 'citation' column of first dataframe.

Solution
You can use x.str.findall('(\d+)') where x is either the pandas.Dataframe column or a pandas.Series object. You can run this on both columns and extract the true numbers, with each row as a list of two numbers or none (if no number is present.
You could then concatenate the numbers into a single string:
num_pair_1 = df1.Values.str.findall('(\d+)')
num_pair_2 = df2.Values.str.findall('(\d+)')
a = num_pair_1.str.join('/') # for first data column
b = num_pair_2.str.join('/') # for second data column
And now finally compare a and b as they should not have any of those additional zeros or spaces.
# for a series s with the values
s.str.strip().str.findall('(\d+)')
# for a column 'Values' in a dataframe df
df.Values.str.findall('(\d+)')
Output
0 []
1 [154, 924]
2 [22, 320]
dtype: object
Data
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
import pandas as pd
ss = """
154/ 924
22/ 3
"""
s = pd.Series(StringIO(ss))
df = pd.DataFrame(s.str.strip(), columns=['Values'])
Output
Values
0
1 154/ 924
2 22/ 320

Here's a pattern that would filter both:
pattern = '[0\s]*(\d+)/[0\s]*(\d+)'
s = pd.Series(['0154/0924','0022/0320', '154/ 924', '22/ 320'])
s.str.extract('[0\s]*(\d+)/[0\s]*(\d+)')
Output:
0 1
0 154 924
1 22 320
2 154 924
3 22 320

Convert the str to a list by str.split('/') and map to int:
int will remove the leading zeros
If the values in the list are different, df1['citation'] == df2['citation'] will compare as False by row
Requires no regular expressions or list comprehensions
Dataframe setup:
df1 = pd.DataFrame({'citation': ['0154/0924', '0022/0320']})
df2 = pd.DataFrame({'citation': ['154/ 924', '22/ 320']})
print(df1)
citation
0154/0924
0022/0320
print(df2)
citation
154/ 924
22/ 320
Split on / and set type to int:
def fix_citation(x):
return list(map(int, x.split('/')))
df1['citation'] = df1['citation'].apply(fix_citation)
df2['citation'] = df2['citation'].apply(fix_citation)
print(df1)
citation
[154, 924]
[22, 320]
print(df2)
citation
[154, 924]
[22, 320]
Compare the columns:
df1 == df2

Related

Searching through strings in a dataframe and increasing the numbers found by 1

I have a dataframe that I have created by hand. I am working on a code that copies the dataframe and concatenates the new dataframe to the end of the first one. For now, I need the code to look through each value of a column of the 'Name' dataframe that contains strings and if there is a number in the string, increase this number by 1. I need the number to be turned into an int so that I can create a function that will look through the dataframe and automatically add 1 to the largest number in the dataframe. An example:
import pandas as pd
data = {'ID': [1,2,3,4],
'Name': ['BN #1', 'HHC', 'A comp', 'B Comp']}
df = pd.DataFrame(data)
df['SysNum'] = [int(re.search('(?<=#)\d', x)[0]) for x in df['Name'].values]
Afterwards the new df looks like
data2 = {'ID': [1,2,3,4,5,6,7,8],
'Name': ['BN #1', 'HHC', 'A comp', 'B Comp','BN #2', 'HHC', 'A comp', 'B Comp']}
When I run this, I receive a 'NoneType' object is not subscriptable error. This makes sense because only the BN # row has a number and re.search returns None when the string parameters are not met, but I cannot figure out how to tell python to ignore the other rows.
EDIT
Only the first row each dataframe will increase by 1, so if there is an easier way where I do not use re.search, that is fine. I know there are a couple ways of doing this but I want to be able to always look through the string value of BN and increase it by 1 every time I run the code.
REGEX EDIT
df2['BaseName'] = [re.sub('\d', '', x) for x in df2['Name'].values]
df['BaseName'] = [re.sub('\d', '', x) for x in df['Name'].values]
df2['SysNum'] = [int(re.search('(?<=#)\d', x)[0]) for x in df2['Name'].values]
# df2['SysNum'] = df2['Name'].get(r'(?<=#)\d').astype(int)
# df['SysNum'] = [int(re.search('(?<=#)\d', x)[0]) for x in df['Name'].values]
df['SysNum'] = df['Name'].str.contains('(?<=#)\d').astype(int)
m = re.search(r'(?<=#)\d', df2['Name'].iloc[0])
if m:
df2['SysNum'] = int(m.group(0)) + 1
n = re.search(r'(?<=#)\d', df['Name'].iloc[0])
if n:
df['SysNum'] = int(n.group(1)) + 1
new_names = df2['BaseName'].unique()
maxes2 = np.zeros((len(new_names), ))
for j in range(len(new_names)):
un2 = new_names[j]
maxes2[j] = df['SysNum'].loc[df['BaseName'] == un2].max()
df2['SysNum'].loc[df2['BaseName'] == un2] = np.linspace(1, len(df2['SysNum'].loc[df2['BaseName'] == un2]), len(df2['SysNum'].loc[df2['BaseName'] == un2]))
df2['SysNum'].loc[df2['BaseName'] == un2] += maxes2[j]
newnames2 = [s + '%d' % num for s,num in zip(df2['BaseName'].loc[df2['BaseName'] == un2].values, df2['SysNum'].loc[df2['BaseName'] == un2].values)]
df2['Name'].loc[df2['BaseName'] == un2] = newnames2
I have this code working for two dataframes and the numbering works out how I would like it to. The first two have a "Name-###" naming convention for all the rows in the dataframe. This allows the commented out re.search line at the top to run just fine. The next two dataframes I am working on are like the examples I put up earlier with the BN #1 and the rest of the names do not have a number. When I run the commented out re.search lines, the code tries to convert the NoneTypes to int and it cannot do that. When I run the code as is now, a new number is put on each and every row immediately following the name, but I need it to add a new number to the row with the #. So what I need and I am struggling with is a piece of code that looks through the dataframe, looks for a # sign, turns the number after the # sign into an int, a loop that looks for the max int and then adds 1 to that number, adds that new number onto the new dataframe, adds new dataframe onto the old one for a larger master list.

You can access the value on the first row of the Name column using df['Name'].iloc[0].
Thus, you can search for a sequence of digits after a # sign in that value using
m = re.search(r'#(\d+)', df['Name'].iloc[0])
if m:
df['SysNum'] = int(m.group(1)) + 1
Output:
>>> df
ID Name SysNum
0 1 BN #1 2
1 2 HHC 2
2 3 A comp 2
3 4 B Comp 2

Python remove everything after specific string and loop through all rows in multiple columns in a dataframe

I have a file full of URL paths like below spanning across 4 columns in a dataframe that I am trying to clean:
Path1 = ["https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID\
=0x012000EDE8B08D50FC3741A5206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D"]
I want to remove everything after a specific string which I defined it as "string1" and I would like to loop through all 4 columns in the dataframe defined as "df_MasterData":
string1 = "&FolderCTID"
import pandas as pd
df_MasterData = pd.read_excel(FN_MasterData)
cols = ['Column_A', 'Column_B', 'Column_C', 'Column_D']
for i in cols:
# Objective: Replace "&FolderCTID", delete all string after
string1 = "&FolderCTID"
# Method 1
df_MasterData[i] = df_MasterData[i].str.split(string1).str[0]
# Method 2
df_MasterData[i] = df_MasterData[i].str.split(string1).str[1].str.strip()
# Method 3
df_MasterData[i] = df_MasterData[i].str.split(string1)[:-1]
I did search and google and found similar solutions which were used but none of them work.
Can any guru shed some light on this? Any assistance is appreciated.
Added below is a few example rows in column A and B for these URLs:
Column_A = ['https://contentspace.global.xxx.com/teams/Australia/NSW/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FNSW%2FDocuments%2FIn%20Scope%2FA%20I%20TOPPER%20GROUP&FolderCTID=\
0x01200016BC4CE0C21A6645950C100F37A60ABD&View=%7B64F44840%2D04FE%2D4341%2D9FAC%2D902BB54E7F10%7D',\
'https://contentspace.global.xxx.com/teams/Australia/Victoria/Documents/Forms/AllItems.aspx?RootFolder\
=%2Fteams%2FAustralia%2FVictoria%2FDocuments%2FIn%20Scope&FolderCTID=0x0120006984C27BA03D394D9E2E95FB\
893593F9&View=%7B3276A351%2D18C1%2D4D32%2DADFF%2D54158B504FCC%7D']
Column_B = ['https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID=0x012000EDE8B08D50FC3741A5\
206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D',\
'https://contentspace.global.xxx.com/teams/Australia/QLD/Documents/Forms/AllItems.aspx?RootFolder=%\
2Fteams%2FAustralia%2FQLD%2FDocuments%2FIn%20Scope%2FAACO%20GROUP&FolderCTID=0x012000E689A6C1960E8\
648A90E6EC3BD899B1A&View=%7B6176AC45%2DC34C%2D4F7C%2D9027%2DDAEAD1391BFC%7D']

This is how i would do it,
first declare a variable with your target columns.
Then use stack() and str.split to get your target output.
finally, unstack and reapply the output to your original df.
cols_to_slice = ['ColumnA','ColumnB','ColumnC','ColumnD']
string1 = "&FolderCTID"
df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
if you want to replace these columns in your target df then simply do -
df[cols_to_slice] = df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)

You should first get the index of string using
indexes = len(string1) + df_MasterData[i].str.find(string1)
# This selected the final location of this string
# if you don't want to add string in result just use below one
indexes = len(string1) + df_MasterData[i].str.find(string1)
Now do
df_MasterData[i] = df_MasterData[i].str[:indexes]

How can I improve the speed of pandas rows operations?

I have a large .csv file that has 11'000'000 rows and 3 columns: id ,magh , mixid2.
What I have to do is to select the rows with the same id and then check if these rows have the same mixid2; if True I remove the rows, If False I initialize a class with the information of the selected rows.
That is my code:
obs=obs.set_index('id')
obs=obs.sort_index()
#dropping elements with only one mixid2 and filling S
ID=obs.index.unique()
S=[]
good_bye_list = []
for i in tqdm(ID):
app=obs.loc[i]
if len(np.unique([app['mixid2'],])) != 1:
#fill the class list
S.append(star(app['magh'].values,app['mixid2'].values,z_in))
else :
#drop
good_bye_list.append(i)
obs=obs.drop(good_bye_list)
The .csv file is very large so it takes 40 min to compute everything.
How can I improve the speed??
Thank you for the help.
This is the .csv file:
id,mixid2,magh
3447001203296326,557,14.25
3447001203296326,573,14.25
3447001203296326,525,14.25
3447001203296326,541,14.25
3447001203296330,540,15.33199977874756
3447001203296330,573,15.33199977874756
3447001203296333,172,17.476999282836914
3447001203296333,140,17.476999282836914
3447001203296333,188,17.476999282836914
3447001203296333,156,17.476999282836914
3447001203296334,566,15.626999855041506
3447001203296334,534,15.626999855041506
3447001203296334,550,15.626999855041506
3447001203296338,623,14.800999641418455
3447001203296338,639,14.800999641418455
3447001203296338,607,14.800999641418455
3447001203296344,521,12.8149995803833
3447001203296344,537,12.8149995803833
3447001203296344,553,12.8149995803833
3447001203296345,620,12.809000015258787
3447001203296345,543,12.809000015258787
3447001203296345,636,12.809000015258787
3447001203296347,558,12.315999984741213
3447001203296347,542,12.315999984741213
3447001203296347,526,12.315999984741213
3447001203296352,615,12.11299991607666
3447001203296352,631,12.11299991607666
3447001203296352,599,12.11299991607666
3447001203296360,540,16.926000595092773
3447001203296360,556,16.926000595092773
3447001203296360,572,16.926000595092773
3447001203296360,524,16.926000595092773
3447001203296367,490,15.80799961090088
3447001203296367,474,15.80799961090088
3447001203296367,458,15.80799961090088
3447001203296369,639,15.175000190734865
3447001203296369,591,15.175000190734865
3447001203296369,623,15.175000190734865
3447001203296369,607,15.175000190734865
3447001203296371,460,14.975000381469727
3447001203296373,582,14.532999992370605
3447001203296373,614,14.532999992370605
3447001203296373,598,14.532999992370605
3447001203296374,184,14.659000396728516
3447001203296374,203,14.659000396728516
3447001203296374,152,14.659000396728516
3447001203296374,136,14.659000396728516
3447001203296374,168,14.659000396728516
3447001203296375,592,14.723999977111815
3447001203296375,608,14.723999977111815
3447001203296375,624,14.723999977111815
3447001203296375,92,14.723999977111815
3447001203296375,76,14.723999977111815
3447001203296375,108,14.723999977111815
3447001203296375,576,14.723999977111815
3447001203296376,132,14.0649995803833
3447001203296376,164,14.0649995803833
3447001203296376,180,14.0649995803833
3447001203296376,148,14.0649995803833
3447001203296377,168,13.810999870300293
3447001203296377,152,13.810999870300293
3447001203296377,136,13.810999870300293
3447001203296377,184,13.810999870300293
3447001203296378,171,13.161999702453613
3447001203296378,187,13.161999702453613
3447001203296378,155,13.161999702453613
3447001203296378,139,13.161999702453613
3447001203296380,565,13.017999649047852
3447001203296380,517,13.017999649047852
3447001203296380,549,13.017999649047852
3447001203296380,533,13.017999649047852
3447001203296383,621,13.079999923706055
3447001203296383,589,13.079999923706055
3447001203296383,605,13.079999923706055
3447001203296384,541,12.732000350952148
3447001203296384,557,12.732000350952148
3447001203296384,525,12.732000350952148
3447001203296385,462,12.784000396728516
3447001203296386,626,12.663999557495115
3447001203296386,610,12.663999557495115
3447001203296386,577,12.663999557495115
3447001203296389,207,12.416000366210938
3447001203296389,255,12.416000366210938
3447001203296389,223,12.416000366210938
3447001203296389,239,12.416000366210938
3447001203296390,607,12.20199966430664
3447001203296390,591,12.20199966430664
3447001203296397,582,16.635000228881836
3447001203296397,598,16.635000228881836
3447001203296397,614,16.635000228881836
3447001203296399,630,17.229999542236328
3447001203296404,598,15.970000267028807
3447001203296404,631,15.970000267028807
3447001203296404,582,15.970000267028807
3447001203296408,540,16.08799934387207
3447001203296408,556,16.08799934387207
3447001203296408,524,16.08799934387207
3447001203296408,572,16.08799934387207
3447001203296409,632,15.84000015258789
3447001203296409,616,15.84000015258789

Hello and welcome to StackOverflow.
In pandas the rule of thumb is that raw loops are always slower than the dedicated functions. To apply a function to a sub-DataFrame of rows that fulfill certain criteria you can use groupby
In your case the function is a bit ... unpythonic as the instantiation of S is a side effect and the deleting of rows you are currenty iterating over is dangerous. For example in a dictionary you should never do this. That said, you can create a function like this:
In [37]: def my_func(df):
...: if df['mixid2'].nunique() == 1:
...: return None
...: else:
...: S.append(df['mixid2'])
...: return df
and apply it to you DataFrame via
S = []
obs.groupby('id').apply(my_func)
This iterates over all subdataframes with the same id and drops them if there is exactly one unique value in mixid2. Otherwise it appends the values to a list S
The resulting DataFrame is 3 rows shorter
Out[38]:
id mixid2 magh
id
3447001203296326 0 3447001203296326 557 14.250000
1 3447001203296326 573 14.250000
... ... ... ...
3447001203296409 98 3447001203296409 632 15.840000
99 3447001203296409 616 15.840000
[97 rows x 3 columns]
and S contains 28 elements. That you could pass into the star constructor just as you did.

I guess you want to groupby and exclude all the elements where mixid2 appears more than 1 times using set_index. To get the original shape, we use reset_index after the filtering.
df = obs.set_index('mixid2').loc[~df.groupby('mixid2').count().id.eq(1)].reset_index()
df.shape
(44, 3)

I'm not entirely sure, if I understood you correctly. But what you can do is first remove duplicates in your dataframe and then use the groupby function to get all the remaining data points with same id:
# dropping all duplicates based on id an mixid2
df.drop_duplicates(["id", "mixid2"], inplace=True)
# then iterate over all groups:
for index, grp in df.groupby(["id"]):
pass # do stuff here with the grp
Normally it is a good idea to rely on pandas internal functions, since they are mostly optimised quite well.

new_df = app.groupby(['id','mixid2'], as_index=False).agg('count')
new_df = new_df[new_df['magh'] > 1]
then pass new_df to your function.

Convert string (comma separated) to int list in pandas Dataframe

I have a Dataframe with a column which contains integers and sometimes a string which contains multiple numbers which are comma separated (like "1234567, 89012345, 65425774").
I want to convert that string to an integer list so it's easier to search for specific numbers.
In [1]: import pandas as pd
In [2]: raw_input = "1111111111 666 10069759 9695011 9536391,2261003 9312405 15542804 15956127 8409044 9663061 7104622 3273441 3336156 15542815 15434808 3486259 8469323 7124395 15956159 3319393 15956184
: 15956217 13035908 3299927"
In [3]: df = pd.DataFrame({'x':raw_input.split()})
In [4]: df.head()
Out[4]:
x
0 1111111111
1 666
2 10069759
3 9695011
4 9536391,2261003

Since your column contains strings and integers, you want probably something like this:
def to_integers(column_value):
if not isinstance(column_value, int):
return [int(v) for v in column_value.split(',')]
else:
return column_value
df.loc[:, 'column_name'] = df.loc[:, 'column_name'].apply(to_integers)

Your best solution to cases like this, where a column has 1 or more values, is splitting the data into multiple columns.
Try something like
ids = df.ID.str.split(',', n=1, expand=True)
for i in range(3):
df['ID' + str(i + 1)] = ids.iloc[, i]

Is there a way to use str.count() function with a LIST of values instead of a single string?

I am trying to count the number of times that any string from a list_of_strings appears in a csv file cell.
For example, the following would work fine.
import pandas as pd
data_path = "SurveryResponses.csv"
df = pd.read_csv(data_path)
totalCount = 0
for row in df['rowName']:
if type(row) == str:
print(row.count('word_of_interest'))
However, I would like to be able to enter a list of strings (['str1', str2', str3']) rather than just one 'word_of_interest', such that if any of those strings appear the count value will increase by one.
Is there a way to do this?

Perhaps something along the lines of
totalCount = 0
words_of_interst = ['cat','dog','foo','bar']
for row in df['rowName']:
if type(row) == str:
if sum([word in row for word in words_of_interst]) > 0:
totalCount += 1

Use the str accessor:
df['rowName'].str.count('word_of_interest')
If you need to convert the column to string first, use astype:
df['rowName'].astype(str).str.count('word_of_interest')

Assuming list_of_strings = ['str1', str2', str3'] you can try the following:
if any(map(lambda x: x in row, list_of_strings)):
totalCount += 1

You can use this method to count from an external list
strings = ['string1','string2','string3']
sum([1 if sr in strings else 0 for sr in df.rowName])

Here is an example:
import io
filedata = """animal,amount
"['cat','dog']",2
"['cat','horse']",2"""
df = pd.read_csv(io.StringIO(filedata))
Returns this dataframe:
animal amount
0 ['cat','dog'] 2
1 ['cat','horse'] 2
Search for word cat (looping through all columns as series):
search = "cat"
# sums True for each serie and then wrap a sum around all sums
# sum([2,0]) in this case
sum([sum(df[cols].astype(str).str.contains(search)) for cols in df.columns])
Returns 2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

two step regex on pandas variable - python

Here's a pattern that would filter both: pattern = '[0\s](\d+)/[0\s](\d+)' s = pd.Series(['0154/0924','0022/0320', '154/ 924', '22/ 320']) s.str.extract('[0\s](\d+)/[0\s](\d+)') Output: 0 1 0 154 924 1 22 320 2 154 924 3 22 320

Related

Searching through strings in a dataframe and increasing the numbers found by 1

Python remove everything after specific string and loop through all rows in multiple columns in a dataframe

How can I improve the speed of pandas rows operations?

Convert string (comma separated) to int list in pandas Dataframe

Is there a way to use str.count() function with a LIST of values instead of a single string?

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

two step regex on pandas variable - python

Here's a pattern that would filter both: pattern = '[0\s]*(\d+)/[0\s]*(\d+)' s = pd.Series(['0154/0924','0022/0320', '154/ 924', '22/ 320']) s.str.extract('[0\s]*(\d+)/[0\s]*(\d+)') Output: 0 1 0 154 924 1 22 320 2 154 924 3 22 320

Related

Searching through strings in a dataframe and increasing the numbers found by 1

Python remove everything after specific string and loop through all rows in multiple columns in a dataframe

How can I improve the speed of pandas rows operations?

Convert string (comma separated) to int list in pandas Dataframe

Is there a way to use str.count() function with a LIST of values instead of a single string?

Categories

Resources

Here's a pattern that would filter both: pattern = '[0\s](\d+)/[0\s](\d+)' s = pd.Series(['0154/0924','0022/0320', '154/ 924', '22/ 320']) s.str.extract('[0\s](\d+)/[0\s](\d+)') Output: 0 1 0 154 924 1 22 320 2 154 924 3 22 320