parsing dataframe columns containing functions - python

Python/pandas newbie here. The csv file I'm trying to work with has been populated with data that looks something like this:
A B C D
Option1(item1=12345, item12='string', item345=0.123) 2020-03-16 1.234 Option2(item4=123, item56=234, item678=345)
I'd like it to look like this:
item1 item12 item345 B C item4 item56 item678
12345 'string' 0.123 2020-03-16 1.234 123 234 345
In other words, I want to replace columns A and D with new columns headed by what's on the left of the equal sign, using what's to the right of the equal sign as the corresponding value, and with the Option1() and Option2() parts and the commas stripped out. The columns that don't contain functions should be left as is.
Is there an elegant way to do this?
Actually, at this point, I'd settle for any old way, elegant or not; I've found various ways of dealing with this situation if, say, there were dicts populating columns, but nothing to help me pick it apart if there are functions there. Trying to search for the answer only gives me a bunch of results for how to apply functions to dataframes.

As long as your functions always have the same arguments, this should work.
You can read the csv with (if separators are 2 or more spaces, that's what I get when I paste your question example):
df = pd.read_csv('test.csv',sep='[\s]{2,}', index_col=False, engine='python')
If your dataframe is df:
# break out both sides of the equal sign in function into columns
A_vals = df['A'].str.extractall(r'([\w\d]+)=([^,\)]*)')
# get rid of the multi-index and put the values after '=' into columns
A_converted = A_vals.unstack(level=-1)[1]
# set column names to values before '='
A_converted.columns = list(A_vals.unstack(level=-1)[0].values[0])
# same thing for 'D'
D_vals = df['D'].str.extractall(r'([\w\d]+)=([^,\)]*)')
D_converted = D_vals.unstack(level=-1)[1]
D_converted.columns = list(D_vals.unstack(level=-1)[0].values[0])
# join everything together
df = A_converted.join(df.drop(['A','D'], axis=1)).join(D_converted)
Some clarification on the regex '([\w\d]+)=([^,\)]*)' has two capture groups (each part in parens):
Group 1 ([\w\d]+) is one or more characters (+) that are word characters \w or numbers \d.
= between groups.
Group 2 ([^,\)]*) is 0 or more characters (*) that are not (^) a comma , or paren \).

I believe you're looking for something along these lines:
contracts = ["Option(conId=384688665, symbol='SPX', lastTradeDateOrContractMonth='20200116', strike=3205.0, right='P', multiplier='100', exchange='SMART', currency='USD', localSymbol='SPX 200117P03205000', tradingClass='SPX')",
"Option(conId=12345678, symbol='DJX', lastTradeDateOrContractMonth='20200113', strike=1205.0, right='P', multiplier='200', exchange='SMART', currency='USD', localSymbol='DJXX 333117Y13205000', tradingClass='DJX')"]
new_conts = []
columns = []
for i in range (len(contracts)):
mod = contracts[i].replace('Option(','').replace(')','')
contracts[i] = mod
new_cont = contracts[i].split(',')
new_conts.append(new_cont)
for contract in new_conts:
column = []
for i in range (len(contract)):
mod = contract[i].split('=')
contract[i] = mod[1]
column.append(mod[0])
columns.append(column)
print(len(columns[0]))
df = pd.DataFrame(new_conts,columns=columns[0])
df
Output:
conId symbol lastTradeDateOrContractMonth strike right multiplier exchange currency localSymbol tradingClass
0 384688665 'SPX' '20200116' 3205.0 'P' '100' 'SMART' 'USD' 'SPX 200117P03205000' 'SPX'
1 12345678 'DJX' '20200113' 1205.0 'P' '200' 'SMART' 'USD' 'DJXX 333117Y13205000' 'DJX'
Obviously you can then delete unwanted columns, change names, etc.

Related

How to use writeStream from a pyspark streaming dataframe to chop the values into different columns?

I am trying to ingest some files and each of them is being read as a single column string (which is expected since it is a fixed-width file) and I have to split that single value into different columns. This means that I must access the dataframe but I must use writeStream since it is a streaming dataframe. This is an example of the input:
"64 Apple 32.32128Orange12.1932 Banana 2.45"
Expected dataframe:
64, Apple, 32.32
128, Orange, 12.19
32, Banana, 2.45
Notice how every column has the same amount of characters (3,6,5)<-This is what the META_SIZE has. Therefore each row has 14 characters each (sum of columns).
I tried using forEach as the following example but it is not doing anything:
two_d = []
streamingDF = (
spark.readStream.format("cloudFiles")
.option("encoding", sourceEncoding)
.option("badRecordsPath", badRecordsPath)
.options(**cloudfiles_config)
.load(sourceBasePath)
)
def process_row(string):
rows = round(len(string)/chars_per_row)
for i in range(rows):
current_index = 0
two_d.append([])
for j in range(len(META_SIZES)):
two_d[i].append(string[(i*chars_per_row+current_index) : (i*chars_per_row+current_index+META_SIZES[j])].strip())
current_index += META_SIZES[j]
print(two_d[i])
query = streamingDF.writeStream.foreach(process_row).start()
I will probably do a withColumn to add them instead of the list or use that list and make it a streaming dataframe if possible and better.
Edit: I added an input example and explained META_SIZES
Assuming the inputs are something like the following.
...
"64 Apple 32.32"
"128 Orange 12.19"
"32 Banana 2.45"
...
You can do this.
streamingDF = (
spark.readStream.format("cloudFiles")
.option("encoding", sourceEncoding)
.option("badRecordsPath", badRecordsPath)
.options(**cloudfiles_config)
.load(sourceBasePath)
)
#remove this line if strings are already utf-8
lines = stream_lines.select(stream_lines['value'].cast('string'))
lengths = (lines.withColumn('Count', functions.split(lines['value'], ' ').getItem(0))
.withColumn('Fruit', functions.split(lines['value'], ' ').getItem(1)
.withColumn('Price', functions.split(lines['value'], ' ').getItem(1))
Note that "value" is set as the default column name when reading a string using readStream. If clouds_config contains anything changing the column name of the input you will have to alter the column name in the code above.

How to combine queries with a single external variable using Pandas

I am trying to accept a variable input of many search terms seperated by commas via html form (#search) and query 2 columns of a dataframe.
Each column query works on its own but I cannot get them to work together in a and/or way.
First column query:
filtered = df.query ('`Drug Name` in #search')
Second column query:
filtered = df.query ('BP.str.contains(#search, na=False)', engine='python')
edit
combining like this:
filtered = df.query ("('`Drug Name` in #search') and ('BP.str.contains(#search, na=False)', engine='python')")
Gives the following error, highlighting the python identifier in the engine argument
SyntaxError: Python keyword not valid identifier in numexpr query
edit 2
The dataframe is read from an excel file, with columns:
Drug Name (containing a single drug name), BP, U&E (with long descriptive text entries)
The search terms will be input via html form:
search = request.values.get('searchinput').replace(" ","").split(',')
as a list of drugs which a patient may be on sometimes with the addition of specific conditions relating to medication use. sample user input:
Captopril, Paracetamol, kidney disease, chronic
I want the list to be checked against specific drug names and also to check other columns such as BP and U&E for any mention of the search terms.
edit 3
Apologies, but trying to implement the answers given is giving me stacks of errors. What I have below is giving me 90% of what I'm after, letting me search both columns including the whole contents of 'BP'. But I can only search a single term via the terminal, if I # out and swap the lines which collect the use input (taking it from the html form as apposed to the terminal) I get:
TypeError: unhashable type: 'list'
#app.route('/', methods=("POST", "GET"))
def html_table():
searchterms = []
#searchterms = request.values.get('searchinput').replace(" ","").split(',')
searchterms = input("Enter drug...")
filtered = df.query('`Drug Name` in #searchterms | BP.str.contains(#searchterms, na=False)', engine='python')
return render_template('drugsafety.html', tables=[filtered.to_html(classes='data')], titles=['na', 'Drug List'])
<form action="" method="post">
<p><label for="search">Search</label>
<input type="text" name="searchinput"></p>
<p><input type="submit"></p>
</form>
Sample data
The contents of the BP column can be quite long, descriptive and variable but an example is:
Every 12 months – Patients with CKD every 3 to 6 months.
Drug Name BP U&E
Perindopril Every 12 months Not needed
Alendronic Acid Not needed Every 12 months
Allopurinol Whilst titrating - 3 months Not needed
With this line:
searchterms = request.values.get('searchinput')
Entering 'months' into the html form outputs:
1 Perindopril Every 12 months Not needed
14 Allopurinol Whilst titrating – 3 months Not needed
All good.
Entering 'Alendronic Acid' into the html form outputs:
13 Alendronic Acid Not needed Every 12 months
Also good, but entering 'Perindopril, Allopurinol' returns nothing.
If I change the line to:
searchterms = request.values.get('searchinput').replace(" ","").split(',')
I get TypeError: unhashable type: 'list' when the page reloads.
However - If I then change:
filtered = df.query('`Drug Name` in #searchterms | BP.str.contains(#searchterms, na=False)', engine='python')
to:
filtered = df.query('`Drug Name` in #searchterms')
Then the unhashable type error goes and entering 'Perindopril, Allopurinol'
returns:
1 Perindopril Every 12 months Not needed
14 Allopurinol Whilst titrating – Every 3 months Not needed
But I'm now no longer searching the BP column for the searchterms.
Just thought that maybe its because searchterms is a list '[]' changed it t oa tuple '()' Didn't change anything.
Any help is much appreciated.
I am assuming you want to query 2 columns and want to return the row if any of the query matches.
In this line, the issue is that engine=python is inside query.
filtered = df.query ("('`Drug Name` in #search') and ('BP.str.contains(#search, na=False)', engine='python')")
It should be
df.query("BP.str.contains(#search, na=False)", engine='python')
If you do searchterms = request.values.get('searchinput').replace(" ","").split(','), it converts your string to list of words which will cause Unhashable type list error because str.contains expects str as input.
What you can do is use regex to search for search terms in list, it will look something like this:
df.query("BP.str.contains('|'.join(#search), na=False, regex=True)", engine='python')
What this does is it searches for all the individual words using regex. ('|'.join(#search) will be "searchterm_1|search_term2|..." and "|" is used to represent or in regex, so it looks for searchterm_1 or searchterm_2 in BP column value)
To combine the outputs of both queries, you can run those separately and concatenate the results
pd.concat([df.query("`Drug Name` in #search", engine='python'),df.query("BP.str.contains('|'.join(#search), na=False, regex=True)", engine='python')])
Also any string based matching will require your strings to match perfectly, including case. so you can maybe lowercase everything in dataframe and query. Similarly for space separated words, this will remove spaces.
if you do searchterms = request.values.get('searchinput').replace(" ","").split(',') on Every 12 months, it will get converted to "Every12months". so you can maybe remove the .replace() part and just use searchterms = request.values.get('searchinput').split(',')
Use sets. You can change the text columns to sets and check for intersection with the input. The rest is pure pandas. I never use .query because it is slow.
# change your search from list to set
search = set(request.values.get('searchinput').replace(" ","").split(','))
filtered = df.loc[(df['Drug Name'].str.split().map(lambda x: set(x).intersection(search)))
& (df['BP'].str.split().map(lambda x: set(x).intersection(search)))]
print(filtered)
Demo:
import pandas as pd
search = set(["apple", "banana", "orange"])
df = pd.DataFrame({
"Drug Name": ["I am eating an apple", "We are banana", "nothing is here"],
"BP": ["apple is good", "nothing is here", "nothing is there"],
"Just": [1, 2, 3]
})
filtered = df.loc[(df['Drug Name'].str.split().map(lambda x: set(x).intersection(search)))
& (df['BP'].str.split().map(lambda x: set(x).intersection(search)))]
print(filtered)
# Drug Name BP Just
# 0 I am eating an apple apple is good 1
Updated:
I would want the results to also show We are banana, nothing is here and 2
That requires or which is Pandas' | instead of and which Pandas' $
filtered = df.loc[(df['Drug Name'].str.split().map(lambda x: set(x).intersection(search)))
| (df['BP'].str.split().map(lambda x: set(x).intersection(search)))]
print(filtered)
# Drug Name BP Just
# 0 I am eating an apple apple is good 1
# 1 We are banana nothing is here 2
If you want to search for text in all columns, you can first join all columns, and then check for search terms in each row using str.contains and the regular expression pattern that matches at least one of the terms (term1|term2|...|termN). I've also added flags=re.IGNORECASE to make the search case-insensitive:
# search function
def search(searchterms):
return df.loc[df.apply(' '.join, axis=1) # join text in all columns
.str.contains( # check if it contains
'|'.join([ # regex pattern
x.strip() # strip spaces
for x in searchterms.split(',') # split by ','
]), flags=re.IGNORECASE)] # case-insensitive
# test search terms
for s in ['Alendronic Acid', 'months', 'Perindopril, Allopurinol']:
print(f'Search terms: "{s}"')
print(search(s))
print('-'*70)
Output:
Search terms: "Alendronic Acid"
Drug Name BP U&E
1 Alendronic Acid Not needed Every 12 months
----------------------------------------------------------------------
Search terms: "months"
Drug Name BP U&E
0 Perindopril Every 12 months Not needed
1 Alendronic Acid Not needed Every 12 months
2 Allopurinol Whilst titrating - 3 months Not needed
----------------------------------------------------------------------
Search terms: "Perindopril, Allopurinol"
Drug Name BP U&E
0 Perindopril Every 12 months Not needed
2 Allopurinol Whilst titrating - 3 months Not needed
----------------------------------------------------------------------
P.S. If you want to limit search to specific columns, here's a version that does that (with the default of searching all columns for convenience):
# search function
def search(searchterms, cols=None):
# search columns (if None, searches in all columns)
if cols is None:
cols = df.columns
return df.loc[df[cols].apply(' '.join, axis=1) # join text in cols
.str.contains( # check if it contains
'|'.join([ # regex pattern
x.strip() # remove spaces
for x in searchterms.split(',') # split by ','
]), flags=re.IGNORECASE)] # make search case-insensitive
Now if I search for months only in Drug Name and BP, it will not return Alendronic Acid where months is only found in U&E:
search('months', ['Drug Name', 'BP'])
Output:
Drug Name BP U&E
0 Perindopril Every 12 months Not needed
2 Allopurinol Whilst titrating - 3 months Not needed
Without having sample input data, I used a random generated dataset as a showcase:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Drug_Name':['Drug1','Drug2','Drug3','Drug2','Drug5','Drug3']*4,
'Inv_Type': ['X', 'Y']*12,
'Quant': np.random.randint(2,20, size=24)})
# Search 1
search = "Drug3"
df.query('Drug_Name==#search')
# Search 2
search2 = "Y"
df.query ('Inv_Type.str.contains(#search2, na=False)', engine='python')
# Combined (use booleans, such as & or | instead of and or or
df.query ('Drug_Name==#search & Inv_Type.str.contains(#search2, na=False)')
Please note that engine='python' should be avoided as stated in the documentation:
Likewise, you can pass engine='python' to evaluate an expression using
Python itself as a backend. This is not recommended as it is
inefficient compared to using numexpr as the engine.
That said, if you are hell-bent on using it, you can do it like this:
mask = df["Inv_Type"].str.contains(search2, na=False)
df.query('Drug_Name==#search & #mask')
Alternatvely, you can achive the same without using .query() at all:
df[(df['Drug_Name']==search) & df['Inv_Type'].str.contains(search2, na=False)]

splitting of urls from a list in dataframe where column name is company_urls

I have a dataframe(df) like this:
company_urls
0 [https://www.linkedin.com/company/gulf-capital...
1 [https://www.linkedin.com/company/gulf-capital...
2 [https://www.linkedin.com/company/fajr-capital...
3 [https://www.linkedin.com/company/goldman-sach...
And df.company_urls[0] is
['https://www.linkedin.com/company/gulf-capital/about/',
'https://www.linkedin.com/company/the-abraaj-group/about/',
'https://www.linkedin.com/company/abu-dhabi-investment-company/about/',
'https://www.linkedin.com/company/national-bank-of-dubai/about/',
'https://www.linkedin.com/company/efg-hermes/about/']
So I have to create a new columns like this:
company_urls company_url1 company_url2 company_url3 ...
0 [https://www.linkedin.com/company/gulf-capital... https://www.linkedin.com/company/gulf-capital/about/ https://www.linkedin.com/company/the-abraaj-group/about/...
1 [https://www.linkedin.com/company/gulf-capital... https://www.linkedin.com/company/gulf-capital/about/ https://www.linkedin.com/company/gulf-related/about/...
2 [https://www.linkedin.com/company/fajr-capital... https://www.linkedin.com/company/fajr-capital/about/...
3 [https://www.linkedin.com/company/goldman-sach... https://www.linkedin.com/company/goldman-sachs/about/...
How do I do that?
I have created this function for my personal use, and I think will work for your needs:
a) Specify the df name
b) Specify the column you want to split
c) Specify the delimiter
def composition_split(dat,col,divider =','): # set your delimiter here
"""
splits the column of interest depending on how many delimiters we have
creates all the columns needed to make the split
"""
x1 = dat[col].astype(str).apply(lambda x: x.count(divider)).max()
x2 = ["company_url_"+str(i) for i in np.arange(0,x1+1,1)]
dat[x2] = dat[col].str.split(divider,expand = True)
return dat
Basically this will create as many columns needed depending on how you specify the delimiter. For example, if the URL can be split 3 times based on a certain delimiter, it will create 3 new columns.
your_new_df = composition_split(df,'col_to_split',',') # for example

Most efficient method to modify values within large dataframes - Python

Overview: I am working with pandas dataframes of census information, while they only have two columns, they are several hundred thousand rows in length. One column is a census block ID number and the other is a 'place' value, which is unique to the city in which that census block ID resides.
Example Data:
BLOCKID PLACEFP
0 60014001001000 53000
1 60014001001001 53000
...
5844 60014099004021 53000
5845 60014100001000
5846 60014100001001
5847 60014100001002 53000
Problem: As shown above, there are several place values that are blank, though they have a census block ID in their corresponding row. What I found was that in several instances, the census block ID that is missing a place value, is located within the same city as the surrounding blocks that do not have a missing place value, especially if the bookend place values are the same - as shown above, with index 5844 through 5847 - those two blocks are located within the same general area as the surrounding blocks, but just seem to be missing the place value.
Goal: I want to be able to go through this dataframe, find these instances and fill in the missing place value, based on the place value before the missing value and the place value that immediately follows.
Current State & Obstacle: I wrote a loop that goes through the dataframe to correct these issues, shown below.
current_state_blockid_df = pandas.DataFrame({'BLOCKID':[60014099004021,60014100001000,60014100001001,60014100001002,60014301012019,60014301013000,60014301013001,60014301013002,60014301013003,60014301013004,60014301013005,60014301013006],
'PLACEFP': [53000,,,53000,11964,'','','','','','',11964]})
for i in current_state_blockid_df.index:
if current_state_blockid_df.loc[i, 'PLACEFP'] == '':
#Get value before blank
prior_place_fp = current_state_blockid_df.loc[i - 1, 'PLACEFP']
next_place_fp = ''
_n = 1
# Find the end of the blank section
while next_place_fp == '':
next_place_fp = current_state_blockid_df.loc[i + _n, 'PLACEFP']
if next_place_fp == '':
_n += 1
# if the blanks could likely be in the same city, assign them the city's place value
if prior_place_fp == next_place_fp:
for _i in range(1, _n):
current_state_blockid_df.loc[_i, 'PLACEFP'] = prior_place_fp
However, as expected, it is very slow when dealing with hundreds of thousands or rows of data. I have considered using maybe ThreadPool executor to split up the work, but I haven't quite figured out the logic I'd use to get that done. One possibility to speed it up slightly, is to eliminate the check to see where the end of the gap is and instead just fill it in with whatever the previous place value was before the blanks. While that may end up being my goto, there's still a chance it's too slow and ideally I'd like it to only fill in if the before and after values match, eliminating the possibility of the block being mistakenly assigned. If someone has another suggestion as to how this could be achieved quickly, it would be very much appreciated.
You can use shift to help speed up the process. However, this doesn't solve for cases where there are multiple blanks in a row.
df['PLACEFP_PRIOR'] = df['PLACEFP'].shift(1)
df['PLACEFP_SUBS'] = df['PLACEFP'].shift(-1)
criteria1 = df['PLACEFP'].isnull()
criteria2 = df['PLACEFP_PRIOR'] == df['PLACEFP_AFTER']
df.loc[criteria1 & criteria2, 'PLACEFP'] = df.loc[criteria1 & criteria2, 'PLACEFP_PRIOR']
If you end up needing to iterate over the dataframe, use df.itertuples. You can access the column values in the row via dot notation (row.column_name).
for idx, row in df.itertuples():
# logic goes here
Using your dataframe as defined
def fix_df(current_state_blockid_df):
df_with_blanks = current_state_blockid_df[current_state_blockid_df['PLACEFP'] == '']
df_no_blanks = current_state_blockid_df[current_state_blockid_df['PLACEFP'] != '']
sections = {}
last_i = 0
grouping = []
for i in df_with_blanks.index:
if i - 1 == last_i:
grouping.append(i)
last_i = i
else:
last_i = i
if len(grouping) > 0:
sections[min(grouping)] = {'indexes': grouping}
grouping = []
grouping.append(i)
if len(grouping) > 0:
sections[min(grouping)] = {'indexes': grouping}
for i in sections.keys():
sections[i]['place'] = current_state_blockid_df.loc[i-1, 'PLACEFP']
l = []
for i in sections:
for x in sections[i]['indexes']:
l.append(sections[i]['place'])
df_with_blanks['PLACEFP'] = l
final_df = pandas.concat([df_with_blanks, df_no_blanks]).sort_index(axis=0)
return final_df
df = fix_df(current_state_blockid_df)
print(df)
Output:
BLOCKID PLACEFP
0 60014099004021 53000
1 60014100001000 53000
2 60014100001001 53000
3 60014100001002 53000
4 60014301012019 11964
5 60014301013000 11964
6 60014301013001 11964
7 60014301013002 11964
8 60014301013003 11964
9 60014301013004 11964
10 60014301013005 11964
11 60014301013006 11964

Pandas split column name

I have a test dataframe that looks something like this:
data = pd.DataFrame([[0,0,0,3,6,5,6,1],[1,1,1,3,4,5,2,0],[2,1,0,3,6,5,6,1],[3,0,0,2,9,4,2,1]], columns=["id", "sex", "split", "group0Low", "group0High", "group1Low", "group1High", "trim"])
grouped = data.groupby(['sex','split']).mean()
stacked = grouped.stack().reset_index(level=2)
stacked.columns = ['group_level', 'mean']
Next, I want to separate out group_level and stack those 2 new factors:
stacked['group'] = stacked.group_level.str[:6]
stacked['level'] = stacked.group_level.str[6:]
This all works fine. My question is this:
This works if my column names ("group0Low", "group0High", "group1Low", "group1High") have something in common with each other.
What if instead my column names were more like "routeLow", "routeHigh", "landmarkLow", "landmarkHigh"? How would I use str to split group_level in this case?
This question is similar to this one posted here: Slice/split string Series at various positions
The difference is all of my column subnames are different and have no commonality (whereas in the other post everything had group or class in the name). Is there a regex string, or some other method, I can use to do this stacking?
Here is another way. It assumes that low/high group ends with the words Low and High respectively, so that we can use .str.endswith() to identify which rows are Low/High.
Here is the sample data
df = pd.DataFrame('group0Low group0High group1Low group1High routeLow routeHigh landmarkLow landmarkHigh'.split(), columns=['group_level'])
df
group_level
0 group0Low
1 group0High
2 group1Low
3 group1High
4 routeLow
5 routeHigh
6 landmarkLow
7 landmarkHigh
Use np.where, we can do the following
df['level'] = np.where(df['group_level'].str.endswith('Low'), 'Low', 'High')
df['group'] = np.where(df['group_level'].str.endswith('Low'), df['group_level'].str[:-3], df['group_level'].str[:-4])
df
group_level level group
0 group0Low Low group0
1 group0High High group0
2 group1Low Low group1
3 group1High High group1
4 routeLow Low route
5 routeHigh High route
6 landmarkLow Low landmark
7 landmarkHigh High landmark
I suppose it depends how general the strings you're working are. Assuming the only levels are always delimited by a capital letter you can do
In [30]:
s = pd.Series(['routeHigh', 'routeLow', 'landmarkHigh',
'landmarkLow', 'routeMid', 'group0Level'])
s.str.extract('([\d\w]*)([A-Z][\w\d]*)')
Out[30]:
0 1
0 route High
1 route Low
2 landmark High
3 landmark Low
4 route Mid
5 group0 Level
You can even name the columns of the result in the same line by doing
s.str.extract('(?P<group>[\d\w]*)(?P<Level>[A-Z][\w\d]*)')
So in your use case you can do
group_level_df = stacked.group_level.extract('(?P<group>[\d\w]*)(?P<Level>[A-Z][\w\d]*)')
stacked = pd.concat([stacked, group_level_df])
Here's another approach which assumes only knowledge of the level names in advance. Suppose you have three levels:
lower = stacked.group_level.str.lower()
for level in ['low', 'mid', 'high']:
rows_in = lower.str.contains(level)
stacked.loc[rows_in, 'level'] = level.capitalize()
stacked.loc[rows_in, 'group'] = stacked.group_level[rows_in].str.replace(level, '')
Which should work as long as the level doesn't appear in the group name as well, e.g. 'highballHigh'. In cases where group_level didn't contain any of these levels you would end up with null values in the corresponding rows

Categories

Resources