Python: strip pair-wise column names - python

I have a DataFrame with columns that look like this:
df=pd.DataFrame(columns=['(NYSE_close, close)','(NYSE_close, open)','(NYSE_close, volume)', '(NASDAQ_close, close)','(NASDAQ_close, open)','(NASDAQ_close, volume)'])
df:
(NYSE_close, close) (NYSE_close, open) (NYSE_close, volume) (NASDAQ_close, close) (NASDAQ_close, open) (NASDAQ_close, volume)
I want to remove everything after the underscore and append whatever comes after the comma to get the following:
df:
NYSE_close NYSE_open NYSE_volume NASDAQ_close NASDAQ_open NASDAQ_volume
I tried to strip the column name but it replaced it with nan. Any suggestions on how to do that?
Thank you in advance.

You could use re.sub to extract the appropriate parts of the column names to replace them with:
import re
df=pd.DataFrame(columns=['(NYSE_close, close)','(NYSE_close, open)','(NYSE_close, volume)', '(NASDAQ_close, close)','(NASDAQ_close, open)','(NASDAQ_close, volume)'])
df.columns = [re.sub(r'\(([^_]+_)\w+, (\w+)\)', r'\1\2', c) for c in df.columns]
Output:
Empty DataFrame
Columns: [NYSE_close, NYSE_open, NYSE_volume, NASDAQ_close, NASDAQ_open, NASDAQ_volume]
Index: []

You could:
import re
def cvt_col(x):
s = re.sub('[()_,]', ' ', x).split()
return s[0] + '_' + s[2]
df.rename(columns = cvt_col)
Empty DataFrame
Columns: [NYSE_close, NYSE_open, NYSE_volume, NASDAQ_close, NASDAQ_open, NASDAQ_volume]
Index: []

Use a list comprehension, twice:
step1 = [ent.strip('()').split(',') for ent in df]
df.columns = ["_".join([left.split('_')[0], right.strip()])
for left, right in step1]
df
Empty DataFrame
Columns: [NYSE_close, NYSE_open, NYSE_volume, NASDAQ_close, NASDAQ_open, NASDAQ_volume]
Index: []

Related

want to apply merge function on column A

How can I apply merge function or any other method on column A.
For example in layman term I want to convert this string "(A|B|C,D)|(A,B|C|D)|(B|C|D)" into a
"(D A|D B|D C)|(A B|A C|A D)|(B|C|D)"
This (B|C|D) will remain same as it doesn't have comma value to merge in it. Basically I want to merge the values which are in commas to rest of its other values.
I have below data frame.
import pandas as pd
data = {'A': [ '(A|B|C,D)|(A,B|C|D)|(B|C|D)'],
'B(Expected)': [ '(D A|D B|D C)|(A B|A C|A D)|(B|C|D)']
}
df = pd.DataFrame(data)
print (df)
My expected result is mentioned in column B(Expected)
Below method I tried:-
(1)
df['B(Expected)'] = df['A'].apply(lambda x: x.replace("|", " ").replace(",", "|") if "|" in x and "," in x else x)
(2)
# Split the string by the pipe character
df['string'] = df['string'].str.split('|')
df['string'] = df['string'].apply(lambda x: '|'.join([' '.join(i.split(' ')) for i in x]))
You can use a regex to extract the values in parentheses, then a custom function with itertools.product to reorganize the values:
from itertools import product
def split(s):
return '|'.join([' '.join(x) for x in product(*[x.split('|') for x in s.split(',')])])
df['B'] = df['A'].str.replace(r'([^()]+)', lambda m: split(m.group()), regex=True)
print(df)
Note that this requires non-nested parentheses.
Output:
A B
0 (A|B|C,D)|(A,B|C|D)|(B|C|D) (A D|B D|C D)|(A B|A C|A D)|(B|C|D)

Can pandas findall() return a str instead of list?

I have a pandas dataframe containing a lot of variables:
df.columns
Out[0]:
Index(['COUNADU_SOIL_P_NUMBER_16_DA_B_VE_count_nr_lesion_PRATZE',
'COUNEGG_SOIL_P_NUMBER_50_DA_B_VT_count_nr_lesion_PRATZE',
'COUNJUV_SOIL_P_NUMBER_128_DA_B_V6_count_nr_lesion_PRATZE',
'COUNADU_SOIL_P_SAUDPC_150_DA_B_V6_lesion_saudpc_PRATZE',
'CONTRO_SOIL_P_pUNCK_150_DA_B_V6_lesion_p_control_PRATZE',
'COUNJUV_SOIL_P_p_0_100_16_DA_B_V6_lesion_incidence_PRATZE',
'COUNADU_SOIL_P_p_0_100_50_DA_B_VT_lesion_incidence_PRATZE',
'COUNEGG_SOIL_P_p_0_100_128_DA_B_VT_lesion_incidence_PRATZE',
'COUNEGG_SOIL_P_NUMBER_50_DA_B_V6_count_nr_spiral_HELYSP',
'COUNJUV_SOIL_P_NUMBER_128_DA_B_V10_count_nr_spiral_HELYSP', # and so on
I would like to keep only the number followed by DA, so the first column is 16_DA. I have been using the pandas function findall():
df.columns.str.findall(r'[0-9]*\_DA')
Out[595]:
Index([ ['16_DA'], ['50_DA'], ['128_DA'], ['150_DA'], ['150_DA'],
['16_DA'], ['50_DA'], ['128_DA'], ['50_DA'], ['128_DA'], ['150_DA'],
['150_DA'], ['50_DA'], ['128_DA'],
But this returns a list, which i would like to avoid, so that i end up with a column index looking like this:
df.columns
Out[595]:
Index('16_DA', '50_DA', '128_DA', '150_DA', '150_DA',
'16_DA', '50_DA', '128_DA', '50_DA', '128_DA', '150_DA',
Is there a smoother way to do this?
You can use .str.join(", ") to join all found matches with a comma and space:
df.columns.str.findall(r'\d+_DA').str.join(", ")
Or, just use str.extract to get the first match:
df.columns.str.extract(r'(\d+_DA)', expand=False)
from typing import List
pattern = r'[0-9]*\_DA'
flattened: List[str] = sum(df.columns.str.findall(pattern), [])
output: str = ",".join(flattened)

How to find and replace substrings at the end of column headers

I have the following columns, among others, in my dataframe: dom_pop', 'an_dom_n', 'an_dom_ncmplt. Equivalent columns exist in multiple dataframes, with the suffix changing. For example, in another dataframe they may be called out as pa_pop', 'an_pa_n', 'an_pa_ncmplt. I want to append '_kwh' to these cols across all my dataframes.
I wrote the following code:
cols = ['_n$', '_ncmplt', '_pop'] << the $ is added to indicate string ending in _n.
filterfuel = 'kwh'
for c in cols:
dfdom.columns = [col.replace(f'{c}', f'{c}_{filterfuel}') for col in dfdom.columns]
dfpa.columns = [col.replace(f'{c}', f'{c}_{filterfuel}') for col in dfpa.columns]
dfsw.columns = [col.replace(f'{c}', f'{c}_{filterfuel}') for col in dfsw.columns]
kwh gets appended to _ncmplt and _pop cols, but not the _n column. If I remove the $ _n gets appended but then _ncmplt looks like 'an_dom_n_kwh_cmplt'.
for df dom the corrected names should look like dom_pop_kwh', 'an_dom_n_kwh', 'an_dom_ncmplt_kwh'
Why is $ not being recongnised as an end of string parameter?
You can use np.where with a regex
cols = ['_n$', '_ncmplt', '_pop']
filterfuel = 'kwh'
pattern = fr"(?:{'|'.join(cols)})"
for df in [dfdom, dfpa, dfsw]:
df.columns = np.where(df.columns.str.contains(pattern, regex=True),
df.columns + f"_{filterfuel}", df.columns)
Output:
>>> pattern
'(?:_n$|_ncmplt|_pop)'
# dfdom = pd.DataFrame([[0]*4], columns=['dom_pop', 'an_dom_n', 'an_dom_ncmplt', 'hello'])
# After:
>>> dfdom
dom_pop_kwh an_dom_n_kwh an_dom_ncmplt_kwh hello
0 0 0 0 0

Python remove everything after specific string and loop through all rows in multiple columns in a dataframe

I have a file full of URL paths like below spanning across 4 columns in a dataframe that I am trying to clean:
Path1 = ["https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID\
=0x012000EDE8B08D50FC3741A5206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D"]
I want to remove everything after a specific string which I defined it as "string1" and I would like to loop through all 4 columns in the dataframe defined as "df_MasterData":
string1 = "&FolderCTID"
import pandas as pd
df_MasterData = pd.read_excel(FN_MasterData)
cols = ['Column_A', 'Column_B', 'Column_C', 'Column_D']
for i in cols:
# Objective: Replace "&FolderCTID", delete all string after
string1 = "&FolderCTID"
# Method 1
df_MasterData[i] = df_MasterData[i].str.split(string1).str[0]
# Method 2
df_MasterData[i] = df_MasterData[i].str.split(string1).str[1].str.strip()
# Method 3
df_MasterData[i] = df_MasterData[i].str.split(string1)[:-1]
I did search and google and found similar solutions which were used but none of them work.
Can any guru shed some light on this? Any assistance is appreciated.
Added below is a few example rows in column A and B for these URLs:
Column_A = ['https://contentspace.global.xxx.com/teams/Australia/NSW/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FNSW%2FDocuments%2FIn%20Scope%2FA%20I%20TOPPER%20GROUP&FolderCTID=\
0x01200016BC4CE0C21A6645950C100F37A60ABD&View=%7B64F44840%2D04FE%2D4341%2D9FAC%2D902BB54E7F10%7D',\
'https://contentspace.global.xxx.com/teams/Australia/Victoria/Documents/Forms/AllItems.aspx?RootFolder\
=%2Fteams%2FAustralia%2FVictoria%2FDocuments%2FIn%20Scope&FolderCTID=0x0120006984C27BA03D394D9E2E95FB\
893593F9&View=%7B3276A351%2D18C1%2D4D32%2DADFF%2D54158B504FCC%7D']
Column_B = ['https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID=0x012000EDE8B08D50FC3741A5\
206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D',\
'https://contentspace.global.xxx.com/teams/Australia/QLD/Documents/Forms/AllItems.aspx?RootFolder=%\
2Fteams%2FAustralia%2FQLD%2FDocuments%2FIn%20Scope%2FAACO%20GROUP&FolderCTID=0x012000E689A6C1960E8\
648A90E6EC3BD899B1A&View=%7B6176AC45%2DC34C%2D4F7C%2D9027%2DDAEAD1391BFC%7D']
This is how i would do it,
first declare a variable with your target columns.
Then use stack() and str.split to get your target output.
finally, unstack and reapply the output to your original df.
cols_to_slice = ['ColumnA','ColumnB','ColumnC','ColumnD']
string1 = "&FolderCTID"
df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
if you want to replace these columns in your target df then simply do -
df[cols_to_slice] = df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
You should first get the index of string using
indexes = len(string1) + df_MasterData[i].str.find(string1)
# This selected the final location of this string
# if you don't want to add string in result just use below one
indexes = len(string1) + df_MasterData[i].str.find(string1)
Now do
df_MasterData[i] = df_MasterData[i].str[:indexes]

Renaming index values in pandas dataframe

I would need to change the name of my indices:
Country Date (other columns)
/link1/subpath2/Text by Poe/
/link1/subpath2/Text by Wilde/
/link1/subpath2/Text by Whitman/
Country and Date are my indices. I would like to extract the words Poe, Wilde and Whitman from index column Country in order to have
Country Date (other columns)
Poe
Wilde
Whitman
Currently I am doing it one by one:
df=df.rename(index={'/link1/subpath2/Text by Poe/': 'Poe'})
df=df.rename(index={'/link1/subpath2/Text by Wilde/': 'Wilde'})
df=df.rename(index={'/link1/subpath2/Text by Whitman/': 'Whitman'})
It works, but since I have hundreds of datasets, as you can imagine is not doable
You can use str.replace:
df['Country'] = df['Country'].str.replace(r'/link1/subpath2/Text by ', '')
df['Country'] = df['Country'].str.replace(r'/', '')
If 'Country' is an Index you can do as follows:
df = df.set_index('Country')
df.index = df.index.str.replace(r'/link1/subpath2/Text by ', '')
If it's a MultiIndex you can use .reset_index:
df = df.reset_index()
df['Country'] = df['Country'].str.replace(r'/link1/subpath2/Text by ', '')
You can always use regex pattern if things get more complicated:
import re
import pandas as pd
df = pd.DataFrame(['foo', 'bar', 'z'], index=['/link1/subpath2/Text by Poe/',
'/link1/subpath2/Text by Wilde/',
'/link1/subpath2/Text by Whitman/'])
name_pattern = re.compile(r'by (\w+)/')
df.index = [name_att.findall(idx)[0] for idx in df.index]
df
where name_pattern will capture all groups between 'by ' and '/'
you can use str.extract with a pattern to catch the last word with (\w*), delimited by a white space \s before and after the character / at the end of the line $. Because it is an index, you need to rebuild the MultiIndex.from_arrays.
df.index = pd.MultiIndex.from_arrays([df.index.get_level_values(0)
.str.extract('\s(\w*)\/$')[0],
df.index.get_level_values(1)],
names=['Country', 'Dates'])

Categories

Resources