Spliting a path column in python - python

Hi I have a column with path like this:
path_column = ['C:/Users/Desktop/sample\\1994-QTR1.tsv','C:/Users/Desktop/sample\\1995-QTR1.tsv']
I need to split and get just the file name.
Expected output:
[1994-QTR1,1995-QTR1]
Thanks

Use str.extract:
df['new'] = df['path'].str.extract(r'\\([^\\]*)\.\w+$', expand=False)
The equivalent with rsplit would be much less efficient:
df['new'] = df['path'].str.rsplit('\\', n=1).str[-1].str.rsplit('.', n=1).str[0]
Output:
path new
0 C:/Users/Desktop/sample\1994-QTR1.tsv 1994-QTR1
1 C:/Users/Desktop/sample\1995-QTR1.tsv 1995-QTR1
regex demo

Similarly to the above, but you don't need to declare the separator.
import os
path = "C:/Users/Desktop/sample\\1994-QTR1.tsv"
name = path.split(os.path.sep)[-1]
print(name)

Use this or you can use regex to match and take what you want.
path.split("\\")[-1].split(".")[0]
Output:
'1994-QTR1'
Edit
new_col=[]
for i in path_column:
new_col.append(i.split("\\")[-1].split(".")[0])
print (new_col)
NOTE: If you need it in a list, you can append it to a new list from the loop.
Output:
['1994-QTR1', '1995-QTR1']

You might harness pathlib for this task following way
import pathlib
import pandas as pd
def get_stem(path):
return pathlib.PureWindowsPath(path).stem
df = pd.DataFrame({'paths':['C:/Users/Desktop/sample\\1994-QTR1.tsv','C:/Users/Desktop/sample\\1994-QTR2.tsv','C:/Users/Desktop/sample\\1994-QTR3.tsv']})
df['names'] = df.paths.apply(get_stem)
print(df)
gives output
paths names
0 C:/Users/Desktop/sample\1994-QTR1.tsv 1994-QTR1
1 C:/Users/Desktop/sample\1994-QTR2.tsv 1994-QTR2
2 C:/Users/Desktop/sample\1994-QTR3.tsv 1994-QTR3

Related

Can pandas findall() return a str instead of list?

I have a pandas dataframe containing a lot of variables:
df.columns
Out[0]:
Index(['COUNADU_SOIL_P_NUMBER_16_DA_B_VE_count_nr_lesion_PRATZE',
'COUNEGG_SOIL_P_NUMBER_50_DA_B_VT_count_nr_lesion_PRATZE',
'COUNJUV_SOIL_P_NUMBER_128_DA_B_V6_count_nr_lesion_PRATZE',
'COUNADU_SOIL_P_SAUDPC_150_DA_B_V6_lesion_saudpc_PRATZE',
'CONTRO_SOIL_P_pUNCK_150_DA_B_V6_lesion_p_control_PRATZE',
'COUNJUV_SOIL_P_p_0_100_16_DA_B_V6_lesion_incidence_PRATZE',
'COUNADU_SOIL_P_p_0_100_50_DA_B_VT_lesion_incidence_PRATZE',
'COUNEGG_SOIL_P_p_0_100_128_DA_B_VT_lesion_incidence_PRATZE',
'COUNEGG_SOIL_P_NUMBER_50_DA_B_V6_count_nr_spiral_HELYSP',
'COUNJUV_SOIL_P_NUMBER_128_DA_B_V10_count_nr_spiral_HELYSP', # and so on
I would like to keep only the number followed by DA, so the first column is 16_DA. I have been using the pandas function findall():
df.columns.str.findall(r'[0-9]*\_DA')
Out[595]:
Index([ ['16_DA'], ['50_DA'], ['128_DA'], ['150_DA'], ['150_DA'],
['16_DA'], ['50_DA'], ['128_DA'], ['50_DA'], ['128_DA'], ['150_DA'],
['150_DA'], ['50_DA'], ['128_DA'],
But this returns a list, which i would like to avoid, so that i end up with a column index looking like this:
df.columns
Out[595]:
Index('16_DA', '50_DA', '128_DA', '150_DA', '150_DA',
'16_DA', '50_DA', '128_DA', '50_DA', '128_DA', '150_DA',
Is there a smoother way to do this?
You can use .str.join(", ") to join all found matches with a comma and space:
df.columns.str.findall(r'\d+_DA').str.join(", ")
Or, just use str.extract to get the first match:
df.columns.str.extract(r'(\d+_DA)', expand=False)
from typing import List
pattern = r'[0-9]*\_DA'
flattened: List[str] = sum(df.columns.str.findall(pattern), [])
output: str = ",".join(flattened)

How to strip/replace "domain\" from Pandas DataFrame Column?

I have a pandas DataFrame that's being read in from a CSV that has hostnames of computers including the domain they belong to along with a bunch of other columns. I'm trying to strip out the Domain information such that I'm left with ONLY the Hostname.
DataFrame ex:
name
domain1\computername1
domain1\computername45
dmain3\servername1
dmain3\computername3
domain1\servername64
....
I've tried using both str.strip() and str.replace() with a regex as well as a string literal, but I can't seem to correctly target the domain information correctly.
Examples of what I've tried thus far:
df['name'].str.strip('.*\\')
df['name'].str.replace('.*\\', '', regex = True)
df['name'].str.replace(r'[.*\\]', '', regex = True)
df['name'].str.replace('domain1\\\\', '', regex = False)
df['name'].str.replace('dmain3\\\\', '', regex = False)
None of these seem to make any changes when I spit the DataFrame out using logging.debug(df)
You are already close to the answer, just use:
df['name'] = df['name'].str.replace(r'.*\\', '', regex = True)
which just adds using r-string from one of your tried code.
Without using r-string here, the string is equivalent to .*\\ which will be interpreted to only one \ in the final regex. However, with r-string, the string will becomes '.*\\\\' and each pair of \\ will be interpreted finally as one \ and final result becomes 2 slashes as you expect.
Output:
0 computername1
1 computername45
2 servername1
3 computername3
4 servername64
Name: name, dtype: object
You can use .str.split:
df["name"] = df["name"].str.split("\\", n=1).str[-1]
print(df)
Prints:
name
0 computername1
1 computername45
2 servername1
3 computername3
4 servername64
No regex approach with ntpath.basename:
import pandas as pd
import ntpath
df = pd.DataFrame({'name':[r'domain1\computername1']})
df["name"] = df["name"].apply(lambda x: ntpath.basename(x))
Results: computername1.
With rsplit:
df["name"] = df["name"].str.rsplit('\\').str[-1]

How to extract sub string by defining before and after delimiter

I have data frame which contains the URLs and I want to extract something in between.
df
URL
https://storage.com/vision/Glass2020/2020-02-04_B8I8FZHl-xJ_2236301468348443721.jpg
https://storage.com/vision/Carpet5020/2020-02-04_B8I8FZHl-xJ_2236301468348443721.jpg
https://storage.com/vision/Metal8020/2020-02-04_B8I8FZHl-xJ_2236301468348443721.jpg
desired output would be like this
URL Type
https://storage.com/vision/Glass2020/2020-02-04_B8I8FZHl-xJ_2236301468348443721.jpg Glass2020
https://storage.com/vision/Carpet5020/2020-02-04_B8I8FZHl-xJ_2236301468348443721.jpg Carpet5020
https://storage.com/vision/Metal8020/2020-02-04_B8I8FZHl-xJ_2236301468348443721.jpg Metal8020
I would use df['URL'].str.extract but to understand how to define before and after the delimiter.
One idea is use Series.str.split with select second last value by indexing:
df['Type'] = df['URL'].str.split('/').str[-2]
print (df)
URL Type
0 https://storage.com/vision/Glass2020/2020-02-0... Glass2020
1 https://storage.com/vision/Carpet5020/2020-02-... Carpet5020
2 https://storage.com/vision/Metal8020/2020-02-0... Metal8020
EDIT: For specify different values outside expected output use Series.str.extract:
df['Type'] = df['URL'].str.extract('vision/(.+)/2020')
print (df)
URL Type
0 https://storage.com/vision/Glass2020/2020-02-0... Glass2020
1 https://storage.com/vision/Carpet5020/2020-02-... Carpet5020
2 https://storage.com/vision/Metal8020/2020-02-0... Metal8020
Try str.split:
df['Type'] = df.URL.str.split('/').str[-2]

Best way to handle path with pandas

When I have a pd.DataFrame with paths, I end up doing a lot of .map(lambda path: Path(path).{method_name}, or apply(axis=1) e.g:
(
pd.DataFrame({'base_dir': ['dir_A', 'dir_B'], 'file_name': ['file_0', 'file_1']})
.assign(full_path=lambda df: df.apply(lambda row: Path(row.base_dir) / row.file_name, axis=1))
)
base_dir file_name full_path
0 dir_A file_0 dir_A/file_0
1 dir_B file_1 dir_B/file_1
It seems odd to me especially because pathlib does implement / so that something like df.base_dir / df.file_name would be more pythonic and natural.
I have not found any path type implemented in pandas, is there something I am missing?
EDIT
I have found it may be better to once for all do sort of a astype(path) then at least for path concatenation with pathlib it is vectorized:
(
pd.DataFrame({'base_dir': ['dir_A', 'dir_B'], 'file_name': ['file_0', 'file_1']})
# this is where I would expect `astype({'base_dir': Path})`
.assign(**{col_name:lambda df: df[col_name].map(Path) for col_name in ["base_dir", "file_name"]})
.assign(full_path=lambda df: df.base_dir / df.file_name)
)
It seems like the easiest way would be:
df.base_dir.map(Path) / df.file_name.map(Path)
It saves the need for a lambda function, but you still need to map to 'Path'.
Alternatively, just do:
df.base_dir.str.cat(df.file_name, sep="/")
The latter won't work on Windows (who cares, right? :) but will probably run faster.
import pandas as pd
import os
df = pd.DataFrame({"p1":["path1"],"p2":["path2"]})
df.apply(lambda x:os.path.join(x.p1, x.p2), axis=1)
Output:
0 path1\path2
dtype: object
Edit:
After being told to not use assign you can try this
See .to_json() docs
import os
import pandas as pd
df = pd.DataFrame({"p1":["path1", "path3"],"p2":["path2", "path4"]})
print(df.to_json(orient="values"))
Output
[["path1","path2"],["path3","path4"]]
From here it's simple, just use map(lambda x:os.path.join(*x), ...) and you get a list of paths.
Using pandas-path
The pandas-path package the functionality you need and more. Just by importing, it adds a .path accessor to pd.Series and pd.Index to make pathlib methods available.
import pandas as pd
import pandas_path
df = pd.DataFrame({'base_dir': ['dir_A', 'dir_B'], 'file_name': ['file_0', 'file_1']})
# .path accessor added by importing pandas_path
df.base_dir.path / df.file_name.path
#> 0 dir_A/file_0
#> 1 dir_B/file_1
#> dtype: object
Created at 2021-03-06 18:09:44 PST by reprexlite v0.4.2

python pandas - function applied to csv is not persisted

I need to polish a csv dataset, but it seems the changes are not applied to the dataset itslef.
CSV is in this format:
ID, TRACK_LINK
761607, https://mylink.com//track/...
This is my script:
import pandas as pd
df = pd.read_csv('./file.csv').fillna('')
# remove double // from TRACK_LINK
def polish_track_link(track_link):
return track_link.replace("//track", "/track")
df['LINK'].apply(polish_track_link)
print(df)
this prints something like:
...
761607 https://mylink.com//track/...
note the //track
If I do print(df['LINK'].apply(polish_track_link)) I get:
...
761607, https://mylink.com/track/...
So the function polish_track_link works but it's not applied to the dataset. Any idea why?
You need assign back:
df['TRACK_LINK'] = df['TRACK_LINK'].apply(polish_track_link)
But better is use pandas functions str.replace or replace with regex=True for replace substrings:
df['TRACK_LINK'] = df['TRACK_LINK'].str.replace("//track", "/track")
Or:
df['TRACK_LINK'] = df['TRACK_LINK'].replace("//track", "/track", regex=True)
print(df)
ID TRACK_LINK
0 761607 https://mylink.com/track/

Categories

Resources