Best way to handle path with pandas - python

When I have a pd.DataFrame with paths, I end up doing a lot of .map(lambda path: Path(path).{method_name}, or apply(axis=1) e.g:
(
pd.DataFrame({'base_dir': ['dir_A', 'dir_B'], 'file_name': ['file_0', 'file_1']})
.assign(full_path=lambda df: df.apply(lambda row: Path(row.base_dir) / row.file_name, axis=1))
)
base_dir file_name full_path
0 dir_A file_0 dir_A/file_0
1 dir_B file_1 dir_B/file_1
It seems odd to me especially because pathlib does implement / so that something like df.base_dir / df.file_name would be more pythonic and natural.
I have not found any path type implemented in pandas, is there something I am missing?
EDIT
I have found it may be better to once for all do sort of a astype(path) then at least for path concatenation with pathlib it is vectorized:
(
pd.DataFrame({'base_dir': ['dir_A', 'dir_B'], 'file_name': ['file_0', 'file_1']})
# this is where I would expect `astype({'base_dir': Path})`
.assign(**{col_name:lambda df: df[col_name].map(Path) for col_name in ["base_dir", "file_name"]})
.assign(full_path=lambda df: df.base_dir / df.file_name)
)

It seems like the easiest way would be:
df.base_dir.map(Path) / df.file_name.map(Path)
It saves the need for a lambda function, but you still need to map to 'Path'.
Alternatively, just do:
df.base_dir.str.cat(df.file_name, sep="/")
The latter won't work on Windows (who cares, right? :) but will probably run faster.

import pandas as pd
import os
df = pd.DataFrame({"p1":["path1"],"p2":["path2"]})
df.apply(lambda x:os.path.join(x.p1, x.p2), axis=1)
Output:
0 path1\path2
dtype: object
Edit:
After being told to not use assign you can try this
See .to_json() docs
import os
import pandas as pd
df = pd.DataFrame({"p1":["path1", "path3"],"p2":["path2", "path4"]})
print(df.to_json(orient="values"))
Output
[["path1","path2"],["path3","path4"]]
From here it's simple, just use map(lambda x:os.path.join(*x), ...) and you get a list of paths.

Using pandas-path
The pandas-path package the functionality you need and more. Just by importing, it adds a .path accessor to pd.Series and pd.Index to make pathlib methods available.
import pandas as pd
import pandas_path
df = pd.DataFrame({'base_dir': ['dir_A', 'dir_B'], 'file_name': ['file_0', 'file_1']})
# .path accessor added by importing pandas_path
df.base_dir.path / df.file_name.path
#> 0 dir_A/file_0
#> 1 dir_B/file_1
#> dtype: object
Created at 2021-03-06 18:09:44 PST by reprexlite v0.4.2

Related

Spliting a path column in python

Hi I have a column with path like this:
path_column = ['C:/Users/Desktop/sample\\1994-QTR1.tsv','C:/Users/Desktop/sample\\1995-QTR1.tsv']
I need to split and get just the file name.
Expected output:
[1994-QTR1,1995-QTR1]
Thanks
Use str.extract:
df['new'] = df['path'].str.extract(r'\\([^\\]*)\.\w+$', expand=False)
The equivalent with rsplit would be much less efficient:
df['new'] = df['path'].str.rsplit('\\', n=1).str[-1].str.rsplit('.', n=1).str[0]
Output:
path new
0 C:/Users/Desktop/sample\1994-QTR1.tsv 1994-QTR1
1 C:/Users/Desktop/sample\1995-QTR1.tsv 1995-QTR1
regex demo
Similarly to the above, but you don't need to declare the separator.
import os
path = "C:/Users/Desktop/sample\\1994-QTR1.tsv"
name = path.split(os.path.sep)[-1]
print(name)
Use this or you can use regex to match and take what you want.
path.split("\\")[-1].split(".")[0]
Output:
'1994-QTR1'
Edit
new_col=[]
for i in path_column:
new_col.append(i.split("\\")[-1].split(".")[0])
print (new_col)
NOTE: If you need it in a list, you can append it to a new list from the loop.
Output:
['1994-QTR1', '1995-QTR1']
You might harness pathlib for this task following way
import pathlib
import pandas as pd
def get_stem(path):
return pathlib.PureWindowsPath(path).stem
df = pd.DataFrame({'paths':['C:/Users/Desktop/sample\\1994-QTR1.tsv','C:/Users/Desktop/sample\\1994-QTR2.tsv','C:/Users/Desktop/sample\\1994-QTR3.tsv']})
df['names'] = df.paths.apply(get_stem)
print(df)
gives output
paths names
0 C:/Users/Desktop/sample\1994-QTR1.tsv 1994-QTR1
1 C:/Users/Desktop/sample\1994-QTR2.tsv 1994-QTR2
2 C:/Users/Desktop/sample\1994-QTR3.tsv 1994-QTR3

Python Numpy Select Dynamic statement from string

I am trying to do the following,
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name': ('Harry','Sally','Megan'), 'Age': (30, 31,'NN')})
a={'target':"Age2",'check':"==30",'iftrue':["Is"]}
condis=[
df['Age'] a['check']
]
df[a['target']]= np.select(condis,a['iftrue'],default=" ")
print(df)
I am stuck at trying to convert the a['check'] parameter received as a string to a statement so this,
df['Age'] a['check']
should resolve/compile to
df['Age'] ==30
Could someone give me any ideas on how to achieve this? Maybe I am missing something very basic and simple here.
Thanks.
You can use eval to convert string to condition:
check = "==30"
age = "20"
print(eval(age+check))
>>> False
But it's not recommended because eval is a function is to use very carefully as it can execute arbitrary code it cause security issue and is hard to debug
a more proper solution would be for example to have an argument for comparison operator and one for comparaison parameter:
check_op = np.equal
check_arg = 30
print(check_op(check_arg, 20)
>>> False

How does Pandas.read_csv type casting work?

Using pandas.read_csv with parse_dates option and a custom date parser, I find Pandas has a mind of its own about the data type it's reading.
Sample csv:
"birth_date", "name"
"","Dr. Who"
"1625", "Rembrandt"
"1533", "Michel"
The actual datecleaner is here, but what I do boils down to this:
import pandas as pd
def dateclean(date):
return str(int(date)) # Note: we return A STRING
df = pd.read_csv(
'my.csv',
parse_dates=['birth_date'],
date_parser=dateclean,
engine='python'
)
print(df.birth_date)
Output:
0 NaN
1 1625.0
2 1533.0
Name: birth_date, dtype: float64
I get type float64, even when I specified str. Also, take out the first line in the CSV, the one with the empty birth_date, and I get type int. The workaround is easy:
return '"{}"'.format(int(date))
Is there a better way?
In data analysis, I can imagine it's useful that Pandas will say 'Hey dude, you thought you were reading strings, but in fact they're numbers'. But what's the rationale for overruling me when I tell it not to?
Using parse_dates / date_parser looks a bit complicated for me, unless you want to generalise your import on many date columns. I think you have more control with converters parameter, where you can fit dateclean() function. You can also experiment with dtype parameter.
The problem with original dateclean() function is that it fails on "" value, because int("") raises ValueError. Pandas seem to resort to standard import when it encounters this problem, but it will fail explicitly with converters.
Below is the code to demonstrate a fix:
import pandas as pd
from pathlib import Path
doc = """"birth_date", "name"
"","Dr. Who"
"1625", "Rembrandt"
"1533", "Michel"
"""
Path('my.csv').write_text(doc)
def dateclean(date):
try:
return str(int(date))
except ValueError:
return ''
df = pd.read_csv(
'my.csv',
parse_dates=['birth_date'],
date_parser=dateclean,
engine='python'
)
df2 = pd.read_csv(
'my.csv',
converters = {'birth_date': dateclean}
)
print(df2.birth_date)
Hope it helps.
The problem is date_parser is designed specifically for conversion to datetime:
date_parser : function, default NoneFunction to use for converting a sequence of string columns to an array of datetime
instances.
There is no reason you should expect this parameter to work for other types. Instead, you can use the converters parameter. Here we use toolz.compose to apply int and then str. Alternatively, you can use lambda x: str(int(x)).
from io import StringIO
import pandas as pd
from toolz import compose
mystr = StringIO('''"birth_date", "name"
"","Dr. Who"
"1625", "Rembrandt"
"1533", "Michel"''')
df = pd.read_csv(mystr,
converters={'birth_date': compose(str, int)},
engine='python')
print(df.birth_date)
0 NaN
1 1625
2 1533
Name: birth_date, dtype: object
If you need to replace NaN with empty strings, you can post-process with fillna:
print(df.birth_date.fillna(''))
0
1 1625
2 1533
Name: birth_date, dtype: object

python pandas - function applied to csv is not persisted

I need to polish a csv dataset, but it seems the changes are not applied to the dataset itslef.
CSV is in this format:
ID, TRACK_LINK
761607, https://mylink.com//track/...
This is my script:
import pandas as pd
df = pd.read_csv('./file.csv').fillna('')
# remove double // from TRACK_LINK
def polish_track_link(track_link):
return track_link.replace("//track", "/track")
df['LINK'].apply(polish_track_link)
print(df)
this prints something like:
...
761607 https://mylink.com//track/...
note the //track
If I do print(df['LINK'].apply(polish_track_link)) I get:
...
761607, https://mylink.com/track/...
So the function polish_track_link works but it's not applied to the dataset. Any idea why?
You need assign back:
df['TRACK_LINK'] = df['TRACK_LINK'].apply(polish_track_link)
But better is use pandas functions str.replace or replace with regex=True for replace substrings:
df['TRACK_LINK'] = df['TRACK_LINK'].str.replace("//track", "/track")
Or:
df['TRACK_LINK'] = df['TRACK_LINK'].replace("//track", "/track", regex=True)
print(df)
ID TRACK_LINK
0 761607 https://mylink.com/track/

How to add a column to Pandas based off of other columns

I'm using Pandas and I have a very basic dataframe:
session_id datetime
5 t0ubmqqpbt01rhce201cujjtm7 2014-11-28T04:30:09Z
6 k87akpjpl004nbmhf4loiafi72 2014-11-28T04:30:11Z
7 g0t7hrqo8hgc5vlb7240d1n9l5 2014-11-28T04:30:12Z
8 ugh3fkskmedq3br99d20t78gb2 2014-11-28T04:30:15Z
9 fckkf16ahoe1uf9998eou1plc2 2014-11-28T04:30:18Z
I wish to add a third column based on the values of the current columns:
df['key'] = urlsafe_b64encode(md5('l' + df['session_id'] + df['datetime']))
But I receive:
TypeError: must be convertible to a buffer, not Series
You need to use pandas.DataFrame.apply. The code below will apply the lambda function to each row of df. You could, of course, define a separate function (if you need to do more something more complicated).
import pandas as pd
from io import StringIO
from base64 import urlsafe_b64encode
from hashlib import md5
s = ''' session_id datetime
5 t0ubmqqpbt01rhce201cujjtm7 2014-11-28T04:30:09Z
6 k87akpjpl004nbmhf4loiafi72 2014-11-28T04:30:11Z
7 g0t7hrqo8hgc5vlb7240d1n9l5 2014-11-28T04:30:12Z
8 ugh3fkskmedq3br99d20t78gb2 2014-11-28T04:30:15Z
9 fckkf16ahoe1uf9998eou1plc2 2014-11-28T04:30:18Z'''
df = pd.read_csv(StringIO(s), sep='\s+')
df['key'] = df.apply(lambda x: urlsafe_b64encode(md5('l' + x['session_id'] + x['datetime'])), axis=1)
Note: I couldn't get the hashing bit working on my machine unfortunately, some unicode error (might be because I'm using Python 3) and I don't have time to debug the inner workings of it, but the pandas part I'm pretty sure about :P

Categories

Resources