What data structure is good for maintaining file paths?

What data structure is good for maintaining file paths? - python

I'm working on the "Longest Absolute filepath" problem on LeetCode. This is a simple problem that asks "What is the length of the longest absolute file path in a given directory". And my working solution is as follows. The file directory is given as a string.
def lengthLongestPath(self, input):
"""
:type input: str, the file directory
:rtype: int
"""
current_folder_path = [""] * 40
longest_file_path_size = 0
for item in input.split("\n"):
num_tabs = item.count("\t")
print num_tabs
if "." not in item:
current_folder_path[num_tabs] = item.lstrip("\t")
else:
absolute_file_path = "/".join(current_folder_path[:num_tabs] + [item.lstrip("\t")])
print item
print num_tabs, absolute_file_path, current_folder_path
longest_file_path_size = max(len(absolute_file_path), longest_file_path_size)
return longest_file_path_size
This works. However, note that on line current_folder_path = [""] * 40 is very unelegant. This was a line to remember the current file path. I wonder if there is a way to remove this.

The problem statement does not address some fine points. It is very unclear what path may correspond to the string
a\n\t\tb
Is it a//b or plain illegal? If the former, do we need to normalize it?
I guess it is safe to assume that such paths are illegal. In other words, the path depth only grows by 1, and the current_folder_path in fact functions like a stack. You don't need to preinitialize it, but just push the name when num_tabs exceeds its size, and pop as necessary.
As a side note, since join is linear to the current accumulated length, the entire algorithm seems quadratic, which violates the time complexity requirement.

Related

Parse list of strings for speed

Background
I have a function called get_player_path that takes in a list of strings player_file_list and a int value total_players. For the sake of example i have reduced the list of strings and also set the int value to a very small number.
Each string in the player_file_list either has a year-date/player_id/some_random_file.file_extension or
year-date/player_id/IDATs/some_random_number/some_random_file.file_extension
Issue
What i am essentially trying to achieve here is go through this list and store all unique year-date/player_id path in a set until it's length reaches the value of total_players
My current approach does not seem the most efficient to me and i am wondering if i can speed up my function get_player_path in anyway??
Code
def get_player_path(player_file_list, total_players):
player_files_to_process = set()
for player_file in player_file_list:
player_file = player_file.split("/")
file_path = f"{player_file[0]}/{player_file[1]}/"
player_files_to_process.add(file_path)
if len(player_files_to_process) == total_players:
break
return sorted(player_files_to_process)
player_file_list = [
"2020-10-27/31001804320549/31001804320549.json",
"2020-10-27/31001804320549/IDATs/204825150047/foo_bar_Red.idat",
"2020-10-28/31001804320548/31001804320549.json",
"2020-10-28/31001804320548/IDATs/204825150123/foo_bar_Red.idat",
"2020-10-29/31001804320547/31001804320549.json",
"2020-10-29/31001804320547/IDATs/204825150227/foo_bar_Red.idat",
"2020-10-30/31001804320546/31001804320549.json",
"2020-10-30/31001804320546/IDATs/123455150047/foo_bar_Red.idat",
"2020-10-31/31001804320545/31001804320549.json",
"2020-10-31/31001804320545/IDATs/597625150047/foo_bar_Red.idat",
]
print(get_player_path(player_file_list, 2))
Output
['2020-10-27/31001804320549/', '2020-10-28/31001804320548/']

Let's analyze your function first:
your loop should take linear time (O(n)) in the length of the input list, assuming the path lengths are bounded by a relatively "small" number;
the sorting takes O(n log(n)) comparisons.
Thus the sorting has the dominant cost when the list becomes big. You can micro-optimize your loop as much as you want, but as long as you keep that sorting at the end, your effort won't make much of a difference with big lists.
Your approach is fine if you're just writing a Python script. If you really needed perfomances with huge lists, you would probably be using some other language. Nonetheless, if you really care about performances (or just to learn new stuff), you could try one of the following approaches:
replace the generic sorting algorithm with something specific for strings; see here for example
use a trie, removing the need for sorting; this could be theoretically better but probably worse in practice.
Just for completeness, as a micro-optimization, assuming the date has a fixed length of 10 characters:
def get_player_path(player_file_list, total_players):
player_files_to_process = set()
for player_file in player_file_list:
end = player_file.find('/', 12) # <--- len(date) + len('/') + 1
file_path = player_file[:end] # <---
player_files_to_process.add(file_path)
if len(player_files_to_process) == total_players:
break
return sorted(player_files_to_process)
If the IDs have fixed length too, as in your example list, then you don't need any split or find, just:
LENGTH = DATE_LENGTH + ID_LENGTH + 1 # 1 is for the slash between date and id
...
for player_file in player_file_list:
file_path = player_file[:LENGTH]
...
EDIT: fixed the LENGTH initialization, I had forgotten to add 1

I'll leave this solution here which can be further improved, hope it helps.
player_file_list = (
"2020-10-27/31001804320549/31001804320549.json",
"2020-10-27/31001804320549/IDATs/204825150047/foo_bar_Red.idat",
"2020-10-28/31001804320548/31001804320549.json",
"2020-10-28/31001804320548/IDATs/204825150123/foo_bar_Red.idat",
"2020-10-29/31001804320547/31001804320549.json",
"2020-10-29/31001804320547/IDATs/204825150227/foo_bar_Red.idat",
"2020-10-30/31001804320546/31001804320549.json",
"2020-10-30/31001804320546/IDATs/123455150047/foo_bar_Red.idat",
"2020-10-31/31001804320545/31001804320549.json",
"2020-10-31/31001804320545/IDATs/597625150047/foo_bar_Red.idat",
)
def get_player_path(l, n):
pfl = set()
for i in l:
i = "/".join(i.split("/")[0:2])
if i not in pfl:
pfl.add(i)
if len(pfl) == n:
return pfl
if n > len(pfl):
print("not enough matches")
return
print(get_player_path(player_file_list, 2))
# {'2020-10-27/31001804320549', '2020-10-28/31001804320548'}
Python Demo

Use dict so that you don't have to sort it since your list is already sorted. If you still need to sort you can always use sorted in the return statement. Add import re and replace your function as follows:
def get_player_path(player_file_list, total_players):
dct = {re.search('^\w+-\w+-\w+/\w+',pf).group(): 1 for pf in player_file_list}
return [k for i,k in enumerate(dct.keys()) if i < total_players]

Rat with the randomize path in and 2-D array

The problem is similar to rat-maze problem. I have given an 2-d array MxN. each cell of an array is either 1 or 0 ,where 1 means blocked. I have given 2 points (starting point and ending point). I have to go from start index to end index. But the catch is 1) Path should be random. 2) There should be some parameter which allow me to decide how much random it can be. (i.e how crazily it should wander before reaching to its destination.) 3) Path should not intersect itself.(like a snake game).
This algorithm is needed to create population (randomly) which will used as input for genetic model for further optimize it.
For now i have used bfs and created one solution. But the problem is i cannot create any no of random path with this (which i will later use as population) + i'm unable to formalize the idea of how much random it should be.
This is my code that only produces min path by using bfs
def isSafe(x,y,length):
if ((x<length) and (x>-1) and (y<length) and (y>-1)):
return True;
return False;
def path(room,x1,y1,x2,y2,distance):
roomSize=len(room);
if ((x1==x2) and (y1==y2)):
room[x1][y1]=distance+1
return
queue=[[x1,y1]]
room[x1][y1]=0
start=0
end=0
while start<=end:
x,y=queue[start]
start+=1
distance=room[x][y]
for i in [-1,1]:
if isSafe(x+i,y,roomSize):
if room[x+i][y]=="O":
queue.append([x+i,y])
room[x+i][y]=distance+1
end+=1;
for i in [-1,1]:
if isSafe(x,y+i,roomSize):
if room[x][y+i]=="O":
queue.append([x,y+i])
room[x][y+i]=distance+1
end+=1;
def retrace(array,x1,y1,x2,y2):
roomSize=len(array)
if not (isSafe(x2,y2,roomSize)):
print("Wrong Traversing Point");
if type(array[x2][y2])==str:
print("##################No Pipe been installed due to path constrained################")
return [];
distance=array[x2][y2];
path=[[x2,y2]]
x=0
while not (array[x2][y2]==0):
if ((isSafe(x2+1,y2,roomSize)) and type(array[x2+1][y2])==int and array[x2+1][y2]==array[x2][y2]-1):
x2+=1;
path.append([x2,y2]);
elif ((isSafe(x2-1,y2,roomSize)) and type(array[x2-1][y2])==int and array[x2-1][y2]==array[x2][y2]-1):
x2-=1;
path.append([x2,y2])
elif ((isSafe(x2,y2+1,roomSize)) and type(array[x2][y2+1])==int and array[x2][y2+1]==array[x2][y2]-1):
y2+=1;
path.append([x2,y2]);
elif ((isSafe(x2,y2-1,roomSize)) and type(array[x2][y2-1])==int and array[x2][y2-1]==array[x2][y2]-1):
y2-=1;
path.append([x2,y2]);
return path;

Check if files in dir are the same

I have a folder of 5000+ images in jpeg/png etc. How can I check if any of the images are the same. The images were collected through web scraping and have been sequentially renamed so I cannot compare file names.
I am currently checking if the hashes are the same however this is a very long process. I am currently using:
def sameIm(file_name1,file_name2):
hash = imagehash.average_hash(Image.open(path + file_name1))
otherhash = imagehash.average_hash(Image.open(path + file_name2))
return (hash == otherhash)
Then nested loops. Comparing 1 image to 5000+ others takes about 5mins so comparing each to each would take days to compute.
Is there a faster way to do this in python. I was thinking parallel processing but would that still take a long time?
or is there another way to compare files which is faster?
Thanks

There is indeed a much faster way of doing this:
import collections
import glob
import os
def dupDetector(dirpath, ext):
hashes = collections.defaultdict(list)
for fpath in glob.glob(os.path.join(dirpath, "*.{}".format(ext))):
h = imagehash.average_hash(Image.open(fpath))
hashes[h].append(fpath)
for h,fpaths in hashes.items():
if len(fpaths) == 1:
print(fpaths[0], "is one of a kind")
continue
print("The following files are duplicates of each other (with the hash {}): \n\t{}".format(h, '\n\t'.join(fpaths)))
Using the dictionary with the file hash as a key gives you O(1) lookups, which means you don't need to do the pair-wise comparisons. You therefore go from a quadratic runtime, to a linear runtime (yay!)

Why not compute hash only once?
hashes = [imagehash.average_hash(Image.open(path + fn)) for fn in file_names]
def compare_hashes(hash1, hash2):
return hash1 == hash2

One solution is to keep using the hash but stocking it in a list of tuple (or a dic, i don't know wich is more efficient here) where the first element is the name of the image and the second is the hash. It should take aproximatively the same 5 mins.
If you have 5000 images,
You compare the value of the first element of the list to the 4999 others
Then the second to the 4998 others (as you already checked the first one)
Then the third ...
This "just" make you do n²/2 comparisons (where n is the number of images)

Just use map structure to calculate hashes for each image,then store hashes as a key and name of the image as a value.
As a result you would have unique images names array.
def get_hash(filename):
return imagehash.average_hash(Image.open(path + filename))
def get_unique_images(filenames):
hashes = {}
for filename in filenames:
image_hash = get_hash(filename)
hashes[image_hash] = filename
return hashes.values()

neo4j: Return all nodes on the longest matching patch

Using neo4j 1.9.4 and py2neo 1.6, I have a binary tree like structure in my graph, where each node has up to two children. However, the graph is not complete, so it might look like this, where "(x)" represents nodes and "[y]" represents relations.
(root)
/ \
[0] [1]
/ \
(0) (1)
/ \ \
[0] [1] [1]
/ \ \
(00) (01) (11)
/
[0]
/
(000)
I made up some similar example here: http://console.neo4j.org/r/ni6t5b (If it is not working for some reason, you can create the graph using the following command:
CREATE (node1 { name: '1' }),(node2 { name: '11' }),(node3 { name: '10' }),(node4 { name: '0' }),(node5 { name: '01' }),(node6 { name: '00' }),(node7 { name: '10' }),(root { name:'root' }), root-[:`1`]->node1, node1-[:`1`]->node2, node1-[:`0`]->node7, node2-[:`0`]->node3, root-[:`0`]->node4, node4-[:`1`]->node5, node4-[:`0`]->node6
I would like to return all nodes that exist on a (single) specific path.
START root=node(1)
MATCH path = root-[?:`0`]-()-[?:`0`]-()-[?:`0`]-()
RETURN NODES(path)
(Note: let ID(root) = 1) This will return all nodes that are in the path, namely (0), (00) and (000)
However, as I don't know what depth each branch has, I also want to query something like this:
START root=node(1)
MATCH path = root-[?:`1`]-()-[?:`1`]-()-[?:`0`]-()-[?:`0`]-()
RETURN NODES(path)
This should return all nodes that are on the longest possible path, i.e. (1) and (11). In fact, this query does not return anything. How can I achieve that? Note: For each path, I don't know beforehand, if this path has an existing end node or not. I only want to return all nodes on that path that exist.
Additionally, whats the best way to automatically construct such a query using py2neo (python). For example, I have a list containing several paths of which I need to query each of them. Each path just existing out of '0's and '1's ?
list_of_paths = ["0010", "101", "11", "101110"]
Thank you, guys!
EDIT: Here (http://wes.skeweredrook.com/cypher-longest-path/) I could find a similar problem. However, the relation type is always the same, which does not hold for my scenario. So I think I can't adopt this solution from there.
EDIT2: The suggestions of Wes does not solve my problem:
START root=node(1)
MATCH path=root-[:`1`|`0`*]->leaf
RETURN path
This query returns all possible paths starting from the root-node. What I want is the nodes of one specific path.
EDIT3: The updated suggestion of Wes does not solve my problem either. It only returns the longest existing Path of the whole graph. But I want to query a specific path and return all nodes up to the point where the path does not exist in the graph anymore. So I could want to query a really long path, but in fact the path already stops at the first node, e.g. root-[1]->()-[0]->()-... This query should only return node (1) as the path stops at that point. (Node (1) has no outgoing relation of type 0)
EDIT4: I tried to figure out a solution which works, but it is quite dirty.
tree_root, = graph_db.create({"name": "root"}) # create a root
node_list = []
my_path = [1, 1, 0, 1, 1, 1] # path to query
len_of_path = len(my_path)
# add the corresponding number of nodes to the list and name them n0,..nX
for i in range(0,len_of_path):
node_list.append("n"+str(i))
# construct the query string
my_query_start = 'START root=node({root_id}) MATCH (root)'
my_query_return = ' RETURN'
for i in range(0, len(node_list)):
my_query_start += '-[?:`' + str(my_path[i]) + '`]->(' + str(node_list[i]) + ')'
if i == len(node_list)-1:
my_query_return += ' ' + str(node_list[i])
else:
my_query_return += ' ' + str(node_list[i]) + ','
# concatenate the query
complete_query = my_query_start + my_query_return
#print "complete_query:", complete_query
# execute the query
query_paths = neo4j.CypherQuery(graph_db, complete_query)
params = {"root_id" : tree_root._id}
my_list_of_nodes = query_paths.execute(**params)
#output results
print "data of my_list_of_nodes:", my_list_of_nodes.data
print "columns of my_list_of_nodes:", my_list_of_nodes.columns
Please guys, don't tell me that this is supposed to be the final solution ;-)

Can you possibly use a variable length path with an OR for the reltype?
MATCH path=root-***put your partial pattern here***->[:`1`|`0`*]->leaf
WHERE NOT (leaf)-[:`0`|`1`]->()
RETURN path
ORDER BY length(path) DESC
LIMIT 1
I think it solves your example... but is it more complicated than that?

Get actual disk space of a file

How do I get the actual filesize on disk in python? (the actual size it takes on the harddrive).

UNIX only:
import os
from collections import namedtuple
_ntuple_diskusage = namedtuple('usage', 'total used free')
def disk_usage(path):
"""Return disk usage statistics about the given path.
Returned valus is a named tuple with attributes 'total', 'used' and
'free', which are the amount of total, used and free space, in bytes.
"""
st = os.statvfs(path)
free = st.f_bavail * st.f_frsize
total = st.f_blocks * st.f_frsize
used = (st.f_blocks - st.f_bfree) * st.f_frsize
return _ntuple_diskusage(total, used, free)
Usage:
>>> disk_usage('/')
usage(total=21378641920, used=7650934784, free=12641718272)
>>>
Edit 1 - also for Windows: https://code.activestate.com/recipes/577972-disk-usage/?in=user-4178764
Edit 2 - this is also available in Python 3.3+: https://docs.python.org/3/library/shutil.html#shutil.disk_usage

Here is the correct way to get a file's size on disk, on platforms where st_blocks is set:
import os
def size_on_disk(path):
st = os.stat(path)
return st.st_blocks * 512
Other answers that indicate to multiply by os.stat(path).st_blksize or os.vfsstat(path).f_bsize are simply incorrect.
The Python documentation for os.stat_result.st_blocks very clearly states:
st_blocks
Number of 512-byte blocks allocated for file. This may be smaller than st_size/512 when the file has holes.
Furthermore, the stat(2) man page says the same thing:
blkcnt_t st_blocks; /* Number of 512B blocks allocated */

Update 2021-03-26: Previously, my answer rounded the logical size of the file up to be an integer multiple of the block size. This approach only works if the file is stored in a continuous sequence of blocks on disk (or if all the blocks are full except for one). Since this is a special case (though common for small files), I have updated my answer to make it more generally correct. However, note that unfortunately the statvfs method and the st_blocks value may not be available on some system (e.g., Windows 10).
Call os.stat(filename).st_blocks to get the number of blocks in the file.
Call os.statvfs(filename).f_bsize to get the filesystem block size.
Then compute the correct size on disk, as follows:
num_blocks = os.stat(filename).st_blocks
block_size = os.statvfs(filename).f_bsize
sizeOnDisk = num_blocks*block_size

st = os.stat(…)
du = st.st_blocks * st.st_blksize

Practically 12 years and no answer on how to do this in windows...
Here's how to find the 'Size on disk' in windows via ctypes;
import ctypes
def GetSizeOnDisk(path):
'''https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-getcompressedfilesizew'''
filesizehigh = ctypes.c_ulonglong(0) # not sure about this... something about files >4gb
return ctypes.windll.kernel32.GetCompressedFileSizeW(ctypes.c_wchar_p(path),ctypes.pointer(filesizehigh))
'''
>>> os.stat(somecompressedorofflinefile).st_size
943141
>>> GetSizeOnDisk(somecompressedorofflinefile)
671744
>>>
'''

I'm not certain if this is size on disk, or the logical size:
import os
filename = "/home/tzhx/stuff.wev"
size = os.path.getsize(filename)
If it's not the droid your looking for, you can round it up by dividing by cluster size (as float), then using ceil, then multiplying.

To get the disk usage for a given file/folder, you can do the following:
import os
def disk_usage(path):
"""Return cumulative number of bytes for a given path."""
# get total usage of current path
total = os.path.getsize(path)
# if path is dir, collect children
if os.path.isdir(path):
for file_name in os.listdir(path):
child = os.path.join(path, file_name)
# recursively get byte use for children
total += disk_usage(child)
return total
The function recursively collects byte usage for files nested within a given path, and returns the cumulative use for the entire path.
You could also add a print "{path}: {bytes}".format(path, total) in there if you want the information for each file to print.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

What data structure is good for maintaining file paths? - python

Related

Parse list of strings for speed

Rat with the randomize path in and 2-D array

Check if files in dir are the same

neo4j: Return all nodes on the longest matching patch

Get actual disk space of a file

Categories

Resources