Python Forum
Fastest dict/map method when 'key' is already a hash?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Fastest dict/map method when 'key' is already a hash?
#1
I'm implementing deduplication for a new backup tool based on LVM thin-delta where the destination archive consists only of chunks, together with a manifest file listing the chunk filenames and SHA-256 hash for each.

Since my keys for each chunk are already hashes (the value is the chunk filename), is there a more efficient way to store and find chunks in Python than dict()?



For example:

chunkidx = {}
with open("manifest","r") as mf:
  for ln in mf:
    chash, chname = ln.strip().split()
    chunkidx[chash] = chname
The above compiles a dict chunkidx from the manifest file where the SHA-256 hash is automatically re-hashed internally by Python before storing the key-value pair. I assume this costs some amount of overhead.

When using the index:

while True:
  buf    = zlib.compress(volume.read(bufsize))
  chash  = hashlib.sha256(buf).hexdigest()
  if chash in chunkidx:
    send_link_to_existing_chunk(chunkidx[chash], chash)
  else:
    send_new_chunk(chname, chash)
In the above example, the overhead seems to be compounded by the fact that Python must create a temporary internal hash for the chash variable when searching the chunkidx dict.

(I realize there is also the hexdigest (ascii) vs digest (binary) issue, which is a choice I'll make based on overall efficiency; I'm open to suggestions here as well.)

Additional:

My only idea for greater speed (so far) involves storing the first 4 bytes of chash in some sort of bytearray and using that for a pre-search. If there is a match in the bytearray search, then do a lookup and compare of the full hash values. I may be wrong, but this seems like it could enhance overall search speed.
Reply


Messages In This Thread
Fastest dict/map method when 'key' is already a hash? - by tasket - Apr-13-2019, 03:35 AM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Fastest way tkinter Quatrixouuu 2 489 Feb-19-2024, 07:20 AM
Last Post: Danishhafeez
  What is the fastest way to get all the frames from a video file? glorsh66 3 1,248 May-26-2023, 04:41 AM
Last Post: Gribouillis
  [SOLVED] How to crack hash with hashlib Milan 0 1,528 Mar-09-2023, 08:25 PM
Last Post: Milan
  Fastest Way of Writing/Reading Data JamesA 1 2,268 Jul-27-2021, 03:52 PM
Last Post: Larz60+
  Fastest Method for Querying SQL Server with Python Pandas BuJayBelvin 7 7,127 Aug-02-2020, 06:21 PM
Last Post: jefsummers
  Sort a dict in dict cherry_cherry 4 89,705 Apr-08-2020, 12:25 PM
Last Post: perfringo
  Hash command works differently for me in CMD and Spyder ZweiDCG 3 2,440 Sep-10-2019, 01:10 PM
Last Post: DeaD_EyE
  length constraint on phrase hash to password javaben 0 1,972 Aug-21-2019, 05:34 PM
Last Post: javaben
  Create file archive that contains crypto hash ED209 1 2,103 May-29-2019, 03:05 AM
Last Post: heiner55
  fastest way to record values between quotes paul18fr 5 3,425 Apr-15-2019, 01:51 PM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020