Deduplication Algorithm

###What Deduplication Is According to wikipedia, “Data deduplication is a specific form of compression where redundant data is eliminated, typically to improve storage utilization. In the deduplication process, duplicate data is deleted, leaving only one copy of the data to be stored. However, indexing of all data is still retained should that data ever be required. Deduplication is able to reduce the required storage capacity since only the unique data is stored.

Methods For DedupLication Algorithm

file-level Deduplication
Block-level Deduplication

file-level deduplication watches for multiple copies of the same file, stores the first copy, and then just links the other references to the first file. Only one copy gets stored on the disk/tape archive. Ultimately, the space you save on disk relates to how many copies of the file there were in the file system.

lets assume a company having a 1000 employee share a common file say “data.txt” which is 10MB in Size.Each employee do the same changes and save the exact similar 1000 copies of file on server.so estimated storage require to save a file on server side is 10 GB.

Here the point is if All the files are identical then why to upload all the files on server.just save a single copy on server and put a pointers in a users folder that points to a single copy on server.So thats how Data Deduplication technique used to save the TB’s of storage.

Block-level Deduplication, sometimes called variable block-level deduplication, looks at the data block itself to see if another copy of this block already exists. If so, the second (and subsequent) copies are not stored on the disk/tape, but a link/pointer is created to point to the original copy.

lets say we have three users each having 4 data blocks.Green and Gray blocks are common in three users so they backed up in data center.Blue,red and purple block are common between two users hence they backed up in Data center. Here the Important point is the memory required for block data in data storage is very less which is equal to sizeof(block) in memory.

Block level deduplication is always efficient then file-level deduplication because In file level you have to dump whole file in data storage if its version changed where as in block level you have to dump the changed block which takes comparatively less space in data storage.

Implemention Of DedupLication Algorithm

In Block-level deduplication we parse the data in terms of blocks.

Steps For Algorithm

1. Start
2. Declare Variable 
3. Initialize variable
4. Read 1024 bytes from file in tone iteration
5. Read from file until reach EOF
   5.1 Generate Hash Value from strBuff[BLOCKSIZE]
   5.2 if (FirstBlock) 
          Consider node as root element
          Inc BlockCtr
       else
          search the generated Hash in BST
          if (Find Hash == True)
             Compute the Node
             Add the Node to a linked List 
             Change the EndLink of SLL
          else
             Add the node in BST
             Inc The BLockCounter
6. Calculate Deduplication Ratio
7. Display the Result for each iteration
8. END

We have a diagrammatic representation which includes BST and SLL that we are created during parsing of file at block level

Here we are using two Important structures.

Binary Search Tree Structure typedef struct treeNode { TCHAR tszHash[MAX]; int iBlockNo; struct Sll *structSlink; struct Sll *structLlink; struct treeNode *left; struct treeNode *right; }treeNode;

Hash <- This filled is used to store the hash generated by CRC64 bit hash algorithm
iBlockno <- each block is having a specific block no.repeated blocks are points to that particular block no.
Slink <- it is start link pointer that contains the address of first node of single linked list
Elink <- it is start link pointer that contains the address of last node of single linked list
left <- it is a pointer that store the address of left child
right <- it is a pointer that store the address of right child

Single Linked List Structure

struct Sll
{
     int iRepBlockNo;
     struct Sll *pLink;
};

irepBlockNo <- Counter variable that keeps track for no of blocks
pLink <- Pointer that holds the address of next node.

Core Logic Loop

As the loop runs it reads 256 bytes of data in one iteration and generate tha hash for the same .
Search() procedure ensures that entry is not already present in BST
If the hash already exist
- Compute the node
- Add the node to Single linked list
- change the slink and elink pointers of tree node.
if hash doesn’t exist
- Compute the node
- Use hash as a key for comparison
- Add the node in BST
calculate deduplication percentage as a measurement also known as Deduplication ratio

Syntax ```C while(is.read(strBuff,BLOCKSIZE)) { mbstowcs(wcstring, tszBuff,sizeof(tszBuff)); wstring strTempHash = HashString(wcstring).c_str(); if(iIncCtr == 0) { root = Insert(root, strTempHash.c_str(), iBlockCtr, &ppTempStore) } else { pTemp = Find(root,tszTempHash.c_str()); if(pTemp == NULL) { root = Insert(root,tszTempHash.c_str(),iBlockCtr,&ppTempStore); } else { pNwLink = (Sll*)malloc(sizeof(Sll)); pNwLink->pLink=NULL; pNwLink->iRepBlockNo=iIncCtr; if(pTemp->structLlink == NULL && pTemp->structSlink == NULL) { pTemp->structSlink=pNwLink; pTemp->structLlink=pNwLink; } else { pTemp->structLlink->pLink=pNwLink; }

     }   

}      
iIncCtr++; } ``` * Code is having a Time complexity of **O(nlogn)**

Output As it shows the output of Block Level Deduplication on 50MB of File

Links And Abbreviation

SLL <- Singly Linked List
BST <- Binary Search Tree
Dedup<- Deduplication

About me