I was looking at a system the other day that had high storage use and utilisation. Clearly there were difficulties writing because the storage was so full and delete operations were also taking a long time. I could see the deletes were queued though and thought that would at least mean that things should get better in a few hours. It was then I noticed deduplication was running and it got me thinking about the problems that might cause.
A while ago I mumbled about my understanding of the types of data deduplication over on another blog technobabble -- Types of De-Duplication so I won't cover that again here but if you need to clarify the terms I use give that a look over. Here I am thinking about "block level" or "Filesystem level" deduplication.
So back to the filesystem. It had about 30% dedupe saving, which was considerable and probably helping a lot. 20TB less 30% is a nice saving on space... at least it is until you only have 500GB free. Now I could see the deletes but I thought to myself, just how much disk space will get freed up and how much will just be pointers to duplicated blocks?
In the simple case the 30% number suggested that if 1TB of data were deleted about 600GB would be freed. When you are trying to free space, deduplication is suddenly making you pay extra for freeing up space. Is that 30% saved a good indicator of what deleting is likely to do?
Next I thought about what might be being deleted. When I need space I look a removing old backups and duplicate copies of files that are no longer needed. Why would that habit change, aim for the low hanging fruit. Only now those fruit might not be taking up much or any space. They are hollow fruit that deduplication has already freed space from. Now there are deletes queued up to delete links that will save very little or purpose no space at all. That 1TB I find and delete could easily be 1TB of the 6TB that deduplication has freed, deleting it does noting for me.
Now what should I do? I need space but deleting duplicates and backups won't give me much. I need to know what deleting some files would do to the disk space. I need an rm --dry-run that could tell me that removing these 1TB of files will only free 100GB of used space.
Do we have the tools to do that?