just wondering what other customers are doing.
It's a balancing act between risk, performance and management!
As I recall it's recommended not to cross the 150 TB boundary. This is not a technical boundary but it makes sense when looking into how much time you need to recover a pool in case of a dual disk error.
We are moving to pools with a initial size of 100 TB and even trying to divide them into what applications which should be running at them (random io, sequential, backup etc.). This could make risk lower, performance better and management cost higher.
Where did you get the '150TB' boundary recommendation? The best I've ever received is "How much data do you want to have to recover?"
150TB is one of those "educated guesses" based on things like recovery time, manageability and responsiveness to changes in workload profile. This last item many people tend to ignore. But all software has some type of upper limit. In environments that have very repeatable workload activity to basically the same data throughout the week, tiering software works best because that repeatable activity keeps the data on the tier it belongs all the time. With less repeatable activity, data has to be relocated from one tier to another during the relocation cycle. Sometimes there is a lot of data that needs to be relocated. Tiering software needs to limit the amount of data to be relocated during a cycle since it doesn’t want that relocation to interfere with server activity (which it considers the highest priority). Too big a pool and there will not be enough time to relocate all the data to the appropriate tier it belongs in for the next cycle. Fortunately, the relocation limit is not at the subsystem level, but the individual HDT pool level. That means 2 pools of 75TB each could be more responsive than 1 pool of 150TB, assuming one can balance each 75TB pool with activity that doesn’t overwhelm that smaller pool. This could also mean the proper storage devices chosen for each pool could be different because all environments have “hot” volumes (due to IO activity or internal business reasons) that alter the configuration needed to support the performance of that pool.
Back to the 150TB suggested limit. That seems to work well in repeatable activity environments. In less repeatable ones, that may be too large to provide the responsiveness one desires.
We build our Pools to support the High End IOPS in SSD/Flash, the High End Throughput in 10KSAS (majority of "repeatable workload") and the 80% of our cold structured data environment in NL-SAS (although we go conservative at make the T3 about 60% of our capacity). This works well for day to day operational management, scalability and keeping our costs down. We have Pools that are much larger than 150TB. RAID Group Configs i like are as follows, 7+1 or 3+1 for SSD/Flash, 14+2 for 10K SAS, and 4+4 for NL-SAS.
Retrieving data ...