Gravity, Residency and Latency: Balancing the Three Dimensions of Big Data

The concept of big data has been around for over a decade now, and today, data sets are larger than ever.

The explosion in all types of material production continues. From IoT sensors to robots to cloud-based microservices, a variety of telemetry-producing devices churn out massive amounts of data – all of which are potentially valuable.

On the consumption side, AI has given us good reason to generate, collect and process vast amounts of data – the more, the better. From autonomous vehicles to preventing application outages, the AI ​​use case cannot be improved with more data all the time.

Where to put all these figures remains a matter of concern. The explosion of data threatens any reduction in storage costs that we can extract. Data gravity continues to weigh us down. And no matter how much data we have, we want any insights we can draw from them immediately,

The challenge of how to handle all this data is actually much more complex than people understand. There are many considerations that influence any decisions we may make about collecting, processing, storing and extracting value from increasingly large, dynamic data sets.

Here are some basics.

Three Dimensions of Big Data Management

We must deal with the first dimension data gravity, Data gravity refers to the relative cost and time constraints of moving large data sets compared to the associated cost and time effects of computation capabilities moving close to the data.

If we are moving data into or out of the cloud, there are typically entry and removal charges that we must take into account. It’s also important to consider the storage cost for those data, given how hot (fast accessible) the data has to be.

Bandwidth or network cost can also be a factor, especially if moving multiple data sets in parallel is the best bet.

Every bit as important as cost is the consideration of time. How long will it take to move these data from here to there? If we are moving data through a narrow pipe, such time constraints can be prohibitive.

the second dimension is Data Residency, Data residency refers to the regulatory constraints that limit the physical locations of our data.

Some jurisdictions require data sovereignty – for example, to hold data on EU citizens within Europe. Other rules restrict the movement of certain data across borders.

In some cases, data residency limits apply to entire data sets, but more often than not, they apply to specific fields within those data sets. A large file with Personally Identifiable Information (PII) will have many regulatory constraints as to its movement and use, whereas an anonymous version of the same file may not.

We should consider the third dimension data latency, How fast can we transfer any amount of data, given the network constraints that apply to a particular situation? Will those limits affect any real-time behavior that business stakeholders require from the data?

The reason data latency is a problem in the first place is because of the odd speed of light. No matter how good your technology, it is physically impossible for any message to exceed this cosmic speed limit.

Once your network is optimized as much as possible, the only way to reduce latency is to move the endpoints closer together.

Given that the greatest latency any message experiences on Earth is about a quarter second (representing the round-trip time for a geosynchronous satellite), most business applications don’t care much about latency.

In some situations, however, latency is critically important. Real-time multiplayer gaming, real-time stock trading, and other real-time applications such as telesurgery all seek to reduce latency below the quarter-second limit.

Low-latency applications may still be the exception in the enterprise, but situations where latency is an important consideration, however, are exploding. An autonomous vehicle traveling 60 mph will travel 22 feet in a quarter second – so better not wait for directions from the cloud or have a pedestrian toast!

Calculating the effect of three dimensions

Most organizations struggling with big data have tackled one or more of these ideas – but typically, they do so separately. However, it is important to keep all three dimensions in mind when planning any big data strategy.

Regulatory barriers provide a guardrail, but depending on specific compliance restrictions, organizations can deal with data residence in a variety of ways. Such calculations should always take into account the considerations of data gravity as well.

For example, if a strategy requires the transfer of large data sets to comply with the Data Residency Regulation, the cost and time constraints of such movements when deciding whether that particular strategy is correct. It is important to keep both in mind.

However, the real wild card in these calculations is the effect of edge computing. Today, organizations are already balancing data gravity and latency considerations when choosing a content delivery network (CDN). CDNs operate servers on cloud edges – locations within clouds that are geographically close to end-users.

Edge computing combines the near and far edges of the idea. Near fringes include telco points of presence, factory data centers, and even retail phone closets—anywhere an organization can locate server equipment to better serve edge-based resources.

While the near edge typically assumes static power and a lot of processor power, we can’t make those assumptions on the far edge. The far edge includes smartphones and other smart devices as well as IoT sensors and actuators and any other technology endpoints that can interact (even intermittently) with the near edge.

Edge is such a wildcard because disruptive innovation is moving at a rapid pace, so planning ahead involves strategic estimation. That being said, there are many examples today where edge computing has influenced computing in the three dimensions of big data management.

Take video surveillance, for example. AI has improved to such an extent that it can detect suspicious behavior in video feeds in real time. Moving video files from cameras over a network to near shore and from there to the cloud, however, poses serious data gravity challenges.

As a result, most AI projections for video surveillance take place either on the near edge (eg, a server closet near the cameras) or on the cameras themselves.

By taking the guesswork out of the devices, in turn, latency is reduced – being able to detect suspicious behavior long before the thieves are away from the jewelry.

Localizing such projections could also help with regulatory compliance, as such video feeds could contain confidential information such as license plates or even people’s faces.

Therefore, any data management strategy for video surveillance must balance gravity, latency, and residence priorities within the same deployment.

Whenever you find yourself faced with multi-dimensional big data management challenges, ask yourself these three questions:

  • What is the business value of the data?
  • What are the costs associated with the data?
  • How real-time does the data need to be?

The answers to those questions, in turn, will involve calculations that balance data gravity, data residence and data latency priorities.

There will never be one right answer, as business ideas will vary dramatically from situation to situation. Therefore, it is important to remember that regardless of the situation, you must reduce the numbers in all dimensions to figure out your optimal big data management strategy.

Leave a Comment