BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
Data was a hot topic at the "Building the Intelligent Enterprise" panel session at the recent MIT Sloan CIO Symposium in Cambridge, Mass. As panelists discussed, changing market trends, increased digitization and the tremendous growth in data usage are demanding a paradigm shift from traditional, centralized enterprise models to decentralized, edge computing models.
All the data required for an intelligent enterprise has to be collected and processed somehow and somewhere -- sometimes in real-time, presenting a challenge for companies.
Here, four IT practitioners break down best practices and architectures for managing large data sets and how they're taking advantage of edge computing. This was in response to a question posed by moderator Ryan Mallory, senior vice president of global solutions enablement at data center provider Equinix.
Here is Mallory's question to the panel: Having an intelligent enterprise means dealing with a lot of data. Can you provide some best practices for managing large data sets?
CEO and co-founder of AI communication company Entefy Inc.
"I think it really depends on the use cases. We live in a multimodal world. Almost everything we do deals with multiple modalities of information. There are lots of different types of information, all being streamed to the same central areas. You can think about it almost like data lake intelligence.
"The hardest part of something like this is actually getting yourself ready for the fact that you actually don't know what information you're going to need in order to predict what you want to predict. In some cases, you don't even necessarily know what you want to predict. You just know you want it to be cheaper, faster, safer -- serve some cost function at the very end.
"So, what we tend to do is design the infrastructure to pool as much diverse information as possible to a centralized core and then understand when it finds something that predicts something else -- and there's a lot of techniques upon which to do that.
"But when the system is looking through this massively unstructured information, the moment it gets to something where it says, 'Oh, I think this is reliable, since I'm getting this over and over again,' it'll take that and automatically pull it out and put it into production at the edge, because the edge is processing the application of information. [The edge] is processing enterprise information in transit, almost like a bus. It doesn't have the benefit of you cleaning it properly, or of you knowing exactly what you're looking for.
Alston GhafourifarCEO and co-founder, Entefy Inc.
"Making that transaction and that transition automatic and intelligent is what takes an enterprise further. [An enterprise] could have petabytes of information, but could be bottlenecked in their learning by the 50 or 100 data scientists looking at it. Now, it could say, 'I'm going to create the computing power of 5,000 data scientists to [do] that job for me,' and just automatically push it out. It's almost like a different type of cloud orchestration."
Global head of analytics, reporting, integration and software engineering at oil and natural gas exploration company Devon Energy
"Let me build on that and say the one thing that we're starting to do is use more of what the industry calls a Lambda architecture, where we're both streaming and storing it. It's having something that's pulling data out of your stream to store it in that long-term data store.
"What we're doing in areas like northwest Texas or the panhandle of Oklahoma, where you have extremely limited communication capability, is we're caching that data locally and streaming the events that you're detecting back over the network. So, you're only streaming a very small subset of the data back, caching the data locally and physically moving that data to locations, up to the cloud and doing that big processing, and then sending the small processing models back to the edge.
"One of the things I think you have to do, though, is understand that -- to [Ghafourifar's] point -- you don't know what you don't know yet. And you don't even know what questions you're going to get yet, and you don't know what business problems you're going to have to solve yet. The more you can do to capture all of the data -- so then when you do your data science work, you have it all -- the better. But differentiate what you need for processing versus what you need for storage and for data science work. Those are two different workloads."
Vice president of information technology at engineering and construction firm CDM Smith
"I can use construction as a primary example. We have large streams of data that we want to analyze as part of the construction process, because we want to know what's happening in real time. We might have remotely operated vehicles driving or flying around doing LIDAR or radar activity, monitoring things from a visualization standpoint, etc., and that data maybe is getting streamed somewhere and kept. But, in real time, we just want to know what's changing and when it's changing -- like weather patterns and other things going on. We want to analyze all that in real time.
"Now, at the end of the project, that's when the data scientists might say, 'We want to improve our construction process. So, what can we do with that data to help us determine what will make our next construction projects be more successful, take less time and be more cost-effective?'"
Senior vice president of product marketing at business intelligence software provider MicroStrategy
"In terms of [managing large data sets], we try and push down as much of the processing into the Hadoop data structure -- into the database -- as possible. So, we're always pulling as small an amount of data back as possible, rather than push as much data as possible to the edge, which ties into some of the points we've already made.
"I think you should always try to optimize and reduce the amount of information that comes back. For us, we're doing that because we want the response to come back faster."