The rising adoption of artificial intelligence (AI), and more recently generative AI (GenAI), has cast the spotlight on the demands of AI workloads on technology infrastructure.
AI workloads are not only more compute intensive – they also place huge demands on data, storage and network infrastructure when training AI models, performing inferencing tasks, as well as securing and governing the use of data to comply with data residency and data sovereignty requirements.
At Singapore’s DBS Bank, traditional AI and machine learning workloads are powered by its private cloud, according to its chief data and transformation officer Nimish Panchmatia, who claims the bank has built up a technology stack that is more efficient than that of hyperscalers to run those workloads.
But the bank’s GenAI cases – including DBS-GPT, its employee-facing version of ChatGPT that helps with content generation and writing – are in the public cloud due to scalability requirements which are difficult to address in a private cloud environment.
“Our intent is for GenAI use cases to run on public cloud,” says Panchmatia. “Of course, there are security and risk discussions that need to be had before we go into production. But those are ongoing discussions with cloud providers.”
Peter Marrs, president for Asia-Pacific, Japan and Greater China at Dell Technologies, says organisations that want to harness the full potential of AI will need an AI infrastructure strategy that is integrated across three deployment models – from edge to core datacentre to cloud – rather than as a series of point deployments in different locations.
“There is a proliferation of data capture points as organisations glean data from edge devices, their own products and services, employees, supply chain partners and customers,” says Marrs. “Data needs to stream freely and where it naturally settles in a storage environment. After having been leveraged for insights, data needs to be joined by compute to perform more analysis.”
Peter Marrs, Dell Technologies
Daniel Ong, director solution architect for Asia-Pacific at Digital Realty, says a well-designed AI infrastructure should lean towards a combination of scalability, openness and purpose-built hardware.
“Rigid, vendor-locked architectures hinder agility and limit the ability to exploit the most cost-effective or performant resources for specific tasks. By prioritising flexibility and choice, organisations are empowered to dynamically navigate changes in the AI landscape with agility and efficiency,” he adds.
Key infrastructure capabilities
The key capabilities of AI infrastructure range from compute, storage, networking and software infrastructure, to data security and infrastructure and cost management. Here’s a look at what each of those entails.
Compute: Central to an organisation’s AI capabilities is compute power, which needs to be scalable and flexible to train complex models. This often involves the use of graphics processing units (GPUs) for parallel processing tasks and tensor processing units (TPUs) for TensorFlow applications, says Ian Smith, cloud-first lead at Accenture Australia and New Zealand.
Erik Bergenholtz, vice-president for strategy and operations at Oracle, notes that GPUs can process multiple computations simultaneously, and can be clustered to scale massively – which are two key requirements for AI. “Today’s GPUs have been developed and optimised specifically to deliver key capabilities such as parallel and high-speed processing, deep learning, model training, natural language processing, computer vision and inference computation,” he adds.
Storage: With AI applications generating and consuming vast amounts of data, Smith notes that high-performance storage solutions that can handle the velocity, volume and variety of AI data are essential. This includes options for data lakes, object storage, and high-speed databases capable of supporting real-time analytics, with the addition of vector and graph databases to support GenAI workflows.
Networking: AI workloads also require robust networking to support the transfer of large datasets and to connect compute resources efficiently. “Low latency and high bandwidth are crucial to mitigate network bottlenecks, and as a result, we are seeing organisations move data closer to where models reside or having integration endpoints for quick inference times,” says Smith.
Software: AI infrastructure requires several software elements to orchestrate AI workloads. These include machine learning frameworks that offer pre-built models to deliver key capabilities such as simplifying AI project implementation and accelerating AI development. AI infrastructure software can also deliver capabilities including AI workload monitoring and management, as well as optimisation and deployment, says Oracle’s Bergenholtz.
Data governance, security and privacy: Smith calls for organisations to establish comprehensive data governance frameworks to manage data access, quality and compliance. Security and privacy are paramount, necessitating encryption, secure access mechanisms, and adherence to regulatory requirements. Creating a “trusted” enclave for developing AI capabilities in a secure manner is recommended.
Cost management and optimisation: Evaluating the trade-offs between investing in expensive, tailored GPU infrastructure for models and using platform-as-a-service (PaaS) offerings, which may be cost-effective for small-scale pilots or proofs-of-concept but can become prohibitively expensive at enterprise scale, is crucial. Implementing FinOps (financial operations) capabilities to monitor and optimise cloud spending is essential, says Smith.
Workload placement
With an increasingly distributed IT environment, AI workloads are likely to span multicloud and hybrid cloud environments. Digital Realty’s Ong notes that the physical locations of AI infrastructure is influenced by the following factors.
Data privacy and security: On-premise deployments are preferred for maximum control and security over highly confidential data or those subject to strict regulations. Conversely, less sensitive data could be colocated or deployed in the cloud.
Latency: Real-time applications such as autonomous vehicles benefit from on-premise or edge deployments to minimise data transfer delays.
Computational requirements: Computationally intensive tasks like deep learning training benefit from dedicated on-premise hardware with high-performance computing capabilities. Less demanding tasks can be handled efficiently in the cloud, offering a cost-effective alternative.
Cost optimisation: On-premise infrastructure requires upfront capital expenditure for hardware, software and ongoing maintenance. Cloud deployments offer a pay-as-you-go model, potentially reducing initial costs. However, long-term cloud usage might exceed on-premise costs for certain workloads with sustained resource requirements.
Scalability and adaptability: Cloud deployments inherently offer on-demand scalability for resources. On-premise infrastructure might necessitate manual hardware upgrades, potentially leading to delays in scaling up or down resources.
Expertise and management: Managing on-premises infrastructure demands dedicated IT staff with expertise in AI hardware and software. Cloud deployments often involve less in-house management overhead, as the provider handles infrastructure maintenance and scaling.
Dell’s Marrs notes that AI environments are typically a mix of centralised and decentralised infrastructure. While organisations may rely on on-premise IT environments for AI projects, the cloud has been a major resource for data-intensive AI workloads.
Daniel Ong, Digital Realty
“We are seeing the adoption of cloud-based services alongside on-premise infrastructure to support AI initiatives effectively. This hybrid approach allows organisations to leverage the benefits of both centralised and decentralised infrastructure based on their specific needs and requirements,” he adds.
Data movement
AI presents new challenges for data, which is needed to power and improve AI models. Oracle’s Bergenholtz’s notes that having large volumes of siloed data makes building an effective AI solution difficult.
“The data is ever increasing, from different sources and formats, on-premise and in the cloud, with varying quality levels, and handled by a wide range of tools and platforms,” he adds.
Given the increasingly distributed nature of AI, Marrs says it’s more efficient to move models and compute closer to where the data is being sourced or generated rather than the other way around.
In this federated approach, model training takes place at the edge where the data is, alleviating the need to move data to a centralised data lake. AI inferencing also occurs at the edge, allowing for more flexibility in connected and disconnected states.
However, the paradigm of bringing models to data is evolving with the advent of GenAI, necessitating the movement of data to where large language models (LLMs) are hosted, according to Accenture’s Smith.
“This transition demands a meticulous approach to infrastructure, emphasising low latency, high bandwidth and specialised hardware to accommodate the data-intensive nature of AI workloads.
Marrs notes that some of the challenges involved in moving large amounts of data are cost, disaster recovery and latency.
“Moving large amounts of data can be costly, especially when you are moving data across locations, such as from on-premise to cloud, and vice versa. Disaster recovery is also a concern – organisations need to ensure that the various versions of data that they are moving across locations are consistent, to avoid using outdated data for analysis.
“Low data latency is also key in use cases where the outcome of your application needs to be real time, such as stock trading or smart vehicles. In these cases, managing the movement of data can be tricky,” he adds.
Sustainability considerations
The computational demands of AI are posing a challenge to sustainability efforts. Datacentres, constantly powering and cooling AI hardware, require a rethink beyond mere technology upgrades. Against this backdrop, Digital Realty’s Ong calls for a fundamental shift in digital architecture.
“Modular datacentres, at the forefront of AI infrastructure, offer a solution by promoting scalability, efficiency and adaptability. As AI applications diversify and increase in complexity, modularity enables efficient infrastructure scaling, eliminating the need for costly and time-consuming overhauls.
“This ensures datacentres can readily adapt to escalating AI needs while maintaining peak performance without breaking the bank or experiencing extended downtime,” he adds.
The benefits of modularity extend beyond scaling. With rapid time-to-market being crucial for AI advancements, Ong says modular designs – with pre-manufactured components – expedite deployment, facilitating AI technology adoption and bolstering an organisation’s competitiveness.
Besides opting for AI solutions that prioritise energy efficiency, organisations can also leverage specialised processors and compact model architectures to minimise carbon emissions, says Smith.
“Additionally, integrating GenAI with technologies that promote decarbonisation can further align AI deployment with sustainability goals. Choosing infrastructure providers and AI models that consider the lifecycle emissions impact – from training to daily usage – enables organisations to adhere to their environmental responsibilities while embracing AI innovation,” he adds.
Anthropic, an AI startup, for example, prioritises partnerships with cloud providers that emphasise the use of renewable energy and achieving carbon neutrality. By actively offsetting its emissions, including those from cloud computing, Anthropic has demonstrated a commitment to reducing its environmental impact.
“Their analysis of their carbon footprint and investment in verified carbon credits to offset emissions exemplify proactive measures towards sustainability,” says Smith. “This example highlights that prioritising sustainability in AI infrastructure selection is crucial for minimising environmental impact and promoting responsible technological development.”
Optimising AI infrastructure with GenAI
GenAI can play a pivotal role in enhancing AI infrastructure in areas like resource provisioning, data quality assurance and auto-scaling. By leveraging LLMs, organisations can dynamically adapt cloud infrastructure to the evolving computational demands of AI models, optimising resource utilisation and reducing operational costs.
“LLMs’ capability to generate and pre-process data, combined with their adaptability to current coding practices and diverse coding styles, significantly improves data quality and model accuracy,” says Smith. “Additionally, GenAI’s auto-scaling features ensure efficient workload management, adjusting resources in real time to handle fluctuations and maintaining performance.”
Beyond technical optimisations, GenAI also democratises coding, speeds up development through automated code suggestions, and enhances software quality via intelligent debugging and test case generation.
Smith says: “This comprehensive approach not only streamlines development and maintenance processes, but also fosters innovation and sustainability in AI ecosystems, marking a new era of efficiency and adaptability in AI infrastructure management.”