The following piece is part of the U.S.-India AI Fellowship Program’s short-form series.
By: Honson Tran
In today's world, the ability to meet the demand for leveraging and implementing Artificial Intelligence (AI) in solutions lies in addressing the implementation gap, which involves addressing the resources required to transition from research to real-world application. These resources could involve access to computing, but also technical skills and publicly available tooling to leverage cutting-edge research for practical use. Although NVIDIA Graphics Processing Units (GPUs) and Compute Unified Device Architecture (CUDA) have helped lower the entry barrier for computing, high infrastructure costs and limited access to hardware might still be an issue for organizations and AI practitioners who need more compute.
Even with all these challenges addressed from a resources standpoint, one must consider how these resources are utilized. Aside from addressing the implementation gap, engineers must also acknowledge the inefficiencies from the “data-to-insight” pipeline. This pipeline is a multifaceted issue spanning across, but not limited to, data science teams, deployment engineers, and the inefficiencies of combining these efforts across every team and project iteration. Addressing this through a better balance between edge and cloud computing, streamlined tools and processes, and a higher focus on model maintenance will be fundamental in expanding AI to the world.
Stop reinventing the wheel; attach an engine to it.
Image source: Well-Architected machine learning lifecycle | AWS
Experts often discuss AI through the lens of accuracy, speed, and architecture. However, the community must go beyond these metrics, evaluating AI from the data to the application where it resides. In AI deployment, speed, repeatability, and maintenance are huge factors in scaling, not just how well one model performs over another.
The machine learning (ML) lifecycle often involves specialized teams at every step. Data scientists are concerned with creating the most accurate models. ML engineers are concerned with accelerating and securing that model. Hardware engineers are focused on fitting models into various hardware, and software engineers are concerned with packaging the work of many to be easily leveraged by the end-user. Each of these groups prioritizes a primary objective with tradeoffs in other areas, leading to a snowball effect of difficulty as solutions are being developed.
Regardless of the amount of resources, AI experts and engineers must identify, automate, and accelerate the ‘common denominator’ of ML pipelines, introducing complexity as needed. To lay this groundwork, addressing infrastructure will be the first task.
Infrastructure demand and the edge continuum
AI infrastructure is ‘table stakes’ for anyone involved in the AI space. Training and deployment become challenging without set infrastructure and defined pipelines in place. A Google search reveals that discussions on AI infrastructure often focus on cloud-based solutions. However, what about those who don't have access to these higher-capacity forms of computing?
As compute and energy demand continue to scale exponentially, considering alternative avenues aside from a cloud-only approach is imperative. While some workloads benefit significantly from high compute, such as training, not all applications require a virtually infinite compute pool and will benefit from hardware on the edge. For inference, for example, incorporating edge computing can benefit many applications due to local compute, such as reducing carbon footprint, reducing network bandwidth demand, improving data security, and improving end-to-end application latency for quicker decision-making. With the use of both cloud and edge systems, the demand for AI workloads can be distributed across many systems and reduce dependency on cloud access. This idea of a hybrid computing stack, also known as the edge continuum, is not new to the industry. However, the challenges that edge computing poses are a critical area of discussion when considering infrastructure. If we recognize the benefits of edge AI, what is preventing more adoption? Development conveniences present in the cloud are not present on the edge, at least not yet.
With a hybrid approach, systems can intelligently orchestrate workloads based on their compute demands. For instance, smaller, less powerful devices can efficiently filter and reduce data sent back to other systems. Hybrid computing approaches offer a potential solution to addressing infrastructure demand and mitigating resource congestion.
Heterogeneous hardware
Image Source: Platform Overview | Latent AI
As aforementioned, the edge continuum is not as uniform as the cloud. One of the many benefits of cloud computing is the abstraction layer, or convenient provisioning, deployment, and management tools, from cloud providers. However, moving away from the cloud adds more complexity, such as choosing various devices from many hardware vendors. These vendors can consist of bulky workstations in remote areas to tiny embedded devices in the field. Each vendor can have their own technology stack and require development time to understand how to deploy to each platform.
Furthermore, each hardware platform may provide several acceleration libraries, requiring more time and effort to identify the best model parameters and deployment settings. Not all models benefit from quantization, and not all models see gains in leveraging acceleration libraries (eg, TensorRT for NVIDIA devices). Metaphorically, a solution's power, size, and performance all exist on the same seesaw, attempting to find the right balance. For instance, faster inferencing speeds might come at the cost of power and memory. Lower memory usage during run-time might reduce inference speed but can also result in tremendous energy savings. Some solutions prioritize high accuracy without concern for inference speeds, whereas others favor faster inference speeds with a smaller footprint.
For the edge continuum to thrive, hardware platforms should integrate into tools that enable seamless model-to-hardware optimization without requiring deep knowledge of the hardware platform itself. Developers would not have to worry about the complexities of tuning and optimizing the inference pipeline, allowing them to focus more on designing the application and solution itself. Tooling that integrates diverse hardware platforms into a standardized packaging interface is also crucial.
Compute and carbon
With a vast amount of hardware and an infinite number of design decisions, developers spend tedious amounts of time retraining and assessing models during the exploration phase. Even with a massive team, the search space to identify candidate models and solutions is limited to the number of team members. In addition, developers still face profiling these models on different hardware platforms, adding another layer of complexity before finalizing their solution.
In the context of scaling AI, creating the solution only wins half the battle. The goal is to enable this solution to be maintained and reused in other solutions as well, especially with varying data sets. Infrastructure and tooling should also include a highly reproducible and easily configurable experimentation process to address this. This experimentation process should be akin to turning dials and knobs to tweak settings regardless of the model, compared to changing settings in code which differ across multiple repositories.
A configurable and templated experimentation process also can automate exploration, including hardware-in-the-loop profiling for power usage, accuracy, and effects of model quantization. Automating this process can reduce the time required to explore potential candidate models. These experiments can also be stored in a database for future comparisons and evaluations for new use cases. A framework for standardized profiling can also ensure all models are tested identically for an apples-to-apples comparison.
AI maintenance
A model left unattended is a model left behind. The days of static legacy code are a thing of the past. AI models demand a new paradigm for approaching and maintaining software. Tools should be created to monitor the ‘health status’ of models out in deployment. As mentioned, the biggest hurdles for the edge continuum are the same hurdles cloud providers have abstracted. Tools must be developed to continuously monitor deployed model health, such as accuracy and performance metrics, outlier detection, and application stability, to ensure optimal performance and longevity, especially for devices with limited computing, system resources, or network connectivity. Solving these challenges can enable efficient retraining and updating of policies, reducing computational resources and development time while maintaining application efficacy.
In addition to identifying when to maintain deployments, developing technologies to address how to save resources during maintenance is just as crucial. For example, due to limited connectivity environments, tooling can help compress and send delta updates. Delta updates will reduce network costs and allow for smaller update sizes, as only new changes are sent to the device.
Conclusion
Lowering the barrier to entry for AI promises an era of unparalleled innovation and progress. Through infrastructural improvements, the United States, India, and the globe can close the implementation gap, cultivating an economy where everyone can contribute to AI.
Harnessing AI's full potential involves strategic resource allocation and effective tooling. By fostering collaboration among academia, industry, and policymakers, societies can harness AI to improve outcomes and tackle humanity's most pressing challenges. In doing so, AI can catalyze technological advancement and be a cornerstone for a more equitable and intelligent future.
Honson Tran is part of the U.S.-India AI Fellowship Program at ORF America. He is currently a Developer Experience Lead at Latent AI.