learning-augmented system 잠재력을 펼치기 위해 ML/DL 알고리즘 또는 시스템 기능을 개선하기보다는 learning-and-system co-design 촉진시키는 것이  중요하다고 주장하고 있으며, 이를 위한 공통 프레임워크로 AutoSys 소개함.

 

1. Motivating Case Studies

Web Search 통해 현대 시스템에 적용할 만한 점들을 분석하였음

- Selection Service : keyword matching, esmantic similarity, past populariy 같은 것들을 통해서 검색함

- Ranking Service : 사용자가  좋은 검색 결과를 얻을  있도록 관련된 문서를 예측해서  상단에서 노출될  있게 

- Reranking Service : 이미지와 같은 추가적인 content types 대해 추가하고 랭킹을 매김

 

insight

현대 시스템은 소프트웨어 로직 파라미터, 하드웨어 configuration, actions of an execution sequence, engine selection, hyper-parameters of ML/DL algorithms and model 같이 컨트롤 가능한 시스템 knobs 함수로 추상화할  있음

 

1) Source of System Complexity

- Heterogenous claasses of decision-makings

system operator knobs 제어하는 것은 가능하나, 모든 조합에 대해 효과를 고려하면서 최적의 knobs 구성하는 것은 어려움

- Multi-dimensional system evaluation metrics

metrics 많아지면서 실제 시스템의 성능을 높이는게 어려움

metric들이 다른 목표를 가지게 되면, 각각에 대해 최적화를 진행하여도 서로 상충되는 결과를 일으켜 실제로 효과가 없을 수도 있음

- Interactions between subsystems and componets

Subsystem 개선이 전체 시스템의 개선으로 이어지지 않을  있음

 

2) Source of Operation Complexity

- Environment diversity and system dyanmics

클라우드 컴퓨팅이 많이 활용되면서, 현대 시스템은 마치 데이터 센터에서 클러스터링   처럼 구성됨

그래서 이렇게 분산된 환경과 동적으로 변하는 시스템을 최적화하여 설계하는 것이 challenge

하드웨어의 업그레이드가 빈번하게 일어나는  뿐만이 아니라, 서버의 자원을 공유하기도 . 그래서 subsystem 인스턴트들과 컴포넌트가 다른 resource budget 있을  있음. 그래서 이를 고려해서 이런 점을 고려해서 설계하는 것이 challenge

현대 시스템은 빈번한 소프트웨어 업데이트 사이클 방식을 채택하고 있음. 그래서 구성이 빠르게 바뀜

- Workload diversity and dynamics

Workload 동적으로 변하며, 예측 가능할 수도, 예측 불가능할 수도 있음

서로 다른 subsystem 같은 workload 대해서 다른 행동을 

- Non-trivial system knobs

시스템과 워크로드가 동적으로 변하기 때문에 현대 시스템을 사전 지식만을 바탕으로 튜닝하는 것은 어려움

knobs 사이의 의존성도 고려해야함

 

2. Learning-augmented System Design

1) Principles on Making Systems Learnable

P1 : 학습을 위해 필요한 feature들을 드러내고, generality 위한 시스템 상세를 추상화한  명시된 인터페이스

ML/DL 위해 필요한 정보를 추출하도록 인터페이스가 만들어지지 않았음

configuration files 직접 수정하는 것은 시스템이 제약 조건을 적용할  없고, ML/DL 작동에 대한 피드백을 제공할  없다는 것을 의미함

raw logs 파싱하는 것은 time-consuming

 

So,

-  정의된 control interface 필요함

- control interface 원인과 효과를 잡아내기 위해 시스템 행동의 feature 드러낼  있어야함

- data outlier 제거하여 skew 또는 잘못된 학습 프로세스를 피할  있어야함

 

P2 : 시스템 실패 예방과 탐지를 위한 모니터링 되는 ML/DL acturations

- 정확성을 보장하는 것은 사람이 작성한 로직을 검토하는 과정이었지만, ML/DL 모델로는 정확성이 보장되는지 쉽게 해석하기 어려움

- 완전한 데이터 셋을 구성하기 어려움

So, 현대 시스템은 ML/DL 행동을 모니터링하고 효과를 입증하는 메카니즘이 필요함

 

2) Principles on Making Learning Manageable

P3 : 학습 복잡도를 줄이기 위한 모듈화된 학습

- 거대한 현대 시스템에서 고려할 사항  하나는 learning task 관리하기 쉬운 부분으로 나누는 것임

So, modularized learning 통해서, 모델은 오직 subsystem 또는 component 행동만을 학습함

 

Challenges,

- modularized learning challenge 전체 learning cost 최소화하는 modularization 결정하는 것임

- 다음 모듈이 어떻게 이전 모듈의 output 사용하는지에 대한 지식이 없는 상태로는 global optimum 대해 설명하기 어려움 (데이터 의존성 문제)

 

P4 : system exploration model maintenance 위한 자원 관리

- 현대 시스템이 요구하는 자원은 다음  타입의 작업과 관련이 있음

(1) system exploration benchmarks

(2) ML/DL model training

- learning-augmented system 유지하는 것은 다음과 같은  가지 이유로 많은 양의 위와 같은 작업을 야기한다.

(1) 시스템 규모와 복잡성을 맞춰주기 위해 modularizing learning 거대한 클라우드 시스템을 많은 수의 ML/DL 모델로 나눔

(2) cloud system dynamics 모델 학습이 one-shot process 간주되면 안된다는 것을 의미함. 이전에 훈련된 모델에 대한 가정이 바뀌면 모델을 재학습 시키거나 점진적으로 업데이트 해야함

So,

- learning-augmented system  좋은 모델로 유지하기 위해서는.  작업을 위한 한정된 자원을 관리하기 위한 메커니즘이 필요함

ex) prioritizing system exploration benchmarks based on how they are expected to help the predictive model accuracy, and scheduling jobs of heterogenous requirements for the resource pool

 

3. AutoSys Framework

1) Training Plane

- P3, P4 맞는 model training requirements 다룸

- 존재하는 모델이 재설계되어야 하는지 또는 Inference Plane 피드백에 따라 업데이트 되어야하는지를 결정함

- Candidate Generator : 반복적적으로 training inputs 생성하여 low inference accuracy 또는 high inference uncertainty learning region 넣음

- Trial Manager : 각각의 candidate Trial instance 추상화함. decision-making scenarios 지원하기 위해 이질적인 자원 요구 사항을 부과할  있음

- Model Trainer : inference accuracy, inference cost, training cost 고려해서 모델 아키텍처가 설계되어야 

 

2) Inference Plane

- P2, P4 맞는 decision-making requirements 다룸

- Inference Runtime : modularized modules 통해 시스템이 모델들의 셋으로 표현됨. Inference Plane 독립적인 시스템 모듈을 위해 final output separate actuations으로 바꿀  있음

- Rule Engine : potential learning-induced failure 대비하기 위해, AutoSys ML/DL decision algorithms 결정록적인 rule-checking engine으로 감싸고 있음. 이는 실제 사람이 읽을  있는 룰을 가짐.  통해 basic sanity, knob dependencies 체크하고, 예측 값과 실제 값의 차이를 체크함

 

3) Target System

- target system control interface 통해 감싸져있음

- system monitor 로그를 분석해서 잠재적인 health problem 탐지함

 

critique

AutoSys 실제로 사용하는 내용이 포함되어 있지 않아, 실제로 각각의 Plane에서 이뤄지는 작업 내용을 파악하기 힘들었음. 각각에 대해 무엇을 해야한다와 같은 내용은 서술되어 있지만, 실제로 어떤 작업을 파악하기 힘들게 작성이 되어있어서 아쉬움. 그리고 Web Search 베이스로 얻은 insight만을 활용해서 서술되어 있어 다른 시도도 같이 해봤으면 좋았을  같다는 생각이 들었음

+ Recent posts