## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Geometric Exploration for Online Control

NIPS 2020, (2020)

EI

Keywords

Abstract

We study the control of an \emph{unknown} linear dynamical system under general convex costs. The objective is minimizing regret vs. the class of disturbance-feedback-controllers, which encompasses all stabilizing linear-dynamical-controllers. In this work, we first consider the case of known cost functions, for which we design the firs...More

Code:

Data:

Introduction

- The authors study the online control of an unknown linear dynamical system under general convex costs.
- For online control of LDS with known cost function, with high probability, Algorithm 2 has regret 4
- For online control of LDS with bandit feedback, with high probability, Algorithm 5 has regret

Highlights

- We study the online control of an unknown linear dynamical system under general convex costs
- We show which is significantly simpler than the stochastic bandit convex optimization (SBCO) algorithms, achieves
- In Appendix F.2, we show that Theorem 5 holds almost unchanged for this case
- Can we improve the O(C) · dxdu(dx + du) T regret bound, in terms of dimension dependence? This looks plausible because the barycentric spanners are constructed by treating the policies as flattened vectors of dimension dxduH, the matrix structure is not exploited
- G√radient-based algorithm that achieves T -regret? Third, a challenging question is whether T -regret is achievable for nonstochastic control, where the disturbances are adversarial and the cost function adersarially changes over time

Results

- In Section 4, the authors describe the main contribution: the algorithm for the case of known cost function, and the proof of Theorem 1.
- To eliminate suboptimal policies from Ur and get Ur+1, it suffices to refine B only in the directions relevant to the controls of Ur. To do this, in the beginning of epoch r, the algorithm constructs a barycentric spanner of Ur. As the authors explain in Subsection 3.2, this can be done in polynomial time, because the authors know the cost function c and Ur is convex.
- The authors extend the algorithm and the proof to tackle the general case of a (κ, γ)-strongly stable A∗, when the cost function is known.
- Before the authors present the algorithm, the authors will need to slightly modify Definition 4 and Theorem 5, to take into account that the state is an affine function of the policy.
- Given an oracle for optimizing linear functions over S, for any C > 1 the authors can compute an affine C-barycentric spanner for S in polynomial time, using O(d2 logC(d)) calls to the oracle.
- Set Mt+1 = Mt. The authors formally state the theorem, which says that a√fter appropriately initializing the input parameters of the SBCO algorithm, Algorithm 5 achieves T -regret.
- There exist C1, C2, C3, C4, C5 = poly dx, du, κ, β, γ−1, G, log T , such that after initializing the SBCO algorithm with d = dx · du · H, D = C1, L = C2, σ2 = C3 and n = T /(2H + 2), if the horizon T ≥ C4, with high probabi√lity, warmup exploration (Algorithm 6 in Appendix E) followed by Algorithm 5 satisfy RT ≤ C5 · T .

Conclusion

- The authors gave the first polynomial-time algorithms with optimal regret, with respect to the time horizon, for online control of LDS with general convex costs and comparator class the set of DFCs. The authors' main result was a novel geometric exploration scheme for the case where the cost funct√ion is known.
- A challenging question is whether T -regret is achievable for nonstochastic control, where the disturbances are adversarial and the cost function adersarially changes over time.
- Can the authors prove regret bounds with respect to interesting nonlinear, yet tractable policy classes?

Reference

- Yasin Abbasi-Yadkori and Csaba Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages 1–26, 2011.
- Alekh Agarwal, Dean P Foster, Daniel Hsu, Sham M Kakade, and Alexander Rakhlin. Stochastic convex optimization with bandit feedback. SIAM Journal on Optimization, 23(1):213–240, 2013.
- Alekh Agarwal, Dean P Foster, Daniel J Hsu, Sham M Kakade, and Alexander Rakhlin. Stochastic convex optimization with bandit feedback. In Advances in Neural Information Processing Systems, pages 1035–1043, 2011.
- Naman Agarwal, Brian Bullins, Elad Hazan, Sham M Kakade, and Karan Singh. Online control with adversarial disturbances. arXiv preprint arXiv:1902.08721, 2019.
- Baruch Awerbuch and Robert Kleinberg. Online linear optimization and adaptive routing. Journal of Computer and System Sciences, 74(1):97–114, 2008.
- Alberto Bemporad, Manfred Morari, Vivek Dua, and Efstratios N Pistikopoulos. The explicit linear quadratic regulator for constrained systems. Automatica, 38(1):3–20, 2002.
- Dimitri P Bertsekas. Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA, 1995.
- Asaf Cassel, Alon Cohen, and Tomer Koren. Logarithmic regret for learning linear quadratic regulators efficiently. arXiv preprint arXiv:2002.08095, 2020.
- Asaf Cassel and Tomer Koren. Bandit linear control. arXiv preprint arXiv:2007.00759, 2020.
- Xinyi Chen and Elad Hazan. Black-box control for linear dynamical systems. arXiv preprint arXiv:2007.06650, 2020.
- Alon Cohen, Avinatan Hasidim, Tomer Koren, Nevena Lazic, Yishay Mansour, and Kunal Talwar. Online linear quadratic control. In International Conference on Machine Learning, pages 1029–1038, 2018.
- Alon Cohen, Tom√er Koren, and Yishay Mansour. Learning linear-quadratic regulators efficiently with only T -regret. arXiv preprint arXiv:1902.06223, 2019.
- Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. 2008.
- Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, pages 4188–4197, 2018.
- Paula Gradu, John Hallman, and Elad Hazan. Non-stochastic control with bandit feedback. arXiv preprint arXiv:2008.05523, 2020.
- Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169–192, 2007.
- Elad Hazan, Sham M Kakade, and Karan Singh. The nonstochastic control problem. arXiv preprint arXiv:1911.12178, 2019.
- Daniel Hsu, Sham M Kakade, and Tong Zhang. Random design analysis of ridge regression. In Conference on learning theory, pages 9–1, 2012.
- Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Anima Anandkumar. Logarithmic regret bound in partially observable linear dynamical systems. arXiv preprint arXiv:2003.11227, 2020.
- Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2020.
- Horia Mania, Stephen Tu, and Benjamin Recht. Certainty equivalent control of lqr is efficient. arXiv preprint arXiv:1902.07826, 2019.
- Samet Oymak and Necmiye Ozay. Non-asymptotic identification of lti systems from a single trajectory. In 2019 American Control Conference (ACC), pages 5655–5661. IEEE, 2019.
- R Tyrell Rockafellar. Linear-quadratic programming and optimal control. SIAM Journal on Control and Optimization, 25(3):781–814, 1987.
- Tuhin Sarkar and Alexander Rakhlin. Near optimal finite time identification of arbitrary linear dynamical systems. In International Conference on Machine Learning, pages 5610–5618, 2019.
- Tuhin Sarkar, Alexander Rakhlin, and Munther A Dahleh. Finite-time system identification for partially observed lti systems of unknown order. arXiv preprint arXiv:1902.01848, 2019.
- Max Simchowitz. Making non-stochastic control (almost) as easy as stochastic. arXiv preprint arXiv:2006.05910, 2020.
- Max Simchowitz, Ross Boczar, and Benjamin Recht. Learning linear dynamical systems with semi-parametric least squares. In Conference on Learning Theory, pages 2714–2802, 2019.
- Max Simchowitz and Dylan J. Foster. Naive exploration is optimal for online lqr. arXiv preprint arXiv:2001.09576, 2020.
- Max Simchowitz, Horia Mania, Stephen Tu, Michael I Jordan, and Benjamin Recht. Learning without mixing: Towards a sharp analysis of linear system identification. In Conference On Learning Theory, pages 439–473, 2018.
- Max Simchowitz, Karan Singh, and Elad Hazan. Improper learning for non-stochastic control. arXiv preprint arXiv:2001.09254, 2020.
- Anastasios Tsiamis and George J Pappas. Finite sample analysis of stochastic system identification. arXiv preprint arXiv:1903.09122, 2019.
- Ramon van Handel. Probability in high dimension. Technical report, PRINCETON UNIV NJ, 2014.
- Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
- 35. So with high probability, S3,h ≤ O
- 2. Lemma 47. With high probability, Algorithm 5 satisfies wt − wt
- 2. Lemma 47 concludes the proof.
- 0. Now, let t ≥ 2H + 2 and let w denote the sequence of disturbances. Also, let ηt(w) =

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn