Apply Modular Optimal Learning Testing Environment for Drug Discovery

2019-06-02
OL
   

TODO

MOLTE的实现

1. MOLTE简介

Modular Optimal Learning Testing Environment(MOLTE)是一个基于MATLAB的学习规则模拟器,用于解决贝叶斯排序和选择问题,以及随机式赌博机问题。1MOLTE模拟器适用于候选方案为离散集合的决策问题。如材料构成,药物组成,产品特性和医学诊断。

MOLTE模拟器将于Excel电子表格中每一个问题的每一个规则对比运行numP次(numP定义位于MOLTE.m)。MOLTE模拟器每次运行将产生numTruth个不同的样例路径,由每个规则共享。之后计算每个样例路径的目标函数值,并平均此numTruth次尝试的值,作为期望终值或者期望累加值。使用时可指定使用在线目标函数(赌博机问题),或者离线目标函数(排序、选择问题或者随机搜索问题,详见3.1)。

2. MOLTE使用流程

  1. 定义问题类(使用Standard API)
  2. 定义规则类(使用Standard API,Optional)
  3. 使用Excel电子表格指定问题类和规则类(详见3.1)
  4. 使用MATLAB运行
  5. 生成输出文件

2.1 使用MOLTE进行虚拟筛选

参考资料:Frazier, D. M. N. P. I. (2009). Optimal Learning for Drug Discovery in Ewing’s Sarcoma.

TODO:

3. 问题输入

3.1 Excel电子表格(指定问题和规则)

每个问题需指定以下信息:

  1. Problem class,问题类:预先编码于文件夹problemClasses中的文件。该项应与文件名相同。如果需要指定
  2. Prior,优先级:获取优先级的方法。
    • Default:使用问题预先定义的优先级。
    • MLT:使用MLT初始化的拉丁超立方设计(Latin hypercube designs)。2
    • Given:由参数或者优先级文件给出(文件名:Prior_问题类.mat)
    • Uninformative:使用零均值化和infinite variance。
  3. Measurement Budget,测量预算:指定决策过程的时间范围与备选方案数量之间的比率。
  4. Belief Model,信念模型。
  5. Offline/Online,在线/离线:指定目标函数是最大化最终值还是总值。
  6. Number of Policies,规则数:用于对比的规则数量,之后提供规则名称。

示例:

Problem classPriorMeasurement Budget NBelief ModelOffline/OnlineNumber of Policies
PayloadDMLE0.2correlatedOffline4krigingEXPLIE(1.7)ThompsonSampling
BraninMLE0.1correlatedOnline4OLkgcbUCBEcb(*)BayesUCBIE(1.2)
Bubeck4uninformative5independentOnline4OLKG UCBSRUCBV
GPRDefault0.3correlatedOffline4krigingkgcbIE(1)EXPT

3.3 预定义问题类(共23种)

  1. Bubeck’s Experiments: (Audibert et al. 2010) We consider Bernoulli distributions with the mean of the best arm always μ 1 = 0.5. M is the number of arms.3
  2. Asymmetric unimodular function (AUF): x is a controllable parameter ranging from 21 to 120. The objective function is , where and the distribution of the random variable ξ are all unknown. ξ is taken as a normal distribution with mean 60. Three noise levels are considered by setting different noise ratios between the standard deviation and the mean of ξ: HNoise–0.5, MNoise–0.4, LNoise–0.3. Unless explicitly pointed out, experiments are taken under LNoise.
  3. Equal-prior: M = 100. The true values are uniformly distributed over [0, 60] and measurement noise . and for every x.All the standard optimization test functions are flipped in MOLTE to generate maximization problems instead of minimization in line with R&S and bandit problems. The standarddeviation of the additive Guassion noise is set to 20 percent of the range of the functionvalues.
  4. Rosenbrock functions with additive noise: where −3 ≤ x ≤ 3, −3 ≤ y ≤ 3. x and y are uniformly discretized into 13 × 13 alternatives.
  5. Pinter’s function with additive noise: where −3 ≤ x ≤ 3, −3 ≤ y ≤ 3. x and y are uniformly discretized into 13 × 13 alternatives.
  6. Goldstein-Price’s function with additive noise: where −3 ≤ x ≤ 3, −3 ≤ y ≤ 3. x and y are uniformly discretized into 13 × 13 alternatives.
  7. Branins’s function with additive noise: where −5 ≤ x ≤ 10, 0 ≤ y ≤ 15. x and y are uniformly discretized into 15 × 15 alternatives.
  8. Ackley’s function with additive noise: where −3 ≤ x ≤ 3, −3 ≤ y ≤ 3. x and y are uniformly discretized into 13 × 13 alternatives.
  9. Hyper Ellipsoid function with additive noise: where −3 ≤ x ≤ 3, −3 ≤ y ≤ 3. x and y are uniformly discretized into 13 × 13 alternatives.
  10. Rastrigin function with additive noise: where −3 ≤ x ≤ 3, −3 ≤ y ≤ 3. x and y are uniformly discretized into 11 × 11 alternatives.
  11. Six-hump camel back function with additive noise: where −2 ≤ x ≤ 2, −1 ≤ y ≤ 1. x and y are uniformly discretized into 13 × 13 alternative.

3.4 预定义规则类(共20种)

We have pre-coded various state-of-the-art policies π, which differ according to their decision of the alternative to measure at time n given state S^{n}.

  1. Knowledge gradient (KG): (Frazier et al. 2008; Frazier, Powell, and Dayanik 2009) This policy is designed for offline objective (1). Define the knowledge gradient:

3.4 定义问题类

TODO:

3.5 定义规则类

TODO:

4. 问题输出

  1. objectiveFunction.mat 每一次尝试的目标函数值。
  2. choice.mat 每一步使用的规则。
  3. FinalFit.mat 当测量预算用尽时的最终估值和相应的情况。
  4. alpha.txt 需要调参的规则的可调参数。
  5. offline_hist.pdf 终值分布直方图。
  6. online_hist.pdf 总值分布直方图。
  7. histChoice.m can read in the choice.mat and generate the distribution of the chosen alternatives for each policy and each trial. e,g, the right figure shows the frequency of choosing each of the 100 alternatives with a measurement budget of 300.
  8. genProb.m can read in the objectiveFunction.mat and depict the mean opportunity cost with error bars indicating the standard deviation of each policy (in OC_hist.tif) as shown in the following left figure , together with the probability of each policy being optimal and being the best in the following right figure.

示例问题

  1. Bubeck1~Bubeck7
  2. Asymmetric Unimodular Functions with the parameter chosen to be 0.2,0.5 and 0.8 with high or medium noise level, respectively
  3. Rosenbrock function with additive noise
  4. Pinter’s function with additive noise
  5. Goldstein function with additive noise
  6. Griewank function with additive noise
  7. Branin’s function with additive noise
  8. Axis parallel hyper-ellipsoid function with additive noise
  9. Rastrigin’s function with additive noise
  10. Ackley’s function with additive noise
  11. Six-hump camel back function with additive noise
  12. Easom function with additive noise

示例规则

  1. Interval Estimation (IE) 可用于关联信念模型
  2. Kriging 可用于关联信念模型
  3. UCB (and a modified version UCBcb incorporating correlated beliefs)
  4. UCBNormal
  5. UCB-E (and a modified version UCBEcb incorporating correlated beliefs)
  6. UCB-V(and a modified version UCBVcb incorporating correlated beliefs)
  7. Bayes-UCB 可用于关联信念模型
  8. KL-UCB
  9. Knowledge gradient policy for offline learning 可用于关联信念模型
  10. Knowledge gradient for online learning 可用于关联信念模型
  11. Successive rejects
  12. Thompson sampling 可用于关联信念模型
  13. Pure exploration 可用于关联信念模型
  14. Pure exploitation 可用于关联信念模型

注释

  1. 多臂赌博机问题(multi-armed bandit problem),也称为顺序资源分配问题(sequential resource allocation problem)。假设有个老虎机并排放在我们面前,我们首先给它们编号,每一轮,我们可以选择一个老虎机来按,同时记录老虎机给出的奖励. 假设各个老虎机不是完全相同的,经过多轮操作后,我们可以勘探出各个老虎机的部分统计信息,然后选择那个看起来奖励最高的老虎机. 在多臂赌博机中,我们把老虎机称为臂.这里有两个问题:奖励以什么方式产生.我们可以想见有很多种方式产生这种奖励:1. 随机式(stochastic bandit): 臂的奖励服从某种固定的概率分布.2. 对抗式(adversarial bandit): 赌场老板使坏,会动态调整臂的奖励,比如让你选的臂的奖励很低,但是其它未选的臂奖励变高.注意这里赌场老板不能也不会使全部臂的奖励变为0,因为这样会使我们无法得到奖励,这时我们体验到的是任何策略都是无差别的.3. 马尔可夫式(Markovian bandit): 臂奖励由马尔可夫链定义.如何测量策略的好坏简单的以总奖励作为测量策略好坏的标准是不切实际的. 所以我们定义遗憾(regret)作为策略好坏的指标,指的是我们可以达到的最理想总奖励与实际得到的总奖励. 

  2. 在拉丁超立方中,每个因子具有与设计中试验次数相同的水平数。这些水平在因子的下限和上限之间等间距分布。与球堆积方法类似,拉丁超立方方法选择点使设计点之间的最小距离最大化,但是具有约束。约束可保持因子水平之间等间距。 

  3. Audibert, J. Y., & Bubeck, S. (2010, June). Best arm identification in multi-armed bandits. In COLT-23th Conference on learning theory-2010 (pp. 13-p). 


Comments

CONTENT