Python Batch.advantagesの例

プログラミング言語: Python

名前空間/パッケージ名: tianshou.data

クラス/型: Batch

メソッド/関数: advantages

hotexamples.comのコード掲載数: 1

Python Batch.advantages - 1件のコード例が見つかりました。すべてオープンソースプロジェクトから抽出されたPythonのtianshou.data.Batch.advantagesの実例で、最も評価が高いものを厳選しています。コード例の評価を行っていただくことで、より質の高いコード例が表示されるようになります。

よく使われるメソッド

表示非表示

Batch(30)

split(30)

weight(28)

pop(23)

returns(17)

stack(14)

update(11)

cat(9)

rew(9)

obs(8)

get(7)

act(7)

to_torch(6)

logp_old(6)

done(6)

cat_(6)

append(5)

adv(5)

is_empty(5)

keys(3)

to_numpy(3)

items(3)

obs_next(2)

update_weight(2)

empty_(2)

empty(2)

cat_list(2)

v_s(2)

v(2)

b(2)

values(1)

value_targets(1)

advantages(1)

loss(1)

policy(1)

stack_(1)

__repr__(1)

info(1)

indice(1)

コード例 #1

ファイルを表示

ファイル: marvil_policy.py プロジェクト: hebowei2000/marvil_tianshou

    def compute_advantage(self, batch:Batch, last_r: float, 
                           gamma: float = 0.9, lamda: float = 1.0, use_gae: bool = True, use_critic: bool = True):
        """
         Given a rollout, compute its value targets and the advantage
         Args: batch (Batch): batch of a single trajectory
               last_r (float): value estimation for the last observation
               gamma (float): Discount factor
               lambda (float): parameter for GAE
               use_gae (bool): using Generalized Advantage Estimation
               use_critic (bool): whether to use critic (value estimation), setting this to false will use 0 as baseline
        
         Returns: batch (Batch): object with experience from batch and processed rewards
        """  

        assert batch.vf_preds in batch or not use_critic
        assert use_critic or not use_gae

        if use_gae:
            vpred_t = np.concatenate([batch.vf_preds, np.array([last_r])])
            delta_t = (batch.rew + gamma * vpred_t[1:] - vpred_t[:-1])
            # This formula for the advantage comes from "Generalized Advantage Estimation": https://arxiv.org/abs/1506.02438
            batch.advantages = self.discount_cumsum(delta_t, gamma * lamda)
            batch.value_targets = (batch.advantages + batch.vf_preds).astype(np.float32)

        else:
            rewards_plus_v = np.concatenate([batch.rew, np.array([last_r])])
            discounted_returns = discount_cumsum(rewards_plus_v, gamma)[:-1].astype(np.float32)

            if use_critic:
                batch.advantages = discounted_returns - batch.vf_preds
                batch.value_targets = discounted_return
            else:
                batch.advantages = discounted_returns
                batch.value_targets = np.zeros_like(batch.advantages)

        batch.advantages = batch.advantages.astype(np.float32)

        return batch