Exemplo n.º 1
0
       [-0.25, -0.75],
       [-0.25, -0.75],
       [-0.25, -0.75],
       [ 0.75, -0.75],
       [-0.25,  0.25],
       [ 0.75, -0.75]]))]] whereas it is discrete array([1.],[0],[0],[0]) for the other experiment1.py code
'''
#print (np.shape(full_traj),full_traj)

demonstrations = 10
super_iterations = 3000  #10000
sub_iterations = 0
learning_rate = 10

#k=4 in  this case. number of primitive options
m = GridWorldModel(4, statedim=(12, 2))
m.sess.run(tf.initialize_all_variables())

with tf.variable_scope("optimizer"):
    opt = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
    #define he optimizer,  put the full trajectorty, 1000, 0
    m.train(opt, full_traj, super_iterations, sub_iterations)
'''So how do we generate the visualised options?
We can look at a state, and then apply the  respective options policy from that state.
so how is this done for the gridworld data? It just computes the max of action probabilities over the entire gridworld. 
Instead of doing that we need to provide the same option policy over continues states until it actally terminates. 
How do we do this? 

1. Find a few good states in the state space. 
2. Iterate over the numerb of options and do the sam ething as before
3. till the termination poliy is reached iterate of v evalpi for the same state space
Exemplo n.º 2
0
       [-0.25, -0.75],
       [-0.25, -0.75],
       [ 0.75, -0.75],
       [-0.25,  0.25],
       [ 0.75, -0.75]]))]] whereas it is discrete array([1.],[0],[0],[0]) for the other experiment1.py code
'''
#print (np.shape(full_traj),full_traj)

demonstrations=10
super_iterations=1000#3000#10000
sub_iterations=0
learning_rate=10


#k=4 in  this case. number of primitive options
m  = GridWorldModel(4, statedim=(12,2))
m.sess.run(tf.initialize_all_variables())

with tf.variable_scope("optimizer"):
	opt = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
	#define he optimizer,  put the full trajectorty, 1000, 0 
	m.train(opt, full_traj, super_iterations, sub_iterations)

'''So how do we generate the visualised options?
We can look at a state, and then apply the  respective options policy from that state.
so how is this done for the gridworld data? It just computes the max of action probabilities over the entire gridworld. 
Instead of doing that we need to provide the same option policy over continues states until it actally terminates. 
How do we do this? 

1. Find a few good states in the state space. 
2. Iterate over the numerb of options and do the sam ething as before