OpenRLHF 学习
RLHF
NaiveExperienceMaker
困扰我许久的 micro_rollout_batch_size 原来是一次推理多少条数据
def generate_samples(self, all_prompts: List[str], **generate_kwargs) -> List[Samples]:
"""
Generate samples and return in batches.
"""
assert not getattr(self, "packing_samples", False)
args = self.strategy.args
self.actor.eval()
# sample multiple response
all_prompts = sum([[prompt] * args.n_samples_per_prompt for prompt in all_prompts], [])
samples_list = []
for i in range(0, len(all_prompts), args.micro_rollout_batch_size):
prompts = all_prompts[i : i + args.micro_rollout_batch_size]
inputs = self.tokenize_fn(prompts, self.prompt_max_len, device="cuda")
sequences, attention_mask, action_mask = self.actor.generate(**inputs, **generate_kwargs)
samples = Samples(
sequences=sequences,
attention_mask=attention_mask,
action_mask=action_mask,
num_actions=action_mask.size(1),
packed_seq_lens=None,
response_length=action_mask.float().sum(dim=-1),
total_length=attention_mask.float().sum(dim=-1),
)
samples_list.append(samples)
return samples_list
kl 是把 samples.sequences 分别输入 actor model 和 ref model,拿到 log_prob 作差 r 是把 sequence 喂给 reward model
forward 和 generate 的区别