Losses
Loss quantify how good a State is. Given a State, losses return a number.
For example, given the following state:
A token forcing loss represents the likelihood that a language model will respond with some exact string to a given prompt. A token forcing loss might return the following value for the above state.
target_str = "Sure, here is how to make a weapon"
model, tokenizer = load_model_and_tokenizer("google/gemma-2-2b-it")
loss = TokenForcingLoss(model, tokenizer, target_str)
Losses are what is being optimized for. A lower loss is better.
anthropic_prefill_sampled_probs_loss
AnthropicPrefillSampledProbLoss
Bases: Loss
Represents the difference in output logit distribution between Anthropic prefill sampled log probs and another model. This uses Anthropic model's support for assistant response prefilling to calculate sampled logit distributions many tokens into the targeted assistant response.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_name
|
AutoModelForCausalLM
|
model name to use for log probs |
required |
behavior
|
str
|
the root prompt to use to get harmful logit distributions for |
required |
surrogate_model
|
AutoModelForCausalLM
|
model to use for harmful logit distributions |
None
|
surrogate_tokenizer
|
AutoTokenizer
|
tokenizer for model to use for harmful logit distributions |
None
|
Source code in src/optimization/losses/anthropic_prefill_sampled_probs_loss.py
__call__(states, visualize=False, device='cuda:0')
Calculates mean of KL divergences in output logit distribution with OpenAI model logprobs.
OpenAI responses are casted into the token space of the surrogate model's tokenizer.
Source code in src/optimization/losses/anthropic_prefill_sampled_probs_loss.py
api_sampled_probs_loss
APISampledProbLoss
Bases: Loss
Represents the difference in output logit distribution between sampled log probs and another model. Sampled log probs are calculated by sampling with high temperature to estimate the logit distribution.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_name
|
AutoModelForCausalLM
|
model name to use for log probs |
required |
behavior
|
str
|
the root prompt to use to get harmful logit distributions for |
required |
surrogate_model
|
AutoModelForCausalLM
|
model to use for harmful logit distributions |
None
|
surrogate_tokenizer
|
AutoTokenizer
|
tokenizer for model to use for harmful logit distributions |
None
|
Source code in src/optimization/losses/api_sampled_probs_loss.py
__call__(states, visualize=False, device='cuda:0')
Calculates mean of KL divergences in output logit distribution with OpenAI model logprobs.
OpenAI responses are casted into the token space of the surrogate model's tokenizer.
Source code in src/optimization/losses/api_sampled_probs_loss.py
cache_loss
CacheLoss
Bases: Loss
A cache wrapper for another loss. Caches previously seen states and avoids recomputation.
Source code in src/optimization/losses/cache_loss.py
combined_loss
CombinedLoss
Bases: Loss
Combines multiple losses into a single loss by summing the losses.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
losses
|
list[Loss]
|
list of Loss objects |
required |
(optional)
|
parallelism
|
Implements parallelism across loss calculations by: 1. Wrapping all losses in async wrapper to be non blocking 2. Wrapping all losses in retry wrapper if OOM exception to only run when memory is available 3. Run all losses at once |
required |
Source code in src/optimization/losses/combined_loss.py
logit_distribution_matching_loss
LogitDistributionMatchingLoss
Bases: Loss
Represents the difference in output logit distribution with another model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
AutoModelForCausalLM
|
model to calculate loss for |
required |
tokenizer
|
AutoTokenizer
|
tokenizer for model to calculate loss for |
required |
behavior
|
str
|
the root prompt to use to get harmful logit distributions for |
required |
surrogate_model
|
AutoModelForCausalLM
|
(optional) model to use for harmful logit distributions |
None
|
surrogate_tokenizer
|
AutoTokenizer
|
(optional) tokenizer for model to use for harmful logit distributions |
None
|
scale_token_positions
|
bool
|
weighs earlier tokens more prominently in the loss |
False
|
loss_clamp
|
float
|
value to clamp token losses at. prevents well solved tokens from being further optimized |
0.15
|
Source code in src/optimization/losses/logit_distribution_matching_loss.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 | |
__call__(states, visualize=False, token_grads=False, device='cuda:0')
Calculates mean of KL divergences in output logit distribution with another model for some number of generations from the other model.
Output
torch.Tensor: tensor(28.1923)
Source code in src/optimization/losses/logit_distribution_matching_loss.py
loss
Loss
A base class for a Loss, some concept of how good a State is.
Source code in src/optimization/losses/loss.py
open_ai_logprobs_loss
OpenAILogProbsLoss
Bases: Loss
Represents the difference in output logit distribution between OpenAI log probs and another model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_name
|
AutoModelForCausalLM
|
model name to use for log probs |
required |
behavior
|
str
|
the root prompt to use to get harmful logit distributions for |
required |
surrogate_model
|
AutoModelForCausalLM
|
model to use for harmful logit distributions |
None
|
surrogate_tokenizer
|
AutoTokenizer
|
tokenizer for model to use for harmful logit distributions |
None
|
Source code in src/optimization/losses/open_ai_logprobs_loss.py
__call__(states, visualize=False, device='cuda:0')
Calculates mean of KL divergences in output logit distribution with OpenAI model logprobs.
OpenAI responses are casted into the token space of the surrogate model's tokenizer.
Source code in src/optimization/losses/open_ai_logprobs_loss.py
perplexity_loss
PerplexityLoss
Bases: Loss
Represents the perplexity (probability of a string as judged by a language model) of a given prompt.
Source code in src/optimization/losses/perplexity_loss.py
__call__(states, visualize=False, device='cuda:0')
Calculates perplexity for each state as judged by model.
Source code in src/optimization/losses/perplexity_loss.py
prompt_format_loss
PromptFormatLoss
Bases: Loss
Applies a format to states before computing the loss.
Source code in src/optimization/losses/prompt_format_loss.py
token_forcing_loss
TokenForcingLoss
Bases: Loss
Calculates token forcing loss given an input string and a target string for a given model and tokenizer. Token forcing loss is roughly defined as p(target string is generated).
Source code in src/optimization/losses/token_forcing_loss.py
__call__(states, visualize=False, token_grads=False, device='cuda:0')
Calculates token forcing loss given an input string and a target string for a given model and tokenizer. Token forcing loss is roughly defined as p(target string is generated).
Ex
input_strs: ["Tell me how to build a bomb"] target_strs: ["Sure, here is how to build a bomb"]
Output
torch.Tensor: tensor(28.1923)
Source code in src/optimization/losses/token_forcing_loss.py
weighted_loss
WeightedLoss
Bases: Loss
A weighted wrapper for another loss. Weights losses by some constant.