RQVAE Training¶
This guide provides detailed instructions on how to train the RQVAE model.
Training Preparation¶
1. Data Preparation¶
Ensure the dataset is downloaded and placed in the correct location:
2. Check Configuration File¶
View the default configuration:
Key configuration parameters:
# Training parameters
train.iterations=400000 # Number of training iterations
train.learning_rate=0.0005 # Learning rate
train.batch_size=64 # Batch size
train.weight_decay=0.01 # Weight decay
# Model parameters
train.vae_input_dim=768 # Input dimension
train.vae_embed_dim=32 # Embedding dimension
train.vae_hidden_dims=[512, 256, 128] # Hidden layer dimensions
train.vae_codebook_size=256 # Codebook size
train.vae_n_layers=3 # Number of quantization layers
# Quantization settings
train.vae_codebook_mode=%genrec.models.rqvae.QuantizeForwardMode.ROTATION_TRICK
train.commitment_weight=0.25 # Commitment loss weight
Start Training¶
Basic Training Command¶
Training Monitoring¶
If Weights & Biases is enabled:
GPU Training¶
Using multi-GPU training:
accelerate config # Configure on first run
accelerate launch genrec/trainers/rqvae_trainer.py config/rqvae/p5_amazon.gin
Custom Configuration¶
Creating Custom Configuration File¶
# my_rqvae_config.gin
import genrec.data.p5_amazon
import genrec.models.rqvae
# Custom training parameters
train.iterations=200000
train.batch_size=32
train.learning_rate=0.001
# Custom model architecture
train.vae_embed_dim=64
train.vae_hidden_dims=[512, 256, 128, 64]
train.vae_codebook_size=512
# Data paths
train.dataset_folder="path/to/my/dataset"
train.save_dir_root="path/to/my/output"
# Experiment tracking
train.wandb_logging=True
train.wandb_project="custom_rqvae_experiment"
Using custom configuration:
Training Monitoring¶
Key Metrics¶
Monitor these metrics during training:
- Total Loss: Overall training loss
- Reconstruction Loss: Reconstruction quality
- Quantization Loss: Quantization effectiveness
- Commitment Loss: Encoder commitment
Sample Log Output¶
Epoch 1000: Loss=2.3456, Recon=2.1234, Quant=0.1234, Commit=0.0988
Epoch 2000: Loss=1.9876, Recon=1.8234, Quant=0.0987, Commit=0.0655
...
Model Evaluation¶
Reconstruction Quality Assessment¶
from genrec.models.rqvae import RqVae
from genrec.data.p5_amazon import P5AmazonItemDataset
# Load trained model
model = RqVae.load_from_checkpoint("out/rqvae/checkpoint_299999.pt")
# Evaluation dataset
eval_dataset = P5AmazonItemDataset(
root="dataset/amazon",
train_test_split="eval"
)
# Calculate reconstruction loss
model.eval()
with torch.no_grad():
eval_loss = model.evaluate(eval_dataset)
print(f"Evaluation loss: {eval_loss:.4f}")
Codebook Utilization Analysis¶
def analyze_codebook_usage(model, dataloader):
used_codes = set()
with torch.no_grad():
for batch in dataloader:
outputs = model(batch)
semantic_ids = outputs.sem_ids
used_codes.update(semantic_ids.flatten().tolist())
usage_rate = len(used_codes) / model.codebook_size
print(f"Codebook usage: {usage_rate:.2%}")
print(f"Used codes: {len(used_codes)}/{model.codebook_size}")
return used_codes
Troubleshooting¶
Common Issues¶
Q: Training loss doesn't converge?
A: Try these solutions:
- Lower learning rate: train.learning_rate=0.0001
- Adjust commitment weight: train.commitment_weight=0.1
- Check if data preprocessing is correct
Q: Codebook collapse (all samples use the same code)?
A: - Use ROTATION_TRICK mode - Increase commitment weight - Reduce learning rate
Q: GPU out of memory?
A:
- Reduce batch size: train.batch_size=32
- Reduce model size: train.vae_hidden_dims=[256, 128]
- Enable mixed precision training
Debugging Tips¶
-
Gradient checking:
-
Loss analysis:
Best Practices¶
Hyperparameter Tuning Recommendations¶
-
Learning rate scheduling:
-
Early stopping strategy:
-
Model saving frequency:
Experiment Management¶
Recommended to use version control and experiment tracking:
# Create experiment branch
git checkout -b experiment/rqvae-large-codebook
# Modify configuration
vim config/rqvae/large_codebook.gin
# Run experiment
python genrec/trainers/rqvae_trainer.py config/rqvae/large_codebook.gin
# Record results
git add .
git commit -m "Experiment: large codebook (size=1024)"
Next Steps¶
After training completion, you can:
- Use the trained RQVAE for TIGER training
- Analyze model performance
- Try different datasets