The Principle of LoRA
What is LoRA used for?
Lora allows for fine-tuning the entire image while keeping the weights of the Checkpoint unchanged. In this case, only adjusting Lora is needed to generate specific images without modifying the entire Checkpoint. For some images that the AI has never encountered before, Lora is used for fine-tuning. This gives AI art a certain degree of "controllability.”
Currently, the trained models are all "refinements" made on officially trained models (SD1.5, SDXL). Of course, refinements can also be made on models created by others.
Lora Training: AI first generates images based on the prompts, then compares these images with the dataset in the training set. By guiding AI to continuously fine-tune the embedding vectors based on the generated differences, the generated results gradually approach the dataset. Eventually, the fine-tuned model can produce results that are completely equivalent to the dataset, forming an association between the images generated by AI and the dataset, making them increasingly similar.
*Compared to the Checkpoint, LoRA has a smaller file size, which saves time and resources. Moreover, it can adjust weights on top of the Checkpoint, achieving different effects.
LoRA Training Process
Five steps: Prepare dataset - Image preprocessing - Set parameters - Monitor Lora training process - Training completion
*Taking the training of a facial Lora with SeaArt as an example.
Prepare the dataset
When uploading the dataset, it's essential to maintain the principle of "diversified samples." This means the dataset should include images from different angles, poses, lighting conditions, etc., and ensure that the images are of high resolution. This step is primarily aimed at helping AI understand the images.
Image preprocessing
I. Cropping images II. Tagging III. Trigger words.
I. Cropping images
To enable the AI to better discern objects through images, it's generally best to maintain consistent image dimensions. You can choose from 512*512 (1:1), 512*768 (2:3), or 768*512 (3:2) based on the desired output.
Crop Mode: Center Crop / Focus Crop / No Crop
Center Crop: Crops the central region of the image.
Focus Crop: Automatically identifies the main subject of the image.
Focus Crop
Center Crop
*Compared to center cropping, focus cropping is more likely to preserve the main subject of the dataset, so it is generally recommended to use focus crop.
II. Tagging
To provide textual descriptions for images in the dataset, allowing AI to learn from the text inside.
Tagging Algorithm: BLIP/Deepbooru
● BLIP: Natural language tagger, for example, "a girl with black hair."
● Deepbooru: Phrase language labels, for example, "a girl, black hair."
● Tagging Threshold: The smaller the value, the finer the description, recommended to be 0.6.
Tagging process: Remove fixed features (such as physical features...) to allow AI to autonomously learn these features. Similarly, you can also add some features you want to adjust in the future (clothing, accessories, actions, background...).
*For example, if you want all the generated images to have black hair and black eyes, you can delete these two tags.
III. Trigger words
Words that trigger the activation of Lora, effectively consolidating the character features into a single word.
Parameter Settings
Base Model: It is recommended to choose a high-quality, stable base model that closely matches the style of Lora, as this makes it easier for AI to match features and record differences.
Recommended Base Models:
Realistic: SD1.5, ChilloutMix, MajicMIX Realistic, Realistic Vison
Anime: AnyLoRA, Anything | 万象熔炉, ReV Animated
Advanced Config
Training Parameters:
Repeat (Single Image Repetitions): The number of times a single image is learned. The more repetitions, the better the learning effect, but excessive repetitions may lead to image rigidity. Suggestion: Anime: 8; Realistic: 15.
Epoch (Cycles): One cycle equals the number of dataset multiplied by Repeat. It represents how many steps the model has been trained on the training set. For example, if there are 20 images in the training set and Repeat is set to 10, then the model will learn 20 * 10 = 200 steps. If Epoch is set to 10, then the Lora training will have a total of 2000 steps. Suggestion: Anime: 20; Realistic: 10.
Batch size: It refers to the number of images the AI learns simultaneously. For example, when set to 2, the AI learns 2 images at a time, which shortens the overall training duration. However, learning multiple images simultaneously may lead to a relative decrease in the precision for each image.
Mixed precision: fp16 is recommended.
Sample Settings:
Resolution: Determines the size of the preview image for the final model effect.
SD1.5: 512*512
SD1.5: 512*512
Seed: Controls the randomly generated images. When using the same r seed with prompts, it will likely generate the same/similar images.
Sampler \ Prompts \ Negative Prompts: Mainly showcase the effect of the preview image of the final model.
Save Settings:
Determines the final number of Loras. If set to 2, and Epoch is 10, then 5 Loras will be saved in the end.
Save precision: Recommended fp16.
Learning Rate & Optimizer:
Learning Rate: It denotes the intensity of AI learning the dataset. The higher the learning rate, the more AI can learn, but it may also lead to dissimilar output images. When the dataset increases, it's advisable to try reducing the learning rate. It's recommended to start with the default value and then adjust it based on training results. It's suggested to gradually increase from a lower learning rate, recommended at 0.0001.
unet lr: When the unet lr is set, the Learning Rate will not take effect. Recommended at 0.0001.
text encoder lr: It determines the sensitivity to tags. Usually, the text encoder lr is set to 1/2 or 1/10 of the unet lr.
Lr scheduler: It primarily governs the decay of the learning rate. Different schedulers have minimal impact on the final results. Generally, the default "cosine" scheduler is used, but an upgraded version, "Cosine with Reastart," is also available. It goes through multiple restarts and decays to fully learn the dataset, avoiding interference from "local optimal solutions" during training. If using "Cosine with Reastart," set the Restart Times to 3-5.
Optimizer: It determines how AI grasps the learning process during training, directly impacting the learning results. It's recommended to use AdamW8bit.
Lion: A newly introduced optimizer, typically with a learning rate about 10 times smaller than AdamW.
Prodigy: If all learning rates are set to 1, Prodigy will automatically adjust the learning rate to achieve the best results, suitable for beginners.
Network:
Used to build a suitable Lora model base for AI input data.
Network Rank Dim: It directly affects the size of Lora. The larger the Rank, the more data needs to be fine-tuned during training. 128=140MB+; 64=70MB+; 32=40MB+.
Recommended:
Realistic: 64/128
Anime: 8/16/32
Setting the value too high will make the AI learn too deeply, capturing many irrelevant details, similar to "overfitting”
Network Alpha: It can be understood as the degree of influence of Lora on the original model weights. The closer it is to Rank, the smaller the influence on the original model weights, while the closer it is to 0, the more pronounced the influence on the original model weights. Alpha generally does not exceed Rank. Currently, Alpha is typically set to half of Rank. If set to 1, it maximizes the influence on weights.
Tagging Settings:
In general, the closer a tag is to the front, the greater its weight. Therefore, it's usually recommended to enable Shuffle Caption
LoRA Training Issues
Overfitting / Underfitting
Overfitting: When there is a limited dataset or the AI matches the dataset too precisely, it leads to Lora generating images that largely resemble the dataset, resulting in poor generalization ability of the model.
The image on the top right closely resembles the dataset on the left, both in appearance and posture.
Comment
Reasons for Overfitting:
1. The dataset is lacking.
2. Incorrect parameter settings (tags, learning rate, steps, optimizer, etc.).
Preventing Overfitting:
1. Decrease learning rate appropriately.
2. Shorten the Epoch.
3. Reduce Rank and increase Alpha.
4. Decrease Repeat.
5. Utilize regularization training.
6. Increase dataset.
Underfitting: The model fails to adequately learn the features of the dataset during training, resulting in generated images that do not match the dataset well.
You can see that Lora's generated images fail to adequately preserve the features of the dataset — they are dissimilar.
Reasons for Underfitting:
1. Low model complexity
2. Insufficient feature quantity
Preventing Underfitting:
1. Increase learning rate appropriately
2. Increase Epoch
3. Raise Rank, reduce Alpha
4. Increase Repeat
5. Reduce regularization constraints
6. Add more features to the dataset (high quality)
Regular Dataset
A way to avoid overfitting of images is by adding additional images to enhance the model's generalization ability. The regular dataset should not be too extensive, otherwise, the AI will overly learn from the regular dataset, leading to inconsistency with the original target. It is recommended to have 10-20 images.
For example, in a portrait dataset where most images feature long hair, you can add images with short hair to the regular dataset. Similarly, if the dataset consists entirely of images with the same artistic style, you can add images with different styles to the regulardataset to diversify the model. The regular dataset does not need to be tagged.
*In layman's terms, training Lora in this way is somewhat like a combination of the dataset and a regular dataset.
Loss
The deviation between what AI learns and reality, guided by loss, can optimize the direction of AI learning. Therefore, when the loss is low, the deviation between what AI learns and reality is relatively small, and at this point, AI learns the most accurately. As long as the loss gradually decreases, there are usually no major issues.
The loss value for Realistic images generally ranges from 0.1 to 0.12, while for anime, it can be lowered appropriately.
Use the loss value to assess model training issues.
Summary
Currently, the "fine-tuning models" can be roughly divided into three types: the Checkpoint output by Dreambooth, the Lora, and the Embeddings output by Textual Inversion. Considering factors such as model size, training duration, and training dataset requirements, Lora offers the best "cost-effectiveness". Whether it's adjusting the art style, characters, or various poses, Lora can perform effectively.