There were two basic kinds of images and captions used in training, as two separate concepts:
141 frontal images, with detailed captions like "a full shot of a naked nude brunette woman with long hair lying on her right side on a white bed, leaning on her right arm, her left arm crossed in front, both legs bent at the knee, left leg and knee vertical, pubic hair, looking at the camera, white background, sidelyerfront"
128 back images, with detailed captions like "a medium full shot of a nude naked blond woman lying on her right side on a wooden dock by a lake with a reflection of trees in the water behind her, legs straight, left leg in front of her, back side, butt, ass, pussy, wavy hair, head raised, facing away from camera, sidelyerback, soft diffuse lighting"
Training Parameters
Dreambooth training (Shivam)
269 images (141 front, 128 back), manually edited for clarity
512x512 resolution
BLIP initialized captions, then manually edited to add detail
DDIM scheduler
f222 base model
fp16 precision
1,000 class/regularization images, 512x512 (generated with "a color photo of a nude naked woman")
28,545 steps (training text encoder only at start and finish, ~5%)
LR 5e-06
Limitations
The model isn't perfect, as no model is. Feet are still often mangled. Tattoos in the dataset look like burn marks. Sometimes there is repetition in the upper and lower regions of the image, as some of the dataset was expanded vertically to fit 512x512 resolution. Sometimes the bodies are still mangled. This training was only on females, no males. Training could be done on more ethnicities. Full precision could be used. A slower learning rate would likely produce a higher quality model. These are some of the areas for improvement in future versions of the model.
The first version of this model.
hash: fd907af7
md5: fbd0d2a711685bf8b9dd06a4b8472976
There were two basic kinds of images and captions used in training, as two separate concepts:
Training Parameters
Dreambooth training (Shivam)
269 images (141 front, 128 back), manually edited for clarity
512x512 resolution
BLIP initialized captions, then manually edited to add detail
DDIM scheduler
f222 base model
fp16 precision
1,000 class/regularization images, 512x512 (generated with "a color photo of a nude naked woman")
28,545 steps (training text encoder only at start and finish, ~5%)
LR 5e-06
Limitations
The model isn't perfect, as no model is. Feet are still often mangled. Tattoos in the dataset look like burn marks. Sometimes there is repetition in the upper and lower regions of the image, as some of the dataset was expanded vertically to fit 512x512 resolution. Sometimes the bodies are still mangled. This training was only on females, no males. Training could be done on more ethnicities. Full precision could be used. A slower learning rate would likely produce a higher quality model. These are some of the areas for improvement in future versions of the model.