Allow for >1 batch size in Splatfacto #3582

akristoffersen · 2025-01-27T01:57:24Z

WIP, preliminary testing makes it look like it's working but I would want to make sure.

nerfstudio/data/datamanagers/full_images_datamanager.py

kerrj · 2025-01-31T19:06:15Z

Hey Alex! this is super cool; especially in MCMC which doesn't require gradient thresholds at all. #3216 might have mildly broken parts of this PR since it merged in parallel dataloading, but it shouldn't be too bad; let us know if you want any help fixing conflicts!

hardikdava · 2025-02-07T11:57:53Z

@akristoffersen I think you might want to modify step_post_backward to support for batch processing. So densification and pruning can be intact as it is.

akristoffersen · 2025-02-09T02:26:30Z

Works with masks now

As expected, I noticed a almost 2x increase in rays/s with a batch size of two, and a very slight performance drop with a batch size of 1 compared to baseline (50.1 M rays/sec -> 48 M rays/sec)

akristoffersen · 2025-02-09T05:51:01Z

@hardikdava do you mean that the tuning might be different for the thresholds? Yeah, I don't know exactly what to do there. maybe someone else has an opinion?

Some quick stats on the poster dataset.
Orange: Baseline (batch size of 1)
Blue: BS = 5
Red: BS = 10

so the splitting / densification outcomes are affected by batch size.

Similarly, train rays/sec do start higher due to the larger batch size, but go down as you'd expect with the higher number of gaussians.

Some good news, with a higher batch I do see the training loss hitting better values quicker as the batch size increases.

hardikdava · 2025-02-09T07:17:52Z

@akristoffersen currently, densification, splitting and culling are implemented inside strategy and logic is based on step. Since batch training can skip some of the functions. I am not sure but you might want to modify the strategy function by calling the batch size e.g. loop for the batch size. So that all densification logic will be performed for all steps, otherwise some of the functions will be skipped.

In simple words, suppose the batch size is 2, opacity reset needs to be applied at every 3000th step. So it should happen at every 1500th steps according to batch size. But according to your current implementation it will be applied at every 3000th steps but actually it will be 6000th step (batch size * step).

akristoffersen · 2025-02-10T05:13:59Z

@hardikdava I think dividing those parameters by the batch size assumes that every image produces gradients for a unique set of gaussians. If there's any overlap, then the gaussians seen by 2 images would just be getting a single gradient descent update applied to them (albeit of a possibly better quality because of the signal from both images), while if it was a single image batch those gaussians would have gotten 2 gradient descent updates applied to them.

I think that dividing those params by the batch size could still be a good approximation, I'll try it and see how the losses look.

AntonioMacaronio

I tested these changes on 2 of my datasets with the following commands

ns-train
ns-render
ns-eval
ns-export

These all worked well! @jeffreyhuparallel do you have any comments about the hyperparameter strategy stuff?

nerfstudio/models/splatfacto.py

abrahamezzeddine · 2025-03-17T15:23:21Z

How would batch size work with a dataset of several thousand images? With opacity reset in mind etc... :)

I suspect that scenes with lots of images would inherently benefit from a larger batch size because more images are part of the training per step.

How is the memory use? Are we able to train with, say, 20, 50 or 100 images per step? How would that affect training speed and memory use? Is it linear?

akristoffersen · 2025-03-17T15:32:40Z

All good questions @abrahamezzeddine , unfortunately I haven't had cycles as of late to finish up the implementation and do the necessary benchmarking. @AntonioMacaronio has been helping on that front.

On the large dataset question, I think you're right. Something that has always bothered me about gs is that a batch isn't representative of the full objective-- with NeRFs the batch is a random collection of rays from all images so that helps.

I do suspect that at the moment, splatfacto is memory bound, so increasing the batch size may not improve things as you'd expect. But I think it should make the scene converge better / faster.

abrahamezzeddine · 2025-03-17T16:11:30Z

All good questions @abrahamezzeddine , unfortunately I haven't had cycles as of late to finish up the implementation and do the necessary benchmarking. @AntonioMacaronio has been helping on that front.

On the large dataset question, I think you're right. Something that has always bothered me about gs is that a batch isn't representative of the full objective-- with NeRFs the batch is a random collection of rays from all images so that helps.

I do suspect that at the moment, splatfacto is memory bound, so increasing the batch size may not improve things as you'd expect. But I think it should make the scene converge better / faster.

Thanks for the quick response!

Just food for thought:
If memory is indeed a bottleneck, what if we initially load all images at aggressively downsampled resolutions to dynamically fit larger batches?

As we reach a satisfactory loss at the initial stage, an upsample session starts and splatfacto progressively reduce the batch size while increasing the image resolution, iteratively refining until we achieve the desired quality at full resolution. Essentially, starting with maximum batch size with lower quality images, and gradually trading off batch size (N-times to the desired batch size) for higher resolution during training.

Do you think this would help converge complex scenes initially as we upsample and reduce batch size?

akristoffersen · 2025-03-18T03:05:49Z

iteratively refining until we achieve the desired quality at full resolution.

We sort of already do this-- we initially train on downsampled images and then increase the resolution as training continues. But inversely scaling the batch size as training goes on also sounds like a good idea.

Something I've also wanted to try (probably in a separate PR) is to load in a large batch of patches, so the training acts more like NeRFs. You could do this by keeping the focal lengths the same, but augmenting the principal points for each patch.

abrahamezzeddine · 2025-03-18T14:23:25Z

Thanks. Dynamic batching is indeed an interesting possibility.

Two thoughts: 💭

How would batching process images? Randomly or sequentially?

One option is to simply process images in the order they were captured—for example, taking the first n images as one batch, then the next n, and so on. The idea here is that sequential ordering might naturally preserve temporal continuity, so adjacent frames (which are likely to have similar viewpoints) get processed together. But that can perhaps be difficult to know depending on what the user matched the images with; exhaustive, sequential or vocabulary tree.

Another approach is to order the sparse point cloud using a Hilbert curve. Since a Hilbert curve is a space-filling curve that preserves locality, it essentially divides your scene into “patches” of points that are spatially close together. For instance, if you select 10,000 consecutive points from this 1D Hilbert index, you’re effectively picking a coherent patch of the scene. If you divide them into n-points, you essentially create patches of local regions. You can then choose the images that see these points for your batch based on the colmap input data. Since the images are already ordered according to the hilbert curve, it’s easy to keep track which images belongs to which patch. This strategy explicitly enforces spatial coherence, ensuring that each batch is focused on a local region of the scene as it is training. You

Would love to hear your thoughts about this.

akristoffersen · 2025-03-18T16:00:52Z

You might want to check out https://arxiv.org/abs/2501.13975 , they have a similar "locality" heuristic that they use to pull multiple images seeing the same region of the scene. They say that this helps prevent overshoot/overfitting to a single image which could happen as they are using a second-order optimization algorithm.

My take is that with a suitably large and diverse batch, this might not be a problem? But I agree that with smaller batches, a local neighborhood might work out better. I imagine batch-building heuristic doesn't have to be super complicated to get the behavior you'd want.

abrahamezzeddine · 2025-03-19T17:47:45Z

Trying this out now myself and seems to converge initially much faster with a batch size of 50 at the moment. Using 2K resolution images and around 2500 images. 18GB VRAM is used of 48GB VRAM.

750 (2.50%) 699.992 ms 5 h, 41 m, 14 s 494.09 M

Not the fastest but as long as it produces a high quality output, it's fine I guess. =)

abrahamezzeddine · 2025-03-19T20:58:42Z

Works with masks now

As expected, I noticed a almost 2x increase in rays/s with a batch size of two, and a very slight performance drop with a batch size of 1 compared to baseline (50.1 M rays/sec -> 48 M rays/sec)

I am not seeing the linear increase in rays/s with larger batch sizes. Is there a diminishing effect after a certain batch?

akristoffersen · 2025-03-19T21:04:05Z

Is there a diminishing effect after a certain batch?

Yes, please see the initial wandb results in an earlier comment. Initially the ray throughput scales, but I think because the splitting behavior currently assumes a single image per patch, we are getting many more gaussians with higher batch sizes.

abrahamezzeddine · 2025-03-19T21:13:19Z

Is there a diminishing effect after a certain batch?

Yes, please see the initial wandb results in an earlier comment. Initially the ray throughput scales, but I think because the splitting behavior currently assumes a single image per patch, we are getting many more gaussians with higher batch sizes.

Ok, thanks.

The learning rates, should one consider the square root batch scaling due to larger batch size?

Bilagrad was also not working but I made these changes to have it work again with batched training. Not sure however if this is "compatible" with Bilagrad.

    def _apply_bilateral_grid(self, rgb: torch.Tensor, cam_idx, H: int, W: int) -> torch.Tensor:
        # Get batch size
        batch_size = rgb.shape[0]
        
        grid_y, grid_x = torch.meshgrid(
            torch.linspace(0, 1.0, H, device=self.device),
            torch.linspace(0, 1.0, W, device=self.device),
            indexing="ij",
        )
        grid_xy = torch.stack([grid_x, grid_y], dim=-1)  # [H, W, 2]
        grid_xy = grid_xy.expand(batch_size, H, W, 2)  # [B, H, W, 2]
        
        if isinstance(cam_idx, torch.Tensor):
            if cam_idx.dim() > 1:
                grid_idx = cam_idx[0, 0].clone().detach().to(device=self.device, dtype=torch.long)
            else:
                grid_idx = cam_idx.clone().detach().to(device=self.device, dtype=torch.long)
        else:
            grid_idx = torch.tensor(cam_idx, device=self.device, dtype=torch.long)
        
        grid_idx = grid_idx.expand(batch_size)
        
        out = slice(
            bil_grids=self.bil_grids,
            rgb=rgb,
            xy=grid_xy,
            grid_idx=grid_idx,
        )
        return out["rgb"]

ichsan2895 · 2025-03-27T13:36:51Z

Recently I tested Nerfstudio v1.1.4+gsplat 1.3.0 and this PR (commit d5bdd45)+gsplat 1.4.0 in a dataset.

Why the metrics (PSNR, SSIM, LPIPS) are fluctuating up and down in Blue line in Nerfstudio commit fix alpha compositing. ??, while pink line (Nerfstudio commit 194b5d4) is more stable..

ichsan2895 · 2025-03-28T10:49:14Z

I see the different of quality while adjusting batch-size 🎉

ns-train splatfacto --pipeline.datamanager.batch-size 1
(PSNR: 28.559, SSIM: 0.9312, LPIPS: 0.2408)

ns-train splatfacto --pipeline.datamanager.batch-size 2
(PSNR: 29.026, SSIM: 0.9355, LPIPS: 0.218)

ns-train splatfacto-mcmc --pipeline.datamanager.batch-size 1
(PSNR: 27.069, SSIM: 0.9225, LPIPS: 0.265)

ns-train splatfacto-mcmc --pipeline.datamanager.batch-size 2
(PSNR: 29.005, SSIM: 0.9363, LPIPS: 0.233)

Bilagrid with batch-size > 1 seems not working well. TV_loss is always zero.

ichsan2895 · 2025-03-30T06:25:04Z

Found another bugs.

Batch-size > 1 is not working when using multi camera setup (For example, the images resolution is not same, some images are landscape, and others is potrait. You can test my dataset here:( https://drive.google.com/file/d/1NWZSDU9tEmrAtpKxntTw6YBge_AZ66mf/view?usp=sharing)

## This code Works IF I set batch-size 1
>> ns-train splatfacto --logging.steps-per-log 200 --vis viewer+wandb --viewer.websocket-port 7007 \
    --pipeline.datamanager.batch-size 2 \
    nerfstudio-data \
    --data path/to/dataset --downscale-factor 1

wandb: 🚀 View run at https://wandb.ai/muhammad-ichsan/nerfstudio-project/runs/l4d8d829
logging events to: outputs/unnamed/splatfacto/2025-03-30_062154
[06:22:05] Caching / undistorting train images                                            �]8;id=442417;file:///workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/data/datamanagers/full_images_datamanager.py�\full_images_datamanager.py�]8;;�\:�]8;id=33326;file:///workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/data/datamanagers/full_images_datamanager.py#241�\241�]8;;�\
Caching / undistorting train images ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00m 0:00:01
Printing profiling stats, from longest to shortest duration in seconds
Trainer.train_iteration: 0.2992              
VanillaPipeline.get_train_loss_dict: 0.2976              
Traceback (most recent call last):
  File "/usr/local/bin/ns-train", line 8, in <module>
    sys.exit(entrypoint())
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/scripts/train.py", line 272, in entrypoint
    main(
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/scripts/train.py", line 257, in main
    launch(
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/scripts/train.py", line 190, in launch
    main_func(local_rank=0, world_size=world_size, config=config)
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/scripts/train.py", line 101, in train_loop
    trainer.train()
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/engine/trainer.py", line 266, in train
    loss, loss_dict, metrics_dict = self.train_iteration(step)
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/utils/profiler.py", line 111, in inner
    out = func(*args, **kwargs)
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/engine/trainer.py", line 502, in train_iteration
    _, loss_dict, metrics_dict = self.pipeline.get_train_loss_dict(step=step)
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/utils/profiler.py", line 111, in inner
    out = func(*args, **kwargs)
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/pipelines/base_pipeline.py", line 298, in get_train_loss_dict
    ray_bundle, batch = self.datamanager.next_train(step)
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/data/datamanagers/full_images_datamanager.py", line 406, in next_train
    data = nerfstudio_collate(
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/data/utils/nerfstudio_collate.py", line 122, in nerfstudio_collate
    {key: nerfstudio_collate([d[key] for d in batch], extra_mappings=extra_mappings) for key in elem}
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/data/utils/nerfstudio_collate.py", line 122, in <dictcomp>
    {key: nerfstudio_collate([d[key] for d in batch], extra_mappings=extra_mappings) for key in elem}
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/data/utils/nerfstudio_collate.py", line 103, in nerfstudio_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [767, 1035, 3] at entry 0 and [1049, 778, 3] at entry 1
Traceback (most recent call last):
  File "/usr/local/bin/ns-train", line 8, in <module>
    sys.exit(entrypoint())
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/scripts/train.py", line 272, in entrypoint
    main(
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/scripts/train.py", line 257, in main
    launch(
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/scripts/train.py", line 190, in launch
    main_func(local_rank=0, world_size=world_size, config=config)
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/scripts/train.py", line 101, in train_loop
    trainer.train()
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/engine/trainer.py", line 266, in train
    loss, loss_dict, metrics_dict = self.train_iteration(step)
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/utils/profiler.py", line 111, in inner
    out = func(*args, **kwargs)
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/engine/trainer.py", line 502, in train_iteration
    _, loss_dict, metrics_dict = self.pipeline.get_train_loss_dict(step=step)
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/utils/profiler.py", line 111, in inner
    out = func(*args, **kwargs)
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/pipelines/base_pipeline.py", line 298, in get_train_loss_dict
    ray_bundle, batch = self.datamanager.next_train(step)
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/data/datamanagers/full_images_datamanager.py", line 406, in next_train
    data = nerfstudio_collate(
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/data/utils/nerfstudio_collate.py", line 122, in nerfstudio_collate
    {key: nerfstudio_collate([d[key] for d in batch], extra_mappings=extra_mappings) for key in elem}
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/data/utils/nerfstudio_collate.py", line 122, in <dictcomp>
    {key: nerfstudio_collate([d[key] for d in batch], extra_mappings=extra_mappings) for key in elem}
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/data/utils/nerfstudio_collate.py", line 103, in nerfstudio_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [767, 1035, 3] at entry 0 and [1049, 778, 3] at entry 1

And yeah, this code works, but the TV_loss is still zero when I activated bilagrid.

    def _apply_bilateral_grid(self, rgb: torch.Tensor, cam_idx, H: int, W: int) -> torch.Tensor:
        # Get batch size
        batch_size = rgb.shape[0]
        
        grid_y, grid_x = torch.meshgrid(
            torch.linspace(0, 1.0, H, device=self.device),
            torch.linspace(0, 1.0, W, device=self.device),
            indexing="ij",
        )
        grid_xy = torch.stack([grid_x, grid_y], dim=-1)  # [H, W, 2]
        grid_xy = grid_xy.expand(batch_size, H, W, 2)  # [B, H, W, 2]
        
        if isinstance(cam_idx, torch.Tensor):
            if cam_idx.dim() > 1:
                grid_idx = cam_idx[0, 0].clone().detach().to(device=self.device, dtype=torch.long)
            else:
                grid_idx = cam_idx.clone().detach().to(device=self.device, dtype=torch.long)
        else:
            grid_idx = torch.tensor(cam_idx, device=self.device, dtype=torch.long)
        
        grid_idx = grid_idx.expand(batch_size)
        
        out = slice(
            bil_grids=self.bil_grids,
            rgb=rgb,
            xy=grid_xy,
            grid_idx=grid_idx,
        )
        return out["rgb"]

AntonioMacaronio · 2025-03-30T07:27:28Z

@ichsan2895 thank you for the beautiful testing! The batching with cameras of different resolutions is concerning, and I suspect this is something that just can't be supported until Pytorch supports jagged tensors. Afaik, it is called NestedTensors and it's currently in beta, but it will likely be some time before it is supported

perhaps the current best solution is to just not allow batching when images of varying resolution are given

abrahamezzeddine · 2025-03-30T10:58:52Z

NestedTensors

I can also mention that camera optim does not work with batch size over 1. With a few modifications, I had made it work again.

abrahamezzeddine · 2025-03-30T10:59:49Z

Found another bugs.

Batch-size > 1 is not working when using multi camera setup (For example, the images resolution is not same, some images are landscape, and others is potrait. You can test my dataset here:( https://drive.google.com/file/d/1NWZSDU9tEmrAtpKxntTw6YBge_AZ66mf/view?usp=sharing)

## This code Works IF I set batch-size 1
>> ns-train splatfacto --logging.steps-per-log 200 --vis viewer+wandb --viewer.websocket-port 7007 \
    --pipeline.datamanager.batch-size 2 \
    nerfstudio-data \
    --data path/to/dataset --downscale-factor 1

wandb: 🚀 View run at https://wandb.ai/muhammad-ichsan/nerfstudio-project/runs/l4d8d829
logging events to: outputs/unnamed/splatfacto/2025-03-30_062154
[06:22:05] Caching / undistorting train images                                            �]8;id=442417;file:///workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/data/datamanagers/full_images_datamanager.py�\full_images_datamanager.py�]8;;�\:�]8;id=33326;file:///workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/data/datamanagers/full_images_datamanager.py#241�\241�]8;;�\
Caching / undistorting train images ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00m 0:00:01
Printing profiling stats, from longest to shortest duration in seconds
Trainer.train_iteration: 0.2992              
VanillaPipeline.get_train_loss_dict: 0.2976              
Traceback (most recent call last):
  File "/usr/local/bin/ns-train", line 8, in <module>
    sys.exit(entrypoint())
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/scripts/train.py", line 272, in entrypoint
    main(
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/scripts/train.py", line 257, in main
    launch(
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/scripts/train.py", line 190, in launch
    main_func(local_rank=0, world_size=world_size, config=config)
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/scripts/train.py", line 101, in train_loop
    trainer.train()
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/engine/trainer.py", line 266, in train
    loss, loss_dict, metrics_dict = self.train_iteration(step)
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/utils/profiler.py", line 111, in inner
    out = func(*args, **kwargs)
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/engine/trainer.py", line 502, in train_iteration
    _, loss_dict, metrics_dict = self.pipeline.get_train_loss_dict(step=step)
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/utils/profiler.py", line 111, in inner
    out = func(*args, **kwargs)
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/pipelines/base_pipeline.py", line 298, in get_train_loss_dict
    ray_bundle, batch = self.datamanager.next_train(step)
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/data/datamanagers/full_images_datamanager.py", line 406, in next_train
    data = nerfstudio_collate(
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/data/utils/nerfstudio_collate.py", line 122, in nerfstudio_collate
    {key: nerfstudio_collate([d[key] for d in batch], extra_mappings=extra_mappings) for key in elem}
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/data/utils/nerfstudio_collate.py", line 122, in <dictcomp>
    {key: nerfstudio_collate([d[key] for d in batch], extra_mappings=extra_mappings) for key in elem}
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/data/utils/nerfstudio_collate.py", line 103, in nerfstudio_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [767, 1035, 3] at entry 0 and [1049, 778, 3] at entry 1
Traceback (most recent call last):
  File "/usr/local/bin/ns-train", line 8, in <module>
    sys.exit(entrypoint())
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/scripts/train.py", line 272, in entrypoint
    main(
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/scripts/train.py", line 257, in main
    launch(
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/scripts/train.py", line 190, in launch
    main_func(local_rank=0, world_size=world_size, config=config)
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/scripts/train.py", line 101, in train_loop
    trainer.train()
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/engine/trainer.py", line 266, in train
    loss, loss_dict, metrics_dict = self.train_iteration(step)
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/utils/profiler.py", line 111, in inner
    out = func(*args, **kwargs)
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/engine/trainer.py", line 502, in train_iteration
    _, loss_dict, metrics_dict = self.pipeline.get_train_loss_dict(step=step)
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/utils/profiler.py", line 111, in inner
    out = func(*args, **kwargs)
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/pipelines/base_pipeline.py", line 298, in get_train_loss_dict
    ray_bundle, batch = self.datamanager.next_train(step)
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/data/datamanagers/full_images_datamanager.py", line 406, in next_train
    data = nerfstudio_collate(
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/data/utils/nerfstudio_collate.py", line 122, in nerfstudio_collate
    {key: nerfstudio_collate([d[key] for d in batch], extra_mappings=extra_mappings) for key in elem}
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/data/utils/nerfstudio_collate.py", line 122, in <dictcomp>
    {key: nerfstudio_collate([d[key] for d in batch], extra_mappings=extra_mappings) for key in elem}
  File "/workspace/NERFSTUDIO_v115a1/nerfstudio/nerfstudio/data/utils/nerfstudio_collate.py", line 103, in nerfstudio_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [767, 1035, 3] at entry 0 and [1049, 778, 3] at entry 1

And yeah, this code works, but the TV_loss is still zero when I activated bilagrid.

    def _apply_bilateral_grid(self, rgb: torch.Tensor, cam_idx, H: int, W: int) -> torch.Tensor:
        # Get batch size
        batch_size = rgb.shape[0]
        
        grid_y, grid_x = torch.meshgrid(
            torch.linspace(0, 1.0, H, device=self.device),
            torch.linspace(0, 1.0, W, device=self.device),
            indexing="ij",
        )
        grid_xy = torch.stack([grid_x, grid_y], dim=-1)  # [H, W, 2]
        grid_xy = grid_xy.expand(batch_size, H, W, 2)  # [B, H, W, 2]
        
        if isinstance(cam_idx, torch.Tensor):
            if cam_idx.dim() > 1:
                grid_idx = cam_idx[0, 0].clone().detach().to(device=self.device, dtype=torch.long)
            else:
                grid_idx = cam_idx.clone().detach().to(device=self.device, dtype=torch.long)
        else:
            grid_idx = torch.tensor(cam_idx, device=self.device, dtype=torch.long)
        
        grid_idx = grid_idx.expand(batch_size)
        
        out = slice(
            bil_grids=self.bil_grids,
            rgb=rgb,
            xy=grid_xy,
            grid_idx=grid_idx,
        )
        return out["rgb"]

How do you activate the train validation loss in the console log output? Maybe I can check and see what to find.

ichsan2895 · 2025-03-30T11:16:20Z

Preliminary Benchmark

Just benchmarking Mip360 dataset with various value of batch-size

In this time, for each scene, I just run 1000 steps only. Maybe when I have other free time, I will set it to 30k steps.

FYI, mip-360's downscaled images is not compatible with nerfstudio since nerfstudio needs downscale with floor rounding decimals. So, I resize it manually.. See this #1438 for discussion.

ns-train splatfacto --pipeline.datamanager.batch-size 1, Nerfstudio commit d5bdd45, Python 3.10, RTX4090

#	SCENE	PSNR	SSIM	LPIPS	duration_seconds	duration_minutes
0	garden	20.855	0.432	0.592	43.000	0.717
1	bicycle	19.842	0.377	0.727	39.000	0.650
2	stump	21.623	0.452	0.651	33.000	0.550
3	bonsai	23.668	0.789	0.293	64.000	1.067
4	counter	22.757	0.746	0.374	46.000	0.767
5	kitchen	20.818	0.697	0.304	53.000	0.883
6	room	24.931	0.801	0.366	47.000	0.783
7	Average	22.071	0.613	0.472	46.429	0.774

ns-train splatfacto --pipeline.datamanager.batch-size 2, Nerfstudio commit d5bdd45, Python 3.10, RTX4090

SCENE	PSNR	SSIM	LPIPS	duration_seconds	duration_minutes
garden	20.911	0.441	0.558	42.000	0.700
bicycle	19.982	0.390	0.688	37.000	0.617
stump	21.833	0.468	0.612	34.000	0.567
bonsai	23.759	0.797	0.271	54.000	0.900
counter	23.310	0.760	0.333	48.000	0.800
kitchen	20.116	0.698	0.284	54.000	0.900
room	25.750	0.817	0.329	48.000	0.800
Average	22.237	0.624	0.439	45.286	0.755

ns-train splatfacto --pipeline.datamanager.batch-size 3, Nerfstudio commit d5bdd45, Python 3.10, RTX4090

#	SCENE	PSNR	SSIM	LPIPS	duration_seconds	duration_minutes
0	garden	20.920	0.447	0.540	43.000	0.717
1	bicycle	20.095	0.398	0.667	39.000	0.650
2	stump	21.949	0.477	0.593	36.000	0.600
3	bonsai	24.043	0.803	0.260	55.000	0.917
4	counter	23.539	0.768	0.314	48.000	0.800
5	kitchen	19.972	0.709	0.273	66.000	1.100
6	room	26.261	0.825	0.311	62.000	1.033
7	Average	22.397	0.632	0.423	49.857	0.831

SPLATFACTO-BIG

ns-train splatfacto-big --pipeline.datamanager.batch-size 1, Nerfstudio commit d5bdd45, Python 3.10, RTX4090

Index	Scene	PSNR	SSIM	LPIPS	Duration (seconds)	Duration (minutes)
0	garden	21.040	0.432	0.587	46.000	0.767
1	bicycle	19.903	0.377	0.727	47.000	0.783
2	stump	21.701	0.456	0.648	38.000	0.633
3	bonsai	23.915	0.793	0.293	175.000	2.917
4	counter	22.931	0.749	0.374	162.000	2.700
5	kitchen	21.051	0.702	0.303	108.000	1.800
6	room	25.014	0.804	0.365	170.000	2.833
7	Average	22.222	0.616	0.471	106.571	1.776

ns-train splatfacto-big --pipeline.datamanager.batch-size 2, Nerfstudio commit d5bdd45, Python 3.10, RTX4090

Index	Scene	PSNR	SSIM	LPIPS	Duration (seconds)	Duration (minutes)
0	garden	21.076	0.440	0.551	52.000	0.867
1	bicycle	20.072	0.391	0.682	63.000	1.050
2	stump	21.979	0.473	0.605	36.000	0.600
3	bonsai	24.170	0.802	0.269	74.000	1.233
4	counter	23.511	0.764	0.331	123.000	2.050
5	kitchen	20.531	0.706	0.281	129.000	2.150
6	room	25.776	0.820	0.326	112.000	1.867
7	Average	22.445	0.628	0.435	84.143	1.402

ns-train splatfacto-big --pipeline.datamanager.batch-size 3, Nerfstudio commit d5bdd45, Python 3.10, RTX4090

Index	Scene	PSNR	SSIM	LPIPS	Duration (seconds)	Duration (minutes)
0	garden	21.121	0.447	0.531	43.000	0.717
1	bicycle	20.205	0.401	0.658	39.000	0.650
2	stump	22.155	0.483	0.581	34.000	0.567
3	bonsai	24.312	0.806	0.260	61.000	1.017
4	counter	23.686	0.772	0.311	49.000	0.817
5	kitchen	20.286	0.716	0.269	56.000	0.933
6	room	26.332	0.827	0.308	48.000	0.800
7	Average	22.585	0.636	0.417	47.143	0.786

MCMC

ns-train splatfacto-mcmc --pipeline.datamanager.batch-size 1, Nerfstudio commit d5bdd45, Python 3.10, RTX4090

Index	Scene	PSNR	SSIM	LPIPS	Duration (seconds)	Duration (minutes)
0	garden	21.171	0.434	0.627	42.000	0.700
1	bicycle	20.050	0.376	0.769	38.000	0.633
2	stump	21.447	0.438	0.707	34.000	0.567
3	bonsai	24.029	0.797	0.310	52.000	0.867
4	counter	23.161	0.754	0.384	47.000	0.783
5	kitchen	21.002	0.703	0.311	54.000	0.900
6	room	25.264	0.809	0.375	48.000	0.800
7	Average	22.303	0.616	0.498	45.000	0.750

ns-train splatfacto-mcmc --pipeline.datamanager.batch-size 2, Nerfstudio commit d5bdd45, Python 3.10, RTX4090

Index	Scene	PSNR	SSIM	LPIPS	Duration (seconds)	Duration (minutes)
0	garden	21.281	0.444	0.596	42.000	0.700
1	bicycle	20.175	0.386	0.739	39.000	0.650
2	stump	21.501	0.447	0.683	35.000	0.583
3	bonsai	24.289	0.807	0.286	52.000	0.867
4	counter	23.645	0.767	0.347	48.000	0.800
5	kitchen	20.754	0.709	0.290	53.000	0.883
6	room	26.017	0.822	0.344	48.000	0.800
7	Average	22.523	0.626	0.469	45.286	0.755

ns-train splatfacto-mcmc --pipeline.datamanager.batch-size 2 --pipeline.datamanager.train-cameras-sampling-strategy fps, Nerfstudio commit d5bdd45, Python 3.10, RTX4090

Index	Scene	PSNR	SSIM	LPIPS	Duration (seconds)	Duration (minutes)
0	garden	21.284	0.445	0.595	42.000	0.700
1	bicycle	20.269	0.389	0.736	39.000	0.650
2	stump	21.587	0.447	0.684	35.000	0.583
3	bonsai	24.378	0.810	0.282	54.000	0.900
4	counter	23.813	0.773	0.342	47.000	0.783
5	kitchen	20.814	0.713	0.286	53.000	0.883
6	room	25.957	0.823	0.341	50.000	0.833
7	Average	22.586	0.629	0.467	45.714	0.762

ichsan2895 · 2025-03-30T11:20:24Z

How do you activate the train validation loss in the console log output? Maybe I can check and see what to find.

@abrahamezzeddine I use wandb for logging.

>> ns-train splatfacto --vis viewer+wandb so on......

wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: muhammad-ichsan. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.18.0
wandb: Run data is saved locally in outputs/unnamed/splatfacto/2025-03-29_180941/wandb/run-20250329_180951-wkvxyz93
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run unnamed
wandb: ⭐️ View project at https://wandb.ai/your-account/nerfstudio-project
wandb: 🚀 View run at https://wandb.ai/your-account/nerfstudio-project/runs/wkvxyz93

akristoffersen · 2025-04-09T04:35:24Z

Sorry for the neglect of this pr all, @ichsan2895 thank you so much for the benchmarking, it's a real help.

I will try to crush the bugs re: bilagrid this weekend. Sorry again for the delay here.

Regarding multi-res camera support, I think I'm okay limiting that ability at the moment, though if I go through with the "large patch sampling" technique described above, that limit can go away.

ichsan2895 · 2025-04-17T04:51:49Z

Personally, this feature is big charger for splatfacto.

ns-train splatfacto --pipeline.datamanager.batch-size 1

ns-train splatfacto --pipeline.datamanager.batch-size 3

ns-train splatfacto --pipeline.datamanager.batch-size 5

Hopefully in the future, the bugs of bilateral_grid and multicam support is fixed.

ichsan2895 · 2025-04-22T06:28:59Z

I confirm that bilagrid works properly. TV_loss, cc_psnr, cc_ssim, and cc_lpips are calculated properly. 🎉

Please note, sometimes metrics (psnr, lpips, ssim) are up and down for unknown reason in some dataset. But in the end, it will be higher and stable in >15k iters.

Next, I will test bilagrid+RGBA images.

ichsan2895 · 2025-04-22T06:50:56Z

Bilagrid+RGBA dataset does not work:

>> ns-train splatfacto --vis viewer+wandb \
    --pipeline.model.use-bilateral-grid True --pipeline.model.color-corrected-metrics True \
    --pipeline.datamanager.batch-size 2 \
    nerfstudio-data \
    --data path/to/scene --downscale-factor 1

.
.
.
Step (% Done)       Train Iter (time)    ETA (time)           
--------------------------------------------------------------
Step (% Done)       Train Iter (time)    ETA (time)                                                  
--------------------------------------------------------------                                       
0 (0.00%)           1 m, 7 s             23 d, 6 h, 36 m, 54 s                                       
---------------------------------------------------------------------------------------------------- 
Viewer running locally at: http://localhost:7007/ (listening on 0.0.0.0)                              
[06:48:33] Caching / undistorting eval images                                             �]8;id=439898;file:///workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/data/datamanagers/full_images_datamanager.py�\full_images_datamanager.py�]8;;�\:�]8;id=231148;file:///workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/data/datamanagers/full_images_datamanager.py#241�\241�]8;;�\
Caching / undistorting eval images ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:0600:0100:01
Printing profiling stats, from longest to shortest duration in seconds
VanillaPipeline.get_eval_image_metrics_and_images: 7.3795              
Trainer.train_iteration: 0.7054              
VanillaPipeline.get_train_loss_dict: 0.6988              
Trainer.eval_iteration: 0.0731              
Traceback (most recent call last):
  File "/usr/local/bin/ns-train", line 8, in <module>
    sys.exit(entrypoint())
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/scripts/train.py", line 272, in entrypoint
    main(
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/scripts/train.py", line 257, in main
    launch(
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/scripts/train.py", line 190, in launch
    main_func(local_rank=0, world_size=world_size, config=config)
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/scripts/train.py", line 101, in train_loop
    trainer.train()
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/engine/trainer.py", line 304, in train
    self.eval_iteration(step)
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/utils/decorators.py", line 71, in wrapper
    ret = func(self, *args, **kwargs)
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/utils/profiler.py", line 111, in inner
    out = func(*args, **kwargs)
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/engine/trainer.py", line 551, in eval_iteration
    metrics_dict, images_dict = self.pipeline.get_eval_image_metrics_and_images(step=step)
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/utils/profiler.py", line 111, in inner
    out = func(*args, **kwargs)
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/pipelines/base_pipeline.py", line 339, in get_eval_image_metrics_and_images
    metrics_dict, images_dict = self.model.get_image_metrics_and_images(outputs, batch)
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/models/splatfacto.py", line 763, in get_image_metrics_and_images
    combined_rgb = torch.cat([gt_rgb, predicted_rgb], dim=1)
RuntimeError: Tensors must have same number of dimensions: got 4 and 3
Traceback (most recent call last):
  File "/usr/local/bin/ns-train", line 8, in <module>
    sys.exit(entrypoint())
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/scripts/train.py", line 272, in entrypoint
    main(
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/scripts/train.py", line 257, in main
    launch(
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/scripts/train.py", line 190, in launch
    main_func(local_rank=0, world_size=world_size, config=config)
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/scripts/train.py", line 101, in train_loop
    trainer.train()
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/engine/trainer.py", line 304, in train
    self.eval_iteration(step)
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/utils/decorators.py", line 71, in wrapper
    ret = func(self, *args, **kwargs)
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/utils/profiler.py", line 111, in inner
    out = func(*args, **kwargs)
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/engine/trainer.py", line 551, in eval_iteration
    metrics_dict, images_dict = self.pipeline.get_eval_image_metrics_and_images(step=step)
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/utils/profiler.py", line 111, in inner
    out = func(*args, **kwargs)
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/pipelines/base_pipeline.py", line 339, in get_eval_image_metrics_and_images
    metrics_dict, images_dict = self.model.get_image_metrics_and_images(outputs, batch)
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/models/splatfacto.py", line 763, in get_image_metrics_and_images
    combined_rgb = torch.cat([gt_rgb, predicted_rgb], dim=1)
RuntimeError: Tensors must have same number of dimensions: got 4 and 3

ichsan2895 · 2025-04-22T08:40:48Z

Another error:

Images and Mask does not work well

>> ns-train splatfacto --vis viewer+wandb \
    --pipeline.model.use-bilateral-grid True --pipeline.model.color-corrected-metrics True \
    --pipeline.datamanager.batch-size 2 \
    colmap \
    --data path/to/scene --downscale-factor 1 --colmap-path "sparse/0" \
    --images-path "images" --masks-path "masks"

[08:31:21] Caching / undistorting train images                                            �]8;id=442417;file:///workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/data/datamanagers/full_images_datamanager.py�\full_images_datamanager.py�]8;;�\:�]8;id=33326;file:///workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/data/datamanagers/full_images_datamanager.py#241�\241�]8;;�\
Caching / undistorting train images ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:3400:0100:02
/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py:135: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(
Printing profiling stats, from longest to shortest duration in seconds
Trainer.train_iteration: 47.2053             
VanillaPipeline.get_train_loss_dict: 47.2035             
Traceback (most recent call last):
  File "/usr/local/bin/ns-train", line 8, in <module>
    sys.exit(entrypoint())
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/scripts/train.py", line 272, in entrypoint
    main(
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/scripts/train.py", line 257, in main
    launch(
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/scripts/train.py", line 190, in launch
    main_func(local_rank=0, world_size=world_size, config=config)
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/scripts/train.py", line 101, in train_loop
    trainer.train()
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/engine/trainer.py", line 266, in train
    loss, loss_dict, metrics_dict = self.train_iteration(step)
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/utils/profiler.py", line 111, in inner
    out = func(*args, **kwargs)
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/engine/trainer.py", line 502, in train_iteration
    _, loss_dict, metrics_dict = self.pipeline.get_train_loss_dict(step=step)
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/utils/profiler.py", line 111, in inner
    out = func(*args, **kwargs)
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/pipelines/base_pipeline.py", line 301, in get_train_loss_dict
    loss_dict = self.model.get_loss_dict(model_outputs, batch, metrics_dict)
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/models/splatfacto.py", line 687, in get_loss_dict
    mask = self._downscale_if_required(batch["mask"])
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/models/splatfacto.py", line 452, in _downscale_if_required
    return resize_image(image, d)
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/models/splatfacto.py", line 66, in resize_image
    downscaled = tf.conv2d(image, weight, stride=d)
RuntimeError: Input type (CUDABoolType) and weight type (torch.cuda.FloatTensor) should be the same
Traceback (most recent call last):
  File "/usr/local/bin/ns-train", line 8, in <module>
    sys.exit(entrypoint())
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/scripts/train.py", line 272, in entrypoint
    main(
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/scripts/train.py", line 257, in main
    launch(
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/scripts/train.py", line 190, in launch
    main_func(local_rank=0, world_size=world_size, config=config)
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/scripts/train.py", line 101, in train_loop
    trainer.train()
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/engine/trainer.py", line 266, in train
    loss, loss_dict, metrics_dict = self.train_iteration(step)
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/utils/profiler.py", line 111, in inner
    out = func(*args, **kwargs)
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/engine/trainer.py", line 502, in train_iteration
    _, loss_dict, metrics_dict = self.pipeline.get_train_loss_dict(step=step)
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/utils/profiler.py", line 111, in inner
    out = func(*args, **kwargs)
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/pipelines/base_pipeline.py", line 301, in get_train_loss_dict
    loss_dict = self.model.get_loss_dict(model_outputs, batch, metrics_dict)
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/models/splatfacto.py", line 687, in get_loss_dict
    mask = self._downscale_if_required(batch["mask"])
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/models/splatfacto.py", line 452, in _downscale_if_required
    return resize_image(image, d)
  File "/workspace/NERFSTUDIO_v115a2/nerfstudio/nerfstudio/models/splatfacto.py", line 66, in resize_image
    downscaled = tf.conv2d(image, weight, stride=d)
RuntimeError: Input type (CUDABoolType) and weight type (torch.cuda.FloatTensor) should be the same

ichsan2895 · 2025-04-27T10:21:29Z

I have tried padding with alpha channel in the images thats does not have same resolution as first image in the batch. It worked but the result is not good.

Now, I have an idea @akristoffersen for using multi-cameras with batch. It only stack same images with same resolution of the first image of the batch. If it does not have any same images with same resolution in the batch, return the first image itself. To preserve the batch shape, I create a dummy of clone of first image in the batch.

Add this code in nerfstudio/nerfstudio/data/utils/nerfstudio_collate.py

    def stacked_batches(batch, dim=0, out=None):
        if not batch:
            raise ValueError("Batch cannot be empty")
        
        # Reference size from the first tensor
        ref_h, ref_w, ref_c = batch[0].shape
        
        # Collect tensors that match the reference size
        matching = [tensor for tensor in batch if tensor.shape == (ref_h, ref_w, ref_c)]
        
        if not matching:
            raise ValueError("No tensors with matching resolution found")
        
        # Create output list, starting with matching tensors
        result = matching.copy()
        
        # Fill remaining slots with duplicates of the first tensor to match original batch length
        while len(result) < len(batch):
            result.append(batch[0])
        
        # Stack all tensors
        return torch.stack(result, dim=dim, out=out)
    
   if isinstance(elem, torch.Tensor):
          out = None
          if torch.utils.data.get_worker_info() is not None:
              # If in a background process, use shared memory
              numel = sum(x.numel() for x in batch)
              storage = elem.untyped_storage()._new_shared(numel, device=str(elem.device))
              out = elem.new(storage).resize_(len(batch), *list(elem.size()))
          return stacked_batches(batch, 0, out=out)

akristoffersen mentioned this pull request Jan 27, 2025

Splatfacto with batch training #3560

Open

akristoffersen commented Jan 28, 2025

View reviewed changes

nerfstudio/data/datamanagers/full_images_datamanager.py Outdated Show resolved Hide resolved

akristoffersen force-pushed the batch_splatfacto branch from 2d95d1e to cbb5ceb Compare February 9, 2025 02:23

akristoffersen requested a review from AntonioMacaronio February 9, 2025 02:32

AntonioMacaronio reviewed Feb 18, 2025

View reviewed changes

bchretien reviewed Mar 13, 2025

View reviewed changes

nerfstudio/models/splatfacto.py Outdated Show resolved Hide resolved

akristoffersen and others added 5 commits April 19, 2025 19:27

Allow for >1 batch size in splatfacto

1e5fea0

updated comment

1aa3d09

cleanup + make work with dataloader refactor

7534956

fix device typing issue

9849f64

fix alpha compositing.

7c7b859

akristoffersen force-pushed the batch_splatfacto branch from d5bdd45 to 7c7b859 Compare April 19, 2025 19:32

fix bilagrid

c325ce4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow for >1 batch size in Splatfacto #3582

Allow for >1 batch size in Splatfacto #3582

akristoffersen commented Jan 27, 2025

kerrj commented Jan 31, 2025

hardikdava commented Feb 7, 2025 •

edited

Loading

akristoffersen commented Feb 9, 2025

akristoffersen commented Feb 9, 2025

hardikdava commented Feb 9, 2025

akristoffersen commented Feb 10, 2025 •

edited

Loading

AntonioMacaronio left a comment •

edited

Loading

abrahamezzeddine commented Mar 17, 2025 •

edited

Loading

akristoffersen commented Mar 17, 2025

abrahamezzeddine commented Mar 17, 2025 •

edited

Loading

akristoffersen commented Mar 18, 2025

abrahamezzeddine commented Mar 18, 2025 •

edited

Loading

akristoffersen commented Mar 18, 2025

abrahamezzeddine commented Mar 19, 2025 •

edited

Loading

abrahamezzeddine commented Mar 19, 2025

akristoffersen commented Mar 19, 2025

abrahamezzeddine commented Mar 19, 2025 •

edited

Loading

ichsan2895 commented Mar 27, 2025 •

edited

Loading

ichsan2895 commented Mar 28, 2025 •

edited

Loading

ichsan2895 commented Mar 30, 2025

AntonioMacaronio commented Mar 30, 2025 •

edited

Loading

abrahamezzeddine commented Mar 30, 2025

abrahamezzeddine commented Mar 30, 2025

ichsan2895 commented Mar 30, 2025 •

edited

Loading

ichsan2895 commented Mar 30, 2025 •

edited

Loading

akristoffersen commented Apr 9, 2025

ichsan2895 commented Apr 17, 2025 •

edited

Loading

ichsan2895 commented Apr 22, 2025 •

edited

Loading

ichsan2895 commented Apr 22, 2025 •

edited

Loading

ichsan2895 commented Apr 22, 2025 •

edited

Loading

ichsan2895 commented Apr 27, 2025 •

edited

Loading

Allow for >1 batch size in Splatfacto #3582

Are you sure you want to change the base?

Allow for >1 batch size in Splatfacto #3582

Conversation

akristoffersen commented Jan 27, 2025

kerrj commented Jan 31, 2025

hardikdava commented Feb 7, 2025 • edited Loading

akristoffersen commented Feb 9, 2025

akristoffersen commented Feb 9, 2025

hardikdava commented Feb 9, 2025

akristoffersen commented Feb 10, 2025 • edited Loading

AntonioMacaronio left a comment • edited Loading

Choose a reason for hiding this comment

abrahamezzeddine commented Mar 17, 2025 • edited Loading

akristoffersen commented Mar 17, 2025

abrahamezzeddine commented Mar 17, 2025 • edited Loading

akristoffersen commented Mar 18, 2025

abrahamezzeddine commented Mar 18, 2025 • edited Loading

akristoffersen commented Mar 18, 2025

abrahamezzeddine commented Mar 19, 2025 • edited Loading

abrahamezzeddine commented Mar 19, 2025

akristoffersen commented Mar 19, 2025

abrahamezzeddine commented Mar 19, 2025 • edited Loading

ichsan2895 commented Mar 27, 2025 • edited Loading

ichsan2895 commented Mar 28, 2025 • edited Loading

ichsan2895 commented Mar 30, 2025

AntonioMacaronio commented Mar 30, 2025 • edited Loading

abrahamezzeddine commented Mar 30, 2025

abrahamezzeddine commented Mar 30, 2025

ichsan2895 commented Mar 30, 2025 • edited Loading

Preliminary Benchmark

ns-train splatfacto --pipeline.datamanager.batch-size 1, Nerfstudio commit d5bdd45, Python 3.10, RTX4090

ns-train splatfacto --pipeline.datamanager.batch-size 2, Nerfstudio commit d5bdd45, Python 3.10, RTX4090

ns-train splatfacto --pipeline.datamanager.batch-size 3, Nerfstudio commit d5bdd45, Python 3.10, RTX4090

SPLATFACTO-BIG

ns-train splatfacto-big --pipeline.datamanager.batch-size 1, Nerfstudio commit d5bdd45, Python 3.10, RTX4090

ns-train splatfacto-big --pipeline.datamanager.batch-size 2, Nerfstudio commit d5bdd45, Python 3.10, RTX4090

ns-train splatfacto-big --pipeline.datamanager.batch-size 3, Nerfstudio commit d5bdd45, Python 3.10, RTX4090

MCMC

ns-train splatfacto-mcmc --pipeline.datamanager.batch-size 1, Nerfstudio commit d5bdd45, Python 3.10, RTX4090

ns-train splatfacto-mcmc --pipeline.datamanager.batch-size 2, Nerfstudio commit d5bdd45, Python 3.10, RTX4090

ns-train splatfacto-mcmc --pipeline.datamanager.batch-size 2 --pipeline.datamanager.train-cameras-sampling-strategy fps, Nerfstudio commit d5bdd45, Python 3.10, RTX4090

ichsan2895 commented Mar 30, 2025 • edited Loading

akristoffersen commented Apr 9, 2025

ichsan2895 commented Apr 17, 2025 • edited Loading

ichsan2895 commented Apr 22, 2025 • edited Loading

ichsan2895 commented Apr 22, 2025 • edited Loading

ichsan2895 commented Apr 22, 2025 • edited Loading

Another error:

ichsan2895 commented Apr 27, 2025 • edited Loading

hardikdava commented Feb 7, 2025 •

edited

Loading

akristoffersen commented Feb 10, 2025 •

edited

Loading

AntonioMacaronio left a comment •

edited

Loading

abrahamezzeddine commented Mar 17, 2025 •

edited

Loading

abrahamezzeddine commented Mar 17, 2025 •

edited

Loading

abrahamezzeddine commented Mar 18, 2025 •

edited

Loading

abrahamezzeddine commented Mar 19, 2025 •

edited

Loading

abrahamezzeddine commented Mar 19, 2025 •

edited

Loading

ichsan2895 commented Mar 27, 2025 •

edited

Loading

ichsan2895 commented Mar 28, 2025 •

edited

Loading

AntonioMacaronio commented Mar 30, 2025 •

edited

Loading

ichsan2895 commented Mar 30, 2025 •

edited

Loading

ichsan2895 commented Mar 30, 2025 •

edited

Loading

ichsan2895 commented Apr 17, 2025 •

edited

Loading

ichsan2895 commented Apr 22, 2025 •

edited

Loading

ichsan2895 commented Apr 22, 2025 •

edited

Loading

ichsan2895 commented Apr 22, 2025 •

edited

Loading

ichsan2895 commented Apr 27, 2025 •

edited

Loading