Firefox Generates img alt text

Mozilla announced a public experiment where Firefox generates [missing] alt text for images automatically - a big unblocker for blind and other text users.

Web technology, if used correctly, is very accessible technology. When a blind or text-only reader goes through content with images, how do they read images? HTML requires an image to always have an alternative text.

This is often not followed though, or the alternative texts are very simplistic and not useful, or even grossly useless from generic value templates (like “alt text”, “text”, “image”, or other generic noise).

With Firefox 130, Mozilla starts providing and experimenting with an AI model to generate image alt texts automatically. This feature and experiment is planned to go through three phases and availability:

  1. PDF editor when adding an image in Firefox 130
  2. PDF reading
  3. [hopefully] general web browsing

Sounds like a good plan.

As for some more technical notes and article comments:

Once quantized, these models can be under 200MB on disk, and run in a couple of seconds on a laptop – a big reduction compared to the gigabytes and resources an LLM requires.

While a reasonable size for Laptop and desktop, the couple of seconds time could still be a bit of a hindrance. Nevertheless, a significant unblock for blind/text users.

I wonder what it would mean for mobile. If it’s an optional accessibility feature, and with today’s smartphones storage space I think it can work well though.

Running inference locally with small models offers many advantages:

They list 5 positives about using local models. On a blog targeting developers, I would wish if not expect them to list the downsides and weighing of the two sides too. As it is, it’s promotional material, not honest, open, fully informing descriptions.

While they go into technical details about the architecture and technical implementation, I think the negatives are noteworthy, and the weighing could be insightful for readers.

So every time an image is added, we get an array of pixels we pass to the ML engine

An array of pixels doesn’t make sense to me. Images can have different widths, so linear data with varying sectioning content would be awful for training.

I have to assume this was a technical simplification or unintended wording mistake for the article.