diff --git a/README.md b/README.md index ac8269d..d008fc6 100644 --- a/README.md +++ b/README.md @@ -73,9 +73,11 @@ for (const result of results) { } ``` -Here `results` will be an array of `{ detectedLanguage, confidence }` objects, with the `detectedLanguage` field being a BCP 47 language tag and `confidence` beeing a number between 0 and 1. The array will be sorted by descending confidence, and the confidences will be normalized so that all confidences that the underlying model produces sum to 1, but confidences below `0.1` will be omitted. (Thus, the total sum of `confidence` values seen by the developer will sometimes sum to less than 1.) +Here `results` will be an array of `{ detectedLanguage, confidence }` objects, with the `detectedLanguage` field being a BCP 47 language tag and `confidence` beeing a number between 0 and 1. The array will be sorted by descending confidence, and the confidences will be normalized so that all confidences that the underlying model produces sum to 1, but very low confidences will be lumped together into an [`"und"`](https://www.rfc-editor.org/rfc/rfc5646.html#:~:text=*%20%20The%20'und'%20(Undetermined)%20primary,certain%20situations.) language. -The language being unknown is represented by `detectedLanguage` being null. The array will always contain at least 1 entry, although it could be for the unknown (`null`) language. +The array will always contain at least 1 entry, although it could be for the undetermined (`"und"`) language. + +For more details on the ways low-confidence results are excluded, see [the specification](https://webmachinelearning.github.io/translation-api/#note-language-detection-post-processing) and the discussion in [issue #39](https://github.com/webmachinelearning/translation-api/issues/39). ### Language detection with expected input languages @@ -115,7 +117,7 @@ async function translateUnknownCustomerInput(textToTranslate, targetLanguage) { const detector = await ai.languageDetector.create(); const [bestResult] = await detector.detect(textToTranslate); - if (bestResult.detectedLangauge ==== null || bestResult.confidence < 0.4) { + if (bestResult.detectedLanguage ==== "und" || bestResult.confidence < 0.4) { // We'll just return the input text without translating. It's probably mostly punctuation // or something. return textToTranslate; @@ -199,112 +201,17 @@ In all cases, the exception used for rejecting promises or erroring `ReadableStr ## Detailed design -### Full API surface in Web IDL - -```webidl -// Shared self.ai APIs - -partial interface WindowOrWorkerGlobalScope { - [Replaceable, SecureContext] readonly attribute AI ai; -}; - -[Exposed=(Window,Worker), SecureContext] -interface AI { - readonly attribute AITranslatorFactory translator; - readonly attribute AILanguageDetectorFactory languageDetector; -}; - -[Exposed=(Window,Worker), SecureContext] -interface AICreateMonitor : EventTarget { - attribute EventHandler ondownloadprogress; - - // Might get more stuff in the future, e.g. for - // https://github.com/webmachinelearning/prompt-api/issues/4 -}; - -callback AICreateMonitorCallback = undefined (AICreateMonitor monitor); - -enum AIAvailability { "unavailable", "downloadable", "downloading", "available" }; -``` - -```webidl -// Translator - -[Exposed=(Window,Worker), SecureContext] -interface AITranslatorFactory { - Promise create(AITranslatorCreateOptions options); - Promise availability(AITranslatorCreateCoreOptions options); -}; - -[Exposed=(Window,Worker), SecureContext] -interface AITranslator { - Promise translate(DOMString input, optional AITranslatorTranslateOptions options = {}); - ReadableStream translateStreaming(DOMString input, optional AITranslatorTranslateOptions options = {}); - - readonly attribute DOMString sourceLanguage; - readonly attribute DOMString targetLanguage; - - undefined destroy(); -}; - -dictionary AITranslatorCreateCoreOptions { - required DOMString sourceLanguage; - required DOMString targetLanguage; -}; - -dictionary AITranslatorCreateOptions : AITranslatorCreateCoreOptions { - AbortSignal signal; - AICreateMonitorCallback monitor; -}; - -dictionary AITranslatorTranslateOptions { - AbortSignal signal; -}; -``` - -```webidl -// Language detector - -[Exposed=(Window,Worker), SecureContext] -interface AILanguageDetectorFactory { - Promise create(optional AILanguageDetectorCreateOptions options = {}); - Promise availability(optional AILanguageDetectorCreateCoreOptions = {}); -}; - -[Exposed=(Window,Worker), SecureContext] -interface AILanguageDetector { - Promise> detect(DOMString input, - optional AILanguageDetectorDetectOptions options = {}); - - readonly attribute FrozenArray? expectedInputLanguages; - - undefined destroy(); -}; - -dictionary AILanguageDetectorCreateCoreOptions { - sequence expectedInputLanguages; -}; +### Language tag handling -dictionary AILanguageDetectorCreateOptions : AILanguageDetectorCreateCoreOptions { - AbortSignal signal; - AICreateMonitorCallback monitor; -}; +If a browser supports translating from `ja` to `en`, does it also support translating from `ja` to `en-US`? What about `en-GB`? What about the (discouraged, but valid) `en-Latn`, i.e. English written in the usual Latin script? But translation to `en-Brai`, English written in the Braille script, is different entirely. -dictionary AILanguageDetectorDetectOptions { - AbortSignal signal; -}; +We're proposing that the API use the same model as JavaScript's `Intl` APIs, which tries to do [best-fit matching](https://tc39.es/ecma402/#sec-lookupmatchinglocalebybestfit) of the requested language tag to the available language tags. The specification contains [a more detailed example](https://webmachinelearning.github.io/translation-api/#example-language-arc-support). -dictionary LanguageDetectionResult { - DOMString? detectedLanguage; // null represents unknown language - double confidence; -}; -``` +### Multilingual text -### Language tag handling - -If a browser supports translating from `ja` to `en`, does it also support translating from `ja` to `en-US`? What about `en-GB`? What about the (discouraged, but valid) `en-Latn`, i.e. English written in the usual Latin script? But translation to `en-Brai`, English written in the Braille script, is different entirely. +For language detection of multilingual text, we return detected language confidences in proportion to the languages detected. The specification gives [an example](https://webmachinelearning.github.io/translation-api#example-multilingual-input) of how this works. See also the discussion in [issue #13](https://github.com/webmachinelearning/translation-api/issues/13). -We're not clear on what the right model is here, and are discussing it in [issue #11](https://github.com/webmachinelearning/translation-api/issues/11). +A future option might be to instead have the API return back the splitting of the text into different-language segments. There is [some precedent](https://github.com/pemistahl/lingua-py?tab=readme-ov-file#116-detection-of-multiple-languages-in-mixed-language-texts) for this, but it does not seem to be common yet. This could be added without backward-compatibility problems by making it a non-default mode. ### Downloading diff --git a/index.bs b/index.bs index 2909c0f..02bbf18 100644 --- a/index.bs +++ b/index.bs @@ -7,7 +7,7 @@ Group: webml Repository: webmachinelearning/translation-api URL: https://webmachinelearning.github.io/translation-api Editor: Domenic Denicola, Google https://google.com, d@domenic.me, https://domenic.me/ -Abstract: The translator and langauge detector APIs gives web pages the ability to translate text between languages, and detect the language of such text. +Abstract: The translator and language detector APIs gives web pages the ability to translate text between languages, and detect the language of such text. Markup Shorthands: markdown yes, css no Complain About: accidental-2119 yes, missing-example-ids yes Assume Explicit For: yes @@ -106,7 +106,9 @@ The translator getter steps are to return [=this=] 1. [=Assert=]: these steps are running [=in parallel=]. - 1. Initiate the download process for everything the user agent needs to translate text from |options|["{{AITranslatorCreateCoreOptions/sourceLanguage}}"] to |options|["{{AITranslatorCreateCoreOptions/targetLanguage}}"]. This could include both a base translation model and specific language arc material, or perhaps material for multiple language arcs if an intermediate language is used. + 1. Initiate the download process for everything the user agent needs to translate text from |options|["{{AITranslatorCreateCoreOptions/sourceLanguage}}"] to |options|["{{AITranslatorCreateCoreOptions/targetLanguage}}"]. + + This could include both a base translation model and specific language arc material, or perhaps material for multiple language arcs if an intermediate language is used. 1. If the download process cannot be started for any reason, then return false. @@ -145,6 +147,7 @@ The translator getter steps are to return [=this=]

Availability

+
The availability(|options|) method steps are: @@ -397,3 +400,315 @@ When translation fails, the following possible reasons may be surfaced to the we

This table does not give the complete list of exceptions that can be surfaced by {{AITranslator/translate()|translator.translate()}} and {{AITranslator/translateStreaming()|translator.translateStreaming()}}. It only contains those which can come from the [=implementation-defined=] [=translate=] algorithm. + +

The language detector API

+ + +partial interface AI { + readonly attribute AILanguageDetectorFactory languageDetector; +}; + +[Exposed=(Window,Worker), SecureContext] +interface AILanguageDetectorFactory { + Promise<AILanguageDetector> create( + optional AILanguageDetectorCreateOptions options = {} + ); + Promise<AIAvailability> availability( + optional AILanguageDetectorCreateCoreOptions options = {} + ); +}; + +[Exposed=(Window,Worker), SecureContext] +interface AILanguageDetector { + Promise<sequence<LanguageDetectionResult>> detect( + DOMString input, + optional AILanguageDetectorDetectOptions options = {} + ); + + readonly attribute FrozenArray<DOMString>? expectedInputLanguages; + + undefined destroy(); +}; + +dictionary AILanguageDetectorCreateCoreOptions { + sequence<DOMString> expectedInputLanguages; +}; + +dictionary AILanguageDetectorCreateOptions : AILanguageDetectorCreateCoreOptions { + AbortSignal signal; + AICreateMonitorCallback monitor; +}; + +dictionary AILanguageDetectorDetectOptions { + AbortSignal signal; +}; + +dictionary LanguageDetectionResult { + DOMString detectedLanguage; + double confidence; +}; + + +Every {{AI}} has a language detector factory, an {{AILanguageDetector}} object. Upon creation of the {{AI}} object, its [=AI/language detector factory=] must be set to a [=new=] {{AILanguageDetectorFactory}} object created in the {{AI}} object's [=relevant realm=]. + +The languageDetector getter steps are to return [=this=]'s [=AI/language detector factory=]. + +

Creation

+ +
+ The create(|options|) method steps are: + + 1. If [=this=]'s [=relevant global object=] is a {{Window}} whose [=associated Document=] is not [=Document/fully active=], then return [=a promise rejected with=] an "{{InvalidStateError}}" {{DOMException}}. + + 1. If |options|["{{AILanguageDetectorCreateOptions/signal}}"] [=map/exists=] and is [=AbortSignal/aborted=], then return [=a promise rejected with=] |options|["{{AILanguageDetectorCreateOptions/signal}}"]'s [=AbortSignal/abort reason=]. + + 1. [=Validate and canonicalize language detector options=] given |options|. + +

This can mutate |options|. + + 1. Return the result of [=creating an AI model object=] given [=this=]'s [=relevant realm=], |options|, [=compute language detector options availability=], [=download the language detector model=], [=initialize the language detector model=], and [=create the language detector object=]. +

+ +
+ To validate and canonicalize language detector options given an {{AILanguageDetectorCreateCoreOptions}} |options|, perform the following steps. They mutate |options| in place to canonicalize language tags, and throw a {{TypeError}} if any are invalid. + + 1. [=Validate and canonicalize language tags=] given |options| and "{{AILanguageDetectorCreateCoreOptions/expectedInputLanguages}}". +
+ +
+ To download the language detector model, given an {{AILanguageDetectorCreateCoreOptions}} |options|: + + 1. [=Assert=]: these steps are running [=in parallel=]. + + 1. Initiate the download process for everything the user agent needs to detect the languages of input text, including all the languages in |options|["{{AILanguageDetectorCreateCoreOptions/expectedInputLanguages}}"]. + + This could include both a base language detection model, and specific fine-tunings or other material to help with the languages identified in |options|["{{AILanguageDetectorCreateCoreOptions/expectedInputLanguages}}"]. + + 1. If the download process cannot be started for any reason, then return false. + + 1. Return true. +
+ +
+ To initialize the language detector model, given an {{AILanguageDetectorCreateCoreOptions}} |options|: + + 1. [=Assert=]: these steps are running [=in parallel=]. + + 1. Perform any necessary initialization operations for the AI model backing the user agent's capabilities for detecting the languages of input text. + + This could include loading the model into memory, or loading any fine-tunings necessary to support the languages identified in |options|["{{AILanguageDetectorCreateCoreOptions/expectedInputLanguages}}"]. + + 1. If initialization failed for any reason, then return false. + + 1. Return true. +
+ +
+ To create the language detector object, given a [=ECMAScript/realm=] |realm| and an {{AILanguageDetectorCreateCoreOptions}} |options|: + + 1. [=Assert=]: these steps are running on |realm|'s [=ECMAScript/surrounding agent=]'s [=agent/event loop=]. + + 1. Return a new {{AILanguageDetector}} object, created in |realm|, with + +
+ : [=AILanguageDetector/expected input languages=] + :: the result of [=creating a frozen array=] given |options|["{{AILanguageDetectorCreateCoreOptions/expectedInputLanguages}}"] if it [=set/is empty|is not empty=]; otherwise null +
+
+ +

Availability

+ + +
+ The availability(|options|) method steps are: + + 1. If [=this=]'s [=relevant global object=] is a {{Window}} whose [=associated Document=] is not [=Document/fully active=], then return [=a promise rejected with=] an "{{InvalidStateError}}" {{DOMException}}. + + 1. [=Validate and canonicalize language detector options=] given |options|. + + 1. Let |promise| be [=a new promise=] created in [=this=]'s [=relevant realm=]. + + 1. [=In parallel=]: + + 1. Let |availability| be the result of [=computing language detector options availability=] given |options|. + + 1. [=Queue a global task=] on the [=AI task source=] given [=this=]'s [=relevant global object=] to perform the following steps: + + 1. If |availability| is null, then [=reject=] |promise| with an "{{UnknownError}}" {{DOMException}}. + + 1. Otherwise, [=resolve=] |promise| with |availability|. +
+ + +
+ To compute language detector options availability given an {{AILanguageDetectorCreateCoreOptions}} |options|, perform the following steps. They return either an {{AIAvailability}} value or null, and they mutate |options| in place to update language tags to their best-fit matches. + + 1. [=Assert=]: this algorithm is running [=in parallel=]. + + 1. If there is some error attempting to determine what languages the user agent supports detecting, which the user agent believes to be transient (such that re-querying could stop producing such an error), then return null. + + 1. Let |availabilities| be the result of [=getting language availabilities=] given the purpose of detecting text written in that language. + + 1. Let |availability| be "{{AIAvailability/available}}". + + 1. [=set/For each=] |language| in |options|["{{AILanguageDetectorCreateCoreOptions/expectedInputLanguages}}"]: + + 1. [=list/For each=] |availabilityToCheck| in « "{{AIAvailability/available}}", "{{AIAvailability/downloading}}", "{{AIAvailability/downloadable}}" »: + + 1. Let |languagesWithThisAvailability| be |availabilities|[|availabilityToCheck|]. + + 1. Let |bestMatch| be [$LookupMatchingLocaleByBestFit$](|languagesWithThisAvailability|, « |language| »). + + 1. If |bestMatch| is not undefined, then: + + 1. [=list/Replace=] |language| with |bestMatch|.\[[locale]] in |options|["{{AILanguageDetectorCreateCoreOptions/expectedInputLanguages}}"]. + + 1. Set |availability| to the [=AIAvailability/minimum availability=] given |availability| and |availabilityToCheck|. + + 1. [=iteration/Break=]. + + 1. Return "{{AIAvailability/unavailable}}". + + 1. Return |availability|. +
+ +

The {{AILanguageDetector}} class

+ +Every {{AILanguageDetector}} has an expected input languages, a {{FrozenArray}}<{{DOMString}}> or null, set during creation. + +
+ +The expectedInputLanguages getter steps are to return [=this=]'s [=AILanguageDetector/expected input languages=]. + +
+ + +
+ The detect(|input|, |options|) method steps are: + + 1. If [=this=]'s [=relevant global object=] is a {{Window}} whose [=associated Document=] is not [=Document/fully active=], then return [=a promise rejected with=] an "{{InvalidStateError}}" {{DOMException}}. + + 1. Let |signals| be « [=this=]'s [=AIDestroyable/destruction abort controller=]'s [=AbortController/signal=] ». + + 1. If |options|["`signal`"] [=map/exists=], then [=set/append=] it to |signals|. + + 1. Let |compositeSignal| be the result of [=creating a dependent abort signal=] given |signals| using {{AbortSignal}} and [=this=]'s [=relevant realm=]. + + 1. If |compositeSignal| is [=AbortSignal/aborted=], then return [=a promise rejected with=] |compositeSignal|'s [=AbortSignal/abort reason=]. + + 1. Let |abortedDuringOperation| be false. + +

This variable will be written to from the [=event loop=], but read from [=in parallel=]. + + 1. [=AbortSignal/add|Add the following abort steps=] to |compositeSignal|: + + 1. Set |abortedDuringOperation| to true. + + 1. Let |promise| be [=a new promise=] created in [=this=]'s [=relevant realm=]. + + 1. [=In parallel=]: + + 1. Let |stopProducing| be the following steps: + + 1. Return |abortedDuringOperation|. + + 1. Let |result| be the result of [=detecting languages=] given |input| and |stopProducing|. + + 1. [=Queue a global task=] on the [=AI task source=] given [=this=]'s [=relevant global object=] to perform the following steps: + + 1. If |abortedDuringOperation| is true, then [=reject=] |promise| with |compositeSignal|'s [=AbortSignal/abort reason=]. + + 1. Otherwise, if |result| is an [=error information=], then [=reject=] |promise| with the result of [=exception/creating=] a {{DOMException}} with name given by |errorInfo|'s [=error information/error name=], using |errorInfo|'s [=error information/error information=] to populate the message appropriately. + + 1. Otherwise: + + 1. [=Assert=]: |result| is a [=list=] of {{LanguageDetectionResult}} dictionaries. (It is not null, since in that case |abortedDuringOperation| would have been true.) + + 1. [=Resolve=] |promise| with |result|. +

+ +

The algorithm

+ +
+ To detect languages given a [=string=] |input| and an algorithm |stopProducing| that takes no arguments and returns a boolean, perform the following steps. They will return either null, an [=error information=], or a [=list=] of {{LanguageDetectionResult}} dictionaries. + + 1. [=Assert=]: this algorithm is running [=in parallel=]. + + 1. Let |availabilities| be the result of [=getting language availabilities=] given the purpose of detecting text written in that language. + + 1. Let |currentlyAvailableLanguages| be |availabilities|["{{AIAvailability/available}}"]. + + 1. In an [=implementation-defined=] manner, subject to the following guidelines, let |rawResult| and |unknown| be the result of detecting the languages of |input|. + + |rawResult| must be a [=map=] which has a [=map/key=] for each language in |currentlyAvailableLanguages|. The [=map/value=] for each such key must be a number between 0 and 1. This value must represent the implementation's confidence that |input| is written in that language. + + |unknown| must be a number between 0 and 1 that represents the implementation's confidence that |input| is not written in any of the languages in |currentlyAvailableLanguages|. + + The [=map/values=] of |rawResult|, plus |unknown|, must sum to 1. Each such value, or |unknown|, may be 0. + + If the implementation believes |input| to be written in multiple languages, then it should attempt to apportion the values of |rawResult| and |unknown| such that they are proportionate to the amount of |input| written in each detected language. The exact scheme for apportioning |input| is [=implementation-defined=]. + +
+

If |input| is "`tacosを食べる`", the implementation might split this into "`tacos`" and "`を食べる`", and then detect the languages of each separately. The first part might be detected as English with confidence 0.5 and Spanish with confidence 0.5, and the second part as Japanese with confidence 1. The resulting |rawResult| then might be «[ "`en`" → 0.25, "`es`" → 0.25, "`ja`" → 0.5 ]» (with |unknown| set to 0). + +

The decision to split this into two parts, instead of e.g. the three parts "`tacos`", "`を`", and "`食べる`", was an [=implementation-defined=] choice. Similarly, the decision to treat each part as contributing to "half" of the result, instead of e.g. weighting by number of [=code points=], was [=implementation-defined=]. + +

(Realistically, we expect that implementations will split on larger chunks than this, as generally more than 4-5 [=code points=] are necessary for most language detection models.) +

+ + If |stopProducing| returns true at any point during this process, then return null. + + If an error occurred during language detection, then return an [=error information=] according to the guidance in [[#language-detector-errors]]. + + 1. [=map/Sort in descending order=] |rawResult| with a less than algorithm which given [=map/entries=] |a| and |b|, returns true if |a|'s [=map/value=] is less than |b|'s [=map/value=]. + + 1. Let |results| be an empty [=list=]. + + 1. Let |cumulativeConfidence| be 0. + + 1. [=map/For each=] |key| → |value| of |rawResult|: + + 1. If |value| is 0, then [=iteration/break=]. + + 1. If |value| is less than |unknown|, then [=iteration/break=]. + + 1. [=list/Append=] «[ "{{LanguageDetectionResult/detectedLanguage}}" → |key|, "{{LanguageDetectionResult/confidence}}" → |value| ]» to |results|. + + 1. Set |cumulativeConfidence| to |cumulativeConfidence| + |value|. + + 1. If |cumulativeConfidence| is greater than or equal to 0.99, then [=iteration/break=]. + + 1. [=Assert=]: 1 − |cumulativeConfidence| is greater than or equal to |unknown|. + + 1. [=list/Append=] «[ "{{LanguageDetectionResult/detectedLanguage}}" → "`und`", "{{LanguageDetectionResult/confidence}}" → 1 − |cumulativeConfidence| ]» to |results|. + + 1. Return |results|. + +

The post-processing of |rawResult| and |unknown| essentially consolidates all languages below a certain threshold into the "`und`" language. Languages which are less than 1% likely, or contribute to less than 1% of the text, are considered more likely to be noise than to be worth detecting. Similarly, if the implementation is less sure about a language than it is about the text not being in any of the languages it knows, that language is probably not worth returning to the web developer. +

+ +

Errors

+ +When language detection fails, the following possible reasons may be surfaced to the web developer. This table lists the possible {{DOMException}} [=DOMException/names=] and the cases in which an implementation should use them: + + + + + + + + +
{{DOMException}} [=DOMException/name=] + Scenarios +
"{{NotAllowedError}}" + +

Language detection is disabled by user choice or user agent policy. +

"{{QuotaExceededError}}" + +

The input to be detected was too large for the user agent to handle. +

"{{UnknownError}}" + +

All other scenarios, or if the user agent would prefer not to disclose the failure reason. +

+ +

This table does not give the complete list of exceptions that can be surfaced by {{AILanguageDetector/detect()|detector.detect()}}. It only contains those which can come from the [=implementation-defined=] [=detect languages=] algorithm.