Smodin Announces the release of its new Language Detection API supporting 176 languages
Since a language detector was needed to improve our applications, we’ve decided to find a solution.
At first, we thought it would be easy since google makes it look so easy, but as we found out, it wasn’t an easy task, on the contrary, language detection has always been a difficult task.
In the search for the best option for predicting a language from text which didn’t require a large machine learning model, we found out that the best solution was a pre-trained language identification model that takes less than 1MB of memory while being able to classify thousands of documents per second.
After many tweaks and improvements, we have developed a tool that can confidently provide good accuracy rates for each language.
Providing really good accuracy ratings, and not only that but also at a fast and reliable speed. Here’s an accuracy list per country.
99% Accurate Languages*: French (fr), English (en), German (de), Portuguese (pt), Turkish (tr), Dutch (nl), Italian (it), Spanish (es), Hungarian (hu), Esperanto (eo), Polish (pl), Finnish (fi), Russian (ru), Macedonian (mk), Ukrainian (uk), Lithuanian (lt), Vietnamese (vi), Greek (el), Marathi (mr), Arabic (ar), Hebrew (he), Hindi (hi), Uyghur (ug), Japanese (ja), Georgian (ka), Bengali (bn), Urdu (ur), Thai (th), Chinese (zh), Armenian (hy), Malayalam (ml), Korean (ko), Khmer (km), Burmese (my), Tamil (ta), Kannada (kn), Telugu (te), Panjabi (pa), Lao (lo), Gujarati (gu), Tibetan Standard (bo), Divehi (dv), Sinhala (si), Amharic (am).
90% Accurate Languages*: Danish (da), Romanian (ro), Swedish (sv), Latin (la), Bulgarian (bg), Czech (cs), Tagalog (tl), Indonesian (id), Tatar (tt), Icelandic (is), Belarusian (be), Basque (eu), Breton (br), Kazakh (kk), Latvian (lv), Estonian (et), Irish (ga), Chuvash (cv), Bashkir (ba), Ossetian (os), Tajik (tg).
*Information is presented in order of most test data. Data were sentences of 30-250 characters in length. Testing was only done on the most popular 100 languages. Testing showed near a 99% accuracy for the majority of sentences at or above 300 characters in length.
Although you can’t get perfect results, the best accuracy (99%+ for many languages, even the lesser-known ones) is seen at 300 characters or more. Regardless of text length, the longer the better.
As Wiki mentions: language identification or language guessing is the problem of determining which natural language given content is in. Computational approaches to this problem view it as a special case of text categorization, solved with various statistical methods.
Language detection services can be used in various ways, for example, they could be used to identify the language of business texts, such as chat and email.
The service can identify the language of the text and the parts of the text where the language has changed, down to the word level.
Using language detection services, Surveillance Insights can highlight and annotate the language used in text and help identify potentially suspicious activities.
Business texts such as email or chat can be in different languages. A key part of the natural language processing pipeline is to determine which language is the primary language so that each text can be processed through related language-specific steps.
In some cases, people may change the language used in chats to avoid monitoring or hiding illegal activities. Determining the point at which the chat language is switched is very useful for determining whether a suspicious activity has occurred.
if you would like to use our API, you can get more information about it and its pricing by clicking HERE
Besides providing an API service, we’ve also decided to release it as open-source.
This is our first open Source Release! Language detector open source, available HERE