JS/Jquery: string to words text-segmentation script using dictionary and longest match?

Given a string such :

var str = "thisisinsane";

assisted by a list of words from a dictionary such:

var dic = [ "insane", "i", "is", "sin", "in", "this", "totally" ];

How to split str into words?

For this string, there are 3 words to identify. But we need avoid the pitfalls. To avoid them most of the time, I know we can attack the sentence by the left, and try to find the longest word prossible. When found, we could attack the rest of the string, etc.

Below : the input, the possible pitfalls, and the wanted output in bottom right.

thisisinsane
                          |
                          |
                     (this)isinsane
                     /            \
                    /              \
          (this,i)sinsane         (this,is)insane
              /                     /           \
             /                     /             \
  (this,i,sin)ane          (this,is,in)sane    (this,is,insane)
                                /                   <BEST IS>
                               /                    <THIS ONE>
                       (this,is,in,sane)

At the end, we want to get :

var splited = ["this", "is", "insane"];

Problem courtesy of: Hugolpz

Solution

This is a quick implementation that will do the search from left to right and match longest words in the dictionary first ( jsfiddle ). However, I'm not so sure it's very clever to implement this on your own as it sounds like a complicated field and even without any knowledge on the subject I can tell that this algorithm is flawed to begin with. You might be better off looking out for existing libraries if there are any.

Needless to say, this was just typed together quickly. It is not optimized for performance in any way (it uses recursion, which is really not at all necessary), and also not extensively tested. It works on your example data, though, and on some variations I tested. I like to leave some of the work to the OP in case I give full code examples, so feel free to improve this if you want to use it.

var splitByDictionary = function (input, dictionary) {
    "use strict";

    // make sure we're going to look for longest-possible matches first
    dictionary.sort( function (a, b) {
        return b.length - a.length;
    } );

    var foundWords = [],
        remaining = input;

    var result = (function match () {
        if( remaining.length === 0 ) {
            return true;
        }

        for( var i = 0; i < dictionary.length; i++ ) {
            if( remaining.substr( 0, dictionary[i].length ) === dictionary[i] ) {
                foundWords.push( dictionary[i] );
                remaining = remaining.substr( dictionary[i].length );

                return match();
            }
        }

        return false;
    })();

    return result ? foundWords : null;
};

var splitted = splitByDictionary( "thisisinsane", ["insane", "i", "is", "sin", "in", "this", "totally"] );
console.log( splitted ); // ["this", "is", "insane"]

Solution courtesy of: Ingo Bürk

Discussion

There is currently no discussion for this recipe.

This recipe can be found in it's original form on Stack Over Flow .

我来评几句
登录后评论

已发表评论数()

相关站点

+订阅
热门文章