score:4

Accepted answer

this works if each sku id have the same length.

// ...
string regexstr = calculate(skus);
// ...

public static string calculate(ienumerable<string> rest) {
    if (rest.first().length > 0) {
        string[] groups = rest.groupby(r => r[0])
            .select(g => g.key + calculate(g.select(e => e.substring(1))))
            .toarray();
        return groups.length > 1 ? "(" + string.join("|", groups) + ")" : groups[0];
    } else {
        return string.empty;
    }
}

score:0

take the entire list of all of your sku's and make a single ternary tree regex.
when you add or delete sku's, regenerate the regex. maybe your database
generates on a weekly basis.

this utility makes a regex of 10,000 strings in less than half a second
and size is not important, it could be 300,000 strings.

for example, here is regex of 175,000 word dictionary.

enter image description here

score:1

this is what i finally worked out:

var skus = new[] { "batpag003", "battwlp03", "battwlp04", "battwsp04", "spifatb01" };

func<ienumerable<igrouping<string, string>>, ienumerable<string>> regexify = null;

func<ienumerable<string>, ienumerable<string>> generate =
    xs =>
        from n in enumerable.range(2, 20)
        let g = xs.groupby(x => new string(x.take(n).toarray()), x => new string(x.skip(n).toarray()))
        where g.count() != xs.count()
        from r in regexify(g)
        select r;

regexify = gxs =>
{
    if (!gxs.any())
    {
        return new [] { "" };
    }
    else
    {
        var rs = regexify(gxs.skip(1)).toarray();
        return
            from f in gxs.take(1)
            from z in new [] { string.join("|", f) }.concat(f.count() > 1 ? generate(f) : enumerable.empty<string>())
            from r in rs
            select f.key + (f.count() == 1 ? z : $"({z})") + (r != "" ? "|" + r : "");
    }
};

then using this query:

generate(skus).orderby(x => x).orderby(x => x.length);

...i got this result:

bat(pag003|tw(lp0(3|4)|sp04))|spifatb01 
bat(pag003|twlp0(3|4)|twsp04)|spifatb01 
ba(tpag003|ttw(lp0(3|4)|sp04))|spifatb01 
bat(pag003|tw(lp(03|04)|sp04))|spifatb01 
bat(pag003|tw(lp03|lp04|sp04))|spifatb01 
bat(pag003|twlp(03|04)|twsp04)|spifatb01 
batpag003|battw(lp0(3|4)|sp04)|spifatb01 
ba(tpag003|tt(wlp0(3|4)|wsp04))|spifatb01 
ba(tpag003|ttw(lp(03|04)|sp04))|spifatb01 
ba(tpag003|ttw(lp03|lp04|sp04))|spifatb01 
ba(tpag003|ttwlp0(3|4)|ttwsp04)|spifatb01 
bat(pag003|twl(p0(3|4))|twsp04)|spifatb01 
bat(pag003|twl(p03|p04)|twsp04)|spifatb01 
batpag003|batt(wlp0(3|4)|wsp04)|spifatb01 
batpag003|battw(lp(03|04)|sp04)|spifatb01 
batpag003|battw(lp03|lp04|sp04)|spifatb01 
ba(tpag003|tt(wlp(03|04)|wsp04))|spifatb01 
ba(tpag003|ttwlp(03|04)|ttwsp04)|spifatb01 
bat(pag003|twlp03|twlp04|twsp04)|spifatb01 
batpag003|batt(wlp(03|04)|wsp04)|spifatb01 
ba(tpag003|tt(wl(p0(3|4))|wsp04))|spifatb01 
ba(tpag003|tt(wl(p03|p04)|wsp04))|spifatb01 
ba(tpag003|tt(wlp03|wlp04|wsp04))|spifatb01 
ba(tpag003|ttwl(p0(3|4))|ttwsp04)|spifatb01 
ba(tpag003|ttwl(p03|p04)|ttwsp04)|spifatb01 
batpag003|batt(wl(p0(3|4))|wsp04)|spifatb01 
batpag003|batt(wl(p03|p04)|wsp04)|spifatb01 
batpag003|batt(wlp03|wlp04|wsp04)|spifatb01 
batpag003|battwlp0(3|4)|battwsp04|spifatb01 
batpag003|battwlp(03|04)|battwsp04|spifatb01 
ba(tpag003|ttwlp03|ttwlp04|ttwsp04)|spifatb01 
batpag003|battwl(p0(3|4))|battwsp04|spifatb01 
batpag003|battwl(p03|p04)|battwsp04|spifatb01 

the only problem with my approach was computation time. some of my source lists have nearly 100 skus. some of the runs were taking longer than i care to wait for and had to break it down into smaller chunks and then manually concatenate.


Related Query

More Query from same tag