Fig. D.1.1 illustrates the pattern for the case r = 4 and s = 2. NP(4,2) represents, for each 6-gram of residues, an ensemble composed of 10 different patterns generated by retaining 4-out-of-6 types of amino acids, with the leading position being occupied, and inserting s = 2 wild cards (indicated by red asterisks in the figure) at the remaining positions, leading to a combination of 5!/(3! 2!) = 10 patterns.
We conducted a series of studies to reach the conclusion that NP(4,2) is the optimal pattern for further study because: (1) the probability of occurrence of NP(4,2) patterns is low enough to occur more than once by random chance in an average-length sequence; (2) the occurrence of two non-overlapping NP(4,2) patterns in a sequence is sufficient to establish family membership in most cases;(3) NP(4,2) patterns inherently capture all combinations of 1-, 2-, 3- and 4-grams; (4) the NP(4,2) patterns are computationally tractable; and (5) the window associated with NP(4,2) patterns is large enough to capture the periodicities associated with secondary structures.