Efficient Translation From Edit Distance To Hamming Distance, Allowing Faster And More Precise Analysis, Search And Indexing Of Data

Efficient Translation from Edit Distance to Hamming Distance, Allowing Faster and More Precise Analysis, Search and Indexing of Data

UC Case No. 2005-431

SUMMARY:

UCLA researchers have developed a new approach to more efficiently search and compare strings that approximately match, thus allowing for much more efficient search for applications.

BACKGROUND:

Many data-intensive applications require computationally intensive algorithms for approximate string matching. Examples include text editors, database archiving, internet search-engines, and bioinformatics applications. For example, sequences of DNA or proteins are routinely searched against one another to determine biological similarity. The edit distance between two strings, the minimum number of character changes, insertions and deletes to map from one string to another, is usually hailed as the one of the best measures for accuracy. Unfortunately, calculating edit distances for hundreds of sequences, which is often the case, is extremely inefficient.

Many heuristic algorithms such as BLAST and FASTA have been developed to overcome this inefficiency. However, the innovation disclosed here provides a faster way to handle edit distance (by transforming into a much simpler form) thus potentially speeding up a host of applications that need approximate matching using the edit distance.

INNOVATION:

UCLA researchers have developed a new approach to more efficiently search and compare strings that approximately match, thus allowing for much more efficient search for applications. They created a novel method for more efficient searching and comparing of letter sequences. The innovation lies in a method for mapping the sequences to a new set of strings, whose similarity can be compared using hamming distances instead of edit distances. The hamming distance similarity between these new strings is proportional the similarity of the original sequences, and more importantly, there are already existing far more efficient algorithms that can search and index strings using hamming distances.

POTENTIAL APPLICATIONS:

• More efficient data and string searching

• Potentially faster music or video search engines

• Improved data archiving routines

• Shorter algorithm running times for proteomic and genomic searches

• Potentially faster and more accurate ways to match internet viewable pages to appropriate advertisement pages that appear on the same page

• Other biocomputing applications that require fast approximate matching

• National security applications, where approximate patterns need to be discovered

ADVANTAGES:

• More efficient than edit distance

• More efficiently streamlined with other search algorithms