Data structure for grouping strings in a collection when they share common substrings [closed]

I am looking for a data structure and an algorithm to manage a dynamic collection of strings, but grouping strings that have a substring in common. I try to describe it through an example. @Christophe: let's take "sample", "oo","foo","food", the three last being in a group. What if I add "good" which also contains "oo", will it be in a separate group? And what if I then add "od" ? Shall strings belong to several groups? If yes, would you iterate over them several times? My current idea is, I will use fuzzy search to process it. By using fuzzy search algorithm, "oo", "foo", "food", "good", and "oo" are all in the same group. It is better suited for checking files with potential similar names, and do not need check every element when I get a new element, just care about the lognest element(s) in each group. There is no need for precise results here. Bitap algorithm looks good. It can be extended to 256-bit SIMD version, very suitable for file name comparisons. And for matching of exactly patterns, It seems that I should check all elements by online algorithm... Do I get something wrong? Another consideration or discussion is, NTFS file system can can index file names in an efficient way, so is it necessary or able to find an offline solution for optimization of the query? If the answer is yes, I have no idea for processing of file names. It may contains 256 unciode characters. The offline algorithms what I know, have no advantage in terms of memory.

Feb 11, 2025 - 11:45
 0
Data structure for grouping strings in a collection when they share common substrings [closed]

I am looking for a data structure and an algorithm to manage a dynamic collection of strings, but grouping strings that have a substring in common. I try to describe it through an example.

@Christophe:

let's take "sample", "oo","foo","food", the three last being in a group. What if I add "good" which also contains "oo", will it be in a separate group? And what if I then add "od" ? Shall strings belong to several groups? If yes, would you iterate over them several times?

My current idea is, I will use fuzzy search to process it. By using fuzzy search algorithm, "oo", "foo", "food", "good", and "oo" are all in the same group. It is better suited for checking files with potential similar names, and do not need check every element when I get a new element, just care about the lognest element(s) in each group. There is no need for precise results here.

Bitap algorithm looks good. It can be extended to 256-bit SIMD version, very suitable for file name comparisons.

And for matching of exactly patterns, It seems that I should check all elements by online algorithm...

Do I get something wrong?

Another consideration or discussion is, NTFS file system can can index file names in an efficient way, so is it necessary or able to find an offline solution for optimization of the query?

If the answer is yes, I have no idea for processing of file names. It may contains 256 unciode characters. The offline algorithms what I know, have no advantage in terms of memory.