English | Japanese

libsary Reference Manual

Last Modified: 2002-10-22 (Since: 2002-10-22)


Table of Contents

Sample Programs

Construction of Suffix Array

Construct the suffix array for the file_name with index points assigned to character by character in UTF-8 encoding and store the resulting suffix array to the file named file_name.ary. Consult mksary.c for a detailed example.

#include <stdlib.h>
#include <errno.h>
#include <sary.h>
  
int
main (int argc, char **argv)
{
    char *file_name;
    SaryInt ipoints;
    gboolean status;
    SaryBuilder *builder;

    if (argc != 2) exit(2);
    file_name = argv[1];

    builder = sary_builder_new(file_name);
    sary_builder_set_ipoint_func(builder, sary_ipoint_char_utf8);

    ipoints = sary_builder_index(builder);
    if (ipoints == -1) {
        g_print("error: %s(.ary): %s\n", file_name, g_strerror(errno));
        exit(2);
    }

    status = sary_builder_sort(builder);
    if (status == FALSE) {
        g_print("error: %s(.ary): %s\n", file_name, g_strerror(errno));
        exit(2);
    }
    sary_builder_destroy(builder);
    return 0;
}

Search with Suffix Array

Search the file_name for the pattern and sort the results in occurrence order and print them line by line. Suffix array made for file_name MUST exist. Consult sary.c for a detailed example.

#include <stdlib.h>
#include <errno.h>
#include <sary.h>

int
main (int argc, char **argv)
{
    SarySearcher *searcher;
    char *pattern;
    char *file_name;

    if (argc != 3) exit(2);
    pattern = argv[1];
    file_name = argv[2];

    searcher = sary_searcher_new(file_name);
    if (searcher == NULL) {
        g_print("error: %s(.ary): %s\n", file_name, g_strerror(errno));
        exit(2);
    }

    if (sary_searcher_search(searcher, pattern)) {
        gchar *line;

        sary_searcher_sort_occurrences(searcher);

        while ((line = sary_searcher_get_next_line(searcher))) {
            g_print("%s", line);
            g_free(line);
        }
    }
    sary_searcher_destroy(searcher);
    return 0;
}

How to Compile

Compiling program.c which uses libsary can be done as the following. Use of autoconf, automake, and libtool is recommended for the real development.

  % gcc program.c -o program `pkg-config sary --libs` `pkg-config sary --cflags`

Construction of Suffix Array

SaryBuilder* sary_builder_new (const gchar *file_name)
Create SaryBuilder object for file_name. It is for assining index points and sorting a suffix array. Return the created object. Return NULL if faled.
SaryBuilder* sary_builder_new2 (const gchar *file_name, const gchar *array_name);
This function is identical to sary_builder_new except the array_name parameter specifying the file name of the suffix array.
void sary_builder_destroy (SaryBuilder *builder);
Destruct SaryBuilder object.
void sary_builder_set_ipoint_func (SaryBuilder *builder, SaryIpointFunc ipoint_func);
Set the function for assigning index points. There are built-in functions for next_ipoint but user-defined functions following SaryIpointFunc API can be used.
SaryInt sary_builder_index (SaryBuilder *builder);
Assign index points. Return the number of index points assigned. Return -1 if failed.
gboolean sary_builder_sort (SaryBuilder *builder);
Sort a suffix array. Return TRUE if success. Return FALSE if failed.
gboolean sary_builder_block_sort (SaryBuilder *builder);
Sort a suffix array by memory-saving block sorting. Return TRUE if success. Return FALSE if failed.
void sary_builder_set_block_size (SaryBuilder *builder, SaryInt block_size);
Set the block size for sary_builder_block_sort.
void sary_builder_set_nthreads (SaryBuilder *builder, SaryInt nthreads);
Set the number of threads for doing sary_builder_block_sort. Performance will improve if your machine has two or more CPUs.
void sary_builder_connect_progress (SaryBuilder *builder, SaryProgressFunc progress_func, gpointer progress_func_data);
Connect a function and data for displaying a progress bar to SaryBuilder object.

Assigning Index Points

gchar* sary_ipoint_char_ascii (SaryText *text);
Assign index points character by character in ASCII encoding (1 octet). Return the pointer of the assigned index point. Return NULL if the end of file reached.
gchar* sary_ipoint_char_eucjp (SaryText *text);
Assign index points character by character in EUC-JP encoding. Return the pointer of the assigned index point. Return NULL if the end of file reached.
gchar* sary_ipoint_char_sjis (SaryText *text);
Assign index points character by character in Shift_JIS encoding. Return the pointer of the assigned index point. Return NULL if the end of file reached.
gchar* sary_ipoint_char_utf8 (SaryText *text);
Assign index points character by character in UTF-8 encoding. Return the pointer of the assigned index point. Return NULL if the end of file reached.
gchar* sary_ipoint_line (SaryText *text);
Assign index points line by line. Return the pointer of the assigned index point. Return NULL if the end of file reached.
gchar* sary_ipoint_word (SaryText *text);
Assign index points word by word. The word is a string delimited by whitespaces. Return the pointer of the assigned index point. Return NULL if the end of file reached.

Search with Suffix Array

NOTE: sary_searcher_search2, sary_searcher_get_next_line2, sary_searcher_get_next_context_lines2, and sary_searcher_get_next_tagged_region2 are especially prepared for ones who want to write bindings for scripting languages or performance enthusiasts. Since each string in these functions is handled with its length, the string can contain '\0' in it. No strings are newly created.

SarySearcher* sary_searcher_new (const gchar *file_name);
Create SarySearcher object for file_name. It handles search and its results. SarySearcher stands for Suffix Array Searcher. Return the created object. Return NULL if faled.
SarySearcher* sary_searcher_new2 (const gchar *file_name, const gchar *array_name)
This function is identical to sary_array_sort except the array_name parameter specifying the file name of the suffix array.
void sary_searcher_destroy (SarySearcher *searcher);
Destruct SarySearcher object.
void sary_searcher_enable_cache (SarySearcher *searcher);
Enable the cache engine. Cache the search results and reuse them for the same pattern later.
gboolean sary_searcher_search (SarySearcher *searcher, const gchar *pattern);
Search for the pattern. Return TRUE if success. Return FALSE if failed.
gboolean sary_searcher_search2 (SarySearcher *searcher, const gchar *pattern, SaryInt len);
Search for the pattern whose length is specified with len. Return TRUE if success. Return FALSE if failed.
gboolean sary_searcher_isearch (SarySearcher *searcher, const gchar *pattern, SaryInt len);
Similar to sary_searcher_search2 but this function does efficient incremental searchs. Each search is performed for the range of the previous search results (the first time is identical to sary_searcher_search2). Call the function continuously with the same pattern and increase len incrementally to the length of the pattern to do incremental searchs. sary_searcher_sort_occurrences MUST not be used together. Return TRUE if success. Return FALSE if failed.
gboolean sary_searcher_isearch_reset (SarySearcher *searcher)
Reset internal states stored for sary_searcher_isearch. To use sary_searcher_isearch with another pattern again, you should call this function beforehand.
gboolean sary_searcher_icase_search (SarySearcher *searcher, const gchar *pattern);
Do case-insensitive search for the pattern. Return TRUE if success. Return FALSE if failed.
gboolean sary_searcher_icase_search2 (SarySearcher *searcher, const gchar *pattern, SaryInt len);
Do case-insensitive search for the pattern whose length is specified with len. Return TRUE if success. Return FALSE if failed.
SaryText* sary_searcher_get_text (SarySearcher *searcher);
Return the SaryText object of the searcher.
SaryMmap* sary_searcher_get_array (SarySearcher *searcher);
Return the Mmaped suffix array of the searcher.
gchar* sary_searcher_get_next_line (SarySearcher *searcher);
Get the next search result line by line as newly created string. The all results can be retrieved by calling the functions continuously. Return NULL if no more results. The returned value MUST be `g_free'ed by caller.
gchar* sary_searcher_get_next_line2 (SarySearcher *searcher, SaryInt *len);
Get the next search result line by line as a pointer. Store the length of the string into len. The all results can be retrieved by calling the functions continuously. Return NULL if no more results.
gchar* sary_searcher_get_next_context_lines (SarySearcher *searcher, gint backward, gint forward);
Get the next search result as context lines as newly created string. The string includes the line feed character '\n' at the end. The all results can be retrieved by calling the functions continuously. Return NULL if no more results. The returned value MUST be `g_free'ed by caller.
gchar* sary_searcher_get_next_context_lines2 (SarySearcher *searcher, SaryInt backward, SaryInt forward, SaryInt *len);
Get the next search result as context lines as a pointer. Store the length of the string into len. The string includes the line feed character '\n' at the end. The all results can be retrieved by calling the functions continuously. Return NULL if no more results.
gchar* sary_searcher_get_next_tagged_region (SarySearcher *searcher, const gchar *start_tag, const gchar *end_tag);
Get the next search result as tagged regions between start_tad and end_tag (including start_tag and end_tag) as newly created string. The all results can be retrieved by calling the functions continuously. Return NULL if no more results. The returned value MUST be `g_free'ed by caller.
gchar* sary_searcher_get_next_tagged_region2 (SarySearcher *searcher, const gchar *start_tag, SaryInt start_tag_len, const gchar *end_tag, SaryInt end_tag_len, SaryInt *len); Get the next search result as tagged regions between start_tad and end_tag (including start_tag and end_tag) as a pointer. Store the length of the string into len. The all results can be retrieved by calling the functions continuously. Return NULL if no more results.
SaryText* sary_searcher_get_next_occurrence (SarySearcher *searcher);
Get the next search result as SaryText object. The function is useful for low-level text processing. The all results can be retrieved by calling the functions continuously. Return NULL if no more results.
SaryInt sary_searcher_get_next_position (SarySearcher *searcher);
Get the next search result as SaryInt. The all results can be retrieved by calling the functions continuously. Return -1 if no more results.
SaryInt sary_searcher_count_occurrences (SarySearcher *searcher);
Return the number of hits of the search.
void sary_searcher_sort_occurrences (SarySearcher *searcher);
Sort the occurrences (search results) in occurrence order.

Text Processing

SaryText object is used for text processing. The object has the state called cursor. Operations for the object are performed with the cursor state.

SaryText* sary_text_new (const gchar *file_name);
Create SaryText object for the file_name. Return the created object. Return NULL if failed.
void sary_text_destroy (SaryText *text);
Destruct SaryText object.
SaryInt sary_text_get_lineno (SaryText *text);
Return the line number of the cursor position. (No useful for libsary)
void sary_text_set_lineno (SaryText *text, SaryInt lineno);
Set the line number of the cursor position. (No useful for libsary)
SaryInt sary_text_get_linelen (SaryText *text);
Return the line length of the cursor position. The length includes the line feed character '\n'. Return 0 if the cursor is on the end of file.
gchar* sary_text_get_line (SaryText *text);
Return the newly creatd string of the line of the cursor position. The string includes the line feed character '\n'. Return NULL if cursor is on the end of file. The returned value MUST be `g_free'ed by caller.
gchar* sary_text_get_region (SaryText *cursor, SaryInt len);
Return the newly creatd string of the region starting at the the cursor position through len. The returned value MUST be `g_free'ed by caller.
gboolean sary_text_is_eof (SaryText *text);
Return TRUE if the cursor is on the end of file. Return FALSE otherwise.
gchar* sary_text_get_cursor (SaryText *text);
Return the cursor position.
void sary_text_set_cursor (SaryText *text, gchar *cursor);
Set the cursor position.
gchar* sary_text_get_bof (SaryText *text);
Return the beginning of file.
gchar* sary_text_get_eof (SaryText *text);
Return the end of file.
gchar* sary_text_goto_next_line (SaryText *text);
Forward the cursor position to the next line. Return the moved cursor position.
gchar* sary_text_goto_next_word (SaryText *text);
Forward the cursor position to the next word. The word is a string delimited by whitespaces. Return the moved cursor position.
gchar* sary_text_goto_bol (SaryText *text);
Forward the cursor position to the beginning of line. Return the moved cursor position.
gchar* sary_text_goto_eol (SaryText *text);
Forward the cursor position to the end of line. Return the moved cursor position.
gchar* sary_text_forward_cursor (SaryText *text, SaryInt step);
Forward the cursor position step bytes. Return the moved cursor position.
gchar* sary_text_backward_cursor (SaryText *text, SaryInt step);
Back the cursor position step bytes. Return the moved cursor position.

Displaying a Progress Bar

Please consult progress_bar function in mksary.c

Appendix: Application of Scripting Languages

Assigning of index points can be performed easily with scripting languages such as Perl. Application of the languages help if advanced text processing is needed. The following example shows the way to assign index points line by line.

  % cat line-indexer.pl
  $offset = 0;
  while (<>) {
      print pack 'N', $offset;
      $offset += length;
  }

  % perl line-indexer.pl foobar.txt > foobar.txt.ary

Then, sort the resulting ary file to construct the suffix array. This work can be done with mkary -s command.

  % mksary -s foobar.txt

Satoru Takabayashi