addition of Spacey, sematic and string flags in initial search, with fallback to rapid fuzz. split into two modules, for easier handling, core and main added threading, and background schedular to refresh cache on 8 hour windows. also to load initial stale cache on startup and refresh in the background.
10 KiB
Search Explanation for IggyAPI
This document explains the two search modes available in IggyAPI's /igs/search
endpoint: semantic
and string
. It details how each mode works, their strengths, and the types of queries each is best suited for, helping users choose the appropriate mode for their search needs.
Overview of the /igs/search
Endpoint
The /igs/search
endpoint allows users to search for FHIR Implementation Guides (IGs) by providing a query string. The endpoint supports two search modes, specified via the search_type
query parameter:
semantic
(default): Matches based on the meaning of the query and package metadata, with a fallback to string-based matching.string
: Matches based on token similarity and exact/near-exact string matching.
Both modes operate on a pre-filtered list of packages, where the query words must be present in the package name or author field. The search then applies the specified similarity matching to rank and return results.
Pre-filtering Step (Common to Both Modes)
Before applying either search mode, IggyAPI filters the list of available packages to reduce the search space:
-
How It Works:
- The query is split into individual words (e.g., "au core" becomes
["au", "core"]
). - Packages are included in the filtered list only if all query words are present in either the
package_name
orauthor
field (case-insensitive). - For example, the query "au core" will include packages like "hl7.fhir.au.core" because both "au" and "core" are substrings of the package name.
- The query is split into individual words (e.g., "au core" becomes
-
Purpose:
- This step ensures that only potentially relevant packages are passed to the similarity matching phase, improving performance by reducing the number of comparisons.
Semantic Search Mode (search_type=semantic
)
How It Works
-
Primary Matching (SpaCy Semantic Similarity):
- Uses SpaCy's
en_core_web_md
model to compute the semantic similarity between the query and a combined text of the package'sname
,description
, andauthor
. - SpaCy processes both the query and the combined text into
Doc
objects, then uses word embeddings to calculate a similarity score (between 0 and 1) based on the meaning of the texts. - A package is included in the results if the similarity score exceeds a threshold of
0.3
.
- Uses SpaCy's
-
Fallback (Rapidfuzz String Matching):
- If the semantic similarity score is below the threshold, the search falls back to rapidfuzz's
partial_ratio
method for string-based matching. partial_ratio
computes a score (0 to 100) based on how closely the query matches substrings in thename
,description
, orauthor
fields.- A package is included if the rapidfuzz score exceeds
70
.
- If the semantic similarity score is below the threshold, the search falls back to rapidfuzz's
-
Result Ranking:
- Results are ranked by their similarity scores (semantic or rapidfuzz), with an adjustment factor applied based on the source of the match (e.g., matches in the
name
field are weighted higher).
- Results are ranked by their similarity scores (semantic or rapidfuzz), with an adjustment factor applied based on the source of the match (e.g., matches in the
Strengths
- Meaning-Based Matching:
- Excels at finding packages that are conceptually similar to the query, even if the exact words differ. For example, a query like "healthcare standard" might match "hl7.fhir.core" because SpaCy understands the semantic relationship between "healthcare" and "fhir".
- Context Awareness:
- Takes into account the
description
andauthor
fields, providing a broader context for matching. This can help when package names alone are not descriptive enough.
- Takes into account the
- Robust Fallback:
- The rapidfuzz fallback ensures that technical queries (e.g., "au core") that might fail semantic matching still return relevant results based on string similarity.
Best Suited For
- Conceptual Queries:
- Queries where the user is looking for packages related to a concept or topic, rather than an exact name (e.g., "patient data" or "clinical standards").
- Natural Language Queries:
- Longer or more descriptive queries where semantic understanding is beneficial (e.g., "Australian healthcare profiles").
- General Exploration:
- When the user is exploring and might not know the exact package name but has a general idea of what they’re looking for.
Limitations
- Technical Queries:
- May struggle with short, technical queries (e.g., "au core") if the semantic similarity score is too low, although the rapidfuzz fallback mitigates this.
- Tokenization Issues:
- SpaCy’s tokenization of package names (e.g., splitting "hl7.fhir.au.core" into "hl7", "fhir", "au", "core") can dilute the semantic match for queries that rely on specific terms.
- Threshold Sensitivity:
- The semantic similarity threshold (
0.3
) might still exclude some relevant matches if the query and package metadata are semantically distant, even with the fallback.
- The semantic similarity threshold (
String Search Mode (search_type=string
)
How It Works
-
Primary Matching (SpaCy Token Similarity):
- Uses SpaCy to compute a token-based similarity score between the query and the combined text of the package’s
name
,description
, andauthor
. - Unlike
semantic
mode, this focuses more on token overlap rather than deep semantic meaning, but still uses SpaCy’s similarity method. - A package is included if the token similarity score exceeds a threshold of
0.7
.
- Uses SpaCy to compute a token-based similarity score between the query and the combined text of the package’s
-
Fallback (Rapidfuzz String Matching):
- If the token similarity score is below the threshold, the search falls back to rapidfuzz’s
partial_ratio
method. partial_ratio
computes a score (0 to 100) based on how closely the query matches substrings in thename
,description
, orauthor
fields.- A package is included if the rapidfuzz score exceeds
70
.
- If the token similarity score is below the threshold, the search falls back to rapidfuzz’s
-
Result Ranking:
- Results are ranked by their similarity scores (token similarity or rapidfuzz), with an adjustment factor applied based on the source of the match (e.g., matches in the
name
field are weighted higher).
- Results are ranked by their similarity scores (token similarity or rapidfuzz), with an adjustment factor applied based on the source of the match (e.g., matches in the
Strengths
- Exact and Near-Exact Matching:
- Excels at finding packages where the query closely matches the package name or author, even with minor variations (e.g., "au core" matches "hl7.fhir.au.core").
- Technical Queries:
- Performs well with short, technical queries that are likely to appear as substrings in package names (e.g., "au core", "fhir r4").
- Reliable Fallback:
- The rapidfuzz fallback ensures that even if SpaCy’s token similarity fails, string-based matching will catch relevant results.
Best Suited For
- Exact Name Searches:
- Queries where the user knows part of the package name or author and wants an exact or near-exact match (e.g., "au core", "hl7 fhir").
- Technical Queries:
- Short queries that correspond to specific terms or abbreviations in package names (e.g., "r4", "us core").
- Precise Matching:
- When the user prioritizes string similarity over conceptual similarity, ensuring that results closely match the query text.
Limitations
- Lack of Semantic Understanding:
- Does not consider the meaning of the query, so it might miss conceptually related packages if the exact words differ (e.g., "healthcare standard" might not match "hl7.fhir.core" as well as in
semantic
mode).
- Does not consider the meaning of the query, so it might miss conceptually related packages if the exact words differ (e.g., "healthcare standard" might not match "hl7.fhir.core" as well as in
- Token Overlap Dependency:
- The initial SpaCy token similarity might still fail for queries with low overlap, relying heavily on the rapidfuzz fallback.
- Less Contextual:
- While it considers
description
andauthor
, it’s less effective at leveraging these fields for broader context compared tosemantic
mode.
- While it considers
Choosing the Right Search Mode
-
Use
semantic
Mode When:- You’re searching for packages related to a concept or topic (e.g., "patient data", "clinical standards").
- Your query is descriptive or in natural language (e.g., "Australian healthcare profiles").
- You’re exploring and want to find packages that are conceptually similar, even if the exact words differ.
- Example: Searching for "healthcare standard" to find "hl7.fhir.core".
-
Use
string
Mode When:- You know part of the package name or author and want an exact or near-exact match (e.g., "au core", "hl7 fhir").
- Your query is short and technical, likely matching specific terms in package names (e.g., "r4", "us core").
- You prioritize precise string matching over conceptual similarity.
- Example: Searching for "au core" to find "hl7.fhir.au.core".
Example Scenarios
Scenario 1: Searching for "au core"
- Semantic Mode:
- SpaCy might compute a low semantic similarity score between "au core" and "hl7.fhir.au.core Martijn Harthoorn" due to tokenization and semantic distance.
- However, the rapidfuzz fallback will match "au core" to "hl7.fhir.au.core" with a high score (e.g.,
85
), ensuring the package is included in the results.
- String Mode:
- SpaCy’s token similarity might also be low, but rapidfuzz will match "au core" to "hl7.fhir.au.core" with a high score, returning the package.
- Best Mode:
string
, as this is a technical query aiming for an exact match. However,semantic
mode will now also work due to the rapidfuzz fallback.
Scenario 2: Searching for "healthcare standard"
- Semantic Mode:
- SpaCy will compute a higher semantic similarity score between "healthcare standard" and "hl7.fhir.core Martijn Harthoorn" because of the conceptual alignment between "healthcare standard" and "fhir".
- The package is likely to exceed the
0.3
threshold and be included in the results.
- String Mode:
- SpaCy’s token similarity might be low because "healthcare standard" doesn’t directly overlap with "hl7.fhir.core".
- Rapdfuzz might also fail if the string match isn’t close enough, potentially excluding the package.
- Best Mode:
semantic
, as this query is conceptual and benefits from meaning-based matching.
Conclusion
IggyAPI’s search functionality provides two complementary modes to cater to different user needs:
semantic
Mode: Best for conceptual, descriptive, or exploratory searches where understanding the meaning of the query is key. It now includes a string-based fallback to handle technical queries better.string
Mode: Best for precise, technical searches where the user knows part of the package name or author and wants an exact or near-exact match.
By understanding the strengths of each mode, users can choose the most appropriate search_type
for their query, ensuring optimal search results.