addition of Spacey, sematic and string flags in initial search, with fallback to rapid fuzz. split into two modules, for easier handling, core and main added threading, and background schedular to refresh cache on 8 hour windows. also to load initial stale cache on startup and refresh in the background.
7.7 KiB
IggyAPI Search Test Cases Summary
This document summarizes the search test cases executed on IggyAPI, as captured in the logs from 2025-05-13. It analyzes the behavior of the /igs/search
endpoint in semantic
mode, focusing on the frequent fallback to RapidFuzz and the reasons behind SpaCy’s semantic similarity failures.
Test Case Overview
The logs capture three search requests using search_type=semantic
, with the following queries: "beda", "au core", and "core". Each search follows a two-step process:
- Pre-filtering: Filters packages based on the presence of query words in the
package_name
orauthor
fields. - Similarity Matching:
- First attempts SpaCy semantic similarity (threshold: 0.3).
- If SpaCy fails or the similarity score is below the threshold, falls back to RapidFuzz string matching (threshold: 70).
Test Case 1: Query "beda"
- Timestamp: 2025-05-13 08:01:43
- Pre-filtering:
- Total packages: 1023
- Filtered packages: 4 (where "beda" appears in
package_name
orauthor
) - Matching packages:
de.abda.eRezeptAbgabedaten
,de.abda.eRezeptAbgabedatenBasis
,de.abda.eRezeptAbgabedatenPKV
,de.biv-ot.everordnungabgabedaten
- SpaCy Semantic Similarity:
- Failed with warning:
[W008] Evaluating Doc.similarity based on empty vectors
. - Likely cause: "beda" lacks a meaningful embedding in the
en_core_web_md
model, and the combined text (package_name
,description
,author
) for these packages also lacks meaningful embeddings (e.g., emptydescription
, technicalpackage_name
tokens like "abda").
- Failed with warning:
- Fallback to RapidFuzz:
- All 4 packages matched via RapidFuzz with a score of 100.0 (e.g., "beda" is a substring of "abda").
- Results:
- Total matches: 4
- SpaCy matches: 0 (0%)
- RapidFuzz matches: 4 (100%)
Test Case 2: Query "au core"
- Timestamp: 2025-05-13 08:02:37
- Pre-filtering:
- Total packages: 1023
- Filtered packages: 1 (where both "au" and "core" appear in
package_name
orauthor
) - Matching package:
hl7.fhir.au.core
- SpaCy Semantic Similarity:
- No semantic match logged, indicating the similarity score was below 0.3.
- Likely cause: "au" has a weak embedding, "core" is a common word, and the combined text ("hl7.fhir.au.core Martijn Harthoorn") includes noise from the
author
field. Thedescription
field is empty, providing no additional semantic context.
- Fallback to RapidFuzz:
- Matched via RapidFuzz with a score of 85.71 ("au core" matches "au.core" in the package name).
- Results:
- Total matches: 1
- SpaCy matches: 0 (0%)
- RapidFuzz matches: 1 (100%)
Test Case 3: Query "core"
- Timestamp: 2025-05-13 08:02:58
- Pre-filtering:
- Total packages: 1023
- Filtered packages: 77 (where "core" appears in
package_name
orauthor
) - Examples:
hl7.fhir.core
,hl7.fhir.au.core
,ch.fhir.ig.ch-core
, etc.
- SpaCy Semantic Similarity:
- Succeeded for 14 packages with similarity scores above 0.3:
- Examples:
ch.fhir.ig.ch-core
(0.6715787550273024),tiga.health.clinical
(0.6912820366040198). - Likely reason: These packages have non-empty
description
fields orauthor
names providing semantic context (e.g., healthcare-related terms).
- Examples:
- Failed for 63 packages (similarity below 0.3).
- Likely cause: Empty
description
fields, technicalpackage_name
tokens (e.g., "hl7", "fhir"), and noise fromauthor
fields dilute similarity scores.
- Succeeded for 14 packages with similarity scores above 0.3:
- Fallback to RapidFuzz:
- 63 packages matched via RapidFuzz with scores of 100.0 (e.g.,
hl7.fhir.core
,hl7.fhir.us.core
). - "core" is a direct substring of these package names, making RapidFuzz highly effective.
- 63 packages matched via RapidFuzz with scores of 100.0 (e.g.,
- Results:
- Total matches: 77
- SpaCy matches: 14 (18%)
- RapidFuzz matches: 63 (82%)
Overall Fallback Statistics
- Total Matches Across All Test Cases: 82 (4 + 1 + 77)
- SpaCy Matches: 14 (all from "core" query)
- RapidFuzz Matches: 68 (4 from "beda", 1 from "au core", 63 from "core")
- Fallback Rate: 68/82 = 83%
The app falls back to RapidFuzz for 83% of matches, indicating that SpaCy’s semantic similarity is not effective for most packages in these test cases.
Why the Frequent Fallback to RapidFuzz?
The high fallback rate to RapidFuzz is due to several issues with SpaCy’s semantic similarity in the context of these searches:
1. Weak Embeddings for Technical Terms
- Issue: Queries like "beda", "au", and "core" have weak or no embeddings in the
en_core_web_md
model:- "beda" is likely an out-of-vocabulary token.
- "au" is a short token, possibly treated as a stop word.
- "core" is a common English word but doesn’t align strongly with tokenized package names (e.g., "hl7", "fhir", "core").
- Impact: SpaCy cannot compute meaningful similarity scores, leading to fallbacks.
2. Empty or Sparse Package Descriptions
- Issue: Many FHIR packages have empty
description
fields (e.g.,hl7.fhir.au.core
has no description). - Impact: The combined text (
package_name
,description
,author
) lacks semantic content, reducing SpaCy’s ability to find meaningful matches. For example, "hl7.fhir.au.core Martijn Harthoorn" provides little semantic context beyond the package name.
3. Noise from Author Field
- Issue: The
author
field (e.g., "Martijn Harthoorn") is included in the combined text for SpaCy similarity. - Impact: Author names often add noise, diluting the semantic similarity between the query and the package’s core content (e.g., "au core" vs. "hl7.fhir.au.core Martijn Harthoorn").
4. Tokenization of Package Names
- Issue: SpaCy tokenizes package names like "hl7.fhir.au.core" into "hl7", "fhir", "au", "core".
- Impact: This splits the package name into parts, reducing the overall semantic alignment with the query (e.g., "au core" doesn’t match strongly with the tokenized form).
5. SpaCy Similarity Threshold (0.3)
- Issue: The threshold for SpaCy semantic similarity is set to 0.3, which is relatively strict.
- Impact: Many packages have similarity scores just below 0.3 (e.g., possibly 0.25 for "au core" vs. "hl7.fhir.au.core"), forcing a fallback to RapidFuzz.
6. Empty Vectors Warning
- Issue: For the query "beda", SpaCy raises a warning:
[W008] Evaluating Doc.similarity based on empty vectors
. - Impact: SpaCy fails entirely when either the query or the target text lacks meaningful embeddings, leading to an immediate fallback to RapidFuzz.
Why RapidFuzz Succeeds
- Mechanism: RapidFuzz uses the
partial_ratio
method, which computes a string similarity score (0 to 100) based on substring matches. - Effectiveness:
- "beda" matches "abda" in package names with a score of 100.0.
- "au core" matches "au.core" with a score of 85.71.
- "core" matches package names like "hl7.fhir.core" with a score of 100.0.
- Threshold: The RapidFuzz threshold is 70, which is lenient and captures most substring matches.
- No Dependency on Semantics: RapidFuzz doesn’t rely on word embeddings or descriptive content, making it effective for technical queries and packages with sparse metadata.
Conclusion
The IggyAPI search functionality in semantic
mode falls back to RapidFuzz for 83% of matches due to SpaCy’s limitations with technical queries, sparse package metadata, and a strict similarity threshold. SpaCy struggles with weak embeddings for terms like "beda", "au", and "core", empty package descriptions, noise from author fields, tokenized package names, and a 0.3 similarity threshold. RapidFuzz succeeds where SpaCy fails by focusing on substring matches, ensuring that relevant results are returned despite SpaCy’s shortcomings. This analysis highlights the challenges of applying semantic similarity to technical FHIR data with limited descriptive content.