You search for a popular topic and five of the ten results come from the same site. The engine thinks they're all relevant; you want broader coverage. How do you keep one per domain?
Extract the domain from each URL, then keep only the first result whose domain hasn't appeared yet? A set to track what I've seen, a list for what to keep.
Exactly. The domain is the middle piece of the URL — url.split("/")[2] for https://example.com/path gives example.com. Then a seen set tracks what you've already kept:
seen = set()
unique = []
for r in results:
domain = r["url"].split("/")[2]
if domain not in seen:
seen.add(domain)
unique.append(r)
print(len(unique))Five results from example.com collapse to one — the first one the engine returned (which is the highest-ranked on that domain).
So the order of unique is the original engine ranking, just with duplicates filtered out? Not a new ranking?
Exactly. The engine's ordering is preserved; you're pruning, not reshuffling. The first per-domain wins because you add to unique on first encounter, and subsequent same-domain results fail the if domain not in seen check. Full function:
def deduplicate_results_by_domain(query: str, count: int) -> list:
results = search(query, count=count)
seen = set()
unique = []
for r in results:
domain = r["url"].split("/")[2]
if domain not in seen:
seen.add(domain)
unique.append(r)
return uniqueWhy set() instead of another list? Would if domain not in unique_domains work just as well?
Functionally yes. Performance-wise, in on a set is O(1); on a list it's O(n). For ten results the difference doesn't matter; for a thousand, the set version finishes instantly and the list version crawls. Using the right data structure for the job is a habit — sets for uniqueness, lists for order.
And the output is still a list of result dicts — exactly the same shape as search() returns, just with fewer items.
Exactly. Dedup is a pure filter — same shape in, same shape out, fewer items. Combine this with yesterday's cache and semantic ranking, and retrieval starts producing genuinely high-quality inputs for downstream agent work.
TL;DR: split the URL, seed a seen set, keep the first result per domain.
url.split("/")[2] — middle part is the domainset() for uniqueness — O(1) membership test| Structure | in cost | Order |
|---|---|---|
set | O(1) | unordered |
list | O(n) | ordered |
Use a set for membership checks; use a list when you need to preserve order.
You search for a popular topic and five of the ten results come from the same site. The engine thinks they're all relevant; you want broader coverage. How do you keep one per domain?
Extract the domain from each URL, then keep only the first result whose domain hasn't appeared yet? A set to track what I've seen, a list for what to keep.
Exactly. The domain is the middle piece of the URL — url.split("/")[2] for https://example.com/path gives example.com. Then a seen set tracks what you've already kept:
seen = set()
unique = []
for r in results:
domain = r["url"].split("/")[2]
if domain not in seen:
seen.add(domain)
unique.append(r)
print(len(unique))Five results from example.com collapse to one — the first one the engine returned (which is the highest-ranked on that domain).
So the order of unique is the original engine ranking, just with duplicates filtered out? Not a new ranking?
Exactly. The engine's ordering is preserved; you're pruning, not reshuffling. The first per-domain wins because you add to unique on first encounter, and subsequent same-domain results fail the if domain not in seen check. Full function:
def deduplicate_results_by_domain(query: str, count: int) -> list:
results = search(query, count=count)
seen = set()
unique = []
for r in results:
domain = r["url"].split("/")[2]
if domain not in seen:
seen.add(domain)
unique.append(r)
return uniqueWhy set() instead of another list? Would if domain not in unique_domains work just as well?
Functionally yes. Performance-wise, in on a set is O(1); on a list it's O(n). For ten results the difference doesn't matter; for a thousand, the set version finishes instantly and the list version crawls. Using the right data structure for the job is a habit — sets for uniqueness, lists for order.
And the output is still a list of result dicts — exactly the same shape as search() returns, just with fewer items.
Exactly. Dedup is a pure filter — same shape in, same shape out, fewer items. Combine this with yesterday's cache and semantic ranking, and retrieval starts producing genuinely high-quality inputs for downstream agent work.
TL;DR: split the URL, seed a seen set, keep the first result per domain.
url.split("/")[2] — middle part is the domainset() for uniqueness — O(1) membership test| Structure | in cost | Order |
|---|---|---|
set | O(1) | unordered |
list | O(n) | ordered |
Use a set for membership checks; use a list when you need to preserve order.
Create a free account to get started. Paid plans unlock all tracks.