A model doesn't see strings — it sees numbers. An embedding is a fixed-length list of floats that represents the meaning of a piece of text. "cat" and "dog" sit close together in vector space; "cat" and "car" sit far apart.
The math you need is one formula — cosine similarity:
def cosine(a, b):
dot = sum(x*y for x, y in zip(a, b))
mag_a = sum(x*x for x in a) ** 0.5
mag_b = sum(y*y for y in b) ** 0.5
return dot / (mag_a * mag_b)Returns a number between -1 and 1. Higher = more similar. For real embeddings the values are usually 0 to 1.
And we get the vectors how?
Today we'll work with hand-built stub vectors so the concept stays clean. The dimensions are pretend (4 floats instead of 1536), but the math is real. Tomorrow we wire up the embedding API.
Why stubs first?
Because the idea — vectors and distance — is the lesson, and a real embedding call burns quota and adds noise. You need to feel that two vectors-that-mean-similar-things produce a high cosine before you trust the API to do the work.
An embedding takes a string and returns a fixed-length list of floats:
embed("cat") → [0.12, -0.04, 0.91, ...] # 1536 floats for OpenAI ada
embed("dog") → [0.10, -0.03, 0.88, ...] # close to cat
embed("car") → [0.71, 0.45, 0.02, ...] # far from catThe magic: words with similar meanings produce vectors that are close together in this high-dimensional space.
The standard distance metric for embeddings:
def cosine(a, b):
dot = sum(x*y for x, y in zip(a, b))
mag_a = sum(x*x for x in a) ** 0.5
mag_b = sum(y*y for y in b) ** 0.5
return dot / (mag_a * mag_b)
print(cosine([1, 0, 0], [1, 0, 0])) # 1.0 — identical
print(cosine([1, 0, 0], [0, 1, 0])) # 0.0 — orthogonal
print(cosine([1, 0, 0], [-1, 0, 0])) # -1.0 — oppositeFor real embedding vectors, you'll see values in roughly the 0.3 to 0.95 range — texts are rarely identical or orthogonal.
Cosine measures direction, not magnitude. Two embeddings about the same topic point the same way regardless of how long the source text was. That's the property you want for similarity.
Tomorrow's lesson calls a real embedding API — values change run-to-run, vectors are 1536 floats, the assertion gets fuzzy. To establish the concept (close meanings → high cosine), we use four hand-built 4-dim vectors where the answer is unambiguous: cat and dog cluster, car is the outlier.
A model doesn't see strings — it sees numbers. An embedding is a fixed-length list of floats that represents the meaning of a piece of text. "cat" and "dog" sit close together in vector space; "cat" and "car" sit far apart.
The math you need is one formula — cosine similarity:
def cosine(a, b):
dot = sum(x*y for x, y in zip(a, b))
mag_a = sum(x*x for x in a) ** 0.5
mag_b = sum(y*y for y in b) ** 0.5
return dot / (mag_a * mag_b)Returns a number between -1 and 1. Higher = more similar. For real embeddings the values are usually 0 to 1.
And we get the vectors how?
Today we'll work with hand-built stub vectors so the concept stays clean. The dimensions are pretend (4 floats instead of 1536), but the math is real. Tomorrow we wire up the embedding API.
Why stubs first?
Because the idea — vectors and distance — is the lesson, and a real embedding call burns quota and adds noise. You need to feel that two vectors-that-mean-similar-things produce a high cosine before you trust the API to do the work.
An embedding takes a string and returns a fixed-length list of floats:
embed("cat") → [0.12, -0.04, 0.91, ...] # 1536 floats for OpenAI ada
embed("dog") → [0.10, -0.03, 0.88, ...] # close to cat
embed("car") → [0.71, 0.45, 0.02, ...] # far from catThe magic: words with similar meanings produce vectors that are close together in this high-dimensional space.
The standard distance metric for embeddings:
def cosine(a, b):
dot = sum(x*y for x, y in zip(a, b))
mag_a = sum(x*x for x in a) ** 0.5
mag_b = sum(y*y for y in b) ** 0.5
return dot / (mag_a * mag_b)
print(cosine([1, 0, 0], [1, 0, 0])) # 1.0 — identical
print(cosine([1, 0, 0], [0, 1, 0])) # 0.0 — orthogonal
print(cosine([1, 0, 0], [-1, 0, 0])) # -1.0 — oppositeFor real embedding vectors, you'll see values in roughly the 0.3 to 0.95 range — texts are rarely identical or orthogonal.
Cosine measures direction, not magnitude. Two embeddings about the same topic point the same way regardless of how long the source text was. That's the property you want for similarity.
Tomorrow's lesson calls a real embedding API — values change run-to-run, vectors are 1536 floats, the assertion gets fuzzy. To establish the concept (close meanings → high cosine), we use four hand-built 4-dim vectors where the answer is unambiguous: cat and dog cluster, car is the outlier.
Create a free account to get started. Paid plans unlock all tracks.