Yesterday you wrote a response-time statistics function. How are you going to test it? You need realistic input data — not just [100, 200, 300], but something that looks like actual server response times.
I could write test cases by hand... but that's fifty numbers typed manually and it won't catch distribution edge cases. I need synthetic data that looks like real server traffic — mostly fast responses with occasional spikes.
random module. Python's built-in controlled randomness. "Controlled" is the key word — seeded random is reproducible:
import random
# Seed for reproducibility — same seed, same sequence every time
rng = random.Random(42)
# Uniform distribution: any value between low and high with equal probability
response_time = rng.uniform(50, 400)
print(round(response_time, 1)) # 202.8 — same every time with seed=42
# Integer
status_code = rng.randint(200, 503) # includes both endpointsThe seed makes the randomness reproducible. Every test run with random.Random(42) generates the exact same sequence. So my test assertions can check exact values, not just ranges.
Exactly. In test code, always seed. In simulation code where you want genuinely different output each run, don't seed or use random.seed(None) to seed from the OS clock. The instance approach — rng = random.Random(42) — is better than random.seed(42) at the module level because it doesn't affect other code that uses random.
How do I generate data that looks like real server traffic? Most requests are fast — under 300ms — but there are occasional spikes to 2000ms or more. Uniform distribution would give me too many slow requests.
random.gauss() gives you normal (bell curve) distribution — most values cluster around the mean, fewer values appear at the extremes:
import random
rng = random.Random(42)
def generate_response_times(n: int, mean_ms: float = 200, spike_rate: float = 0.05) -> list[float]:
times = []
for _ in range(n):
if rng.random() < spike_rate:
# Spike: slow request
times.append(abs(rng.gauss(1500, 300)))
else:
# Normal request
times.append(abs(rng.gauss(mean_ms, 50)))
return times
sample = generate_response_times(1000)
print(f"Mean: {sum(sample)/len(sample):.1f}ms")
print(f"Max: {max(sample):.1f}ms")5% of requests are slow spikes around 1500ms, 95% are normal requests around 200ms. That's what real server traffic looks like — mostly nominal with occasional slow outliers. I can use this to test whether my percentile and standard deviation functions actually capture the spikes.
And random.choice() for selecting from a list, random.choices() for weighted selection — useful for generating error codes with realistic frequency:
import random
rng = random.Random(42)
log_levels = ["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"]
weights = [5, 60, 25, 8, 2] # relative frequencies
sample_levels = rng.choices(log_levels, weights=weights, k=100)
from collections import Counter
print(Counter(sample_levels).most_common())
# [('INFO', ~60), ('WARNING', ~25), ('DEBUG', ~5), ('ERROR', ~8), ('CRITICAL', ~2)]random.choices() with weights — INFO appears 60 times out of 100 on average. That models the actual distribution of log levels in the ops team's data. Most entries are INFO, some are WARNING, a few are ERROR, almost none are CRITICAL.
And for shuffling a list — randomizing the order of log entries for testing that your analysis doesn't depend on ordering:
import random
rng = random.Random(42)
entries = list(range(10))
rng.shuffle(entries) # in-place
print(entries) # [0, 2, 7, ...] — same permutation every timerandom.sample(population, k) for picking k items without replacement — that's for sampling a large log file to inspect a representative subset. random.choices() is with replacement. random.sample() is without. Now I can generate realistic test data for every function I've written this week.
That's the week summary: statistics functions need realistic test inputs; random generates them. Tomorrow: collections — the highlight of Week 3. Counter.most_common(5) replaces twenty lines of manual dict accumulation. You're going to have a strong reaction.
The random module generates pseudo-random numbers using the Mersenne Twister algorithm — fast, well-distributed, reproducible with a seed. "Pseudo-random" is the key qualifier: the sequence looks random but is deterministically determined by the initial seed. This is a feature for testing: seeded randomness produces the same data every run, so test assertions can check exact values.
The random module exposes two interfaces. The module-level functions (random.uniform(), random.gauss(), etc.) share a global state. random.seed(42) sets the state, affecting all subsequent calls anywhere in the program — including in libraries you import. The instance API — rng = random.Random(42) — creates an independent generator with its own state. For test code, always use the instance API to avoid contaminating the module-level state and to make the randomness explicit.
random.uniform(a, b) — uniform distribution between a and b. random.gauss(mu, sigma) — normal (Gaussian) distribution with mean mu and standard deviation sigma. random.expovariate(lambd) — exponential distribution with rate parameter lambd, useful for modeling inter-arrival times in Poisson processes (like web server requests). random.lognormvariate(mu, sigma) — log-normal distribution, appropriate for response time modeling where most values are fast but the tail is long.
random.choice(seq) — uniform selection of one element. random.choices(population, weights, k) — selection with replacement and optional weights. random.sample(population, k) — selection without replacement (no duplicates). random.shuffle(lst) — in-place shuffling of a list. The distinction between choices (with replacement) and sample (without) matters when the population is small relative to k.
The random module is not cryptographically secure. For security-sensitive operations — generating tokens, session IDs, API keys — use secrets.token_hex(), secrets.token_urlsafe(), or secrets.choice(). The secrets module uses the OS's cryptographic random source. For statistical simulation and test data generation, random is appropriate and significantly faster.
random.Random(seed) with an integer seed produces the same sequence every call. random.Random() with no seed uses the OS clock and system entropy — different each run. For unit tests: seed every generator. For simulation runs: no seed, or document the seed in the output for reproducibility. For production: never use random for security; always use secrets.
Sign up to write and run code in this lesson.
Yesterday you wrote a response-time statistics function. How are you going to test it? You need realistic input data — not just [100, 200, 300], but something that looks like actual server response times.
I could write test cases by hand... but that's fifty numbers typed manually and it won't catch distribution edge cases. I need synthetic data that looks like real server traffic — mostly fast responses with occasional spikes.
random module. Python's built-in controlled randomness. "Controlled" is the key word — seeded random is reproducible:
import random
# Seed for reproducibility — same seed, same sequence every time
rng = random.Random(42)
# Uniform distribution: any value between low and high with equal probability
response_time = rng.uniform(50, 400)
print(round(response_time, 1)) # 202.8 — same every time with seed=42
# Integer
status_code = rng.randint(200, 503) # includes both endpointsThe seed makes the randomness reproducible. Every test run with random.Random(42) generates the exact same sequence. So my test assertions can check exact values, not just ranges.
Exactly. In test code, always seed. In simulation code where you want genuinely different output each run, don't seed or use random.seed(None) to seed from the OS clock. The instance approach — rng = random.Random(42) — is better than random.seed(42) at the module level because it doesn't affect other code that uses random.
How do I generate data that looks like real server traffic? Most requests are fast — under 300ms — but there are occasional spikes to 2000ms or more. Uniform distribution would give me too many slow requests.
random.gauss() gives you normal (bell curve) distribution — most values cluster around the mean, fewer values appear at the extremes:
import random
rng = random.Random(42)
def generate_response_times(n: int, mean_ms: float = 200, spike_rate: float = 0.05) -> list[float]:
times = []
for _ in range(n):
if rng.random() < spike_rate:
# Spike: slow request
times.append(abs(rng.gauss(1500, 300)))
else:
# Normal request
times.append(abs(rng.gauss(mean_ms, 50)))
return times
sample = generate_response_times(1000)
print(f"Mean: {sum(sample)/len(sample):.1f}ms")
print(f"Max: {max(sample):.1f}ms")5% of requests are slow spikes around 1500ms, 95% are normal requests around 200ms. That's what real server traffic looks like — mostly nominal with occasional slow outliers. I can use this to test whether my percentile and standard deviation functions actually capture the spikes.
And random.choice() for selecting from a list, random.choices() for weighted selection — useful for generating error codes with realistic frequency:
import random
rng = random.Random(42)
log_levels = ["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"]
weights = [5, 60, 25, 8, 2] # relative frequencies
sample_levels = rng.choices(log_levels, weights=weights, k=100)
from collections import Counter
print(Counter(sample_levels).most_common())
# [('INFO', ~60), ('WARNING', ~25), ('DEBUG', ~5), ('ERROR', ~8), ('CRITICAL', ~2)]random.choices() with weights — INFO appears 60 times out of 100 on average. That models the actual distribution of log levels in the ops team's data. Most entries are INFO, some are WARNING, a few are ERROR, almost none are CRITICAL.
And for shuffling a list — randomizing the order of log entries for testing that your analysis doesn't depend on ordering:
import random
rng = random.Random(42)
entries = list(range(10))
rng.shuffle(entries) # in-place
print(entries) # [0, 2, 7, ...] — same permutation every timerandom.sample(population, k) for picking k items without replacement — that's for sampling a large log file to inspect a representative subset. random.choices() is with replacement. random.sample() is without. Now I can generate realistic test data for every function I've written this week.
That's the week summary: statistics functions need realistic test inputs; random generates them. Tomorrow: collections — the highlight of Week 3. Counter.most_common(5) replaces twenty lines of manual dict accumulation. You're going to have a strong reaction.
The random module generates pseudo-random numbers using the Mersenne Twister algorithm — fast, well-distributed, reproducible with a seed. "Pseudo-random" is the key qualifier: the sequence looks random but is deterministically determined by the initial seed. This is a feature for testing: seeded randomness produces the same data every run, so test assertions can check exact values.
The random module exposes two interfaces. The module-level functions (random.uniform(), random.gauss(), etc.) share a global state. random.seed(42) sets the state, affecting all subsequent calls anywhere in the program — including in libraries you import. The instance API — rng = random.Random(42) — creates an independent generator with its own state. For test code, always use the instance API to avoid contaminating the module-level state and to make the randomness explicit.
random.uniform(a, b) — uniform distribution between a and b. random.gauss(mu, sigma) — normal (Gaussian) distribution with mean mu and standard deviation sigma. random.expovariate(lambd) — exponential distribution with rate parameter lambd, useful for modeling inter-arrival times in Poisson processes (like web server requests). random.lognormvariate(mu, sigma) — log-normal distribution, appropriate for response time modeling where most values are fast but the tail is long.
random.choice(seq) — uniform selection of one element. random.choices(population, weights, k) — selection with replacement and optional weights. random.sample(population, k) — selection without replacement (no duplicates). random.shuffle(lst) — in-place shuffling of a list. The distinction between choices (with replacement) and sample (without) matters when the population is small relative to k.
The random module is not cryptographically secure. For security-sensitive operations — generating tokens, session IDs, API keys — use secrets.token_hex(), secrets.token_urlsafe(), or secrets.choice(). The secrets module uses the OS's cryptographic random source. For statistical simulation and test data generation, random is appropriate and significantly faster.
random.Random(seed) with an integer seed produces the same sequence every call. random.Random() with no seed uses the OS clock and system entropy — different each run. For unit tests: seed every generator. For simulation runs: no seed, or document the seed in the output for reproducibility. For production: never use random for security; always use secrets.