Outliyr BioHarmony Score — Evidence-Based Intervention Rating Framework

The Outliyr BioHarmony Score is a single 0-10 number that tells you whether a health intervention is worth it. Not just whether it “works.” Whether the full picture, upside and downside, justifies the cost.

It weighs 13 dimensions of benefit and harm. It penalizes risk more heavily than reward. It adjusts for evidence quality.

Every supplement, device, protocol, and therapy gets the same transparent treatment. You can see the formula, the raw scores, and exactly why anything lands where it does.

I built this after years of testing interventions on myself. No existing framework captured the full decision. A supplement can have strong efficacy data and still be a bad choice if it costs $200/month, creates dependency, and requires precise timing.

Star ratings flatten all of that into one useless number.

What’s Wrong With Current Ratings?

Star ratings and letter grades hide more than they reveal.

You’ve seen the pattern. A supplement gets an “A” on one site and a “C” on another. Neither tells you why.

Existing platforms rate on narrow slices. Examine.com does excellent work summarizing research, but it doesn’t produce a single comparable score across interventions. You can’t look at an Examine page for creatine and another for red light therapy and know which one deserves your budget this month.

Labdoor tests whether what’s on the label matches what’s in the bottle. That’s purity testing, not outcome evaluation. ConsumerLab does the same.

None of them answer the question I care about: given everything we know, is this worth my time, money, and biological risk?

That question requires weighing efficacy against side effects. Speed against dependency. Cost against breadth. No existing system does this in a structured, transparent way.

So I built one.

What Are the 13 Dimensions?

Every intervention gets rated on six upside dimensions and seven downside dimensions. Each scored 1-5.

The upside dimensions:

Efficacy: how well it works
Breadth: how many body systems it benefits
Evidence quality: strength of the supporting research
Speed of onset: how quickly you’ll notice results
Durability: how long those results persist
Bioindividuality: how consistently it works across different people

The downside dimensions:

Safety: risk of serious harm
Side effects: common negative reactions
Cost: monthly financial expense
Effort: time and complexity required
Opportunity cost: what you could do instead
Dependency: whether you need to keep taking it
Reversibility: whether negative effects resolve when you stop

Upside Dimensions

Dimension	Weight	Description
Efficacy	25%
Breadth	15%
Evidence	25%
Speed	10%
Durability	10%
Bioindividuality	15%

Downside Dimensions

Dimension	Weight	Multiplier
Safety	30%	1.4×
Side effects	15%	1.4×
Cost	5%	1.0×
Effort	5%	1.0×
Opportunity	5%	1.0×
Dependency	15%	1.4×
Reversibility	25%	1.4×

I chose these 13 because they cover the axes real people weigh when deciding whether to try something.

You don’t just want to know “does creatine build muscle.” You want to know if it’s safe long-term, whether the benefits fade when you stop, how much it costs, and whether the research is based on rodent studies or large human RCTs.

Each dimension is scored using a structured rubric tied to evidence. No gut feelings, no guesswork. Every score links back to the data that justifies it.

Why Does Harm Get a Higher Multiplier?

Safety-related downside dimensions carry a 1.4x multiplier in the final calculation. This is deliberate.

The logic follows the precautionary principle. Harm is harder to undo than benefit is to gain.

If a supplement gives you a modest cognitive boost but causes liver stress, those aren’t equal trade-offs. The liver damage matters more.

Think of it like a coin flip. You’d reject a 50/50 between gaining $1,000 and losing $1,000. The weight of loss exceeds the weight of equivalent gain. In health decisions, that’s not a bias. It’s wisdom.

Harm-Type Dimensions

These dimensions carry a risk multiplier above 1.0×, reflecting the precautionary principle: irreversible or health-threatening downsides should be weighted more heavily than their raw scores suggest.

Safety 1.4×
Side effects 1.4×
Dependency 1.4×
Reversibility 1.4×

Opportunity-Type Dimensions

These dimensions use a 1.0× multiplier. They represent recoverable costs (money, time, opportunity) that can be regained if the intervention is stopped.

Cost 1.0×
Effort 1.0×
Opportunity 1.0×

The 1.4x multiplier means an intervention needs meaningfully more upside than downside to score well. Perfectly balanced pros and cons land around 5.0/10 (Neutral), not 7.0.

You have to earn a high score.

How Does the Formula Work?

The Outliyr BioHarmony Score uses an expected-value calculation. Each dimension’s score gets shifted by a baseline offset of -1.0, multiplied by its weight, and summed. Downside sums get the 1.4x harm multiplier before subtraction.

Step 1: Expected Value (EV)

EV = Σ((upside_score − 1) × weight) − Σ((downside_score − 1) × weight × risk_multiplier)

The baseline offset (1) shifts the neutral point so a score of 1 on any dimension contributes zero to EV. This prevents the "everything scores positive" problem: an intervention must exceed the baseline to register as a benefit.

Step 2: Normalize to 0–10

If EV ≥ 0: Score = 5 + (EV ÷ 4.00) × 5
If EV < 0: Score = 5 + (EV ÷ 5.36) × 5

A score of 5 is the neutral point: expected benefits roughly equal expected costs. Above neutral, EV = +4.00 maps to 10.0; below neutral, EV = -5.36 maps to 0.0. These bounds are the maximum upside and downside the dimension weights can produce.

In plain language: add up all the good. Subtract all the bad (weighted 40% heavier). Convert to a 0-10 scale.

5.0 = benefits and risks are balanced. Above 5.0 = net positive. Below 5.0 = harms outweigh benefits.

The formula is fully transparent. You can see every input, every weight, and recalculate any score yourself.

Most scoring systems are black boxes. Here, the math is the argument.

Current engine: BioHarmony v2.0. This methodology was last reviewed June 14, 2026. Every published score also shows its own last-reviewed date and the engine version it was scored under.

How Does Scoring Work in Practice?

Three real examples show how the formula plays out, from a clean win to a high-efficacy drug held back by its risks to a peptide scored on mechanism instead of trials.

Astaxanthin: 7.4/10 (Strong Recommend)

Astaxanthin is a carotenoid antioxidant found in wild salmon, krill, and microalgae. It scores well because the upside is solid across the board while the downside stays near the floor.

Efficacy lands at 3.3 and breadth at 4.5, because it touches skin, eyes, and cardiovascular markers at once. Evidence sits at 3.4, backed by dozens of human trials.

The downside barely registers. Safety is 2.0, side effects 1.5, cost 1.8.

The math: an upside of 2.46 minus a downside of 0.58 gives an expected value of 1.88, which maps to 7.4/10. The risk-reward clearly favors trying it.

Semaglutide: 6.2/10 (Worth Trying)

Semaglutide (Ozempic/Wegovy) tells a more complicated story.

The efficacy is remarkable: 4.8 for weight loss, with an evidence score of 5.0 from large phase III trials in thousands of people. On raw upside, almost nothing competes.

Then the downside pulls it back. Safety sits at 4.0, because rare but serious signals (vision loss, pancreatitis, gastroparesis) trip the catastrophic-risk floor. That floor is set once, in safety, and never double-counted across the other dimensions. The weight regain after stopping isn’t a safety harm or a dependency trap, so it lands in durability at 2.0. Reversibility stays low at 1.7, because the drug itself washes out cleanly.

After the 1.4x multiplier on the harm dimensions, the downside totals 2.44 against an upside of 3.40. An expected value of 0.95 maps to 6.2/10.

This is the kind of nuance star ratings destroy. Semaglutide is at once one of the most effective weight-loss drugs ever developed and a measured recommendation for the right person, once the full picture is in.

BPC-157: 7.2/10 (Strong Recommend)

BPC-157 shows the second evidence path in action. It’s a peptide studied mostly in animals for soft-tissue and gut repair, with a large, consistent base of real-world use but no large human efficacy trial yet.

Under a trials-or-nothing rule, the thin human-trial record would have capped its score low. Here, the coherent repair mechanism, broad preclinical signal, clean reported safety, and heavy real-world use carry real evidence weight, so the missing big RCT lowers confidence rather than gutting the score.

Efficacy 4.3, breadth 4.0, bioindividuality 4.0. The downside stays low: cost around $40 to $80 a month, no established dependency, reversible effects. That combination lands it at 7.2/10.

The catch lives in the confidence band. Because the human-trial record is still thin, a single strong trial could move the score in either direction. The score tells you how good the real-world picture looks; the confidence band tells you how settled it is.

What Do the Tiers Mean?

Scores map to six recommendation tiers so you can quickly interpret any number.

	Tier	Score Range	Meaning
✅	Top-tier	8.8–10.0	Do this yesterday
💪	Strong recommend	7.0–8.7	Worth prioritizing
👍	Worth trying	5.8–6.9	Good for the right person
⚖️	Neutral	4.5–5.7	Context-dependent
⚠️	Caution	3.0–4.4	Significant downsides to weigh
🚫	Skip	0.0–2.9	Not worth the risk

The tiers exist for fast scanning. But always look at the dimensional breakdown, not just the label.

Two interventions can both score 6.5 for completely different reasons. One might have moderate efficacy with zero risk. Another might have exceptional efficacy dragged down by high cost and dependency. Same tier. Different decision.

How Do We Grade Evidence?

Evidence answers one question: how sure can you be that this will actually work for you in the real world? That is a different question from whether a trial has formally proven it. Nobody taking an intervention cares what a study says. They care what happens to them. So this dimension weighs the total real-world signal, not the format of the research behind it.

Human trials are one strong source of that signal, especially large or replicated ones. The study hierarchy below is how we weigh those trials, from pooled meta-analyses down to single case reports. But trials are only part of the picture, and their absence is never a penalty.

Systematic Reviews & Meta-Analyses of RCTsScore anchor: 4.5-5.0
Pooled data from multiple randomized controlled trials
Large Randomized Controlled TrialsScore anchor: 3.5-4.4
Well-designed RCTs with adequate sample sizes and controls
Small RCTs & Controlled StudiesScore anchor: 2.5-3.4
Pilot RCTs, crossover studies, or controlled but non-randomized trials
Observational & Epidemiological StudiesScore anchor: 2.0-2.4
Cohort, case-control, cross-sectional, or ecological studies
Traditional Use & Expert ConsensusScore anchor: 1.5-1.9
Long historical use, clinical expert opinion, or professional guidelines without RCT backing
Anecdotal & Case ReportsScore anchor: 1.0-1.4
Individual case reports, user testimonials, forum consensus, N=1 personal data
Mechanistic & In Vitro OnlyScore anchor: 1.0-1.2
Plausible mechanism of action from cell or animal studies, no human data

Real-world results count as much as trials

A treatment can reach the very top of the evidence scale two ways: a strong, replicated trial record, or overwhelming real-world clinical results. That second route is large-scale practitioner experience plus consistent user outcomes plus a mechanism that makes sense. Either one earns a top score on its own.

Plenty of interventions work, are used by thousands of people, and will never attract a large trial, because no one can patent them or pay for the study. Peptides, lifestyle protocols, devices, and off-label uses of older drugs all live here. We don’t dock them for lacking RCTs. If the real-world track record is strong and consistent, the score reflects that.

Time counts too. Something used safely and effectively for centuries that still works today is evidence, not folklore. A long, consistent, cross-cultural track record that holds up in practice carries real weight here.

This isn’t a free pass. A single study never moves a score on its own, because researchers can choose what to measure and land the result they were hoping for. Real movement takes a consistent body of evidence pointing the same way.

Rigor still cuts both ways. When modern research, historical use, and real-world results all point the same direction, confidence rises. When the evidence turns out to be gamed, funded and buried by the seller, built on a discredited mechanism, or contradicted by what actually happens in practice, the score comes down.

Popularity still doesn’t move the number. A product can be everywhere on social media and score mediocre because the real-world results, the mechanism, or the safety record don’t back the hype. Real outcomes move the score. Marketing doesn’t.

What Are Confidence Bands?

Every score comes with an implicit confidence range based on evidence depth and consistency.

Creatine or vitamin D? Tight confidence band. Extensive, consistent human trial data. The score probably won’t shift much.

Newer peptides or experimental nootropic stacks? Wide band. The current score is our best estimate, and new evidence could move it. Confidence tracks how consistent and deep the real-world and trial signal is, not whether a formal trial exists.

Confidence bands keep me honest. They prevent false precision. You deserve to know how firm the ground is beneath any number I publish.

What Exactly Does the Score Measure?

Every score measures the intervention itself, dosed correctly and sourced clean. It does not rate the worst knockoff you might find online, and it does not rate what happens when someone takes ten times the right dose.

This matters most for anything with a messy supply chain. Take a peptide sold gray-market. The molecule might be well-tolerated at the right dose from a clean source, while the actual vials people buy vary in purity and dosing accuracy. Those are two different problems, and I score them separately.

Intrinsic factors, what the molecule or protocol does when it’s pure and used correctly, drive every dimension: safety, side effects, efficacy, evidence, all of it. Extrinsic factors never inflate the safety or side-effect scores. That includes counterfeit risk, purity uncertainty, dosing accuracy from unregulated vendors, and harms that only appear at overdose or with adulterated product.

Those extrinsic risks show up where they belong: as a sourcing-and-dosing caveat in the verdict, and often a lower confidence band. A case report of harm at ten times the normal dose is a dosing problem, not a property of the intervention.

The analogy I keep coming back to: lacing the most nutritious food on earth with heavy metals doesn’t make the food unhealthy. It makes that batch contaminated. The score rates the food. The caveat warns you about the batch.

Can You Personalize Your Scores?

Default scores use population-level weights. That’s a reasonable starting point. But your priorities aren’t average.

Maybe cost barely matters to you but you’re extremely risk-averse about side effects. Maybe you care most about speed because you’re prepping for a competition in eight weeks. Maybe dependency is a dealbreaker regardless of how effective something is.

The BioHarmony profile quiz captures your health goals, sensitivities, and constraints. It adjusts dimension weights to reflect what matters to you.

The result: your version of the score. The same intervention might be a 7.4 for the general population and an 8.0 for you, because your profile emphasizes the dimensions where it excels. Or it could drop to 6.1 because your profile penalizes the exact dimensions where it’s weakest.

What This Score Is Not

The Outliyr BioHarmony Score is not medical advice. It’s a framework for thinking about health interventions more clearly. It doesn’t replace your doctor, your lab work, or your own judgment.

It’s also not a product review. I’m not testing purity or comparing brands within a category. Two magnesium glycinate products might differ dramatically in quality. This score tells you whether magnesium glycinate as an intervention is worth pursuing. Brand selection is a separate step.

It’s not a popularity contest. Sales volume, influencer endorsements, and Reddit hype don’t move the number. The inputs are structured evidence and dimensional analysis. Nothing else.

Scores reflect current evidence. They’ll change as new research publishes. That’s a feature, not a bug.

Start With Your Profile

The fastest way to make these scores useful is to personalize them.

👉 Take the BioHarmony Profile Quiz to get scores weighted to your health priorities, goals, and risk tolerance.

Already know the system? Browse all rated interventions to compare scores across supplements, devices, protocols, and therapies.

Frequently Asked Questions

What is the Outliyr BioHarmony Score?

The Outliyr BioHarmony Score is a transparent 0-10 rating that captures how worthwhile a health intervention is across 13 evidence-weighted dimensions. It covers everything from supplements and peptides to devices and lifestyle protocols. Unlike simple star ratings, it separately evaluates six upside dimensions (efficacy, breadth, evidence quality, speed, durability, bioindividuality) and seven downside dimensions (safety, side effects, cost, effort, opportunity cost, dependency, reversibility), then combines them using a harm-weighted expected value formula.

How is the BioHarmony Score calculated?

Each of the 13 dimensions is scored 1-5 based on available evidence. Upside and downside scores are each shifted by a baseline offset, multiplied by their weights, and summed. Downside sums receive a 1.4x harm multiplier before subtraction. The resulting expected value maps to a 0-10 scale where 5.0 is neutral. The full formula and every input are visible for each rated intervention.

What types of interventions does BioHarmony score?

Any health intervention with enough evidence to evaluate: supplements, prescription medications, medical devices, biohacking tools (red light therapy, cold plunge, neurostimulation), dietary protocols, exercise methodologies, and lifestyle practices. If it makes a health claim and has some evidence base, it can get a score.

How is this different from Examine.com or Labdoor?

Examine summarizes research but doesn’t produce a single cross-category score. Labdoor and ConsumerLab test product purity and label accuracy. The BioHarmony Score asks a different question entirely: given all known evidence across 13 dimensions, is this intervention worth your time, money, and risk? It’s the only system that applies a consistent, harm-weighted formula across intervention categories.

Can I personalize my BioHarmony scores?

Yes. Default scores use population-level dimension weights. The BioHarmony profile quiz captures your health priorities, risk tolerance, and goals, then adjusts dimension weights accordingly. Your personalized scores reflect what matters most to you rather than population averages.

How often are scores updated?

Scores update when significant new evidence publishes. A major RCT, a safety signal from post-market surveillance, or a systematic review can all trigger a re-evaluation. Each score page shows its last-reviewed date and engine version.

Is the BioHarmony Score medical advice?

No. The score is an educational framework for structured thinking about health interventions. It does not replace diagnosis, treatment recommendations, or supervision from a qualified healthcare provider. Always consult your doctor before starting, stopping, or changing any health intervention, especially prescription medications.

Your Next Move

You now understand how the Outliyr BioHarmony Score works. The formula, the dimensions, the multipliers, the evidence hierarchy. None of that matters until you use it.

Start by browsing the scored interventions. Find something you’re already taking or considering. Look at the dimensional breakdown. See where it scores well and where it doesn’t. That one exercise will change how you evaluate every health decision going forward.

If you want scores calibrated to your priorities, take the BioHarmony profile quiz. It takes about two minutes and reweights every score based on what matters to you.

Your biology is unique. Your evaluation framework should be too.

Know someone who obsesses over supplement research? Send them this page.

The information on this page is for educational purposes only. It is not intended to diagnose, treat, cure, or prevent any disease. Consult a qualified healthcare professional before making changes to your health regimen.

Comparison methodology

A BioHarmony comparison is a head-to-head report that maps each use case to its winning intervention. Comparisons are dynamic: they pull live scores from every referenced intervention, so when a component is re-scored, every comparison that uses it refreshes automatically.

The verdict matrix is editorial, not formulaic. Each use case is independently evaluated against the available evidence; the winner is the intervention with stronger outcomes for that specific use case. Ties and “stack both” results are first-class outcomes, not placeholder values.

Every comparison includes citation-ready passages: self-contained paragraphs a journalist or language model can quote without losing meaning. A validator enforces this; comparisons that reference prior sections or use pronouns without clear antecedents are rejected before publish.

Confidence is inherited conservatively. A comparison’s confidence equals the lowest confidence across its component interventions. A high-confidence intervention paired with a low-confidence one yields a low-confidence comparison, because the decision depends on both sides being reliable.

Comparisons are re-reviewed every 90 days by default, and immediately whenever any component intervention’s BioHarmony score changes. That cadence keeps the verdicts current without forcing manual audit cycles.

Independently tested

Every product Outliyr reviews is measured on real instruments, spectrometers, EMF and thermal imaging, oscilloscope, EEG and microscopy. See the full lineup in the Outliyr Testing Lab.