Test results are very easy to fake. The "gold standard" is whether outside groups can run the same tests and reproduce the same results. Peer review doesn't mean much more than "this paper seems interesting and is well-written and properly formatted."
Peer review isn't (and never was meant to be) a golden seal declaring a result to be unchangeable truth. It is a process that makes sure that at least one competent colleague has read it and caught all the embarrassing Monday morning mistakes that might have been in there. This is done before wider publication to avoid that the 100 people in your field all stumble over the same stupid little mistake and can spent their time thinking about the actual content.
Think of it as the type checking by the compiler or a code review by somebody on your team, NOT a papal announcement of unchangeable dogma.
"I can run your code/experiment/etc. in your population and get your results" isn't the gold standard either. "We approached it from a different way, and with a different population, and we get a consistent answer" is both harder and more compelling.
Well, there's replicating the experiment and then there's challenging the conclusion. Those are two separate ideas, and are actually fairly independent.