Yandex Data Leak: The Ranking Factors & The Myths We Found (Festive Flashback) via @sejournal, @TaylorDanRW

4 months ago 24
ARTICLE AD BOX

Celebrate the Holidays with immoderate of SEJ’s champion articles of 2023.

Our Festive Flashback bid runs from December 21 – January 5, featuring regular reads connected important events, fundamentals, actionable strategies, and thought person opinions.

2023 has been rather eventful successful the SEO manufacture and our contributors produced immoderate outstanding articles to support gait and bespeak these changes.

Catch up connected the champion reads of 2023 to springiness you plentifulness to bespeak connected arsenic you determination into 2024.


Yandex is the hunt motor with the bulk of marketplace stock successful Russia and the fourth-largest hunt motor successful the world.

On January 27, 2023, it suffered what is arguably 1 of the largest information leaks that a modern tech institution has endured successful galore years – but is the 2nd leak successful little than a decade.

In 2015, a erstwhile Yandex worker attempted to merchantability Yandex’s hunt motor codification connected the achromatic marketplace for astir $30,000.

The archetypal leak successful January this twelvemonth revealed 1,922 ranking factors, of which much than 64% were listed arsenic unused oregon deprecated (superseded and champion avoided).

This leak was conscionable the record labeled kernel, but arsenic the SEO assemblage and I delved deeper, much files were recovered that combined incorporate astir 17,800 ranking factors.

When it comes to practicing SEO for Yandex, the usher I wrote 2 years ago, for the astir part, inactive applies.

Yandex, similar Google, has ever been nationalist with its algorithm updates and changes, and successful caller years, however it has adopted instrumentality learning.

Notable updates from the past two-three years include:

  • Vega (which doubled the size of the index).
  • Mimicry (penalizing fake websites impersonating brands).
  • Y1 update (introducing YATI).
  • Y2 update (late 2022).
  • Adoption of IndexNow.
  • A caller rollout and assumed update of the PF filter.

On a idiosyncratic note, this information leak is similar a 2nd Christmas.

Since January 2020, I’ve tally an SEO quality website arsenic a hobby dedicated to covering Yandex SEO and hunt quality successful Russia with 600+ articles, truthful this is astir apt the highest lawsuit of the hobby site.

I’ve besides spoken doubly astatine the Optimization league – the largest SEO league successful Russia.

This is besides a bully trial to spot however intimately Yandex’s nationalist statements lucifer the codebase secrets.

In 2019, moving with Yandex’s PR team, I was capable to interrogation engineers successful their Search squad and inquire a fig of questions sourced from the wider Western SEO community.

You tin work the interrogation with the Yandex Search squad here.

Whilst Yandex is chiefly known for its beingness successful Russia, the hunt motor besides has a beingness successful Turkey, Kazakhstan, and Georgia.

The information leak was believed to beryllium politically motivated and the actions of a rogue employee, and contains a fig of codification fragments from Yandex’s monolithic repository, Arcadia.

Within the 44GB of leaked data, there’s accusation relating to a fig of Yandex products including Search, Maps, Mail, Metrika, Disc, and Cloud.

What Yandex Has Had To Say

As I constitute this station (January 31st, 2023), Yandex has publically stated that:

the contents of the archive (leaked codification base) correspond to the outdated mentation of the repository – it differs from the existent mentation utilized by our services

And:

It is important to enactment that the published codification fragments besides incorporate trial algorithms that were utilized lone wrong Yandex to verify the close cognition of the services.

So, however overmuch of this codification basal is actively utilized is questionable.

Yandex has besides revealed that during its probe and audit, it recovered a fig of errors that interruption its ain interior principles, truthful it is apt that portions of this leaked codification (that are successful existent use) whitethorn beryllium changing successful the adjacent future.

Factor Classification

Yandex classifies its ranking factors into 3 categories.

This has been outlined successful Yandex’s nationalist documentation for immoderate time, but I consciousness is worthy including here, arsenic it amended helps america recognize the ranking origin leak.

  • Static factors – Factors that are related straight to the website (e.g. inbound backlinks, inbound interior links, headers, and ads ratio).
  • Dynamic factors – Factors that are related to some the website and the hunt query (e.g. substance relevance, keyword inclusions, TF*IDF).
  • User search-related factors – Factors relating to the idiosyncratic query (e.g. wherever is the idiosyncratic located, query language, and intent modifiers).

The ranking factors successful the papers are tagged to lucifer the corresponding category, with TG_STATIC and TG_DYNAMIC, and past TG_QUERY_ONLY, TG_QUERY, TG_USER_SEARCH, and TG_USER_SEARCH_ONLY.

Yandex Leak Learnings So Far

From the information frankincense far, beneath are immoderate of the affirmations and learnings we’ve been capable to make.

There is truthful overmuch information successful this leak, it is precise apt that we volition beryllium uncovering caller things and making caller connections successful the adjacent fewer weeks.

These include:

  • PageRank (a signifier of).
  • At immoderate constituent Yandex utilized TF*IDF.
  • Yandex inactive uses meta keywords, which are besides highlighted successful its documentation.
  • Yandex has circumstantial factors for medical, legal, and fiscal topics (YMYL).
  • It besides uses a signifier of leafage prime scoring, but this is known (ICS score).
  • Links from high-authority websites person an interaction connected rankings.
  • There’s thing caller to suggest Yandex tin crawl JavaScript yet extracurricular of already publically documented processes.
  • Server errors and excessive 4xx errors tin interaction ranking.
  • The clip of time is taken into information arsenic a ranking factor.

Below, I’ve expanded connected immoderate different affirmations and learnings from the leak.

Where possible, I’ve besides tied these leaked ranking factors to the algorithm updates and announcements that subordinate to them, oregon wherever we were told astir them being impactful.

MatrixNet

MatrixNet is mentioned successful a fewer of the ranking factors and was announced successful 2009, and past superseded successful 2017 by Catboost, which was rolled retired crossed the Yandex merchandise sphere.

This further adds validity to comments straight from Yandex, and 1 of the origin authors DenPlusPlus (Den Raskovalov), that this is, successful fact, an outdated codification repository.

MatrixNet was primitively introduced arsenic a new, halfway algorithm that took into information thousands of ranking factors and assigned weights based connected the idiosyncratic location, the existent hunt query, and perceived hunt intent.

It is typically seen arsenic an aboriginal mentation of Google’s RankBrain, erstwhile they are so 2 precise antithetic systems. MatrixNet was launched six years earlier RankBrain was announced.

MatrixNet has besides been built upon, which isn’t surprising, fixed it is present 14 years old.

In 2016, Yandex introduced the Palekh algorithm that utilized heavy neural networks to amended lucifer documents (webpages) and queries, adjacent if they didn’t incorporate the close “levels” of communal keywords, but satisfied the idiosyncratic intents.

Palekh was susceptible of processing 150 pages astatine a time, and successful 2017 was updated with the Korolyov update, which took into relationship much extent of leafage content, and could enactment disconnected 200,000 pages astatine once.

URL & Page-Level Factors

From the leak, we person learned that Yandex takes into information URL construction, specifically:

  • The beingness of numbers successful the URL.
  • The fig of trailing slashes successful the URL (and if they are excessive).
  • The fig of superior letters successful the URL is simply a factor.

Yandex leak of ranking factorsScreenshot from author, January 2023

The property of a leafage (document age) and the past updated day are besides important, and this makes sense.

As good arsenic papers property and past update, a fig of factors successful the information subordinate to freshness – peculiarly for news-related queries.

Yandex formerly utilized timestamps, specifically not for ranking purposes but “reordering” purposes, but this is present classified arsenic unused.

Also successful the deprecated file are the usage of keywords successful the URL. Yandex has antecedently measured that 3 keywords from the hunt query successful the URL would beryllium an “optimal” result.

Internal Links & Crawl Depth

Whilst Google has gone connected the grounds to accidental that for its purposes, crawl extent isn’t explicitly a ranking factor, Yandex appears to person an progressive portion of codification that dictates that URLs that are reachable from the homepage person a “higher” level of importance.

Yandex factorsScreenshot from author, January 2023

This mirrors John Mueller’s 2018 statement that Google gives “a small much weight” to pages recovered much than 1 click from the homepage.

The ranking factors besides item a circumstantial token weighting for webpages that are “orphans” wrong the website linking structure.

Clicks & CTR

In 2011, Yandex released a blog station talking astir however the hunt motor uses clicks arsenic portion of its rankings and besides addresses the desires of the SEO pros to manipulate the metric for ranking gain.

Specific click factors successful the leak look astatine things like:

  • The ratio of the fig of clicks connected the URL, comparative to each clicks connected the search.
  • The aforesaid arsenic above, but breached down by region.
  • How often bash users click connected the URL for the search?

Manipulating Clicks

Manipulating idiosyncratic behavior, specifically “click-jacking”, is simply a known maneuver wrong Yandex.

Yandex has a filter, known arsenic the PF filter, that actively seeks retired and penalizes websites that prosecute successful this enactment utilizing scripts that show IP similarities and past the “user actions” of those clicks – and the interaction tin beryllium significant.

The beneath screenshot shows the interaction connected integrated sessions (сессии) aft being penalized for imitating idiosyncratic clicks.

 Russian Search NewsImage from Russian Search News, January 2023

User Behavior

The idiosyncratic behaviour takeaways from the leak are immoderate of the much absorbing findings.

User behaviour manipulation is simply a communal SEO violation that Yandex has been combating for years. At the 2020 Optimization conference, past Head of Yandex Webmaster Tools Mikhail Slevinsky said the institution is making bully advancement successful detecting and penalizing this benignant of behavior.

Yandex penalizes idiosyncratic behaviour manipulation with the aforesaid PF filter utilized to combat CTR manipulation.

Dwell Time

102 of the ranking factors incorporate the tag TG_USERFEAT_SEARCH_DWELL_TIME, and notation the device, idiosyncratic duration, and mean leafage dwell time.

All but 39 of these factors are deprecated.

Yandex factorsScreenshot from author, January 2023

Bing archetypal utilized the word Dwell clip successful a 2011 blog, and successful caller years Google has made it wide that it doesn’t usage dwell clip (or akin idiosyncratic enactment signals) arsenic ranking factors.

YMYL

YMYL (Your Money, Your Life) is simply a conception well-known wrong Google and is not a caller conception to Yandex.

Within the information leak, determination are circumstantial ranking factors for medical, legal, and fiscal contented that beryllium – but this was notably revealed successful 2019 astatine the Yandex Webmaster league erstwhile it announced the Proxima Search Quality Metric.

Metrika Data Usage

Six of the ranking factors subordinate to the usage of Metrika information for the purposes of ranking. However, 1 of them is tagged arsenic deprecated:

  • The fig of akin visitors from the YandexBar (YaBar/Ябар).
  • The mean clip spent connected URLs from those aforesaid akin visitors.
  • The “core audience” of pages connected which determination is simply a Metrika antagonistic [deprecated].
  • The mean clip a idiosyncratic spends connected a big erstwhile accessed externally (from different non-search site) from a circumstantial URL.
  • Average ‘depth’ (number of hits wrong the host) of a user’s enactment connected the big erstwhile accessed externally (from different non-search site) from a peculiar URL.
  • Whether oregon not the domain has Metrika installed.

In Metrika, idiosyncratic information is handled differently.

Unlike Google Analytics, determination are a fig of reports focused connected idiosyncratic “loyalty” combining tract engagement metrics with instrumentality frequency, duration betwixt visits, and root of the visit.

For example, I tin spot a study successful 1 click to spot a breakdown of idiosyncratic tract visitors:

MetrikaScreenshot from Metrika, January 2023

Metrika besides comes “out of the box” with heatmap tools and idiosyncratic league recording, and successful caller years the Metrika squad has made bully advancement successful being capable to place and filter bot traffic.

With Google Analytics, determination is an statement that Google doesn’t usage UA/GA4 information for ranking purposes due to the fact that of however casual it is to modify oregon interruption the tracking codification – but with Metrika counters, they are a batch much linear, and a batch of the reports are unchangeable successful presumption of however the information is collected.

Impact Of Traffic On Rankings

Following connected from looking astatine Metrika information arsenic a ranking factor; These factors efficaciously corroborate that nonstop postulation and paid postulation (buying ads via Yandex Direct) tin interaction integrated hunt performance:

  • Share of nonstop visits among each incoming traffic.
  • Green postulation stock (aka nonstop visits) – Desktop.
  • Green postulation stock (aka nonstop visits) – Mobile.
  • Search postulation – transitions from hunt engines to the site.
  • Share of visits to the tract not by links (set by manus oregon from bookmarks).
  • The fig of unsocial visitors.
  • Share of postulation from hunt engines.

News Factors

There are a fig of factors relating to “News”, including 2 that notation Yandex.News directly.

Yandex.News was an equivalent of Google News, but was sold to the Russian societal web VKontakte successful August 2022, on with different Yandex merchandise “Zen”.

So, it’s not wide if these factors related to a merchandise nary longer owned oregon operated by Yandex, oregon to however quality websites are ranked successful “regular” search.

Backlink Importance

Yandex has akin algorithms to combat nexus manipulation arsenic Google – and has since the Nepot filter successful 2005.

From reviewing the backlink ranking factors and immoderate of the specifics successful the descriptions, we tin presume that the champion practices for gathering links for Yandex SEO would beryllium to:

  • Build links with a much earthy frequence and varying amounts.
  • Build links with branded anchor texts arsenic good arsenic usage commercialized keywords.
  • If buying links, debar buying links from websites that person mixed topics.

Below is simply a database of link-related factors that tin beryllium considered affirmations of champion practices:

  • The property of the backlink is simply a factor.
  • Link relevance based connected topics.
  • Backlinks built from homepages transportation much value than interior pages.
  • Links from the apical 100 websites by PageRank (PR) tin interaction rankings.
  • Link relevance based connected the prime of each link.
  • Link relevance, taking into relationship the prime of each link, and the taxable of each link.
  • Link relevance, taking into relationship the non-commercial quality of each link.
  • Percentage of inbound links with query words.
  • Percentage of query words successful links (up to a synonym).
  • The links incorporate each the words of the query (up to a synonym).
  • Dispersion of the fig of query words successful links.

However, determination are immoderate link-related factors that are further considerations erstwhile planning, monitoring, and analyzing backlinks:

  • The ratio of “good” versus “bad” backlinks to a website.
  • The frequence of links to the site.
  • The fig of incoming SEO trash links betwixt hosts.

The information leak besides revealed that the nexus spam calculator has astir 80 progressive factors that are taken into consideration, with a fig of deprecated factors.

This creates the question arsenic to however good Yandex is capable to admit antagonistic SEO attacks, fixed it looks astatine the ratio of bully versus atrocious links, and however it determines what a atrocious nexus is.

A negative SEO attack is besides apt to beryllium a abbreviated burst (high frequency) nexus lawsuit successful which a tract volition unwittingly summation a precocious fig of mediocre quality, non-topical, and perchance over-optimized links.

Yandex uses instrumentality learning models to place Private Blog Networks (PBNs) and paid links, and it makes the aforesaid presumption betwixt nexus velocity and the clip play they are acquired.

Typically, paid-for links are generated implicit a longer play of time, and these patterns (including nexus root tract analysis) are what the Minusinsk update (2015) was introduced to combat.

Yandex Penalties

There are 2 ranking factors, some deprecated, named SpamKarma and Pessimization.

Pessimization refers to reducing PageRank to zero and aligns with the expectations of terrible Yandex penalties.

SpamKarma besides aligns with assumptions made astir Yandex penalizing hosts and individuals, arsenic good arsenic idiosyncratic domains.

Onpage Advertising

There are a fig of factors relating to advertizing connected the page, immoderate of them deprecated (like the screenshot illustration below).

Yandex factorsScreenshot from author, January 2023

It’s not known from the statement precisely what the thought process with this origin was, but it could beryllium assumed that a precocious ratio of adverts to disposable surface was a antagonistic origin – overmuch similar however Google takes umbrage if adverts obfuscate the page’s main content, oregon are obtrusive.

Tying this backmost to known Yandex mechanisms, the Proxima update besides took into information the ratio of utile and advertizing contented connected a page.

Can We Apply Any Yandex Learnings To Google?

Yandex and Google are disparate hunt engines, with a fig of differences, contempt the tens of engineers who person worked for some companies.

Because of this combat for talent, we tin infer that immoderate of these maestro builders and engineers volition person built things successful a akin manner (though not nonstop copies), and applied learnings from erstwhile iterations of their builds with their caller employers.

What Russian SEO Pros Are Saying About The Leak

Much similar the Western world, SEO professionals successful Russia person been having their accidental connected the leak crossed the assorted Runet forums.

The absorption successful these forums has been antithetic to SEO Twitter and Mastodon, with a absorption much connected Yandex’s filters, and different Yandex products that are optimized arsenic portion of wider Yandex optimization campaigns.

It is besides worthy noting that a fig of conclusions and findings from the information lucifer what the Western SEO satellite is besides finding.

Common themes successful the Russian hunt forums:

  • Webmasters asking for insights into caller filters, specified arsenic Mimicry and the updated PF filter.
  • The property and relevance of immoderate of the factors, owed to writer names nary longer being astatine Yandex, and mentions of long-retired Yandex products.
  • The main absorbing learnings are astir the usage of Metrika data, and accusation relating to the Crawler & Indexer.
  • A fig of factors outline the usage of DSSM, which successful mentation was superseded by the merchandise of Palekh successful 2016. This was a hunt algorithm utilizing instrumentality learning, announced by Yandex successful 2016.
  • A statement astir ICS scoring successful Yandex, and whether oregon not Yandex whitethorn supply much postulation to a tract and power its ain factors by doing so.

The leaked factors, peculiarly astir however Yandex evaluates tract quality, person besides travel nether scrutiny.

There is simply a long-standing sentiment successful the Russian SEO assemblage that Yandex oftentimes favors its ain products and services successful hunt results up of different websites, and webmasters are asking questions like:

Why does it fuss going to each this trouble, erstwhile it conscionable nails its services to the apical of the leafage anyway?

In loosely translated documents, these are referred to arsenic the Sorcerers oregon Yandex Sorcerers. In Google, we’d telephone these hunt motor results pages (SERPs) features – similar Google Hotels, etc.

In October 2022, Kassir (a Russian summons portal) claimed ₽328m compensation from Yandex owed to mislaid revenue, caused by the “discriminatory conditions” successful which Yandex Sorcerers took the lawsuit basal distant from the backstage company.

This is disconnected the backmost of a 2020 people action successful which aggregate companies raised a lawsuit with the Federal Antimonopoly Service (FAS) for anticompetitive promotion of its ain services.

More resources:


Featured Image: FGC/Shutterstock