Hi! I’m trying to build a ranking system for popularity based off of publicly available data. I have five separate data sets, and am struggling because I haven’t done statistical analysis like this before, and I am hoping that the great minds of E4 may be able to point me in the right direction.
Bias
This poll is a top 100 poll and has the fewest number of results. Ideally I would like to build a ranking that includes all Pokémon, however only google search volume is inclusive of all Pokémon.
2. Reddit Poll, Internation 2019 - Top 809
Bias
This dataset is good, albeit the number of participants was small.
3. International Pokémon Official Poll 2020 - Top 240
At first glance google search volumes seems like the most conclusive and up-to-date metric of popularity, just see what people are searching right? The truth is that search volume contains noise that skews certain data points. Flamigo, Unown, Onix, and Vaporeon in the Top 12 shows us that unrelated searches are conflated into the search volume total. Someone googling a “Flamingo” makes a typo and suddenly it’s the 2nd most search for Pokémon. I think google search volume would make a good dataset of the gaps. Potentially to help rank Pokémon not included in the other polls or as a guideline.
The middle three polls have total counts of each individual vote per Pokémon. The issue with this is that some polls were larger than others, and those with the larger voting pool dominate the results. This creates an unrealistic ranking where Greninja is considered the #1 most popular Pokémon.
I tried to make a convoluted composite rank score by using the least common multiple of each poll’s results and multiplying each pokemon’s score by that number to give it a ranking score. This also turned out unfavourable with wonky results such as flygon ranking higher than umbreon.
Use some algebra and a graph to aggregate a list using both methods and use Quick sort to quickly sort parameters from least to most. Then use nearest neighbor algorithm to find and classify certain groups for patterns and other factors relating to datasets. You could do this all in one or just use this for each dataset, then combining it and creating a chart for all of them.
Probably inefficent, but then again, I got my knowledge from trying to program 3D games on a 2D platform,
Yes, it would be absolutely better. I came to that conclusion cause it would be “kinda objective/measurable” and surely correlated to general popularity (more people sent that card to grade, more buyers are willing to pay big dollars for that pokemon, and so on).
Unfortunately it is indeed biased by scarcity, coolness of the artwork, difficulty to grade, and other external factors.
Probably I’m wrong but I have the feeling that people aren’t always so serious during polls: maybe they say “Flygon” “Luxray” “Cinccino” cause they remember a couple of anime episodes, a part of the games or prefer giving a more edgy reply, but they aren’t too invested in the Pokemon brand.
Voting with $$$ could be more reliable, since a dude buying a Sylveon shirt, an Umbreon plush or a Charizard card is surely invested in it and willing to pay.
I think you would probably need some public survey data like the Q score to have a real data-driven approach to this. I don’t think aggregating these sets makes sense because you’re aggregating bad data in the first place. Polls of Pokemon fans are subject to bias and manipulation, google searches is likely better but also no way is Clodsire the #3 most popular Pokemon.
This is the most important point. Bad data will never be fixed by a perfect algorithm or composite score. Sampling bias will forever be your problem.
Sampling collectors may bias your results toward the popular goods made for those collectors (e.g., chase cards, merchandise).
Sampling trading card/video game players may bias your results toward powerful, meta-defining Pokemon.
Sampling the general public may bias your results toward cute and heavily advertised Pokemon.
Beyond these sampling bias issues, there are also cohort and period effects.
Cohort Effects: We all grow up with a subset of Pokemon that we are most fond of or develop relationships to over time. Each cohort (often defined by birth year and what video games/sets were available to play with/collect during their formative years) will have biases based on familiarity.
Period Effects: When a poll is taken will affect the results of that poll. Each year, new games/sets come out, pokedexes expand, different Pokemon are highlighted by TPC/TPCi in the anime and in merchandise, and new Pokemon are added to the list every few years.
Thank you for the mini-bias refresher. I think you and fourthstar are right. There are obvious biases in the data, period effects are especially obvious in the polls and google search data. The question then becomes how can collect unbiased data?
I thought about TCGFish when @decoypalmette mentioned going through the market cap of each species. Is there a way to sort the top 100, Top 240 or even all species in this way without having to manually scroll through the market cap data and count each new species?
I’m not sure that the Market reflects popularity. The market has a lot of bias. For example how many cards of that species are printed? What’s the rarity? These metrics influence # sold
You would have to randomly poll a representative sample of consumers, made up of lay public, collectors, and players with a sufficiently large sample size. Repeat that over several years and aggregate the data. That will resolve some sampling biases and period effects.
Cohort effects are more difficult to disentangle because familiarity and nostalgia permanently affect affinity toward specific Pokemon. You could stratify your results by age group (cohort), or develop sampling weights for Pokemon generation exposure.
Clearly, all of this is far more work than it is worth.