Rendered at 07:24:48 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
zaptrem 3 hours ago [-]
Love me some JSD. Here is a problem most people don't consider with generative modeling (e.g., AI text, image, music, video models): basically all standard pre-training algorithms for generative models (i.e., cross entropy, basically all diffusion/flow formulations) are closer to a Forward KL divergence. In other words, given limited capacity the model will try to stretch itself to cover every mode. This gives you a jack of all trades (lots of knowledge and diversity), but a master of none (you get blurry images and text filled with nonsense).
The real magic in generative modeling comes from the post training process that comes after, which usually (e.g., RLHF) approximates Reverse KL (given limited capacity, try to perfectly cover what you can, but it's fine to drop the rest entirely). This gives amazing results, but is also the cause of AI oddities like the "AI Image Pixar Look", many of the verbal tics of LLMs, and all AI music using the same small set of voices. Jensen-Shannon Divergence sits right in the middle of Forward and Reverse KL and is what many GANs are claimed to approximate. Ideally, it is a better trade-off between diversity and fidelity.
sansseriff 7 hours ago [-]
It has applications outside of machine learning too! I used symmetric Kullback–Leibler divergence for a project with photon number resolving single photon detectors during my PhD. I used it with an adjacency matrix to split a gaussian mixture model (modelling some data with multivariate gaussians) into a series of clusters.
Currently piloting the use of JSD for a synthetic audience survey application, measuring how closely the synthetic response distribution matches a human panel.
Been knee-deep trying to understand this world, so seeing this on Hacker News today is kind of scary.
wilted-iris 10 hours ago [-]
This looks interesting and I'm curious if anyone has more context for why it's on the frontpage today.
acjohnson55 10 hours ago [-]
Every now and then, a random math or science concept hits front page. Usually, people chime in with interesting perspectives on it. Guess we'll see.
raddan 9 hours ago [-]
I’d like to know what the advantage is over KL divergence. It seems like the important idea is symmetry? Not clear to me why that matters; I’d love to know what application this is used for.
fumeux_fume 9 hours ago [-]
There are many applications. I mainly see it used for detecting drift in datasets for ML models. It has a nice benefit over the KL divergence in the case where the two distributions you're measuring have no overlap (KL won't compute, but JS will just return 0). Also, when taking its square root you get a distance rather than a divergence which allows you to compare it to JSD measurements of other distributions.
patcon 6 hours ago [-]
> Also, when taking its square root you get a distance
Easy conversion into a distance metric is hugely valuable to making the property amenable to KNN-based dimensionality reduction algos (and I'm sure other things I don't understand, as a non-mathematician)
Iirc (and I could be wrong, this is from memory) JS divergence is what is minimized in GANs (where we simultaneously train a generator and real/synthetic classifier with the goal of each trying to beat the other to converge on real looking synthetic data), at least for some training methods.
I don’t think GANs are used much now in comparison to diffusion models, but as recently as a few years ago they were the standard way to make fake data, a la “this face does not exist”
9 hours ago [-]
10 hours ago [-]
lasermatts 9 hours ago [-]
The Hacker News hive mind is real!
I was just reading about JSD the other day after reading about KL divergence...seems like a nifty measurement device for things like sim-to-real evaluations in robots (the reason I was going down this rabbit hole.)
I think the appeal over raw KL is that JSD behaves a bit nicer when the simulated and real distributions don't perfectly overlap...which is basically always true in the real world!
ernsheong 6 hours ago [-]
I thought Jensen Huang was getting a divorce :D
knollimar 5 hours ago [-]
I thought it was some anthropic nvidia breakup
8 hours ago [-]
mountainriver 8 hours ago [-]
Why not use this instead of KL in reinforcement learning?
anvuong 6 hours ago [-]
JSD is just symmetrized KL, it's the forward KL + reverse KL.
In reinforcement learning, usually what we want is to find the optimal action, i.e. action that maximizes the reward, this translates to the so-called "mode-seeking" optimization, which is the reverse KL.
The real magic in generative modeling comes from the post training process that comes after, which usually (e.g., RLHF) approximates Reverse KL (given limited capacity, try to perfectly cover what you can, but it's fine to drop the rest entirely). This gives amazing results, but is also the cause of AI oddities like the "AI Image Pixar Look", many of the verbal tics of LLMs, and all AI music using the same small set of voices. Jensen-Shannon Divergence sits right in the middle of Forward and Reverse KL and is what many GANs are claimed to approximate. Ideally, it is a better trade-off between diversity and fidelity.
https://snsphd.online/chapter_04/section_05_results/#photon-...
Been knee-deep trying to understand this world, so seeing this on Hacker News today is kind of scary.
Easy conversion into a distance metric is hugely valuable to making the property amenable to KNN-based dimensionality reduction algos (and I'm sure other things I don't understand, as a non-mathematician)
Here's a library that the creator of UMAP provides (UMAP being a workhorse of dimensional reduction algos), for doing approx nearest neighbor search: https://pynndescent.readthedocs.io/en/latest/api.html#pynnde...
I don’t think GANs are used much now in comparison to diffusion models, but as recently as a few years ago they were the standard way to make fake data, a la “this face does not exist”
I was just reading about JSD the other day after reading about KL divergence...seems like a nifty measurement device for things like sim-to-real evaluations in robots (the reason I was going down this rabbit hole.)
I think the appeal over raw KL is that JSD behaves a bit nicer when the simulated and real distributions don't perfectly overlap...which is basically always true in the real world!
In reinforcement learning, usually what we want is to find the optimal action, i.e. action that maximizes the reward, this translates to the so-called "mode-seeking" optimization, which is the reverse KL.