todo/error-diffusion-experiments.txt - external/github.com/google/gemmlowp - Git at Google

 TODO: Error diffusion experiments

 Platforms: all

 Coding time: M
 Experimentation time: XL
 Skill required: XL

 Prerequisite reading:
   doc/less-than-8-bit.txt


 Overview
 ========

 In internal/pack.h, the Requantize function takes care of requantizing
 input 8 bit values to less than 8 bit. This is currently done either by
 rounding-to-nearest, or by probabilistic rounding.

 People have suggested trying error diffusion instead.
 https://en.wikipedia.org/wiki/Error_diffusion
 This technique originally from graphics might be adaptable to GEMM; however,
 that is far from trivial.

 Still, it may be worth experimenting with it, as the reward of higher accuracy
 could be very worthwhile especially if it allows to explore even smaller
 bit-depths.


 Why getting error diffusion to work is nontrivial
 =================================================

 In graphics, there is only one array to
 apply error diffusion to, and the criteria are mostly aesthetic. Here in GEMM,
 there are two arrays involved, allowing for unwanted interaction between the
 error diffusion terms added on either side separately; and we have stringent
 accuracy criteria.

 Here is a toy example showing how naive approaches to error diffusion may
 suffer from unwanted interactions between the LHS and RHS separate error
 diffusion terms:

 Say that we're working on 1-dimensional data (as opposed to 2-D matrices) to
 simplify the discussion.

 Say that our input values are real numbers in [0, 1] and that we're quantizing
 them to either 0 or 1.

 Say that the left-hand-side is filled with the constant value 0.9 and that our
 error-diffusion filter results in the following sequence of quantized values:
     1 (repeated 9 times), 0, ... (repeat).

 Say that the left-hand-side is filled with the constant value 0.1 and that our
 error-diffusion filter results in the following sequence of quantized values:
     0 (repeated 9 times), 1, ... (repeat).

 So if we compute the dot product (which is what we really do in a GEMM) of
 these quantized vectors, we're computing
     1*0 + ... (repeated 9 times) + 0*1 + ... (repeat)

 So we get exactly 0! This shows how a naive approach to error diffusion may
 suffer from bias issues similar to round-to-nearest.


 Some avenues to explore to make error diffusion work
 ====================================================

 1. Maybe some fixed error diffusion kernels just happen to avoid that issue?

 2. Maybe it's just a matter of doing error diffusion for a different vector
 error metric, e.g. l^2 instead of l^1?

 3. Maybe some randomization (adding some random term to the error term being
 diffused) would be acceptable? It seems like it would allow to avoid the
 interference problem discussed above.


 Performance considerations
 ==========================

 Error diffusion is going to be relatively expensive compared to the current
 requantization methods. It may be acceptable for large enough GEMM depth,
 since it only needs to be applied once for the n^2 input matrix entries, thus
 becoming negligible compared to the n^3 arithmetic cost of GEMM for large
 enough n.

 Alternatively, we may consider doing requantization of some matrices once and
 for all, but that would likely be the case only for one of LHS or RHS,
 otherwise one might as well precompute the whole GEMM.
	TODO: Error diffusion experiments

	Platforms: all

	Coding time: M
	Experimentation time: XL
	Skill required: XL

	Prerequisite reading:
	doc/less-than-8-bit.txt


	Overview
	========

	In internal/pack.h, the Requantize function takes care of requantizing
	input 8 bit values to less than 8 bit. This is currently done either by
	rounding-to-nearest, or by probabilistic rounding.

	People have suggested trying error diffusion instead.
	https://en.wikipedia.org/wiki/Error_diffusion
	This technique originally from graphics might be adaptable to GEMM; however,
	that is far from trivial.

	Still, it may be worth experimenting with it, as the reward of higher accuracy
	could be very worthwhile especially if it allows to explore even smaller
	bit-depths.


	Why getting error diffusion to work is nontrivial
	=================================================

	In graphics, there is only one array to
	apply error diffusion to, and the criteria are mostly aesthetic. Here in GEMM,
	there are two arrays involved, allowing for unwanted interaction between the
	error diffusion terms added on either side separately; and we have stringent
	accuracy criteria.

	Here is a toy example showing how naive approaches to error diffusion may
	suffer from unwanted interactions between the LHS and RHS separate error
	diffusion terms:

	Say that we're working on 1-dimensional data (as opposed to 2-D matrices) to
	simplify the discussion.

	Say that our input values are real numbers in [0, 1] and that we're quantizing
	them to either 0 or 1.

	Say that the left-hand-side is filled with the constant value 0.9 and that our
	error-diffusion filter results in the following sequence of quantized values:
	1 (repeated 9 times), 0, ... (repeat).

	Say that the left-hand-side is filled with the constant value 0.1 and that our
	error-diffusion filter results in the following sequence of quantized values:
	0 (repeated 9 times), 1, ... (repeat).

	So if we compute the dot product (which is what we really do in a GEMM) of
	these quantized vectors, we're computing
	10 + ... (repeated 9 times) + 01 + ... (repeat)

	So we get exactly 0! This shows how a naive approach to error diffusion may
	suffer from bias issues similar to round-to-nearest.


	Some avenues to explore to make error diffusion work
	====================================================

	1. Maybe some fixed error diffusion kernels just happen to avoid that issue?

	2. Maybe it's just a matter of doing error diffusion for a different vector
	error metric, e.g. l^2 instead of l^1?

	3. Maybe some randomization (adding some random term to the error term being
	diffused) would be acceptable? It seems like it would allow to avoid the
	interference problem discussed above.


	Performance considerations
	==========================

	Error diffusion is going to be relatively expensive compared to the current
	requantization methods. It may be acceptable for large enough GEMM depth,
	since it only needs to be applied once for the n^2 input matrix entries, thus
	becoming negligible compared to the n^3 arithmetic cost of GEMM for large
	enough n.

	Alternatively, we may consider doing requantization of some matrices once and
	for all, but that would likely be the case only for one of LHS or RHS,
	otherwise one might as well precompute the whole GEMM.