Author: Markus Demleitner
Date: 2021-06-24
Nb of terms discussed together : 1
New Term: stat.histogram
Action: Addition
Label: A list of counts or ratios in bins
Prefix: P
Use Case: A column contains not a single value but a
distribution-like entity. For an example, run a query like
gavo_histogram(phot_g_mean_mag, 5, 15, 10) as dist,
round(100/parallax) as bin
from tgas.main
where parallax>5
group by bin
on the TAP service and inspect the dist column.
Proposed Solution:
a primary-only atom stat.histogram would work for this. In the example in the
use case, the service could assign a UCD stat.distribution;pos.parallax to the
dist column. A footnote like "This also includes simple nonparametric
distributions" or similar would be appreciated.
The column dist from the example certainly cannot be annotated as pos.parallax,
and while there is stat.likelihood, there is no indication that this is
intended to cover collection of likelihoods. I argue the UCD system should
allow a plausible annotation of the dist column. If so, it appears we need some
new term.
This new term should, I think, cover simple binned aggregations as well as
nonparametric distributions (in the sense of \sum_{x\in\Omega} P(x)=1) in order
to have sufficient generality. Hence, I'd avoid stat.distribution, which would
imply normalisation.
Also, stat.distribution (or something like that) might, some day, be useful to
annotate a different thing that this would not cover: columns containing
distributions in the sense of some representation of "a gaussian with mu=3
and sigma=0.3" or "a poisson distribution with lambda=0.2".
The discussion started on the semantics mailing list . see the starting point
for this discussion here.
The proposed term is considering an array element instead of a single value,
usually considered for the quantity tagged by a UCD.
Note that the semantic tag does not describe any property of the array except
its statistical definition of histogram.
Examine more use cases where this UCD could be combined to check the P or Q
status before approving addition
Further discussion:
Being able to tell that a column contains a histogram could be useful.
However, the use of the UCD only is not sufficient to describe that column.
We suggest that the histogram column is connected to a <GROUP ... /> through
a ref attribute. The GROUP should define the binsize, the number of bins and
the lower / upper bounds of the histogram, with units. Example:
<GROUP ID='histogram'>
<PARAM name='lower_bound' value=5 unit='' ucd='phot.mag;em.opt;stat;mean'/>
<PARAM name='upper_bound' value=15 unit='' ucd='phot.mag;em.opt;stat;mean'/>
<PARAM name='bin_count' value=10 unit='' ucd='meta.number'/>
<PARAM name='bin_size' value=1 unit='' ucd=''/>
<FIELD ID='dist' datatype='int', ucd='stat.histogram;phot.mag;em.opt;stat;mean' ref='histogram' arraysize='*'/>
The result of the proposed query can't be used out of context, since the
histogram properties are not described. This is beyond the scope of Semantics/UCD,
but this shows that we have to be cautious and may be propose a Note on how to
deal with histograms in VOTable columns.
Further discussion:
