Hostname: page-component-76fb5796d-zzh7m Total loading time: 0 Render date: 2024-04-26T15:40:28.348Z Has data issue: false hasContentIssue false

Optimal Sampling for Identifying a Cluster: An Application in Sorting

Published online by Cambridge University Press:  27 July 2009

Shmuel Gal
Affiliation:
IBM Israel, Science and Technology LTD.Haifa, Israel
Dafna Sheinwald
Affiliation:
IBM Israel, Science and Technology LTD.Haifa, Israel

Abstract

We consider the following problem. For a given population of m items, we have to make a decision whether or not the population includes a relatively large cluster of identical items. This decision affects the effectiveness of a subsequent computational process, depending on the actual existence of the cluster and its size. To make a good decision, we use a statistical sample which should indicate the existence of a cluster and find a representative thereof. This paper describes the optimal sampling technique to be used in such a case, given the cost of the sampling and the potential gain in speed of the subsequent process. The optimal fixed sample size is specified, as well as the optimal sequential sampling, along with characterizing the dependence of the cost function on the truncation point.

For the case that the a priori distribution of the cluster proportion is known, we present formulae by which the optimal sampling procedures can be easily calculated. For the common situation in which the a priori distribution is not known, we present, in the case of a fixed sample size, a tight upper bound for the sample size, which is independent of the a priori distribution, and for the case of the sequential sampling, we present an approximately optimal truncation point, which is also independent of the a priori distribution.

The situation described arose in connection with choosing the best sorting method, an application that will be described in full detail. The most interesting practical result is that for our application truncating the sequential procedure at 35 observations, out of a population of 25,000–30,000 items, guarantees that in our sorting application we are always within 2.1% of the optimal cost independently of the a priori distribution.

Type
Research Article
Copyright
Copyright © Cambridge University Press 1993

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

1.Berger, J.O. (1980). Statistical decision theory. New York: Springer-Verlag.CrossRefGoogle Scholar
2. IBM system/370 extended architecture principle of operation. Publication SA22-7085-1, IBM.Google Scholar
3.Knuth, D.E. (1973). The art of programming, sorting and searching. Addison-Wesley.Google Scholar