Approximation Algorithms: Clustering and Facility Location
This is the lecture notes from Chandra Chekuri's CS583 course on Approximation Algorithms. Chapter 9: Clustering and Facility Location.
You can read Chapter 10: Introduction to Network Design, here. Chapter 8: Introduction to Local Search, here.
Chapter 9
Clustering and Facility Location are two widely studied topics with a vast literature. Facility location problems have been well-studied in Operations Research and logistics. Clustering is ubiquitious with many applications in data analysis and machine learning. We confine attention to a few central problems and provide some pointers as needed to other topics. These problems have also played an important role in approximation algorithms and their study has led to a variety of interesting techniques. Research on these topics is still quite active.
For both classes of problems a key assumption that we will make is that we are working with points in some underlying metric space. Recall that a space where is a metric space if the distance function satisfies metric properties: (i) iff (reflexivity) (ii) for all (symmetry) and (iii) for all (triangle inequality). We will abuse the notation and use for two sets to denote the quantity . Similarly for and will denote .
Center based clustering: In center based clustering we are given points in a metric space , and an integer . The goal is to cluster/partition into clusters which are induced by choosing centers from . Each point is assigned to its nearest center from and this induces a clustering. The nature of the clustering is controlled by an objective function that measures the quality of the clusters. Typically we phrase the problem as choosing to minimize the clustering objective for some . The three most wellstudied problems are special cases obtained by choosing an appropriate .
-
-Center is the problem when which can be equivalently phrased as . In other words we want to minimize the maximum distance of the input points to the cluster centers. -
-Median is problem when . -
-Means is the problem when .
We will mainly focus on the discrete versions of the problems where which means that the centers are to be chosen from the input points themselves. However, in many data analysis applications the points lie in for some and the centers can be chosen anywhere in the ambient space. In fact this makes the problems more difficult in a certain sense since the center locations now come from an infinite set. One can argue that limiting centers to the input points does not lose more than a constant factor in the approximation and this may be reasonable from a first-cut point of view but perhaps not ideal from a practical point of view. In some settings there may a requirement or advantage in choosing the cluster centers from the input set.
Facility Location: In facility location we typically have two finite sets and where represents a set of locations where facilities can be located and is a set of client/demand locations. We will assume that and that there is a metrid over . There are several variants but one of the simplest one is the Uncapacitated Facility Location (UCFL) problem. In UCFL we are given as well auxiliarly information which specifies the cost of opening a facility at location . The goal is to open a subset of facilities in to minimize the sum of the cost of the opened facilities and the total distance traveled by the clients to their nearest open facility. In other words we want to solve . The problem has close connections to -Median problem. The term "uncapacitated" refers to the fact that we do not limit the number of clients that can be assigned to an open facility. In several OR applications that motivate facility location (opening warehouses or distribution facilities), capacity constraints are likely to be important. For this reasons there are several capacitated versions.
9.1 -Center
Recall that in -Center we are given points in a metric space and an integer and we need to choose cluster centers such that we minimize . An alternative view is that we wish to find the smallest radius such that there are balls of radius that together cover all the input points. Given a fixed this can be seen as a Set Cover problem. In fact there is an easy reduction from Dominating Set to -Center establishing the NP-Hardness. Moreoever, as we saw already in Chapter -Center has no -approximation unless via a reduction from Dominating Set. Here we will see two 2-approximation algorithms that are quite different and have their own advantages. The key lemma for their analysis is common and is stated below.
Lemma 9.1. Suppose there are points such that for all . Then .
Proof. Suppose . Then there are centers which induces clusters such that for each cluster and each we have . By the pigeon hole principle some are in the same cluster but this implies that which contradicts the assumption of the lemma.
Note that the lemma holds even if the centers can be chosen from outside the given point set .
9.1.1 Gonzalez's algorithm and nets in metric spaces
The algorithm starts with an empty set of centers, and in each iteration picks a new center which is the farthest point from the set of centers chosen so far.
1. |
2. For |
3. Output |
Note that is chosen arbitrarily.
Theorem 9.1. Let where is the set of centers chosen by Gonzalez's algorithm. Then where is the optimum -Center radius for .
Proof. Suppose not. There is a point such that which implies that . Since the algorithm chose the farthest point in each iteration and could have chosen in each of the iteration but did not, we have the property that for to . This implies that the distance between each pair of points in the set is more than . By Lemma 9.1, the optimum radius must be larger than , a contradiction.
Exercise 9.1. Construct an instance to demonstrate that the algorithm's worstcase performance is .
Remark 9.1. Gonzalez's algorithm can be extended in a simple way to create a permutation of the points ; we simply run the algorithm with . It is easy to see from the proof above that for any , the prefix of the permutation consisting of the first points provides a 2-approximation for that choice of . Thus, one can compute the permutation once and reuse it for all .
9.1.2 Hochbaum-Shmoys bottleneck approach
A second algorithmic approach for -Center is due to Hochbaum and Shmoys and has played an influential role in variants of this problem.
For a point and radius let denote the ball of radius around .
1. Guess |
2. |
3. While |
4. Output |
Theorem 9.2. Let be the output of the HS algorithm for a guess . Then for all and moreover if then .
Proof. The first property is easy to see since we only remove a point from if we add a center to such that . Let be the centers chosen by the algorithm. We observe that . Thus, if the algorithm outputs points then the pairwise distance between any two of them is more than . By Lemma 9.1, if the optimum radius is . Hence, if the guess the algorithm outputs at most centers.
The guessing of can be implemented by binary search in various ways. We omit these routine details.
Exercise 9.2. Describe an example where the algorithm uses exactly centers even with guess . Describe an example where the algorithm outputs less than centers with a guess of .
9.1.3 Related Problems and Discussion
The -Center problem is natural in geometric settings and one can see from the proof that the 2-approximation for the two algorithms holds even when allowing for centers to be chosen outside. A surprising result of Feder and Greene [1] shows that even in two dimensions (the Euclidean plane) one cannot improve the factor of 2 unless .
The -Supplier problem is closely related to -Center and is motivated by facility location considerations. Here we are given which are in a metric space. We need to choose centers to minimize . Note that we don't have to cover the facilities. Hochbaum and Shmoys gave a variant of their algorithm that obtains a 3-approximation for k-Supplier and moreover showed that unless one cannot improve 3 [2]. Interestingly in Euclidean spaces 3 is not tight [3]. Several generalizations of -Center which constrain the choice of centers have been considered - see [4] for a general framework that also considers outliers.
One can consider weighted version of -Center or relabeled as priority version in subsequent work. We refer to the work of Plesnik [5] and a recent one [6] on this variant which has found applications in fair clustering.
We finally mention that -Center clustering has nice connections to the notion of -nets in metric spaces. Given a set of points in a metric space and a radius , an -net is a set of centers such that (i) for all we have (that is the points are covered by balls of radius ) and (ii) for any distinct we have and are disjoint (packing property or the property that no two centers are too close). -nets provide a concise summary of distances in a metric space at scale . One can use -nets to obtain nearest-neighbor data structures and other applications, especially in low-dimensional settings. We refer the reader to [7],[8].
LP Relaxation: The two -Center algorithms we described are combinatorial. One can also consider an LP relaxation. Since it is a bottleneck problem, writing an LP relaxation involves issues similar to what we saw with unrelated machine scheduling. Given a guess we can write an LP to check whether a radius is feasible and then find the smallest for which it is feasible. The feasibility LP can be written as follows. Let be an indicator for whether we open a center at point .
Exercise 9.3. Prove that if is feasible for the preceding LP then one can obtain a solution with centers with max radius .
Exercise 9.4. Generalize the LP for the -Supplier problem and prove that one can obtain a 3-approximation with respect to lower bound provided via the LP approach.
9.2 Uncapacitated Facility Location
We now discuss UCFL. One can obtain a constant factor for UCFL via several techniques: LP rounding, primal-dual, local search and greedy. The best known approximation bound is due to Li [9] while it is known that one cannot obtain a ratio better than [10]. We will describe the complicated algorithms and focus on the high-level approaches that yield some constant factor approximation.
It is common to use to denote a facility in and to denote a demand/client.
9.2.1 LP Rounding
The first constant factor approximation for UCFL was via LP rounding by Aardal, Shmoys and Tardos using a filtering technique of Lin and Vitter. We start with the LP relaxation. We use a variable for to indicate whether is opened or not. We use a variable to indicate whether is connected to (or assigned to ). One set of constraints are natural here: each client has to be assigned/connected to a facility. The other constraint requires that is assigned to only if is open.
Given a feasible solution to the LP the question is how to round. We note that the LP relaxation does not "know" whether is a metric or not. In fact when is arbitrary (but non-negative) we obtain the non-metric facility location problem which is as hard as the Set Cover problem but not much harder - one can obtain an -approximation. However, we can obtain a constant factor when is a metric and the rounding will exploit this property.
Given the fractional solution for each we define to be the distance cost paid for by the LP: therefore . Note that the LP cost is .
Lemma 9.2. For each and each there is a total facility value of at least in . That is, . In particular .
Proof. This essentially follows from Markov's inequality or averaging. Note that and . Suppose . Since for all , , we will have which is impossible.
We say that two clients and intersect if there is some such that . The rounding algorithm is described below.
1. Solve LP and let |
2. For each |
3. Renumber clients such that |
4. For |
5. Output the list of open facilities and the client assignment |
It is not hard to see that every client is assigned to an open facility. The main issue is to bound the total cost. Let be the total facility opening cost, and let be the total connection cost. We will bound these separately.
Lemma 9.3. .
Proof. Note that a client opens a new facility only if it has not been assigned when it is considered by the algorithm. Let be the clients that open facilities. Let be the set of facilities in the ball of radius around . From the algorithm and the definition of intersection of clients, we see that the sets are pairwise disjoint. The algorithm opens the cheapest facility in for . Note that by Lemma 9.2. Hence the cost of the cheapest facility in is at most (why?). By the disjointness of the sets we see that the total cost of the facilities opened is at most .
Lemma 9.4. .
Proof. Consider a client that opens a facility in . In this case is assigned to and . Now consider a client that does not open a facility. This implies that there was a client that opened a facility and and intersect, and was assigned to . What is ? We claim that . To see this we note that and intersect because there is some facility . By considering the path , via triangle inequality, since . Thus the distance traveled by each client to its assigned facility is at most . The lemma follows.
The two preceding lemmas give us the following which implies that the algorithm yields a 6-approximation.
Theorem 9.3. .
It should be clear to the reader that the algorithm and analysis are not optimized for the approximation ratio. The goal here was to simply outline the basic scheme that led to the first constnat factor approximation.
9.2.2 Primal-Dual
Jain and Vazirani developed an elegant and influential primal-dual algorithm for UCFL [11]. It was influential since it allowed new algorithms for -median and several generalizations of UCFL in a clean way. Moreover the primal-dual algorithm is simple and efficient to implement. On the other hand we should mention that one advantage of having an LP solution is that it gives an explicit lower bound on the optimum value while a primal-dual method yields a lower bound via a feasible dual which may not be optimal. We need some background to describe the primal-dual method in approximation.
Complementary slackness: To understand primal-dual we need some basic background in complementary slackness. Suppose we have a primal of the form s.t which we intentionally wrote in this standard form as a covering LP. It's is a packing LP s.t . We will assume that both primal and dual are feasible and hence the optimum values are finite, and via strong duality we know that the optimum values are the same.
Definition 9.4. A feasible solution to and a feasible solution to satisfy the primal complementary slackness condition with respect to each other if the following is true: for each or the corresponding dual constraint is tight, that is . They satisfy the dual complementary slackness condition if the following is true: for each or the corresponding primal constraint is tight, that is .
One of the consequences of the duality theory of LP is the following.
Theorem 9.5. Suppose and are a primal and dual pair of LPs that both have finite optima. A feasible solution to and a feasible solution to satisfy the primal-dual complementary slackness properties with respect to each other if and only if they are both optimum solutions to the respective LPs.
We illustrate the use of complementary slackness in the context of approximation via Vertex Cover. Recall the primal covering LP and we also write the dual.
Recall that we described a simple rounding algorithm. Given a feasible primal solution . We let and showed that (i) is a vertex cover for the given graph and (ii) . Now suppose we have an optimum solution to the primal rather than an arbitrary feasible solution. We can prove an interesting and stronger statement via complementary slackness.
Lemma 9.5. Let be an optimum solution to the primal covering LP. Then is a feasible vertex cover for and moreover .
Proof. It is easy to see that is a vertex cover via the same argument that we have seen before. How do we bound the cost now that we may be rounding to 1 even though may be tiny? Let be any optimum solution to the dual; one exists (why?). Via strong duality we have . Via primal-complementary slackness we have the property that if then . Hence
Interchanging the order of summation we obtain that
where we use the fact that an edge has only two end points in the inequality.
Primal-dual for approximating Vertex Cover: We will first study a primaldual approximation algorithm for the Vertex Cover problem - this algorithm due to Bar-Yehuda and Even [12] can perhaps be considered the first primal-dual algorithm in approximation. Primal-dual is a classical method in optimization and is applicable in both continuous and discrete settings. The basic idea, in the context of LPs (the method applies more generally), is to obtain a solution to a primal LP and a solution to the dual LP together and certify the optimality of each solution via complementary slackness. It is beyond the scope of this notes to give a proper treatment. One could argue that understanding the setting of approximation is easier than in the classical setting where one seeks exact algorithms, since our goals are more modest.
Typically one starts with one of being feasible and the other infeasible, and evolve them over time. In discrete optimization, this method is successfully applied to LPs that are known to be integer polytopes. Examples include shortest paths, matchings, and others. In approximation the LP relaxations are typically not integral. In such a setting the goal is to produce a primal and dual pair where is an integral feasible solution, and is fractional feasible solution. The goal is to approximately bound the value of via the dual , and for this purpose we will enforce only the primal complementary slackness condition for the pair . To make the algorithm and analysis manageable, the primal-dual algorithm is typically done in a simple but clever fashion – there have been several surprisingly strong and powerful approximation results via this approach.
We illustrate this in the context of Vertex Cover first. It is useful to interpret the dual as we did already in the context of the dual-fitting technique for Set Cover. We think of the edges as agents that wish to be covered by the vertices in at minimum cost. The dual variable can be thought of as the amount that edge is willing to pay to be covered. The dual packing constraint says that for any vertex , the total payments of all edges incident to cannot exceed its weight. This can be understood game theoretically as follows. The set of edges can together pay and get covered, and hence we cannot charge them more. The dual objective is to maximize the total payment that can be extracted from the edges subject to these natural and simple constraints. With this interpretation in mind we wish to produce a feasible dual (payments) and a corresponding feasible integral primal (vertex cover). The basic scheme is to start with an infeasible primal and a feasible dual and increase while maintaining feasibility; during the process we will maintain primal complementary slackness which means that if a dual constraint for a vertex becomes tight we set . Note that we are producing an integer primal solution in this process. How should we increase values? We will do it in the naive fashion which is to uniformly increase for all that are not already part of a tight constraint (and hence not covered yet).
1. |
2. |
3. While |
4. Output integer solution |
Remark 9.2. Note that when checking whether a vertex is tight we count the payments from even though some of them are no longer active.
Lemma 9.6. At the end of the algorithm is a feasible vertex cover for and .
Proof. By induction on the iterations one can prove that (i) remains dual feasible throughout (ii) at the start of an iteration iff is not covered yet (iii) each iteration adds at least one more vertex and hence the algorithm terminates in iterations and outputs a feasible vertex cover. The main issue is the cost of .
For this we note that the algorithm maintains primal complementary slackness. That is, or if then . Thus, we have
We used the fact that has at most two end points in the first inequality and the fact that is dual feasible in the second inequality. In terms of payment what this says is that edge 's payment of can be used to pay for opening and while the dual pays only once.
As the reader can see, the algorithm is very simple to implement.
Exercise 9.5. Describe an example to show that the primal-dual algorithm's worst case performance is 2. Describe an example to show that the dual value constructed by the algorithm is . Are these two parts the same?
Remark 9.3. The algorithm generalizes to give an -approximation for Set Cover where is the maximum frequency of any element. There are examples showing that the performance of this algorithm in the worst-case can indeed be a factor of . We saw earlier that the integrality gap of the LP is at most where is the maximum set size. There is no contradiction here since the specific primal-dual algorithm that we developed need not achieve the tight integrality gap.
Primal-Dual for UCFL: Now we consider a primal-dual algorithm for UCFL. Recall the primal LP that we discussed previously. Now we describe the dual LP written below. The dual has two types of variables. For each client there is a variable corresponding to the primal constraint that each client needs to be connected to a facility. For each facility and client there is a variable corresponding to the constraint that .
It is important to interpret the dual variables. There is a similarity to Set Cover since Non-Metric Facility Location is a generalization, and the LP formulation does not distinguish between metric and non-metric facility location - it is only in the rounding that we can take advantage of metric properties. The variable can be interpreted as the amount of payment client is willing to make. This comes in two parts - the payment to travel to a facility which it cannot share with any other clients, and the payment it is willing to make to open a facility which it can share with other clients. The variable corresponds to the amount client is willing to pay to facility to open it. (In Set Cover there is no need to distinguish between and since there are no distances (or they can be assumed to be 0 or ).) The first set of constraints in the dual say that for any facility , the total payments from all the clients cannot exceed . The second set of constraints specify that is at most . One way to understand this is that if then client will not even be able to reach and hence will not contribute to opening . Via this interpretation it is convenient to assume that .
The primal-dual algorithm for UCFL will have a growth stage that is similar to what we saw for Vertex Cover. We increase for each "uncovered" client uniformly. A facility will receive payment from . To maintain dual feasibility, as soon as the constraint for facility becomes tight we open facility (in the primal we set ); note that any client such that cannot increase its any further and hence will stop growing since it is connected to an open facility. This process is very similar to that in Set Cover. The main issue is that we will get a weak approximation (a factor of ) since a client can contribute payments to a large number of facilities. Note that the process so far has not taken advantage of the fact that we have a metric facility location problem. Therefore, in the second phase we will close some facilities which means that a client may need to get connected to a facility that it did not contribute to - however we will use the metric property to show that a client does not need to travel too far to reach a facility that remains open.
With the above discussion in place, we describe the two phase primal-dual algorithm below. The algorithm also creates a bipartite graph with vertex set and initially it has no edges.
1. |
2. |
3. |
4. While |
5. Create graph |
6. |
7. Close facilities in |
8. Assign each client |
Example: The example in Fig 9.2.2 illustrates the need for the pruning phase. There are clients and facilities and the opening costs of the facilities are except for the first one which has an opening cost of . The first group of clients shown at the top of the figure have a connection cost of 1 to each facility. The second group of clients have the following property: and when . The rest of the distances are induced by these. One can verify that in the growth phase all the facilities will be opened. However the total dual value after the growth phase is while the total cost of the opened facilitie is and hence pruning is necessary to obtain a good approximation.
Figure 9.1: Example to illustrate the need for the pruning phase.
We will do a high-level analysis of the algorithm skipping a few fine details. A formal treatment with full details can be found in [13].
The algorithm maintains the property that and variables are dual feasible. Consider the graph created by the algorithm. We call an edge a witness edge for when is removed from (it happens exactly once for each ). We call an edge special if which means that paid to temporarily open . We remark that a client may have a special edge to even though is not its witness edge - this can happen if is temporarily opened after was already removed from (due to other clients contributing to opening later). One can associate a notion of time with the progression of the algorithm since dual variables are monotonically increased. This can be used to order events.
A basic property maintained by the algorithm is the following.
Claim 9.2.1. If is an edge in then and hence .
Proof. The algorithm adds an edge only if is a witness edge for or if (in which case is special). The algorithm maintain the invariant and hence if the claim is clear. If then is a witness edge for and case (i) happens when is removed from and in this case .
Analysis: We upper bound the cost of opening the facilities in and the connection cost of the clients.
We leave the proof of the following lemma as an exercise.
Lemma 9.7. For each facility we have .
Since a client can pay for multiple facilities which we cannot afford, the pruning phase removes facilities such that no client is connected two facilities in with special edges (otherwise those two facilities will have an edge in ). We say that a client is directly connected to a facility if is a special edge. We call all such clients directly connected clients and the rest of the clients are called indirectly connected. We let be the set of all directly connected clients and let be the set of all indirectly connected clients.
For let be the directly connected clients. By the pruning rule we have the property that a client is directly connected to at most one facility in . We show that each facility in can be paid for by its directly connected clients.
Lemma 9.8. For each .
Proof. From Lemma 9.7 and the fact that if then every client with special edge must be directly connected to .
From Claim 9.2.1 we see that if is directly connected to then , and hence can pay its connection cost to and its contribution to opening . What about indirectly connected clients? The next lemma bounds their connection cost.
Lemma 9.9. Suppose , that is, it is an indirectly connected client. Let be its closest facility in then .
Proof. Let be the witness edge for . Note that . Since is an indirectly connected client there is no facility such that is a special edge. Since it must be because was closed in the pruning phase and hence there must be a facility such that is an edge in (otherwise would not be a maximal independent set). Therefore there is some client such that and are both special edges. We claim that . Assuming this claim we see via triangle inequality and Claim 9.2.1 that,
Since the nearest client to is within distance .
We now prove the claim that . Let be the time when connects to facility as its witness. Consider two cases. In the first case which means that had already reached at or before ; in this case since was open at . In the second case ; this means that reached strictly after . Since was already open at would not pay to open which implies that but then would not be a special edge and hence this case cannot arise. This finishes the proof of the claim.
With the preceding two lemmas in place we can bound the total cost of opening facilities in and connecting clients to them. We will provide a refined statement that turns out to be useful in some applications.
Theorem 9.6.
In particular the algorithm yields a -approximation.
Proof. Consider directly connected clients . We have where are directly connected to . Via Lemma 9.8 and Claim 9.2.1
For indirectly connected clients, via Lemma 9.9, we have . Thus
The algorithm is easy and efficient to implement. One of the main advantages of the stronger property that we saw in the theorem is that it leads to a nice algorithm for the -median problem; we refer the reader to Chapter 25 in [13:1] for a detailed description. In addition the flexibility of the primal-dual algorithm has led to algorithms for several other variants; see [14] for one such example.
9.2.3 Local Search
Local search has been shown to be a very effective heuristic algorithm for facility location and clustering probelms and there is extensive literature on this topic. The first paper that proved constant factor approximation bounds for UCFL is by Korupolu, Plaxtion and Rajaraman [15] and it provided a useful template for many future papers. We refer the reader to Chapter 9 in [16] for local search analysis for UCFL.
9.3 -Median
Rounding of the LP for -Median is not as simple as it is for UCFL. This is mainly because one needs a global argument. To understanding this consider the following example. Suppose we have points where the distance between each pair is (uniform metric space). Then the optimum integer solution has cost since we can place medians at points and the remaining point has to travel a distance of . Now consider a fractional solution where for each point. The cost of this fractional solution is also , however, each client now pays a fractional cost of . Any rounding algorithm will make at least one of the clients pay a cost of which is much larger than its fractional cost; thus the analysis cannot be based on preserving the cost of each client within a constant factor of its fractional cost; thus the analysis cannot be based on preserving the cost of each client within a constant factor of its fractional cost.
There are several LP rounding algorithms known. An advantage of LP based approach is that it leads to a constant factor approximation for the Matroid Median problem which is a nice and powerful generalization of the -Median problem; here there is a matroid defined over the facilities and the constraint is that the set of facilities chosen must be independent in the given matroid. One can write a natural LP relaxation for this problem and prove that the LP has a constant integrality gap by appealing to matroid intersection! It showcases the power of classical result in combinatorial optimization. We refer the reader to [18],[19].
9.3.1 Local Search
Local search for center based clustering problems is perhaps one of the most natural heuristics. In particular we will consider the -swap heuristic. Given the current set of centers , the -swap heuristic will consider swapping out up to centers from with new centers. It is easy to see that this local search algorithm can be implemented in time for each iteration. When we simply refer to the algorithm as (basic) local search. We will ignore the convergence time. As we saw for Max Cut, one can use standard tricks to make the algorithm run in polynomial time with a loss of a -factor in the approximation bound guaranteed by the worst-case local optimum. Thus the main focus will be on the quality of the local optimum. The following is a very nice result.
Theorem 9.7 (Arya et al. [20]). For any fixed the -swap local search heuristic has a tight worst-case approximation ratio of for -Median. In particular the basic local search algorithm yields a -approximation.
Here we give a proof/analysis of the preceding theorem for , following the simplified analysis presented in [21]. See also [16:1] and the notes from CMU. Given any set of centers we define to be the sum of the distances of the clients to the centers. Let be a local optimum and let be some fixed optimum solution to the given -Median instance. To compare with the key idea is to set up a clever set of potential swaps between the centers in and centers in . That is, we consider a swap pair where and . Since is a local optimum it must be the case that . The analysis upper bounds the potential increase in the cost in some interesting fashion and sums up the resulting series of inequalities. This may seem magical, and indeed it is not obvious why the analysis proceeds in this fashion. The short answer is that the analysis ideas required a series of developments with the somewhat easier case of UCFL coming first.
We set up some notation. Let be the mapping of clients to facilities in based on nearest distance. Similarly let the mapping to facilities in the optimum solution . Thus connects to facility in the local optimum and to in the optimum solution. We also let denote the set of all clients assigned to a facility and let denote the set of all clients assigned to a facility . Let and be the cost paid by in local optimum and optimal solutions respectively. To reinforce the notation we express as follows.
Similarly
Setting up the swap pairs: We create a set of pairs as follows. We will assume without loss of generality that . For technical convenience we also assume ; we can always create dummy centers that are co-located to make this assumption. Consider the mapping where each is mapped to its closest center in ; hence for is the closest center in to it. Let be the set of centers in that have exactly one center in mapped to them. Let be the set of centers in with no centers of mapped to them. This means that for each there are two or more centers mapped to them. Let be the centers mapped to . See Figure 9.2. By a simple averaging argument we have the following claim.
Claim 9.3.1. .
We create a set of pairs as follows. There will be exactly pairs. For each we add the pair where . For each we add a pair where is any arbitrary center from - however we make sure that a center is in at most two pairs in ; this is possible because of Claim 9.3.1.
The pairs satisfy the following property.
Claim 9.3.2. If then for any facility .
The intuition for the pairs is as follows. If then we are essentially forced to consider the pair since could be the only center near with all other centers from very far away. When considering the swap we can move the clients connecting to to . On the other hand if then is close to several centers in and may be serving many clients. Thus we do not consider such an in the swap pairs.
Figure 9.2: Mapping between and with each mapped to its closest center in .
The main technical claim about the swap pairs is the following.
Lemma 9.10. Let be a pair in . Then
We defer the proof of the lemma for now and use it to show that . We sum over all pairs and note that each occurs in exactly one pair and each occurs in at most two pairs. Note that can be a negative number while is non-negative number.
This implies the desired inequality.
Figure 9.3: Two cases in proof of Lemma 9.10. Consider the swap pair In the figure on the left the client is assigned to . In the figure on the right the client is is assigned to .
Proof of Lemma 9.10. Since is a local optimum swapping with cannot improve the cost and hence we obtain . We focus on the more interesting inequality. To bound the increase in cost by removing and adding to , we consider a specific assignment of the clients. Any client is assigned to (even if it is suboptimal to do so). See Figure 9.3. For such a client the change in cost is . Now consider any client ; since is no longer available we need to assign it to another facility. Which one? Let be the facility that is assigned to in the optimum solution. Note that . We assign to ; from Claim 9.3.2, and hence . See Figure 9.3. The change in the cost for such a client is . We bound it as follows
Every other client is assigned to its existing center in . Thus the total increase in the cost is obtained as
See [20:1] for a description of the tight example. The example in the conference version of the paper is buggy.
9.4 -Means
The -Means problem is very popular in practice for a variety of reasons. In terms of center-based clustering the goal is to choose centers to minimize . In the discrete setting one can obtain constant factor approximation algorithms via several techniques that follow the approach of -Median. We note that the squared distance does not satisfy triangle inequality, however, it satisfies a relaxed triangle inequality and this is sufficient to generalize certain LP rounding and local search techniques.
In practice the continuous version is popular for clustering applications. The input points are in the Euclidean space where is typically large. Let be the set of input points where each is now a -dimensional vector. The centers are now allowed to be in ambient space. This is called the Euclidean -Means. Here the squared distance actually helps in a certain sense. For instance if then we can see that the optimum center is simply obtained by - in other words we take the "average". One can see this by considering the problem of finding the center as an optimization problem: . It can be seen that we can optimize in each dimension separately and that the optimum in dimension , can be seen to be . Surprisingly, hardness results for Euclidean -Means were only established in the last decade. NP-hardness even when is established in [22], and APX-hardness for high dimensions was shown in [23].
9.5 Lloyd's algorithm, -sampling and -Means
Llyod's algorithm is a very well-known and widely used heuristic for the Euclidean -Means problem. It can viewed as a local search algorithm with an alternating optimization flavor. The algorithm starts with a set of centers which are typically chosen randomly in some fashion. The centers define clusters based on assigning each point to its nearest center. That is is the set of all points in that are closest to (ties broken in some fashion). Once we have the clusters, via the observation above, one can find the best center for that cluster by taking the mean of the points in the cluster. That is, for each we find a new center (if is empty we simply set ). Thus we have a new set of centers and we repeat the process until convergence or some time limit. It is clear that the cost can only improve by recomputing the centers since we know the optimum center for is obtained by using the average of the points.
1. Seeding: Pick |
2. repeat |
3. Until cost improvement is too small |
4. Output clusters |
There are two issues with the algorithm. The first issue is that the algorithm can, in the worst-case, run for an exponential number of iterations. This issue is common for many local search algorithms and as we discussed, it can be overcome by stopping when cost improvement is too small in a relative sense. The second issue is the more significant one. The algorithm can get stuck in a local optimum which can be arbitrarily bad when compared to the optimum solution. See figure below for a simple example.
Figure 9.4: Example demonstrating that a local optimum for Lloyd's algorithm can be arbitrarily bad compared to the optimum clustering. The green clusters are the optimum ones and the red ones are the local optimum.
1. |
2. for |
3. Output |
Theorem 9.8 ([24:1]). Let be the output of Lloyd's algorithm initialized with sampling. Then . Moreover there are examples showing that it is no better than competitive.
The analysis establishes that the seeding already creates a good approximation, so in a sense the local search is only refining the initial approximation. [26], [27] show that if one uses centers, initialized according to sampling, then the local optimum will yield a constant factor approximation with constant probability; note that this is a bicriteria approximation where the number of centers is a constant factor more than and the cost is being compared with respect to the optimum cost with centers. The authors also show that there is a subset of centers from the output of the algorithm that yields a constant factor approximation. One can then run a discrete optimization algorithm using the centers. Another interesting result based on -sampling ideas yields a PTAS but the running time is of the form [28]. See [29] for a scalable version of -Means .
- Tomás Feder and Daniel Greene. “Optimal algorithms for approximate clustering”. In: Proceedings of the twentieth annual ACM symposium on Theory of computing. 1988, pp. 434–444. ↩︎
- Dorit S Hochbaum and David B Shmoys. “A unified approach to approximation algorithms for bottleneck problems”. In: Journal of the ACM (JACM) 33.3 (1986), pp. 533–550. ↩︎
- Viswanath Nagarajan, Baruch Schieber, and Hadas Shachnai. “The Euclidean k-supplier problem”. In: Mathematics of Operations Research 45.1 (2020), pp. 1–14. ↩︎
- Deeparnab Chakrabarty and Maryam Negahbani. “Generalized center problems with outliers”. In: ACM Transactions on Algorithms (TALG) 15.3 (2019), pp. 1–14. ↩︎
- Ján Plesnık. “A heuristic for the p-center problems in graphs”. In: Discrete Applied Mathematics 17.3 (1987), pp. 263–268. ↩︎
- Tanvi Bajpai, Deeparnab Chakrabarty, Chandra Chekuri, and Maryam Negahbani. “Revisiting Priority
-Center: Fairness and Outliers”. In: arXiv preprint arXiv:2103.03337 (2021). ↩︎ - Sariel Har-Peled and Manor Mendel. “Fast construction of nets in lowdimensional metrics and their applications”. In: SIAM Journal on Computing 35.5 (2006), pp. 1148–1184. ↩︎
- Sariel Har-Peled and Benjamin Raichel. “Net and prune: A linear time algorithm for euclidean distance problems”. In: Journal of the ACM (JACM) 62.6 (2015), pp. 1–35. ↩︎
- Shi Li. “A 1.488 approximation algorithm for the uncapacitated facility location problem”. In: Information and Computation 222 (2013), pp. 45–58. ↩︎
- Sudipto Guha and Samir Khuller. “Greedy strikes back: Improved facility location algorithms”. In: Journal of algorithms 31.1 (1999), pp. 228–248. ↩︎
- Kamal Jain and Vijay V Vazirani. “Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and Lagrangian relaxation”. In: Journal of the ACM (JACM) 48.2 (2001), pp. 274–296. ↩︎
- Reuven Bar-Yehuda and Shimon Even. “A linear-time approximation algorithm for the weighted vertex cover problem”. In: Journal of Algorithms 2.2 (1981), pp. 198–203. ↩︎
- Yair Bartal, Moses Charikar, and Danny Raz. “Approximating Min-Sum k-Clustering in Metric Spaces”. In: Proceedings of the Thirty-Third Annual ACM Symposium on Theory of Computing. STOC ’01. Hersonissos, Greece: Association for Computing Machinery, 2001, pp. 11–20. isbn: 1581133499. doi: 10.1145/380752.380754. url: https://doi.org/10.1145/380752.380754. ↩︎
- Madhukar R Korupolu, C Greg Plaxton, and Rajmohan Rajaraman. “Analysis of a local search heuristic for facility location problems”. In: Journal of algorithms 37.1 (2000), pp. 146–188. ↩︎
- Moses Charikar, Sudipto Guha, Éva Tardos, and David B Shmoys. “A constant-factor approximation algorithm for the k-median problem”. In: Journal of Computer and System Sciences 65.1 (2002), pp. 129–149. ↩︎
- Ravishankar Krishnaswamy, Amit Kumar, Viswanath Nagarajan, Yogish Sabharwal, and Barna Saha. “The matroid median problem”. In: Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms. SIAM. 2011, pp. 1117–1130. ↩︎
- Chaitanya Swamy. “Improved approximation algorithms for matroid and knapsack median problems and applications”. In: ACM Transactions on Algorithms (TALG) 12.4 (2016), pp. 1–22. ↩︎
- Anupam Gupta and Kanat Tangwongsan. Simpler Analyses of Local Search Algorithms for Facility Location. 2008. arXiv: 0809.2554 [cs.DS]. ↩︎
- Meena Mahajan, Prajakta Nimbhorkar, and Kasturi Varadarajan. “The planar k-means problem is NP-hard”. In: Theoretical Computer Science 442 (2012), pp. 13–21. ↩︎
- Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and Ali Kemal Sinop. “The Hardness of Approximation of Euclidean k-Means”. In: 31st International Symposium on Computational Geometry (SoCG 2015). Ed. by Lars Arge and János Pach. Vol. 34. Leibniz International Proceedings in Informatics (LIPIcs). Dagstuhl, Germany: Schloss Dagstuhl–LeibnizZentrum fuer Informatik, 2015, pp. 754–767. isbn: 978-3-939897-83-5. doi: 10.4230/LIPIcs.SOCG.2015.754. url: http://drops.dagstuhl.de/opus/ volltexte/2015/5117. ↩︎
- Rafail Ostrovsky, Yuval Rabani, Leonard J Schulman, and Chaitanya Swamy. “The effectiveness of Lloyd-type methods for the k-means problem”. In: Journal of the ACM (JACM) 59.6 (2013), pp. 1–22. ↩︎
- Ankit Aggarwal, Amit Deshpande, and Ravi Kannan. “Adaptive sampling for k-means clustering”. In: Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques. Springer, 2009, pp. 15–28. ↩︎
- Nir Ailon, Ragesh Jaiswal, and Claire Monteleoni. “Streaming k-means approximation.” In: NIPS. Vol. 22. 2009, pp. 10–18. ↩︎
- Ragesh Jaiswal, Amit Kumar, and Sandeep Sen. “A simple D 2-samplingbased PTAS for k-means and other clustering problems”. In: Algorithmica 70.1 (2014), pp. 22–46. ↩︎
- Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. “Scalable K-Means++”. In: Proc. VLDB Endow. 5.7 (Mar. 2012), pp. 622–633. issn: 2150-8097. doi: 10.14778/2180912.2180915. url: https://doi.org/10.14778/2180912.2180915. ↩︎