Java Collections: Sets

A Set is a Java Collection that does not allow duplicate elements.  A Java Set is embodiment of a Mathematical Set, which comprises of zero or more distinct, possibly infinite distinct unordered elements. Like all other collections, Sets in Java comprise of an interface aptly named Set and a number of concrete implementations.

Differences Between Java Sets and Sets in Mathematics

A Java Set models as closely as possible a mathematical set. Nevertheless, there are a few noteworthy differences

  1. A mathematical set is by definition unordered. Therefore, it is not possible to identify elements of a mathematical set based on the index. A Java Set may or may not be unordered. A few implementations such as HashSet and TreeSet are unordered, others such as LinkedHashSet are not.
  2. A mathematical set may contain heterogeneous types of elements. For example,
    1{ “CodingRaptor.com is #”, 1 }
    is a perfectly valid mathematical set. On the other hand, Java being a typed language cannot allow variables of distinct types to be the member of the same set.
  3. A mathematical set may be an infinite set. For example, a set comprising all positive numbers is infinite and is a perfectly valid mathematical set. Of course, no programming language can accommodate infinite elements.
  4. A Java Set uses 1 element1.equals(element2)to determine whether two elements are distinct or not. A mathematical notion of equality is based on the operator.
  5. A mathematical set is allowed to have a maximum of one null element, denoted by ∅. A Java Set may or may not allow null elements. This behavior depends on the concrete implementation.

Java Set Implementations

We have already seen that the Set interface does not define any new methods. There are two sub-interfaces of java.util.Set interface viz. SortedSet and NavigableSet. It should be noted here that although Set interface does not introduce any new methods but imposes new contracts implicitly. For example, the add method of a Set will reject duplicates.

The Java Collections framework provides four concrete, ready-to-use implementations of the Set interface.

  1. HashSet
  2. LinkedHashSet
  3. TreeSet
  4. EnumSet

HashSet

HashSet is the most often used Set. Internally it uses a hash table (a HashMap) to ensure that duplicates are not allowed. The other salient features of  HashSet are –

  1. Θ(1) or constant time performance for add, remove, contains and size operations. This is the core value proposition of HashSet. Anytime you want a bag like data structure of unique elements with best add and remove performance, HashSet should be used.
  2. Ο(n) or linear time performance for iterating over all elements. Again, this is the most efficient asymptotic performance for iterating over all elements. Note that iteration performance is dependent on the capacity rather than size of HashSet.
  3. The overall performance of a HashSet is dependent on initial capacity and load factor.
  4. HashSet is unordered. Therefore, order of insertion into the data structure says nothing about the relative position of elements.
  5. HashSet permits null value to be added exactly once. By default the hashCode() returns 0 for null.
  6. HashSet is not synchronized.  This means that in a multi-threaded environment a HashSet may be altered by one thread while other threads are concurrently reading or writing the same HashSet.
  7. The iterators of HashSet are fail-fast. This means that in a multi-threaded environment if a HashSet is modified in a different thread while it is being used in the current thread, a ConcurrentModificationException is thrown. Throwing of this exception partly solves multiple reader writer without using a lock as it notifies to other threads that the underlying data structures has been modified and they need to begin their operations afresh. Note that HashSet does not provide hard guarantees around ConcurrentModificationException. Therefore, while using HashSet you should be prepared for this exception but you cannot depend on this exception for correctness of programs.

LinkedHashSet

A LinkedHashSet is a sub-class of LinkedHashSet and has just one purpose over and beyond what is served by HashSet. A LinkedHashSet preserves the insertion order of the elements. Therefore, it is possible to iterate over the elements of a LinkedHashSet in the order they were inserted in. Since is LinkedHashSet a HashSet, all other properties of HashSet mentioned in the previous section also apply to LinkedHashSet.

A LinkedHashSet maintains the insertion order of the elements by also keeping the elements in a doubly linked list in addition to being kept in HashMap. This extra data structure results in slight dip in performance for add, remove and contains operations as the changes need to be made at two places. The asymptotic complexity remains the same.

Iteration, on the other hand, is slightly more efficient in LinkedHashSet than in HashSet. This is because iterating a doubly linked list is proportional to the actual size (i.e. number of elements). Iterating a HashMap (and therefore HashSet) on the other hand is proportional to the capacity of the hashtable. Overall performance of LinkedHashSet pretty much like HashSet is dependent on initial capacity and load factor. But in LinkedHashSet the performance hit for having a high initial capacity is pretty low.

TreeSet

A TreeSet is an ordered Set and therefore has very different characteristic than a HashSet or LinkedHashSet. A TreeSet maintains the ordering (either natural or as defined by a Comparator) elements. In practical terms this translates into the elements being accessed in ascending order during iteration.

Salient features of TreeSet –

  1. A TreeSet keeps its element sorted. This sorting may be based on natural ordering or a Comparator instance.
  2. TreeSet uses a Red-Black Tree underneath the same way as a HashSet uses HashMap. (The underlying implementation is JDK dependent and is not part of any specification. Oracle’s JDK uses Red-Black Tree, other JDKs may use other Trees).
  3. lg(n) or logarithmic time performance for add, remove, contains and size operations. This complexity is the result of using a balanced tree as the underlying implementation.
  4. TreeSet is not synchronized.  This means that in a multi-threaded environment a TreeSet may be altered by one thread while other threads are concurrently reading or writing the same TreeSet.
  5. The iterators of TreeSet are fail-fast. This means that in a multi-threaded environment if a TreeSet is modified in a different thread while it is being used in the current thread, a ConcurrentModificationException is thrown. Throwing of this exception partly solves multiple reader writer without using a lock as it notifies to other threads that the underlying data structures has been modified and they need to begin their operations afresh. Note that TreeSet does not provide hard guarantees around ConcurrentModificationException. Therefore, while using TreeSet you should be prepared for this exception but you cannot depend on this exception for correctness of programs.
  6. A TreeSet does not allow null element.

EnumSet

An EnumSet, introduced in JDK 1.5, is a special kind of Set where all elements are of the same enum type.

The salient features of EnumSet are as follows –

  1. EnumSet is abstract and cannot be instantiated using new operator.
  2. All the elements of EnumSet must be of the same enum type.
  3. The elements of an EnumSet must be specified at the creation time. This is usually done by passing arguments to the of method of an EnumSet.
  4. All operations such as add, remove, contains are Ο(1). This is the biggest selling point of EnumSet. From a performance angle this is the most efficient Set, although it is not as versatile as others and cannot be used in every situation.
  5. EnumSet is internally implemented as Bit Vectors.
  6. EnumSet is not synchronized and therefore unsafe in a multi-threaded environment.
  7. The iterators of EnumSet are not fail-safe but weakly consistent. Therefore, a concurrent modification will never throw an exception.
  8. null values are not permitted in an EnumSet.

Leave a comment

Your email address will not be published. Required fields are marked *