The Cartesian Product Read Issue

Building up a project with a JPA persistence layer requires some design decisions from you. Although the abstraction layer is meant to relieve you from the pain working with a relational database, you have to keep in mind that everything you do with your entities has somehow to be translated to the underlying resource and is subject to its constraints. Often the abstraction layer even adds constraints, since the OR-Mapper has to be independent from the concrete implementation of your resource and therefore can only use the intersection of the different sets of functions. Furthermore there are general solutions to many common problems that might be a lot of slower than a specialized implementation in some cases. In the last posts I tried to give some hints and advices, how a persistence provider — especially hibernate — can be configured, in order to overcome many of this problems. Still, you have to decide beside plenty others, when to use lazy- and eager-fetching.

This post describes another concern using eager fetches, the cartesian product read issue. Imagine you have a bidirectional one-to-many association between a bill and the corresponding line items. Every line item has a many-to-one association referencing the corresponding bill. Since you always use the latter association in your code, you decide to fetch this association eagerly (either global defined in the @ManyToOne annotation or with a fetch join in a jpql query):

public @Table @Entity class Bill {
  //...
  @OneToMany
  @OrderBy("position ASC")
  @Fetch(FetchMode.SUBSELECT)
  public Set<LineItem> getLineItems() {
    return this.lineItems;
  }
}

public @Table @Entity class LineItem {
  // ...
  @ManyToOne(fetch=FetchType.EAGER)
  public Bill getOwningBill() {
    this.owningBill;
  }
}

This seems to be a wise decision, but let us take a more comprehensive look at it. If you try to retrieve the one-to-many association from the bill, it is retrieved in a second query to the database, the corresponding bill for each of the elements is retrieved via a LEFT OUTER JOIN. So, if the bill has 100 entries, the bill will be retrieved 100+1 times, one time when the bill is queried in the first place and 100 times, when the association is retrieved. Fortunately, the redundant data will be fetched in one database query. It is ignored by the persistence provider, since only one instance of a database row is present in a JPA transaction (first level cache). But still, the data is called unnecessarily. The problem gets more difficult when you retrieve a one-to-many association or even a many-to-many association eagerly. If you do an eager fetch on both sides of a bidirectional many-to-many association n x m rows are loaded. Because of that it is called cartesian product read issue.

So, be careful with eager fetches and if necessary use them only with fetch joins, because you cannot do an eager fetch on both sides of a bidirectional association this way. Second, use FetchMode.SUBSELECT if you apply hibernate or something similar if you are using another persistence provider, in order to have less database calls. In addition you should activate batch fetching, since there are more occasions where multiple rows of an entity are retrieved, than for calling associations:

<?xml version="1.0" encoding="UTF-8"?>
<persistence xmlns="http://java.sun.com/xml/ns/persistence"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://java.sun.com/xml/ns/persistence
          http://java.sun.com/xml/ns/persistence/persistence_1_0.xsd"
	version="1.0">
  <persistence-unit name="MyPU">
    <!-- ... -->
    <property name="hibernate.jdbc.fetch_size" value="64" />
    <property name="hibernate.jdbc.batch_size" value="64" />
    <property name="hibernate.default_batch_fetch_size" value="64" />
  </persistence-unit>
</persistence>

Third, try not to retrieve associations in your java code in loops. Replace them by queries retrieving the necessary data in one database call instead. Last but not least employ a second level cache (can be configured in persistence.xml, e.g an EHCache) and a query cache (for hibernate add a query hint to the definition of a named query and activate query caching in your persistence unit). Refer to other documentation in order to get a second level and a query cache up and running. If you are encountering problems, write a comment. That might tempt me to author another blog entry for that purpose.

Leave a Reply Cancel reply