I’ve been working on migrating an application using MySQL to using Neo4j, a graph database for the few past months.
The new application uses Neo4j as its primary data store. The application accesses the Neo4j database through the Spring Data for Neo4j framework (SDN). SDN is built on top of Spring and Spring Data and adds a higher level API on top Neo4j’s Java API. Most importantly it allows applications to map Neo4j entities to Java types (e.g. User, Job, etc.) rather than Neo4j’s Node or Relationship types, which are more or less just collections of key/value pairs (ProperyContainer).
As part of the development effort I need to migrate data from MySQL to Neo4j. It’s roughly 500 million rows of data in 4 InnoDB tables in MySQL.
BatchInserter is Neo4j’s solution to bulk load data into a Neo4j database. It speeds up bulk loading data into neo4j by bypassing transactions among other things. However, BatchInserter is not designed to work with SDN out-of-the-box.
As mentioned above, among other things, SDN acts as a mapping layer between Java types and Neo4j entities. At the data level it can do this in a few different ways using its TypeRepresentationStrategy implementations. The default implementation is a IndexingNodeTypeRepresentationStrategy (for node entities) and IndexingRelationshipTypeRepresentationStrategy (for relationship entities), which add a __type__ property on each entity, and creates a __types__ node index and __rel_types__ relationship index. BatchInserter will not create these properties or indexes on its own.
With some help from Michael Hunger, a Neo4j engineer primarily responsible for SDN development, I implemented a simple application using BatchInserter to populate a Neo4j database with the __type__ properties and __types__ and __rel_types__ indexes created as they should.
The important bits are the pieces of code that add the “__type__” property and the nodeTypeIndex and relTypeIndex manipulation. They deal with the SDN TypeRepresentationStrategy required properties and indexes.
Here’s a simplified example on how I implemented it:
BatchInserter routine:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
Map configFull = MapUtil.stringMap("type", "fulltext", "to_lower_case", "true"); Map configExact = MapUtil.stringMap("type", "exact"); BatchInserter inserter = new BatchInserterImpl("/data/graph.db"); BatchInserterIndexProvider indexProvider = new LuceneBatchInserterIndexProvider(inserter); // the SDN node and relationship indexes BatchInserterIndex nodeTypeIndex = indexProvider.nodeIndex("__types__", configExact); BatchInserterIndex relTypeIndex = indexProvider.relationshipIndex("__rel_types__", configExact); // indexes for the node entities BatchInserterIndex nodeFullIndex = indexProvider.nodeIndex("nodefull", configFull); BatchInserterIndex nodeExactIndex = indexProvider.nodeIndex("nodeexact", configExact); // indexes for the relationship entities BatchInserterIndex relationshipFullIndex = indexProvider.relationshipIndex("relfull", configFull); BatchInserterIndex relationshipExactIndex = indexProvider.relationshipIndex("relexact", configExact); Long endNode = 0l; // for illustrative purposes only, create a relationship to the reference node try { List nodes = getNodes(); for (MyNode node : nodes) { Map allNodeProperties = getNonIndexedNodeProperties(node); Map exactNodeProperties = getExactIndexedNodeProperties(node); Map fullNodeProperties = getFullIndexedNodeProperties(node); allNodeProperties.putAll(exactNodeProperties); allNodeProperties.putAll(fullNodeProperties); Map allRelationshipProperties = getNonIndexedRelationshipProperties(node); Map exactRelationshipProperties = getExactIndexedRelationshipProperties(node); Map fullRelationshipProperties = getFullIndexedRelationshipProperties(node); allRelationshipProperties.putAll(exactNodeProperties); allRelationshipProperties.putAll(fullNodeProperties); Long node = inserter.createNode(allNodeProperties); nodeExactIndex.add(node, exactNodeProperties); nodeFullIndex.add(node, fullNodeProperties); nodeTypeIndex.add(node, MapUtil.map("className", MyNode.getClass().getName())); Long relationship = inserter.createRelationship(node, endNode, new ExampleType(), allRelationshipProperties); relationshipExactIndex.add(node, exactRelationshipProperties); relationshipFullIndex.add(node, fullRelationshipProperties); relTypeIndex.add(node, MapUtil.map("className", MyRelationship.getClass().getName())); } } finally { nodeFullIndex.flush(); nodeExactIndex.flush(); relationshipFullIndex.flush(); relationshipExactIndex.flush(); relTypeIndex.flush(); nodeTypeIndex.flush(); indexProvider.shutdown(); inserter.shutdown(); } |
RelationshipType implementation:
1 2 3 4 5 6 |
public class ExampleType implements RelationshipType { @Override public String name() { return "RELATED"; } } |
Utility functions:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
private Map getNonIndexedNodeProperties(MyNode node) { Map m = Maps.newHashMap(); m.put("__type__", MyNode.getClass().getName()); m.put("omega", node.getOmega()); return m; } private Map getExactIndexedNodeProperties(MyNode node) { Map m = Maps.newHashMap(); if (node.getAlpha() != null) { m.put("alpha", node.getAlpha()); } if (node.getBeta() != null) { m.put("beta", node.getBeta()); } return m; } private Map getFullIndexedNodeProperties(MyNode node) { Map m = Maps.newHashMap(); if (node.getFullName() != null) { m.put("fullName", node.getFullName()); } return m; } private Map getNonIndexedRelationshipProperties(MyNode node) { Map m = Maps.newHashMap(); m.put("__type__", MyRelationship.getClass().getName()); m.put("foo", node.getFoo()); m.put("bar", node.getBar()); return m; } private Map getExactIndexedRelationshipProperties(MyNode node) { Map m = Maps.newHashMap(); m.put("dataSource", node.getSource()); if (node.getGamma() != null) { m.put("gamma", node.getGamma()); } return m; } private Map getFullIndexedRelationshipProperties(MyNode node) { Map m = Maps.newHashMap(); if (node.getKappa() != null) { m.put("kappa", node.getKappa()); } return m; } |
Hi Tero,
thanks a lot for the article. It would be great if you could share the code as a project for instance on github (including the mysql connection to pull the data).
Something that could reduce the size of your graph is to use @TypeAlias(“xx”) on top of your SDN classes, so that the __type__ property could be reduced to a few characters instead of the FQN.
Could you also post some information about the runtime of your insertion?
Hi Michael,
Good tip on TypeAlias.
The example I posted on this article is simplified quite a bit. I just wanted to demonstrate the SDN “integration” without all the gunk I had in the actual migration application.
The initial data migration finished earlier today, actually. It ended up being 90M nodes and 240M relationships. Database size (incl. Lucene indexes) was 197GB. I still have to do incremental migrations of data that has changed since I started the imports. It’s going to be about 10% of the data, if my estimations are accurate.
There were three primary issues complicating the migration process:
1. The MySQL database that was the source of the data is hopelessly over the capacity of the servers it’s running on (which is one of the reasons we’re moving the data out). We’re using master/slave replication with three slaves, one of which is dedicated to my data migration apps. There was one particular table that was performing so poorly a straightforward data export was going to take about two weeks or more. It was taking 2 to 5 minutes to get 50,000 records at a time from that table. I tried everything I could think of trying to optimize the queries against that table, but nothing helped. So what I did instead was parallel process the data export. I had four threads running simultaneously taking data out of MySQL and dumping into a separate Neo4j database for each batch of 10,000 primary entities (primary entity + a object graph of dependent data…roughly 50K to 100K actual rows per batch). That created almost 20,000 small neo4j databases of 5MB to 20MB each. These small neo4j databases did not have the SDN __type__ properties or Lucene indexes, but they had the proper graph structure, and all the node and relationship properties fully populated. I then ran a script that read the data from the 20,000 databases and combined them into a single database that will eventually become the production database. I saved about 50% to 60% of total runtime doing this.
2. I have a fair number of unique entities in this database, so the migration scripts needed to do quite a few index lookups to make sure I wasn’t creating duplicate entities. This slowed down the process a lot as well. I also noticed that Lucene lookups would degrade faster as the database size grew. I created a Mongo implementation of the BatchIndex and used that for the uniqueness lookups. Mongo based index lookups seemed to have a more consistent performance characteristics than Lucene based index lookups.
3. I didn’t size the server performing the migrations correctly. I only had 16GB of RAM on it. That was adequate until about half way through the import. After that I was consuming more than 16GB of memory during the BatchIndex flush and BatchInserter shutdown operations. It slowed it down a LOT. If I were to do this again from scratch, I’d get a server with 64GB of memory and faster disks.
I might post another blog post about all that at a later date.
Tero,
a blog post about the things you learned and perhaps the source code of the Mongo based index (it would also be interesting to look into redis or memcached for that) would be really great.
I’m just thinking about providing a tool like this as part of SDN. It would use the meta-information of the entities (MappinContext) to correctly control the batch-inserter, it should be possible then to stream iterators of entities (sub-graphs) to the batch inserter.
Did you ever look into accessing mysql not through sql but through those lower level KV-based APIs for the export? We had many customers whose mysql export took several orders of magnitude longer than the import into Neo4j.
Thanks
Michael
Thanks guys. What should be done for index with unique constraint and a spatial index?
Michael, Do you know if there any plan to support batch functionality in SDN Neo4j?
This is awesome – thanks a bunch!
Initially had troubles using short class names (@TypeAlias did not behave exactly as I had expected). Using
for __type__ and className values solved that. I expect that’s what Michael intended in his comment above. Interestingly it was not required for __type__ – it just looks better/takes less space.
Could also get the TypeAlias directly from the class with eg this, but that seems to circumvent the point and add unnecessary complexity to a batch operation.
Shouldn’t it be
relTypeIndex.add(relationship, MapUtil.map(“className”, MyRelationship.getClass().getName()));
instead of
relTypeIndex.add(node, MapUtil.map(“className”, MyRelationship.getClass().getName()));
Hi Michael,
following Tero’s post I implemented something similar, but with the main difference that indexes are created automatically based on the Spring Data annotations in the classes. Similarly also the properties for nodes and entities are extracted automatically based on the annotated properties in the corresponding POJOs.
I’d be very happy to contribute if you plan to have such tool as part of SDN.
Thanks,
Roberto