Skip to content

Bobain/pysql2neo4j

Repository files navigation

Export from Postgresql import in Neo4j QnD fixes from a dead repository : https://github.com/lycofron/pysql2neo4j (hence the non PEP 8, full camelCase code, whereas I hate it...) I've adapted this code to make it work with recent neo4j and py2neo version works only for postgresql

#install Neo4j:

wget -O - https://debian.neo4j.org/neotechnology.gpg.key | sudo apt-key add - echo 'deb http://debian.neo4j.org/repo stable/' | sudo tee /etc/apt/sources.list.d/neo4j.list sudo apt-get update sudo apt-get install neo4j

sudo neo4j start

visit http://localhost:7474

in case you need to start from scratch : sudo neo4j stop sudo rm -rf /var/lib/neo4j/data/databases/graph.db sudo neo4j start

cheatsheet : https://neo4j.com/docs/cypher-refcard/current/

To avoid the : WARNING: Max 1024 open files allowed, minimum of 40 000 recommended. See the Neo4j manual. I did : sudo vim /etc/security/limits.conf and added the following lines: root soft nofile 40000 root hard nofile 40000

neo4j soft nofile 40000 neo4j hard nofile 40000

tonigor soft nofile 40000 tonigor hard nofile 40000

performance tuning : https://neo4j.com/developer/guide-performance-tuning/ du -hs /var/lib/neo4j/data/databases/graph.db -> 4.6 Gb so I chose (sudo vim /etc/neo4j/neo4j.conf): dbms.memory.heap.initial_size=???g dbms.memory.heap.max_size=???g dbms.memory.pagecache.size=???g

set the password on a server wihtout remote browser connection: curl -H "Content-Type: application/json" -X POST -d '{"password":"Neo4j"}' -u neo4j:neo4j http://localhost:7474/user/neo4j/password

TODO : make it work on a specified schema the schema selection does not work, you have to manually move it to public: psql -d stats DROP SCHEMA public CASCADE; ALTER SCHEMA ulule_extract RENAME TO public; \q

#################################################################### OLD README ####################################################################

pysql2neo4j

A script to automatically migrate relational databases to Neo4J.

It is based on SqlAlchemy and py2neo.

Motivation / Intended audience

When I first discovered Neo4J, I thought it was great. I wanted to try it but I needed a database I was familiar with to try it out. I had a couple of relational databases at the time, but I did not like the idea of loading each table separately, then setting up relationships.

I thought there should be a tool to do the job automatically.

So, here it is. This script is meant to extract an SQL database and migrate its data to Neo4j automatically, with some simple modifications on the model. It is intended to be used by people who need some data on Neo4j to play with and get them quickly up and running, or to provide a starting point from which they can build their graph model. It is not meant to provide an optimal transformation of relational data as a graph - after all, "optimal", in this case, is a matter of personal taste.

Requirements

  • Python 2.7
    • SQLAlchemy 0.8.3 or newer
    • py2neo 2.0
    • unicodecsv 0.9.4
    • you need to have an SQLAlchemy-supported DBAPI. In development, mysql-connector v.1.0.12 was used.
  • Neo4j installed locally. Version 2.1.5 was used during development.
  • The database to be migrated
    • must declare a primary key on every table
    • must have all foreign keys declared

(Well, I understand that the aforementioned requirements leave out approximately 95% of Real World databases :) )

What it does (in brief):

  • sqlalchemy inspects the database and gets schema information
  • tables are extracted to CSV files
  • csv files are imported into Neo4j
  • Constraints or indexes are created for each primary key
  • relations are imported
  • the model of imported data is created as a graph, for reference

What it does (in detail):

Via sqlalchemy, the script gets all necessary information about the relational database's structure: tables, primary keys, foreign keys and columns with their data types.

Using this information, it extracts each table to a bunch of CSV files, encoded in utf-8. Each file contains, by default, 10K rows - but you can change this if you want.

For each CSV file that was exported, a corresponding Cypher command is issued to Neo4j in order to load it to Neo4j, using periodic commit (set to 500 for testing, but also configurable). Each row is imported as a node labeled after the table name, by default Capitalized (i.e. employees => Employees).

Then constraints and indexes are created. Single-field primary keys and unique indexes are directly converted to unique constraints in Neo4j. Composite primary keys and unique indexes, are converted to multiple single-field indexes.

In the last step, using the same CSV files, relations are created. The relationship type is all caps, in the form of REFERENCED_REFERENCING and with direction from referenced to referencing.

In addition to the above, the script will create a graph representation of the structure of imported data. The nodes will have two labels (Pysql2neo4j and SchemaInfo) with table name as property __tablename. They will be connected by the relationships that are connecting the corresponding nodes. Relationships will have their name as property __relationType. The other properties will be in the form sourceTable.sourceField : destinationTable.destinationField.

Transformations in schema

In the process, the script does the following changes (all of them being default, but optional):

  • It removes redundant fields i.e. fields that (1) are part of a foreign key and (2) do not participate in a primary key
  • Tables implementing a many-to-many relationship are not imported as nodes but as relationships. Non-redundant fields of these tables become relationship properties.

USAGE

You need to edit file settings.ini before running main.py. Parameter documentation is provided in the file.

Dry run

By default, the script will only simulate the process and will not perform any changes. To perform changes, you need to set the flag "dry_run" on settings.ini to 0.

Offline mode

In offline mode, the script will not import any data to Neo4j, not even try to communicate with it. It will only export the csv files plus a cypher script to import them. This way, you may copy these files to another machine and do the import. As the script is in Cypher, the target machine needs nothing but Neo4j installed. Note that, in that case, it is better to specify the directory where these files will be on the target machine (there is an option for that in settings.ini).

Data type support

  • LOBs are not supported by Neo4j and, thus, not supported by the script at the moment.
  • All other standard sql types are supported.
  • Vendor-specific data types (e.g. SMALLMONEY, INTERVAL etc.) are, in theory, supported, but no tests have been performed. If you encounter such types, please provide feedback.

Tests so far

So far, the script has migrated:

  • SAKILA sample database on MySQL.
  • Northwind sample database on MySQL.

Features under consideration

  • Offline mode: extract CSVs to a convenient directory. Create a cypher script to import them and create relations, in order to run it anytime, on a different machine.
  • User edit mode: output a file describing the actions that will be performed (maybe in YAML format) that can be modified by the user. That way, the user will be able to rename relationship ACTOR_FILM to PLAYS_IN
  • Some way to handle LOBs: maybe save LOBs as binary files and store the file's URI as string.
  • ... ?

LICENSE


Copyright (c) 2014, Ioannis Theodosis

Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.

THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.