Getting Started

For a very basic setup, you can configure a docker-compose.yml file to use with docker, which is especially helpful when you are testing out BioMAJ.

Docker

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
version: '2'
services:
    biomaj:
        image: osallou/biomaj-docker
        links:
            - mongodb:biomaj-mongodb
        volumes:
            - ./data:/var/lib/biomaj

    mongodb:
        image: mongo

This configuration file defines a simple MongoDB instance which is used for backend storage by BioMAJ, as well as the BioMAJ instance itself. Line 8 denotes that a folder named data in the current directory will be mounted into the volume as storage. Any files downloaded by BioMAJ will appear in this directory.

Running the --help command can be done easily:

$ docker-compose run --rm biomaj --help

Simple Configuration

Once you’ve reached this point, you’re ready to start configuring BioMAJ to download datasets for you. Configuration files should go instead a folder conf inside the data folder in your current directory. As an example, we will use this simple ALU configuration file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
[GENERAL]
# Database name/description
db.fullname="alu.n : alu repeat element. alu.a : translation of alu.n repeats"
# The short name for the database
db.name=alu
# Database type. Some common values include genome, nucleic, nucleic_protein, protein, other
db.type=nucleic_protein
# Base directory to download to download temp files to
offline.dir.name=offline/ncbi/blast/alu_tmp
# Base directory to download to
dir.version=ncbi/blast/alu
# Update frequency
frequency.update=0
# Number of threads used during downloading
files.num.threads=1

# Protocol, common values include ftp, http
protocol=ftp
# The FQDN of the server you with to connect to
server=ftp.ncbi.nih.gov
# And the directory on that server
remote.dir=/blast/db/FASTA/
# The files to find in that page of the remote server.
remote.files=^alu.*\.gz$

# BioMAJ can automatically extract the version number from a release
# document. This will be covered in another section.
release.file=
release.regexp=
release.file.compressed=

#Uncomment if you don't want to extract the data files.
#no.extract=true

# ?
local.files=^alu\.(a|n).*

## Post Process  ##  The files should be located in the projectfiles/process directory
db.post.process=

### Deployment ###
keep.old.version=1

The file can be broken down into a couple of sections:

  • Metadata (lines 1-15)
  • Remote Source (17-24)
  • Release Information (26-30)
  • Other

The metadata consists of things like where data should be stored, and how to name it. The remote source describes where data is to be fetched from, release information we will see in another example, and then there are a few extra, miscellaneous options shown in this example config.

If you have copied the alu.properties file into ./data/conf/alu.properties, you are ready to download this database:

$ docker-compose run --rm biomaj --bank alu --update
2016-08-24 21:43:15,276 INFO  [root][MainThread] Log file: /var/lib/biomaj/log/alu/1472074995.28/alu.log
Log file: /var/lib/biomaj/log/alu/1472074995.28/alu.log
...

This command should complete successfully, and you will have some more files in ./data/:

$ find data
data/conf/alu.properties
data/data/ncbi/blast/alu/alu-2003-11-26/flat/alu.a
data/data/ncbi/blast/alu/alu-2003-11-26/flat/alu.n
data/cache/files_1472074995.29
data/log/alu/1472074995.28/alu.log

The data/data directories contain your downloaded files. Additionally a cache file exists and a job run log is contains data about what occurred during the download and processing. Note that the files that appear are alu.a and alu.n, instead of alu.a.gz and alu.n.gz. By having the option no.extract=true commented out on line 33, BioMAJ automatically extracted the data for us.

The --status command will allow you to see the status of various databases you have downloaded.

$ docker-compose run --rm biomaj --bank alu --status
+--------+-----------------+----------------------+---------------------+
| Name   | Type(s)         | Last update status   | Published release   |
|--------+-----------------+----------------------+---------------------|
| alu    | nucleic_protein | 2016-08-24 21:58:14  | 2003-11-26          |
+--------+-----------------+----------------------+---------------------+
+---------------------+------------------+------------+----------------------------------------------------+----------+
| Session             | Remote release   | Release    | Directory                                          | Freeze   |
|---------------------+------------------+------------+----------------------------------------------------+----------|
| 2016-08-24 21:58:14 | 2003-11-26       | 2003-11-26 | /var/lib/biomaj/data/ncbi/blast/alu/alu-2003-11-26 | no       |
+---------------------+------------------+------------+----------------------------------------------------+----------+

Advanced Configuration

Once you have this sort of simple configuration working, you may wish to explore more advanced configurations. There is a public repository of BioMAJ configurations which will be interesting to the advanced user wishing to learn more about what can be done with BioMAJ.