Getting Started¶
For a very basic setup, you can configure a docker-compose.yml
file to use
with docker,
which is especially helpful when you are testing out BioMAJ.
Docker¶
1 2 3 4 5 6 7 8 9 10 11 | version: '2'
services:
biomaj:
image: osallou/biomaj-docker
links:
- mongodb:biomaj-mongodb
volumes:
- ./data:/var/lib/biomaj
mongodb:
image: mongo
|
This configuration file defines a simple MongoDB instance which is used for
backend storage by BioMAJ, as well as the BioMAJ instance itself. Line 8
denotes that a folder named data
in the current directory will be mounted
into the volume as storage. Any files downloaded by BioMAJ will appear in this
directory.
Running the --help
command can be done easily:
$ docker-compose run --rm biomaj --help
Simple Configuration¶
Once you’ve reached this point, you’re ready to start configuring BioMAJ to
download datasets for you. Configuration files should go instead a folder
conf
inside the data
folder in your current directory. As an example,
we will use this simple ALU configuration file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | [GENERAL]
# Database name/description
db.fullname="alu.n : alu repeat element. alu.a : translation of alu.n repeats"
# The short name for the database
db.name=alu
# Database type. Some common values include genome, nucleic, nucleic_protein, protein, other
db.type=nucleic_protein
# Base directory to download to download temp files to
offline.dir.name=offline/ncbi/blast/alu_tmp
# Base directory to download to
dir.version=ncbi/blast/alu
# Update frequency
frequency.update=0
# Number of threads used during downloading
files.num.threads=1
# Protocol, common values include ftp, http
protocol=ftp
# The FQDN of the server you with to connect to
server=ftp.ncbi.nih.gov
# And the directory on that server
remote.dir=/blast/db/FASTA/
# The files to find in that page of the remote server.
remote.files=^alu.*\.gz$
# BioMAJ can automatically extract the version number from a release
# document. This will be covered in another section.
release.file=
release.regexp=
release.file.compressed=
#Uncomment if you don't want to extract the data files.
#no.extract=true
# ?
local.files=^alu\.(a|n).*
## Post Process ## The files should be located in the projectfiles/process directory
db.post.process=
### Deployment ###
keep.old.version=1
|
The file can be broken down into a couple of sections:
- Metadata (lines 1-15)
- Remote Source (17-24)
- Release Information (26-30)
- Other
The metadata consists of things like where data should be stored, and how to name it. The remote source describes where data is to be fetched from, release information we will see in another example, and then there are a few extra, miscellaneous options shown in this example config.
If you have copied the alu.properties
file into ./data/conf/alu.properties
, you are ready to download this database:
$ docker-compose run --rm biomaj --bank alu --update
2016-08-24 21:43:15,276 INFO [root][MainThread] Log file: /var/lib/biomaj/log/alu/1472074995.28/alu.log
Log file: /var/lib/biomaj/log/alu/1472074995.28/alu.log
...
This command should complete successfully, and you will have some more files in ./data/
:
$ find data
data/conf/alu.properties
data/data/ncbi/blast/alu/alu-2003-11-26/flat/alu.a
data/data/ncbi/blast/alu/alu-2003-11-26/flat/alu.n
data/cache/files_1472074995.29
data/log/alu/1472074995.28/alu.log
The data/data
directories contain your downloaded files. Additionally
a cache file exists and a job run log is contains data about what occurred
during the download and processing. Note that the files that appear are
alu.a
and alu.n
, instead of alu.a.gz
and alu.n.gz
. By
having the option no.extract=true
commented out on line 33, BioMAJ
automatically extracted the data for us.
The --status
command will allow you to see the status of various databases you have downloaded.
$ docker-compose run --rm biomaj --bank alu --status
+--------+-----------------+----------------------+---------------------+
| Name | Type(s) | Last update status | Published release |
|--------+-----------------+----------------------+---------------------|
| alu | nucleic_protein | 2016-08-24 21:58:14 | 2003-11-26 |
+--------+-----------------+----------------------+---------------------+
+---------------------+------------------+------------+----------------------------------------------------+----------+
| Session | Remote release | Release | Directory | Freeze |
|---------------------+------------------+------------+----------------------------------------------------+----------|
| 2016-08-24 21:58:14 | 2003-11-26 | 2003-11-26 | /var/lib/biomaj/data/ncbi/blast/alu/alu-2003-11-26 | no |
+---------------------+------------------+------------+----------------------------------------------------+----------+
Advanced Configuration¶
Once you have this sort of simple configuration working, you may wish to explore more advanced configurations. There is a public repository of BioMAJ configurations which will be interesting to the advanced user wishing to learn more about what can be done with BioMAJ.