Welcome to BioMAJ’s documentation!¶
Getting Started Documentation:
Getting Started¶
For a very basic setup, you can configure a docker-compose.yml
file to use
with docker,
which is especially helpful when you are testing out BioMAJ.
Docker¶
1 2 3 4 5 6 7 8 9 10 11 | version: '2'
services:
biomaj:
image: osallou/biomaj-docker
links:
- mongodb:biomaj-mongodb
volumes:
- ./data:/var/lib/biomaj
mongodb:
image: mongo
|
This configuration file defines a simple MongoDB instance which is used for
backend storage by BioMAJ, as well as the BioMAJ instance itself. Line 8
denotes that a folder named data
in the current directory will be mounted
into the volume as storage. Any files downloaded by BioMAJ will appear in this
directory.
Running the --help
command can be done easily:
$ docker-compose run --rm biomaj --help
Simple Configuration¶
Once you’ve reached this point, you’re ready to start configuring BioMAJ to
download datasets for you. Configuration files should go instead a folder
conf
inside the data
folder in your current directory. As an example,
we will use this simple ALU configuration file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | [GENERAL]
# Database name/description
db.fullname="alu.n : alu repeat element. alu.a : translation of alu.n repeats"
# The short name for the database
db.name=alu
# Database type. Some common values include genome, nucleic, nucleic_protein, protein, other
db.type=nucleic_protein
# Base directory to download to download temp files to
offline.dir.name=offline/ncbi/blast/alu_tmp
# Base directory to download to
dir.version=ncbi/blast/alu
# Update frequency
frequency.update=0
# Number of threads used during downloading
files.num.threads=1
# Protocol, common values include ftp, http
protocol=ftp
# The FQDN of the server you with to connect to
server=ftp.ncbi.nih.gov
# And the directory on that server
remote.dir=/blast/db/FASTA/
# The files to find in that page of the remote server.
remote.files=^alu.*\.gz$
# BioMAJ can automatically extract the version number from a release
# document. This will be covered in another section.
release.file=
release.regexp=
release.file.compressed=
#Uncomment if you don't want to extract the data files.
#no.extract=true
# ?
local.files=^alu\.(a|n).*
## Post Process ## The files should be located in the projectfiles/process directory
db.post.process=
### Deployment ###
keep.old.version=1
|
The file can be broken down into a couple of sections:
- Metadata (lines 1-15)
- Remote Source (17-24)
- Release Information (26-30)
- Other
The metadata consists of things like where data should be stored, and how to name it. The remote source describes where data is to be fetched from, release information we will see in another example, and then there are a few extra, miscellaneous options shown in this example config.
If you have copied the alu.properties
file into ./data/conf/alu.properties
, you are ready to download this database:
$ docker-compose run --rm biomaj --bank alu --update
2016-08-24 21:43:15,276 INFO [root][MainThread] Log file: /var/lib/biomaj/log/alu/1472074995.28/alu.log
Log file: /var/lib/biomaj/log/alu/1472074995.28/alu.log
...
This command should complete successfully, and you will have some more files in ./data/
:
$ find data
data/conf/alu.properties
data/data/ncbi/blast/alu/alu-2003-11-26/flat/alu.a
data/data/ncbi/blast/alu/alu-2003-11-26/flat/alu.n
data/cache/files_1472074995.29
data/log/alu/1472074995.28/alu.log
The data/data
directories contain your downloaded files. Additionally
a cache file exists and a job run log is contains data about what occurred
during the download and processing. Note that the files that appear are
alu.a
and alu.n
, instead of alu.a.gz
and alu.n.gz
. By
having the option no.extract=true
commented out on line 33, BioMAJ
automatically extracted the data for us.
The --status
command will allow you to see the status of various databases you have downloaded.
$ docker-compose run --rm biomaj --bank alu --status
+--------+-----------------+----------------------+---------------------+
| Name | Type(s) | Last update status | Published release |
|--------+-----------------+----------------------+---------------------|
| alu | nucleic_protein | 2016-08-24 21:58:14 | 2003-11-26 |
+--------+-----------------+----------------------+---------------------+
+---------------------+------------------+------------+----------------------------------------------------+----------+
| Session | Remote release | Release | Directory | Freeze |
|---------------------+------------------+------------+----------------------------------------------------+----------|
| 2016-08-24 21:58:14 | 2003-11-26 | 2003-11-26 | /var/lib/biomaj/data/ncbi/blast/alu/alu-2003-11-26 | no |
+---------------------+------------------+------------+----------------------------------------------------+----------+
Advanced Configuration¶
Once you have this sort of simple configuration working, you may wish to explore more advanced configurations. There is a public repository of BioMAJ configurations which will be interesting to the advanced user wishing to learn more about what can be done with BioMAJ.
Advanced Topics¶
LDAP¶
The BioMAJ watcher, provides an optional web interface to manage banks. Users can create “private” banks and manage them via the web.
ElasticSearch¶
In order to use the --search
flag, you may wish to connect an
ElasticSearch cluster.
You will need to edit your global.properties
to indicate where the ES servers are:
use_elastic=0
#Comma separated list of elasticsearch nodes host1,host2:port2
elastic_nodes=localhost
elastic_index=biomaj
# Calculate data.dir size stats
data.stats=1
An example docker-compose.yml
would use this:
version: '2'
services:
biomaj:
image: osallou/biomaj-docker
links:
- mongodb:biomaj-mongodb
- elasticsearch
volumes:
- ./data:/var/lib/biomaj
- ./global.advanced.properties:/etc/biomaj/global.properties
mongodb:
image: mongo
elasticsearch:
image: elasticsearch:1.7
And a modified global.properties
referenced in that file would enable elasticsearch:
[GENERAL]
root.dir=/var/lib/biomaj
conf.dir=%(root.dir)s/conf
log.dir=%(root.dir)s/log
process.dir=%(root.dir)s/process
cache.dir=%(root.dir)s/cache
lock.dir=%(root.dir)s/lock
#The root directory where all databases are stored.
#If your data is not stored under one directory hirearchy
#you can override this value in the database properties file.
data.dir=%(root.dir)s/data
db.url=mongodb://biomaj-mongodb:27017
db.name=biomaj
use_ldap=0
ldap.host=localhost
ldap.port=389
ldap.dn=nodomain
use_elastic=1
#Comma separated list of elasticsearch nodes host1,host2:port2
elastic_nodes=elasticsearch
elastic_index=biomaj
# Calculate data.dir size stats
data.stats=1
celery.queue=biomaj
celery.broker=mongodb://biomaj-mongodb:27017/biomaj_celery
auto_publish=1
########################
# Global properties file
#To override these settings for a specific database go to its
#properties file and uncomment or add the specific line you want
#to override.
#----------------
# Mail Configuration
#---------------
#Uncomment thes lines if you want receive mail when the workflow is finished
mail.smtp.host=
#mail.stmp.port=25
mail.admin=
mail.from=biomaj@localhost
mail.user=
mail.password=
mail.tls=
# tail last X bytes of log in mail body , 0 = no tail
# mail.body.tail=2000000
# attach log file if size < X bytes, 0 for no attach
#mail.body.attach=4000000
# path to jinja template for subject, leave empty for defaults
#mail.template.subject=
# path to jinja template for body, leave empty for default
#mail.template.body=
#---------------------
#Proxy authentification
#---------------------
#proxyHost=
#proxyPort=
#proxyUser=
#proxyPassword=
#---------------------
# PROTOCOL
#-------------------
#possible values : ftp, http, rsync, local
port=21
username=anonymous
password=anonymous@nowhere.com
#access user for production directories
production.directory.chmod=775
#Number of thread during the download
bank.num.threads=4
#Number of threads to use for downloading and processing
files.num.threads=4
#to keep more than one release increase this value
keep.old.version=0
#Link copy property
do.link.copy=true
#The historic log file is generated in log/
#define level information for output : DEBUG,INFO,WARN,ERR
historic.logfile.level=INFO
http.parse.dir.line=<a[\\s]+href=\"([\\S]+)/\".*alt=\"\\[DIR\\]\">.*([\\d]{2}-[\\w\\d]{2,5}-[\\d]{4}\\s[\\d]{2}:[\\d]{2})
http.parse.file.line=<a[\\s]+href=\"([\\S]+)\".*([\\d]{2}-[\\w\\d]{2,5}-[\\d]{4}\\s[\\d]{2}:[\\d]{2})[\\s]+([\\d\\.]+[MKG]{0,1})
http.group.dir.name=1
http.group.dir.date=2
http.group.file.name=1
http.group.file.date=2
http.group.file.size=3
#Needed if data sources are contains in an archive
log.files=true
local.files.excluded=\\.panfs.*
#~40mn
ftp.timeout=2000000
ftp.automatic.reconnect=5
ftp.active.mode=false
# Bank default access
visibility.default=public
#proxy=http://localhost:3128
[loggers]
keys = root, biomaj
[handlers]
keys = console
[formatters]
keys = generic
[logger_root]
level = INFO
handlers = console
[logger_biomaj]
level = INFO
handlers = console
qualname = biomaj
propagate=0
[handler_console]
class = StreamHandler
args = (sys.stderr,)
level = DEBUG
formatter = generic
[formatter_generic]
format = %(asctime)s %(levelname)-5.5s [%(name)s][%(threadName)s] %(message)s
API Documentation:
metaprocess¶
MetaProcess API reference¶
- class
biomaj.process.metaprocess.
MetaProcess
(bank, metas, meta_status=None, meta_data=None, simulate=False)[source]¶Meta process in biomaj process workflow. Meta processes are executed in parallel.
Each meta process defined a list of Process to execute sequentially
__init__
(bank, metas, meta_status=None, meta_data=None, simulate=False)[source]¶Creates a meta process thread
Parameters:
- bank (
biomak.bank
) – Bank- meta (list of str) – list of meta processes to execute in thread
- meta_status (bool) – initial status of the meta processes
- simulate (bool) – does not execute process
_get_metata_from_outputfile
(proc)[source]¶Extract metadata given by process on stdout. Store metadata in self.metadata
Parameters: proc – process
run
()[source]¶Method representing the thread’s activity.
You may override this method in a subclass. The standard run() method invokes the callable object passed to the object’s constructor as the target argument, if any, with sequential and keyword arguments taken from the args and kwargs arguments, respectively.
processfactory¶
ProcessFactory API reference¶
- class
biomaj.process.processfactory.
PostProcessFactory
(bank, blocks=None, redis_client=None, redis_prefix=None)[source]¶Manage postprocesses
self.blocks: dict of meta processes status Each meta process status is a dict of process status
- class
biomaj.process.processfactory.
PreProcessFactory
(bank, metas=None, redis_client=None, redis_prefix=None)[source]¶Manage preprocesses
- class
biomaj.process.processfactory.
ProcessFactory
(bank, redis_client=None, redis_prefix=None)[source]¶Manage process execution
__init__
(bank, redis_client=None, redis_prefix=None)[source]¶x.__init__(…) initializes x; see help(type(x)) for signature
__weakref__
¶list of weak references to the object (if defined)
- class
biomaj.process.processfactory.
RemoveProcessFactory
(bank, metas=None, redis_client=None, redis_prefix=None)[source]¶Manage remove processes