Mercurial > repos > rnateam > blockclust_workflow
comparison readme.rst @ 3:d6553277b759 draft
Uploaded
author | rnateam |
---|---|
date | Tue, 21 Jan 2014 04:57:28 -0500 |
parents | ba161910b46f |
children |
comparison
equal
deleted
inserted
replaced
2:e9b2400cc569 | 3:d6553277b759 |
---|---|
1 | |
2 | |
1 This package is a Galaxy workflow for BlockClust pipeline. | 3 This package is a Galaxy workflow for BlockClust pipeline. |
2 | 4 |
3 It uses the Glimmer3 tool (Delcher et al. 2007) trained on a known set of | 5 |
4 genes to generate gene predictions on a new genome, and then calls EMBOSS | 6 ====== |
5 (Rice et al. 2000) to translate the predictions into a FASTA file of | 7 Galaxy |
6 predicted protein sequences. The workflow requires two input files: | 8 ====== |
7 | 9 |
8 * Nucleotide FASTA file of know gene sequences (training set) | 10 `Galaxy <http://galaxyproject.org/>`_ is an open, web-based platform for data intensive research. |
9 * Nucleotide FASTA file of genome sequence or assembled contigs | 11 All tools can be combined in workflows without any need of programming skills. |
10 | 12 Furthermore the platform can be extended with more tools at any time. |
11 First an interpolated context model (ICM) is built from the set of known | 13 Each tool has its own information about what it does and how the input is supposed to look like. |
12 genes, preferably from the closest relative organism(s) available. Next this | 14 You can make data available for Galaxy by uploading local files or downloading online content. |
13 ICM model is used to predict genes on the genomic FASTA file. This produces | 15 Inputfiles, workflowsteps and results are stored in a history where you can view them or reaccess them later. |
14 a FASTA file of the predicted gene nucleotide sequences, which is translated | 16 It is possible to share workflows and histories with other users or make the public available. |
15 into protein sequences using the EMBOSS tool transeq. | 17 Saved workflows can be used with new input files or just to rerun an analyses which ensures repeatability. |
16 | 18 |
17 Glimmer is intended for finding genes in microbial DNA, especially bacteria, | 19 |
18 archaea, and viruses. | 20 |
19 | 21 Getting Started |
20 See http://www.galaxyproject.org for information about the Galaxy Project. | 22 =============== |
23 | |
24 BlockClust can be installed on all common Unix systems. | |
25 However, it is developed on Linux and I don't have access to OS X. You are welcome to help improving this documentation, just contact_ me. | |
26 | |
27 For any additional information, especially cluster configuration or general Galaxy_ questions, | |
28 please have a look at the Galaxy Wiki. | |
29 | |
30 - http://wiki.galaxyproject.org/ | |
31 | |
32 - http://wiki.galaxyproject.org/Admin/ | |
33 | |
34 - http://galaxyproject.org/search/web/ | |
35 | |
36 .. _contact: https://github.com/bgruening | |
37 .. _Galaxy: http://galaxyproject.org/ | |
38 | |
39 Prerequisites:: | |
40 | |
41 * Python 2.6 or 2.7 | |
42 * standard C compiler, C++ and Fortran compiler | |
43 * Autotools | |
44 * CMake | |
45 * cairo development files (used for PNG depictions) | |
46 * python development files | |
47 * Java Runtime Environment (JRE, used by OPSIN and NPLS) | |
48 | |
49 To install all of the prerequisites you can run the following command, depending on your OS: | |
50 | |
51 - Debian based systems: apt-get install build-essential gfortran cmake mercurial libcairo2-dev python-dev | |
52 - Fedora: yum install make automake gcc gcc-c++ gcc-gfortran cmake mercurial libcairo2-devel python-devel | |
53 - OS X (MacPorts_): port install gcc cmake automake mercurial cairo-devel | |
54 | |
55 .. _MacPorts: http://www.macports.org/ | |
56 | |
57 | |
58 =================== | |
59 Galaxy installation | |
60 =================== | |
61 | |
62 | |
63 0. Create a sand-boxed Python using virtualenv_ (not necessary but recommended):: | |
64 | |
65 wget https://raw.github.com/pypa/virtualenv/master/virtualenv.py | |
66 python ./virtualenv.py --no-site-packages galaxy_env | |
67 . ./galaxy_env/bin/activate | |
68 | |
69 .. _virtualenv: http://www.virtualenv.org/ | |
70 | |
71 | |
72 1. Clone the latest `Galaxy platform`_:: | |
73 | |
74 hg clone https://bitbucket.org/galaxy/galaxy-central/ | |
75 | |
76 .. _Galaxy platform: http://wiki.galaxyproject.org/Admin/Get%20Galaxy | |
77 | |
78 2. Navigate to the galaxy-central folder and update it:: | |
79 | |
80 cd ~/galaxy-central | |
81 hg pull | |
82 hg update | |
83 | |
84 This step is not necessary if you have a fresh checkout. Anyway, it is good to know ;) | |
85 | |
86 3. Create folders for toolshed and dependencies:: | |
87 | |
88 mkdir ~/shed_tools | |
89 mkdir ~/galaxy-central/tool_deps | |
90 | |
91 4. Create configuration file:: | |
92 | |
93 cp ~/galaxy-central/universe_wsgi.ini.sample ~/galaxy-central/universe_wsgi.ini | |
94 | |
95 5. Open universe_wsgi.ini and change the dependencies directory:: | |
96 | |
97 LINUX: gedit ~/galaxy-central/universe_wsgi.ini | |
98 OS X: open -a TextEdit ~/galaxy-central/universe_wsgi.ini | |
99 | |
100 6. Search for ``tool_dependency_dir = None`` and change it to ``tool_dependency_dir = ./tool_deps``, remove the ``#`` if needed | |
101 | |
102 7. Remove the ``#`` in front of ``tool_config_file`` and ``tool_path`` | |
103 | |
104 8. (Re-)Start the galaxy daemon:: | |
105 | |
106 sh run.sh --reload | |
107 | |
108 In deamon mode all logs will be written to main.log in your Galaxy Home directory. You can also use:: | |
109 | |
110 run.sh | |
111 | |
112 During the first startup Galaxy will prepare your database. That can take some time. Have a look at the log file if you want to know what happens. | |
113 | |
114 After launching galaxy is accessible via the browser at ``http://localhost:8080/``. | |
115 | |
116 | |
117 | |
118 ======================= | |
119 Tool Shed configuration | |
120 ======================= | |
121 | |
122 - Register a new user account in your Galaxy instance: Top Panel → User → Register | |
123 - Become an admin | |
124 - open ``universe_wsgi.ini`` in your favourite text editor (gedit universe_wsgi.ini) | |
125 - search ``admin_users = None`` and change it to ``admin_users = EMAIL_ADDRESS`` (your Galaxy Username) | |
126 - remove the ``#`` if needed | |
127 - restart Galaxy | |
128 | |
129 :: | |
130 | |
131 sh run.sh --reload | |
132 | |
133 | |
134 ======================= | |
135 BlockClust installation | |
136 ======================= | |
137 | |
138 BlockClust will automatically download and compile all requirements, | |
139 like EDeN, samtools and so on. It can take up to 1-2 hours. | |
140 | |
141 | |
142 Installation via Galaxy API (recommended) | |
143 ========================================= | |
144 | |
145 - Generate an `API Key`_ | |
146 - Run the installation script:: | |
147 | |
148 python ./scripts/api/install_tool_shed_repositories.py --api YOUR_API_KEY -l http://localhost:8080 --url http://toolshed.g2.bx.psu.edu/ -o rnateam -r e9b2400cc569 --name blockclust_workflow --tool-deps --repository-deps --panel-section-name ChemicalToolBoX | |
149 | |
150 The -r argument specifies the version of ChemicalToolBoX. You can get the latest revsion number from the | |
151 `test tool shed`_ or with the following command:: | |
152 | |
153 hg identify http://toolshed.g2.bx.psu.edu/repos/bgruening/chemicaltoolbox | |
154 | |
155 You can watch the installation status under: Top Panel → Admin → Manage installed tool shed repositories | |
156 | |
157 | |
158 .. _API Key: http://wiki.galaxyproject.org/Admin/API#Generate_the_Admin_Account_API_Key | |
159 .. _`test tool shed`: http://testtoolshed.g2.bx.psu.edu/ | |
160 | |
161 | |
162 Installation via webbrowser | |
163 =========================== | |
164 | |
165 - go to the `admin page`_ | |
166 - select *Search and browse tool sheds* | |
167 - Galaxy test tool shed > Sequence Analysis > blockclust_workflow | |
168 - install chemicaltoolbox | |
169 | |
170 .. _admin page: http://localhost:8080/admin | |
171 | |
172 | |
173 | |
174 =============== | |
175 Troubleshooting | |
176 =============== | |
177 | |
178 If you have any trouble or the installation did not finish properly, do not hesitate to contact me. However, if the | |
179 installation fails during the Galaxy installation, you can have a look at the `Galaxy wiki`_. If the ChemicalToolBoX installation fails, | |
180 you can try to run:: | |
181 | |
182 python ./scripts/api/repair_tool_shed_repository.py --api YOUR_API_KEY -l http://localhost:8080 --url http://toolshed.g2.bx.psu.edu/ -o rnateam -r e9b2400cc569 --name blockclust_workflow | |
183 | |
184 That will rerun all failed installation routines. Alternatively, you can navigate to the ChemicalToolBoX repository in | |
185 your browser and repair manually: | |
186 Top Panel → Admin → Manage installed tool shed repositories → chemicaltoolbox → Repository Actions → Repair repository | |
187 | |
188 ------ | |
189 | |
190 | |
191 On slow computers and during the compilation of large software libraries, like R, | |
192 the Tool Shed can run into a timeout and kills the installation. | |
193 That problem is known and should be fixed in the near future. | |
194 | |
195 If you encouter a timeout or 'hung' during the installation you can increase the ``threadpool_kill_thread_limit`` in your universe_wsgi.ini file. | |
196 | |
197 | |
198 ------ | |
199 | |
200 **Database locking errors** | |
201 | |
202 Please note that Galaxy per default uses a SQLite database. Sqlite is not intended for production use. | |
203 With multiple users or complex components, like that workflow, you will see database locking errors. | |
204 We highly recommend to use PostgreSQL for any kind of production system. | |
205 | |
206 | |
207 .. _Galaxy wiki: http://wiki.galaxyproject.org/ | |
208 | |
209 | |
210 Workflows | |
211 ========= | |
212 | |
213 An example workflow is located in the `Tool Shed`:: | |
214 | |
215 http://testtoolshed.g2.bx.psu.edu/view/rnateam/blockclust_workflow | |
216 | |
217 You can install the workflow with the API:: | |
218 | |
219 python ./scripts/api/install_tool_shed_repositories.py --api YOUR_API_KEY -l http://localhost:8080 --url http://toolshed.g2.bx.psu.edu/ -o rnateam -r e9b2400cc569 --name blockclust_workflow --tool-deps --repository-deps --panel-section-name BlockClust | |
220 | |
221 or as described above via webbrowser. You have now successfully installed the workflow, | |
222 to import it to all your users you need to go to the admin panel, choose the worklow and import it. | |
223 For more information have a look at the Galaxy wiki:: | |
224 | |
225 http://wiki.galaxyproject.org/ToolShedWorkflowSharing#Finding_workflows_in_tool_shed_repositories | |
226 | |
227 Please **note** that Galaxy per default uses a SQLite database. Sqlite is not intended for production use. | |
228 With multiple users or complex components, like that workflow, you will see database locking errors. | |
229 We highly recommend to use PostgreSQL for any kind of production system. | |
230 | |
21 | 231 |
22 | 232 |
23 Sample Data | 233 Sample Data |
24 =========== | 234 =========== |
25 | 235 |
26 As an example, we will use the first public assembly of the 2011 Shiga-toxin | |
27 producing *Escherichia coli* O104:H4 outbreak in Germany. This was part of the | |
28 open-source crowd-sourcing analysis described in Rohde et al. (2011) and here: | |
29 https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/wiki | |
30 | |
31 You can upload this assembly directly into Galaxy using the "Upload File" tool | |
32 with either of these URLs - Galaxy should recognise this is a FASTA file with | |
33 3,057 sequences: | |
34 | |
35 * http://static.xbase.ac.uk/files/results/nick/TY2482/TY2482.fasta.txt | |
36 * https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/blob/master/strains/TY2482/seqProject/BGI/assemblies/NickLoman/TY2482.fasta.txt | |
37 | |
38 This FASTA file ``TY2482.fasta.txt`` was the initial TY-2482 strain assembled | |
39 by Nick Loman from 5 runs of Ion Torrent data released by the BGI, using the | |
40 MIRA 3.2 assembler. It was initially released via his blog, | |
41 http://pathogenomics.bham.ac.uk/blog/2011/06/ehec-genome-assembly/ | |
42 | |
43 We will also need a training set of known *E. coli* genes, for example the | |
44 model strain *Escherichia coli* str. K-12 substr. MG1655 which is well | |
45 annotated. You can upload the NCBI FASTA file ``NC_000913.ffn`` of the | |
46 gene nucleotide sequences directly into Galaxy via this URL, which Galaxy | |
47 should recognise as a FASTA file with 4,321 sequences: | |
48 | |
49 * ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/NC_000913.ffn | |
50 | |
51 Then run the workflow, which should produce 2,333 predicted genes for the | |
52 TY2482 assembly (two FASTA files, nucleotide and protein sequences). | |
53 | 236 |
54 | 237 |
55 Citation | 238 Citation |
56 ======== | 239 ======== |
57 | 240 |
59 wrappers for Galaxy, in work leading to a scientific publication, | 242 wrappers for Galaxy, in work leading to a scientific publication, |
60 please cite: | 243 please cite: |
61 | 244 |
62 P. Videm at al... | 245 P. Videm at al... |
63 | 246 |
64 For Glimmer3 please cite: | |
65 | |
66 Delcher, A.L., Bratke, K.A., Powers, E.C., and Salzberg, S.L. (2007) | |
67 Identifying bacterial genes and endosymbiont DNA with Glimmer. | |
68 Bioinformatics 23(6), 673-679. | |
69 http://dx.doi.org/10.1093/bioinformatics/btm009 | |
70 | |
71 For EMBOSS please cite: | |
72 | |
73 Rice, P., Longden, I. and Bleasby, A. (2000) | |
74 EMBOSS: The European Molecular Biology Open Software Suite | |
75 Trends in Genetics 16(6), 276-277. | |
76 http://dx.doi.org/10.1016/S0168-9525(00)02024-2 | |
77 | 247 |
78 | 248 |
79 Additional References | 249 Additional References |
80 ===================== | 250 ===================== |
81 | 251 |
82 Rohde, H., Qin, J., Cui, Y., Li, D., Loman, N.J., et al. (2011) | |
83 Open-source genomic analysis of shiga-toxin-producing E. coli O104:H4. | |
84 New England Journal of Medicine 365, 718-724. | |
85 http://dx.doi.org/10.1056/NEJMoa1107643 | |
86 | 252 |
87 | 253 |
88 Availability | 254 Availability |
89 ============ | 255 ============ |
90 | 256 |
91 This workflow is available on the main Galaxy Tool Shed: | 257 This workflow is available on the main Galaxy Tool Shed: |
92 | 258 |
93 http://toolshed.g2.bx.psu.edu/view/bgruening/glimmer_gene_calling_workflow | 259 http://testtoolshed.g2.bx.psu.edu/view/rnateam/blockclust_workflow |
94 | 260 |
95 Development is being done on github: | 261 Development is being done on github: |
96 | 262 |
97 https://github.com/bgruening/galaxytools/workflows/glimmer3/ | 263 https://github.com/bgruening/galaxytools/tree/master/workflows/blockclust |
98 | 264 |
99 | 265 |
100 Dependencies | 266 Dependencies |
101 ============ | 267 ============ |
102 | 268 |
103 These dependencies should be resolved automatically via the Galaxy Tool Shed: | 269 These dependencies should be resolved automatically via the Galaxy Tool Shed: |
104 | 270 |
105 * http://toolshed.g2.bx.psu.edu/view/bgruening/glimmer3 | 271 * http://testtoolshed.g2.bx.psu.edu/view/iuc/package_samtools_0_1_19 |
106 * http://toolshed.g2.bx.psu.edu/view/devteam/emboss_5 | 272 * http://testtoolshed.g2.bx.psu.edu/view/iuc/package_r_3_0_1 |
273 * http://testtoolshed.g2.bx.psu.edu/view/rnateam/package_segemehl_0_1_6 | |
274 * http://testtoolshed.g2.bx.psu.edu/view/iuc/msa_datatypes | |
275 * http://testtoolshed.g2.bx.psu.edu/view/iuc/package_infernal_1_1rc4 | |
276 * http://testtoolshed.g2.bx.psu.edu/view/rnateam/blockbuster | |
277 * http://testtoolshed.g2.bx.psu.edu/view/bgruening/package_eden_1_1 | |
278 * http://testtoolshed.g2.bx.psu.edu/view/iuc/package_mcl_12_135 |