2. Create GTax database¶

2.1. Download genomes data with NCBI Datsets¶

GTax uses four taxonomy superkingdoms for downloading data: archaea, bacteria, viruses and eukaryotes

Users need to run these commands to download the genomes sequences:

2.1.1. Datasets¶

localhost:~> wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/datasets
localhost:~> chmod a+x datasets

2.1.2. Archaea¶

localhost:~> ./datasets download genome taxon 2157 --assembly-source refseq --dehydrated --filename archaea_meta.zip

2.1.3. Bacteria¶

localhost:~> ./datasets download genome taxon 2 --assembly-source refseq --dehydrated --filename bacteria_meta.zip

2.1.4. Viruses¶

localhost:~> ./datasets download genome taxon 10239 --assembly-source refseq --dehydrated --filename viruses_meta.zip

2.1.5. Eukaryotes¶

localhost:~> ./datasets download genome taxon 2759 --assembly-source refseq --dehydrated --filename eukaryotes_meta.zip

2.2. Process metadata and creates the directories for hydration¶

The command filter_metadata_zip will read the zipped metadata file for each superkingdom and create the folders for hydration with the datasets command. This command will keep the reference genome for each taxa if it is available. If no reference genome is available, the latest assembly will be kept.

localhost:~> filter_metadata_zip

2.3. Hydrate directories with datasets¶

2.3.1. Archaea¶

localhost:~> ./datasets rehydrate --directory archaea/

2.3.2. Bacteria¶

localhost:~> ./datasets rehydrate --directory bacteria/

2.3.3. Viruses¶

localhost:~> ./datasets rehydrate --directory viruses/

2.3.4. Eukaryotes¶

localhost:~> ./datasets rehydrate --directory eukaryotes/

2.4. Create Gtax FASTA files¶

After all data is downloaded, it will take few hours to finish, we can create the FASTA, indexes and TaxID maps for the databases.

localhost:~> gtax_database