2. Create GTax database

2.1. Download genomes data with NCBI Datsets

GTax uses four taxonomy superkingdoms for downloading data: archaea, bacteria, viruses and eukaryotes

Users need to run these commands to download the genomes sequences:

2.1.1. Datasets

localhost:~> wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/datasets
localhost:~> chmod a+x datasets

2.1.2. Archaea

localhost:~> ./datasets download genome taxon 2157 --assembly-source refseq --dehydrated --filename archaea_meta.zip

2.1.3. Bacteria

localhost:~> ./datasets download genome taxon 2 --assembly-source refseq --dehydrated --filename bacteria_meta.zip

2.1.4. Viruses

localhost:~> ./datasets download genome taxon 10239 --assembly-source refseq --dehydrated --filename viruses_meta.zip

2.1.5. Eukaryotes

localhost:~> ./datasets download genome taxon 2759 --assembly-source refseq --dehydrated --filename eukaryotes_meta.zip

2.2. Process metadata and creates the directories for hydration

The command filter_metadata_zip will read the zipped metadata file for each superkingdom and create the folders for hydration with the datasets command. This command will keep the reference genome for each taxa if it is available. If no reference genome is available, the latest assembly will be kept.

localhost:~> filter_metadata_zip

2.3. Hydrate directories with datasets

2.3.1. Archaea

localhost:~> ./datasets rehydrate --directory archaea/

2.3.2. Bacteria

localhost:~> ./datasets rehydrate --directory bacteria/

2.3.3. Viruses

localhost:~> ./datasets rehydrate --directory viruses/

2.3.4. Eukaryotes

localhost:~> ./datasets rehydrate --directory eukaryotes/

2.4. Create Gtax FASTA files

After all data is downloaded, it will take few hours to finish, we can create the FASTA, indexes and TaxID maps for the databases.

localhost:~> gtax_database