Creating Maps
Creating a simple map using CCT
-
To use CCT to create a simple map with no comparisons, first create a new analysis project. In the following example, a new project directory called my_project is created:
cgview_comparison_tool.pl -p my_project
For details on the CCT project directory structure see Creating a New CCT Project.
-
Place the genome sequence you wish to analyze in the reference_genome directory, which is located in the newly created my_project directory. The sequence file can be in GenBank format (with a '.gbk' file extension) or FASTA format (with a '.fasta' extension). In this example the genome sequence should be placed in the my_project/reference_genome directory:
cp $CCT_HOME/sample_projects/sample_project_5/reference_genome/NC_001823.gbk \ my_project/reference_genome
-
Run CCT again:
cgview_comparison_tool.pl -p my_project
The final step will create a .png map in the my_project/maps directory.
Note: to create a comparison map directly with cgview_comparison_tool.pl the project_settings.conf has to be edited. See the section Editing the project_settings.conf file for an example. An easier method is to use the build_blast_atlas.sh script as described next.
Creating a BLAST atlas using CCT
CCT can be used to build "BLAST atlases" which compare a reference genome of interest to one or more other genomes or sequence collections. To simplify the creation of BLAST atlases CCT includes a wrapper script called build_blast_atlas.sh. This script automatically creates maps for nucleotide (blastn) comparisons and translated coding sequence (blastp) comparisons. It also generates multiple maps for each comparison type, differing in terms of size and detail.
-
Run the build_blast_atlas.sh script, passing it the file describing the reference genome:
build_blast_atlas.sh -i $CCT_HOME/sample_projects/sample_project_1/reference_genome/NC_007719.gbk
This will produce a project directory called NC_007719. For details on the directory structure see Creating a New BLAST Atlas Project.
The configuration files project_settings_cds_vs_cds.conf and project_settings_dna_vs_dna.conf can be edited prior to completion of the map drawing process (see Customizing CCT maps)
-
Place the genomes you want compared to the reference genome into the NC_007719/comparison_genomes directory. These files must end with a '.gbk' extension.
cp $CCT_HOME/sample_projects/sample_project_1/comparison_genomes/*.gbk \ NC_007719/comparison_genomes
-
Begin the map drawing process by running the build_blast_atlas.sh script again, pointing at the project directory:
build_blast_atlas.sh -p NC_007719
The above command will generate several maps of different sizes showing nucleotide or protein-based BLAST comparisons. The resulting maps can be accessed within the following directories, once the entire process is complete:
NC_007719/maps_for_dna_vs_dna NC_007719/maps_for_cds_vs_cds
The DNA_vs_DNA maps show the results of blastn comparisons between the reference genome and each comparison genome, while the CDS_vs_CDS maps show the results of blastp comparisons between the CDS translations extracted from the GenBank files. In these maps color is used to indicate the percent identify of BLAST hits. The BLAST hit rings are sorted such that the most similar genomes are presented first (closest to the outside of the circle).
Creating an all vs. all BLAST atlas using CCT
A wrapper script called build_blast_atlas_all_vs_all.sh is included with CCT. This script generates several CCT projects automatically, and then it combines the results into a single montage map. The montage consists of a separate map for each sequence of interest. This allows each sequence in a group of sequences to be visualized as the reference sequence.
-
To generate a separate BLAST atlas for each genome in a collection of genomes, first create a new project:
build_blast_atlas_all_vs_all.sh -p montage_project
This will produce a project directory called montage_project. For details on the project directory structure see Creating a New BLAST Atlas All vs All Project.
The configuration files project_settings_multi.conf can be edited prior to completion of the map drawing process (see Customizing CCT maps)
-
Place the GenBank files for the genomes in the montage_project/comparison_genomes directory. The files must end with a '.gbk' extension. In this example we are fetching all the available Escherichia genomes:
fetch_refseq_bacterial_genomes_by_name.sh \ -n "Bordetella*" -o montage_project/comparison_genomes/
-
Run the build_blast_atlas_all_vs_all.sh command again:
build_blast_atlas_all_vs_all.sh -p montage_project
This will start the map creation process. Within the montage_project directory a separate directory for each map is created (one for each sequence). After all the maps are created a single montage of the maps is generated called montage.png in the montage_project directory.
The maps created using the build_blast_atlas_all_vs_all.sh script show the results of blastn comparisons by default.
Editing the project_settings.conf file
To adjust which types of analyses are performed for your sequence, you can edit the project_settings.conf file in your projects directory. Try the following:
-
Create a new project:
cgview_comparison_tool.pl -p my_project_2
-
Place a genome sequence in my_project_2/reference_genome:
cp \ $CCT_HOME/sample_projects/sample_project_2/reference_genome/Methanobacterium_thermoautotrophicum.gbk \ my_project_2/reference_genome
-
In this example we will compare the reference genome to a second genome, by placing a genome sequence in the my_project_2/comparison_genomes directory:
cp \ $CCT_HOME/sample_projects/sample_project_2/comparison_genomes/Methanosarcina_acetivorans.gbk \ my_project_2/comparison_genomes
-
Edit the my_project_2/project_settings.conf file. This file controls how CCT processes your project. For example, to perform a BLAST using the reference genome coding regions as queries, find the section called 'BLAST query source settings', and change:
Change:
query_source = none
→To:
query_source = cds
Similarly, to specify how the comparison genomes are used in the BLAST comparison, find the section called 'BLAST database source settings' and change:
Change:
database_source = none
→To:
database_source = trans
The settings in this example tell CCT to extract the coding sequence translations from the reference genome GenBank file, and BLAST them against the 6-frame translation of the genome sequences in the project's comparison_genomes directory.
-
You can also control how the results are presented using the 'Graphical map settings' section. For example, to draw feature labels, make the following change:
Change:
draw_feature_labels = F
→To:
draw_feature_labels = T
To draw a larger map (this will allow the feature labels to fit on the canvas), make this change:
Change:
map_size = medium
→To:
map_size = large
The various other settings are described in the project_settings.conf file.
-
Now that you have edited project_settings.conf, run CCT:
cgview_comparison_tool.pl -p my_project_2
This command will perform a BLAST analysis and create a map in my_project_2/maps. Whenever you make changes to the project_settings.conf file you can update the map using this command.
project_settings.conf file options
Attribute | Value | Description |
---|---|---|
minimum_orf_length | Integer | The minimum ORF length used (in codons) when ORFs are extracted from genomic sequences. |
genetic_code | Integer | The genetic code to use for translated BLAST searches and for ORF translation. The default is the bacterial genetic code (genetic code 11). See https://bioinformatics.org/sms2/genetic_code.html for descriptions of the different genetic codes. |
start_codons | Codons separated by '|' | The start codons to be used when finding ORFs. The default set (ttg|ctg|att|atc|ata|atg|gtg) contains the starts for bacterial sequences. |
stop_codons | Codons separated by '|' | The stop codons to use when finding ORFs. |
query_size | Integer | The query size for BLAST searches, i.e. how much of the reference genome is used in each BLAST search. This setting only applies to 'trans' and 'nucleotide' comparisons (see the query_source option below). |
expect | Real | The BLAST expect value to use. |
score | Integer | The minimum score required for BLAST hits. |
hits | Integer | The number of BLAST hits to keep for each query. |
minimum_hit_proportion | Real | The minimum acceptable hit length for BLAST results, expressed as a proportion of the length of the query. |
query_source | nucleotide / trans / cds / orfs / none | The source of the BLAST query sequences. These sequences are extracted from the reference genome sequence, located in the reference_genome directory. Details on the different types can be found in the Blast Comparisons section. |
database_source | nucleotide / dna / trans / cds / orfs / proteins / none | The sources of the BLAST databases. The databases are built using the sequences in the comparison_genomes directory. Details on the different types can be found in the Blast Comparisons section. |
cog_source | orfs / cds / none |
The proteins from the reference sequence to be assigned COG
functional categories. Three options are available:
|
cog_top_hit | T / F | Whether to use only the top BLAST hit for COG functional assignment. |
T / F | ||
draw_divider | T / F | Whether a divider should be drawn between the start and end of the sequence to indicate that the sequence is linear. |
draw_orfs | T / F | Whether open reading frames (ORFs) in the reference genome should be drawn. |
draw_gc_skew | T / F | Whether GC content in the reference genome should be drawn. |
draw_legend | T / F | Whether a feature legend should be drawn. |
draw_feature_labels | T / F | Whether features should be labeled. It is recommended that this option be set to 'T' only when 'large', 'x-large' or 'navigable' maps are drawn (see map_size). |
draw_hit_labels | T / F | Whether BLAST hits should be labeled. It is recommended that this option be set to 'T' only when 'large', 'x-large' or 'navigable' maps are drawn (see map_size). |
draw_orf_labels | T / F | Whether ORFs should be labeled. It is recommended that this option be set to 'T' only when 'large', 'x-large' or 'navigable' maps are drawn (see map_size). |
draw_condensed | T / F | Whether thin feature rings should be used. This option is useful for maps that are to be used for analysis purposes rather than as a figure for publication. |
draw_divider_rings | T / F | Whether divider rings should be drawn between feature rings. |
draw_hits_by_reading_frame | T / F | Whether each set of BLAST results should be divided into six slots, based on the reading frame and strand of the query gene or ORF that produced the hit. This option only applies to comparisons done when the 'query_source' option is set to 'orfs' or 'cds'. |
use_opacity | T / F | Whether BLAST hits should be drawn with partial opacity so that overlapping hits can be seen. |
scale_blast | T / F | Whether BLAST hits should be drawn with height proportional to percent identity of hit. |
gene_decoration | arc / arrow | Whether genes should be drawn as an arc or as an arrow. |
highlight_query | T / F | Whether the position of the queries should be faintly highlighted on the map. By showing the query positions it is easier to see if a hit was obtained for specific ORFs or features. |
map_size | small / medium / large / x-large or combination separated by commas |
The size of the maps to draw. Multiple options can be separated by
commas (e.g. small,large).
|
BLAST comparison types and the 'query_source' and 'database_source' settings
There are several types of BLAST comparisons that can be performed by CCT. The table below shows the compatible values for 'query_source' and 'database_source', lists the required reference and comparison sequence file types and file extensions, and describes the comparisons that are performed. Note that multiple comma-separated values can be given for 'query_source' and 'database_source'--CCT will perform all the compatible comparisons. Many different files can be included in the 'comparison_genomes' directory. CCT examines file extensions when deciding which files to include in each BLAST comparison. When there are multiple files with the same extension, a separate BLAST comparison is conducted for each, and the results are shown in separate rings on the resulting map.
query_source value | database_source value | reference_genome file types and file extensions required | comparison_genomes file types and file extensions required | description of BLAST comparison in the form 'reference vs comparison (BLAST type)' |
---|---|---|---|---|
nucleotide | nucleotide | GenBank (.gbk), FASTA (.fasta), RAW (.raw) | GenBank (.gbk), FASTA (.fasta), RAW (.raw) | DNA vs DNA (blastn) |
nucleotide | dna | GenBank (.gbk), FASTA (.fasta), RAW (.raw) | One or more DNA sequences in FASTA format (.fna) | DNA vs DNA sequences (blastn) |
trans | trans | GenBank (.gbk), FASTA (.fasta), RAW (.raw) | GenBank (.gbk), FASTA (.fasta), RAW (.raw) | 6-frame translated DNA vs 6-frame translated DNA (tblastx) |
trans | cds | GenBank (.gbk), FASTA (.fasta), RAW (.raw) | GenBank (.gbk) | 6-frame translated DNA vs CDS protein sequences extracted from GenBank files (blastx) |
trans | orfs | GenBank (.gbk), FASTA (.fasta), RAW (.raw) | GenBank (.gbk), FASTA (.fasta), RAW (.raw) | 6-frame translated reference DNA vs translated ORFs (blastx) |
trans | proteins | GenBank (.gbk), FASTA (.fasta), RAW (.raw) | One or more protein sequences in FASTA format (.faa) | 6-frame translated reference DNA vs protein sequences (blastx) |
trans | dna | GenBank (.gbk), FASTA (.fasta), RAW (.raw) | One or more DNA sequences in FASTA format (.fna) | 6-frame translated reference DNA vs 6-frame translated DNA sequences (tblastx) |
cds | trans | GenBank (.gbk) | GenBank (.gbk), FASTA (.fasta), RAW (.raw) | CDS protein sequences extracted from GenBank file vs 6-frame translated DNA (tblastn) |
cds | cds | GenBank (.gbk) | GenBank (.gbk) | CDS protein sequences extracted from GenBank file vs CDS protein sequences extracted from GenBank files (blastp) |
cds | orfs | GenBank (.gbk) | GenBank (.gbk), FASTA (.fasta), RAW (.raw) | CDS protein sequences extracted from GenBank file vs translated ORFs (blastp) |
cds | proteins | GenBank (.gbk) | One or more protein sequences in FASTA format (.faa) | CDS protein sequences extracted from GenBank file vs protein sequences (blastp) |
cds | dna | GenBank (.gbk) | One or more DNA sequences in FASTA format (.fna) | CDS protein sequences extracted from GenBank file vs 6-frame translated DNA sequences (tblastn) |
orfs | trans | GenBank (.gbk), FASTA (.fasta), RAW (.raw) | GenBank (.gbk), FASTA (.fasta), RAW (.raw) | Translated ORFs vs 6-frame translated DNA (tblastn) |
orfs | cds | GenBank (.gbk), FASTA (.fasta), RAW (.raw) | GenBank (.gbk) | Translated ORFs vs CDS protein sequences extracted from GenBank files (blastp) |
orfs | orfs | GenBank (.gbk), FASTA (.fasta), RAW (.raw) | GenBank (.gbk), FASTA (.fasta), RAW (.raw) | Translated ORFs vs translated ORFs (blastp) |
orfs | proteins | GenBank (.gbk), FASTA (.fasta), RAW (.raw) | One or more protein sequences in FASTA format (.faa) | Translated ORFs vs protein sequences (blastp) |
orfs | dna | GenBank (.gbk), FASTA (.fasta), RAW (.raw) | One or more DNA sequences in FASTA format (.fna) | Translated ORFs vs 6-frame translated DNA sequences (tblastn) |
Customizing CCT maps
There are many ways to modify the contents and appearance of CCT maps. See the Tutorials section for examples. The general approaches are described below.
-
Edit the project_settings.conf file or files that are created in the project directory. These files are used to control the types of BLAST searches that are performed, the types of graphs that are displayed, and a variety of other map characteristics.
-
Use the '--cct' option with the cgview_comparison_tool.pl script. This causes the BLAST results rings to take up more space, and causes them to be coloured according to the percent identity of each hit. This option is always used when the build_blast_atlas.sh and build_blast_atlas_all_vs_all.sh wrapper scripts are used.
-
Use the '-t' option with the cgview_comparison_tool.pl script. This causes the BLAST results rings to be ordered so that the ones containing more hits of high percent identity are drawn closest to the outer edge of the figure. This option is always used when the build_blast_atlas.sh and 'build_blast_atlas_all_vs_all.sh' wrapper scripts are used.
-
Use the '-b' option to control the number of BLAST rings shown. This option works with the cgview_comparison_tool.pl script and with the wrapper scripts build_blast_atlas.sh and build_blast_atlas_all_vs_all.sh. The default value for this option is 100, which means that up to 100 BLAST results rings will be shown. When the map is created using more than 100 comparison genomes or sequence collections, the top 100 rings (i.e. those producing the most high-identity hits) are shown.
-
After the maps have been drawn, edit the CGView XML files (in maps/cgview_xml) and then redraw the maps using the redraw_maps.sh script:
redraw_maps.sh -p my_project
In the example above, the redraw_maps.sh script is used to draw maps from the CGView XML files located in the my_project CCT project. More information on the CGView XML format is available on the CGView website.
-
Use the '--custom' option to fine-tune the appearance of the map. This option works with the cgview_comparison_tool.pl script and with the wrapper scripts build_blast_atlas.sh and build_blast_atlas_all_vs_all.sh. This option is used to supply key-value pairs, as in the following example:
build_blast_atlas.sh -p NC_012920 -m 2500m -b 2500 \ --map_size x-large --custom 'width=20000 height=20000 backboneRadius=8000 \ featureThickness=120 rulerFontSize=100 rulerPadding=200 \ tickThickness=15 tickLength=40 draw_divider_rings=F \ _cct_blast_thickness=2.0'
Typically the '--custom' option is used after a map has been created, to adjust the appearance of the map. To see what key-value pairs are available see the Customization keys table. To see what values were used for each key when a map was created, examine the '.log' files located in the maps/cgview_xml directory. Once you've determined the keys you would like to change, you can rerun the script using the '--custom' option with the keys and their new values. However, if a lot of BLAST searches were performed to build the first map you can reuse these BLAST results by starting the BLAST atlas at CGView XML creation. For example, suppose that after examining maps already created with the build_blast_atlas.sh script, you decide that you prefer the x-large map, but that you want the make the backbone circle larger. You examine the x-large.log file and find a section showing the key-value attributes for the map, and you see that the 'backboneRadius' key was assigned a value of '4000'. To redraw the x-large maps for the DNA vs DNA and the CDS vs CDS comparisons, without having the BLAST searches repeated, the following commands could be used:
build_blast_atlas.sh -p NC_012920 --start_at_xml --custom 'backboneRadius=4500' --map_size x-large
The '--start_at_xml' option causes the script to rebuild the XML and the '--map_size x-large' option only redraws the x-large maps.
-
Draw zoomed maps that show regions of interest in more detail. For example, suppose that the build_blast_atlas.sh script was run as follows:
build_blast_atlas.sh -i NC_012920.gbk
If after examining some of the maps you find that the 400000 bp region of the reference genome looks interesting you can generate a zoomed version of all the maps showing this region in more detail using the following command:
create_zoomed_maps.sh -p my_project -c 400000 -z 10 --format svgz
This will create new maps in the same directories as the existing maps, showing the 400000 bp region expanded by a factor of 10. Instead of the default PNG format the maps will be generated in SVGZ format.
Labelling a subset of genes (labels_to_show.txt)
To label a subset of genes in the reference genome, place a file called labels_to_show.txt in the project directory. This file should be a tab-delimited or comma-delimited text file specifying which genes should be labeled. Each row must consist of a gene identifier followed by the text that is to be used for the label. When using a GenBank or EMBL file as the reference genome the gene identifier should match the value of the '/gene' qualifier (or the value of the '/locus_tag' qualifier if there isn't a '/gene' qualifier given for a particular gene). When describing genes using the 'features' directory and .gff files, the gene identifier should match the 'seqname' value. Note that providing a labels_to_show.txt file will cause the 'draw_feature_labels' setting in the 'project_settings.conf' file to be ignored.
Adding additional features (feature GFF files)
To add features to the map, place one or more files with a '.gff' extension in the features directory, which is located in the CCT project directory. The files should be tab-delimited or comma-delimited and should have the following column titles, in the following order: 'seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame'. The first line in the file must be the column titles. For a given entry, 'seqname' should be the name of the gene, 'feature' should be the type of gene (CDS, rRNA, tRNA, other) or the single letter COG category (J for example). 'start' and 'end' should be integers between 1 and the length of the sequence, and the 'start' value should be less than or equal to the 'end' regardless of the 'strand' value. The 'strand' value should be '+' for the forward strand and '-' for the reverse strand. All other values can be given as '.' or left blank, since they are ignored. These column titles are based on the specification of the GFF file format. If 'start' and 'end' values are not supplied, but a 'seqname' is given, this script will attempt to get the 'start' and 'end' values from the sequence file.
Adding additional analysis results (analysis GFF files)
To add analysis results to the map, place one or more files with a '.gff' extension in the analysis directory, which is located in the CCT project directory. The files should be tab-delimited or comma-delimited and should have the following column titles, in the following order: 'seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame'. The first line in the file must be the column titles. For a given entry, only the 'start', 'end', 'strand', and 'score' values are required. 'start' and 'end' should be integers between 1 and the length of the sequence, and the 'start' value should be less than or equal to the 'end' regardless of the 'strand' value. The 'strand' value should be '+' for the forward strand and '-' for the reverse strand. The 'score' value should be a real number, positive or negative. The other values can be given as '.' or left blank. These column titles are based on the specification of the GFF file format. If 'start' and 'end' values are not supplied, but a 'seqname' is given, this script will attempt to get the 'start' and 'end' values from the sequence file.
COG categories and colours
Category | Colour | Description | |
---|---|---|---|
Information storage and processing [oranges/reds] | |||
A | Red | RNA processing and modification | |
B | Tomato | Chromatin structure and dynamics | |
J | Light coral | Translation, ribosomal structure and biogenesis | |
K | Dark orange | Transcription | |
L | Deep pink | Replication, recombination and repair | |
Cellular processes and signaling [greens/yellows] | |||
D | Khaki | Cell cycle control, cell division, chromosome partitioning | |
O | Dark khaki | Post-translational modification, protein turnover, and chaperones | |
M | Olive drab | Cell wall/membrane/envelope biogenesis | |
N | Forest green | Cell motility | |
P | Yellow green | Inorganic ion transport and metabolism | |
T | Lime green | Signal transduction mechanisms | |
U | Green yellow | Intracellular trafficking, secretion, and vesicular transport | |
V | Medium spring green | Defense mechanisms | |
W | Dark sea green | Extracellular structures (this doesn't appear in reference database) | |
Y | Medium sea green | Nuclear structure (this appears once in reference database) | |
Z | Yellow | Cytoskeleton | |
Metabolism [blues/purples] | |||
C | Cyan | Energy production and conversion | |
G | Dark turquoise | Carbohydrate transport and metabolism | |
E | Steel blue | Amino acid transport and metabolism | |
F | Deep sky blue | Nucleotide transport and metabolism | |
H | Blue | Coenzyme transport and metabolism | |
I | Slate blue | Lipid transport and metabolism | |
Q | Navy | Secondary metabolites biosynthesis, transport, and catabolism | |
Poorly characterized [grays] | |||
R | Gray | General function prediction only (examples include "Predicted thioesterase", "Predicted ATPase") | |
S | Dark gray | Function unknown (examples include "Uncharacterized conserved protein", "Predicted small secreted protein") | |
Unknown | White | Not assigned COG letter because protein is not similar to any COG |