ReliefSeq is a feature (attribute) selection and ranking algorithm written in C++ designed to handle various types of genetic features including combinations of feature data types and endpoints (phenotypes/classes).
For example: SNPs and gene expression features can be analyzed with discrete or continuous phenotypes (classes). The full list of command line options are listed by typing ‘reliefseq’ or ‘reliefseq-help’ at the command prompt.
Installation
ReliefSeq can be installed from the github repository located at Insilico ReliefSeq github . Instructions for installation are provided there.
ReliefSeq on…
Digital Gene Expression (DGE) RNA-Seq Data Sets
ReliefSeq will rank the features (genes) in an RNA-Seq digital gene
expression data set. The features in these data sets are counts, so
ReliefSeq treats them as “numeric” attributes. Even though ReliefSeq
can handle numeric attributes in a general way with the ‘-n’
command line option, a separate phenotype file must be specified with
a ‘-a’, or alternate phenotype file, command line option.
Furthermore, these files require ID matching (see ReliefSeq on Gene
Expression Data Sets below). To avoid this requirement, use of a special DGE
format CSV file that contains phenotypes and numeric attributes together can
be used. The CSV file contains as a file header the case-control phenotypes
(1/0). The remaining lines contain the genes and their counts. The following
shows the first few lines of an example file:
,0,0,0,0,0,1,1,1,1 7A,24,16,57,30,17,27,36,25,31 A1BG,319,180,288,112,109,233,143,251,169 A1CF,3,1,2,2,0,2,2,0,3 A26C1A,0,1,2,0,0,0,2,0,0 A26C1B,0,0,1,2,3,2,1,2,0 A26C3,3,5,3,5,3,17,3,5,7 A2BP1,1,0,0,4,0,1,1,0,1 A2M,1865,1250,3061,2070,1164,2337,1142,3209,1914 A2ML1,3,1,1,5,0,4,0,2,2
Note the first line contains a leading ‘,’ to skip the gene label column.
Therefore, the phenotype labels correspond to samples in the data set.
To analyze a file of this format, invoke ReliefSeq with a command line
of the form:
$ reliefseq --dge-counts-data example.csv -o example
This command produces a tab-delimited file named example.reliefseq
that contains a list of genes ranked by Relief-F score. For example:
0.280785 NDUFA 0.280724 MRPS 0.275989 UBE2G 0.250673 P4HB 0.250568 BRUNOL 0.247726 PSMA 0.245764 PHTF 0.242897 ZFP36L2 0.232065 LBH 0.213614 STAT5B
We have found that ReliefSeq for RNA-Seq data works best when the
k-nearest neighbors are optimized for each attribute in the data set.
The following options are used to optimize the k-nearest neighbors.
ReliefSeq provides command line options for specifying the optimization
of k-nearest neighbors.
-k [ --k-nearest-neighbors ] arg (=10) set k nearest neighbors (0=optimize k) --kopt-begin arg (=1) optimize k starting with kopt-begin --kopt-end arg (=1) optimize k ending with kopt-end --kopt-step arg (=1) optimize k incrementing with kopt-step --write-best-k optimize k, write best k's --write-each-k-scores optimize k, write best scores for each
For instance an example command line to run the full range of nearest
neighbors for a data set having 24 cases and 24 controls (maximum of
23 nearest neighbors) is shown below:
$ reliefseq --dge-counts-data example.csv -k 0 --kopt-begin 1 --kopt-end 23 -o example
For a data set containing over 16,000 genes, this analysis takes about
10 minutes. The resulting scores file contains ranked genes by the optimum
k-nearest neighbors. The other options are for writing the scores files
for each k tried and for writing a file that lists the best k found for
each gene.
SNP Data Sets
Related Command Line Options
-s [ --snp-data ] arg read SNP attributes from genotype filename: txt, ARFF, plink (map/ped, binary, raw) --snp-file-type arg Ignore file extension and use type: textwhitesp, wekaarff, plinkped, plinkbed, plinkraw, dge, birdseed --snp-metric arg (=gm) metric for determining the difference between subjects (gm|am|nca|nca6) -B [ --snp-metric-nn ] arg (=gm) metric for determining the difference between subjects (gm|am|nca|nca6|km) -W [ --snp-metric-weights ] arg (=gm) metric for determining the difference between SNPs (gm|am|nca|nca6)
The most basic SNP analysis is to specify a SNP/GWAS data file:
$ reliefseq --snp-data data_file.ext
‘ext’ is used to determine the format of the SNP file. The following
‘ext’ values are recognized by ReliefSeq:
FILE EXTENSION | DETAILS |
---|---|
txt | tab-delimited header followed by data, class column designated “Class” in the header line (originally the only supported format) |
map/ped | PLINK map/ped file; either map or ped is recognized |
bed/bim/fam | PLINK binary encoded map/ped; any of bed, bim or fam is recognized |
raw | PLINK RAW file from –recodeA PLINK operation (similar to txt format) |
arff | Weka attribute relation file format (using nominals encoded {0,1,2}) |
Many messages are sent to the console (stdout) to keep the user informed of
the algorithm’s progress. The resulting ranked attributes are stored in the
file reliefseq_default.reliefseq. The name of the output file can be changed
with the command line option –out-files-prefix, in which case the prefix is
used to produce output filenames of the form out-files-prefix.reliefseq. The
reliefseq program will report to the console the exact name used. The format
of the output scores files is a two-column, tab-delimited text file of sorted
scores and attribute names.
SNP-only, continuous phenotype (discrete-continuous) Analysis
$ reliefseq --snp-data data_file.ext
If the phenotype in a SNP file is found to be continuous regression ReliefF
(RReliefF) algorithm is invoked. The phenotype type is determined
from the phenotypes in the file, or an alternate phenotype file can be used
with the –alternate-pheno-file to override the phenotypes in the data_file.
If the phenotype is “1” or “2” in the case of PLINK files, or “0” and “1” in
the case of txt and ARFF files, the phenotype is considered case-control.
Otherwise, the phenotype is assumed to be continuous. The same is true of the
alternate phenotype file. The format of the alternate phenotype file is a
three-column, tab-delimited text file. This is the same as PLINK’s phenotype
file format and has the following required columns:
FID family ID | IID individual ID | PHENOTYPE value |
NOTE: It is assumed the phenotype file has NO HEADER (in contrast to PLINK
where it is optional). ADDITIONAL NOTE: See “A Note about IDs” below for
important details about ID matching. The third column of values in the
phenotype file replaces the phenotypes read from the SNPs file.
Gene Expression (or Other Numeric) Data Sets
Related Command Line Options
-n [ --numeric-data ] arg read continuous attributes from PLINK-style covar file -N [ --numeric-metric ] arg (=manhattan) metric for determining the difference between numeric attributes (manhattan=|euclidean) -a [ --alternate-pheno-file ] arg specifies an alternative phenotype/class label file; one value per line
$ reliefseq -n data_file.dat --alternate-pheno-file discrete_class.pheno
With this combination numeric attributes such as gene expression or other
continuous genetic measurements can be used. Note the alternate phenotype file
option is required for numeric-only attributes. The –numeric-metric command
line option is used to specify the metric used for distance between instances.
While not treated as covariates, the PLINK covariate file format is used to specify
the numeric variables.
A Note about IDs
Note that like phenotype files described above, the ID fields are important
and can be used to effectively filter the data set in various ways through
ID matching. The IID field (second column) must match the IID field in the
PLINK format SNP files or be an eight-character, zero-padded sequence
beginning with ‘00000001’ and incrementing by one for each line/instance in
the file (for txt and RAW files). (This encoding insures a strict ordering of
instances that affects the selecting between ties in the nearest neighbor
algorithm, which affects algorithm validation by matching the Weka machine
learning system results.) The numeric and phenotype files’ IDs are read and
intersected to find common IDs. Then if present, the SNP data set is read,
keeping only the IDs that matched with the numeric and phenotype files.
Finally, if any phenotypes are missing from an alternate phenotype file,
these instances are removed from algorithmic consideration. In this way a SNP
and/or numeric file can be used with several different phenotype files with
different individuals. One should always read the console output/log carefully
to make sure the number of instances in the final analysis meets expectations.
Integrated Data Sets
Combining both discrete and continuous attributes is referred to as
“integrated analysis”. In the case of integrated attributes, both types of
distance measures are used and can be overridden with the command line
options –snp-metric and –numeric-metric. ReliefF is used with discrete
class read from the SNP file. RReliefF is used if continuous phenotypes are
detected (as described above).
A Note on…
Missing Values
Missing values are handled for all types of data sets supported. Each data set
reader considers the missing encoding(s) for its particular file format.
The following table summarizes the missing genotype values recognized by each
reader.
TXT | 9 or ? or empty string |
ARFF | ? |
PLINK RAW | NA |
PLINK PED | ‘0 0’ |
PLINK BED | bit string ‘10’ |
Missing SNP values in ReliefF are handled by algorithms as described in
section “2.2. RELIEFF – EXTENSION” in the paper “Theoretical and Empirical
Analysis of ReliefF and RReliefF”, Machine Learning Journal (2003) 53:23-69.
For continuous values, the normalized difference is used, as in the Weka
machine learning system.
Missing phenotypes cause the file reader to skip the individual/instance with
a warning message and subsequent reduction in reported number of instances in
the program output. Missing phenotypes for TXT and ARFF files are any encoded
-9. For PLINK formats, missing phenotypes are 0 or -9 for SNPs and -9 for
continuous phenotypes.
Weighting by Distance in ReliefF
When computing nearest neighbors, the influence of distance between an
instance and its nearest neighbors is considered equal by default; that is,
the distances are used only to rank the neighbors. In both SNPs and continuous
attributes, the influence of each ranked neighbor can be taken into
consideration by applying a weighting factor to each neighbor’s distance.
This is particularly important in the case of regression ReliefF, since it
uses the distance between instances as a way of making a kind of
hits-and-misses analogy to the standard ReliefF algorithm. For more details
see section “2.3. RRELIEFF – IN REGRESSION” in the paper “Theoretical and
Empirical Analysis of ReliefF and RReliefF”, Machine Learning Journal
(2003) 53:23-69. See “Overriding ReliefF Default Algorithm Parameters” above
for the command line options for using this feature.
Multiclass Phenotypes
Multiclass phenotypes are implemented in the ReliefF C++ class.; however, the
PLINK data set readers restrict phenotypes to case-control, since PLINK does
not support multiclass phenotypes and this feature has not been needed.
Multiclass is supported in TXT and ARFF formats, though results are
unpredictable if the class column is not coded as integers (the first character
of whatever is read as the class column is converted to an integer, effectively
limiting classes to ten levels: 0-9).