Introduction

GAIN and SNPrank are useful in determining relevant single nucleotide polymorphisms (SNPs) to a given phenotype. In concert with PLINK, a third-party tool, these can provide a powerful analysis engine in gauging SNP relevancy based on a specified phenotype.

Assumptions

Initial data is assumed to be SNP genotypic data with a single phenotype column in CSV format. Both GAIN and SNPrank assume an initial pre-processing step to filter the input data with PLINK. PLINK is a free, open-source whole genome association analysis command-line tool.

Dependencies

1. PLINK binaries for Mac, Linux, and Windows can be downloaded at http://pngu.mgh.harvard.edu/~purcell/plink/download.shtml

2. Python is required for the GAIN and SNPrank tools. Version 2.6.5 has been tested, but 2.7.x should also work. Python is available for several platforms at: http://python.org/download

3. csv2plink.py is a Python script that converts an input CSV file to corresponding PLINK .map and .ped files. This script is hosted on Github at http://github.com/insilico/converters

4. SNPrank requires the NumPy numerical computation library for linear algebra operations. SNPrank was tested with NumPy 1.3.0. Information on installing NumPy is here: http://numpy.scipy.org

5. For optional GPU support, SNPrank requires the CUDA drivers available from NVIDIA at http://developer.nvidia.com/object/cuda_download.html 6. Additionally, a Python CUDA matrix library called CUDAMat is used for the linear algebra operations on the GPU. CUDAMat is hosted on Google Code and can be downloaded at http://code.google.com/p/cudamat

Downloading the tools

pygain (Python implementation of GAIN) can be downloaded from Github http://github.com/insilico/pygain.

Click the Downloads button on the right for the latest tagged release (0.1.0 as of this writing).

pysnprank (Python implementation of SNPrank) is also available via Github http://github.com/insilico/pysnprank.

Click the Downloads button on the right for the latest tagged release (0.1.0 as of this writing).

Instructions

1. Convert the input CSV to PLINK .map and .ped files using the csv2plink.py utility

$ csv2plink.py -c 1 -p 1 sample-data.csv sample-data

2. Use PLINK to recode the .map and .ped files as a .raw file

$ plink --file sample-data --recodeA --map3 --out sample-data

(If there are errors with the PLINK recode, try adding options to exclude missing data columns):

$ plink --file sample-data --recodeA --map3 --no-sex --no-fid --no-parents --missing-genotype ? --out sample-data

3. Once the PLINK command successfully generates the .raw file, run GAIN on the PLINK .raw data

$ gain.py -i sample-data.raw -o sample-data.gain

4. Run SNPrank on the GAIN matrix data to output a ranked list of SNPs

$ snprank.py -i sample-data.gain -o sample-data-rankings.txt
GAIN/SNPrank analysis pipeline
Introduction

GAIN and SNPrank are useful in determining relevant single nucleotide polymorphisms (SNPs) to a given phenotype. In concert with PLINK, a third-party tool, these can provide a powerful analysis engine in gauging SNP relevancy based on a specified phenotype.

Assumptions

Initial data is assumed to be SNP genotypic data with a single phenotype column in CSV format. Both GAIN and SNPrank assume an initial pre-processing step to filter the input data with PLINK. PLINK is a free, open-source whole genome association analysis command-line tool.

Dependencies

1. PLINK binaries for Mac, Linux, and Windows can be downloaded at
http://pngu.mgh.harvard.edu/~purcell/plink/download.shtml

2. Python is required for the GAIN and SNPrank tools. Version 2.6.5 has
been tested, but 2.7.x should also work. Python is available for
several platforms at:
http://python.org/download

3. csv2plink.py is a Python script that converts an input CSV file to
corresponding PLINK .map and .ped files. This script is hosted on
Github at
http://github.com/insilico/converters

4. SNPrank requires the NumPy numerical computation library for linear
algebra operations. SNPrank was tested with NumPy 1.3.0. Information on
installing NumPy is here:
http://numpy.scipy.org

5. For optional GPU support, SNPrank requires the CUDA drivers
available from NVIDIA at
http://developer.nvidia.com/object/cuda_download.html

6. Additionally, a Python CUDA matrix library called CUDAMat is used
for the linear algebra operations on the GPU. CUDAMat is hosted on
Google Code and can be downloaded at http://code.google.com/p/cudamat

Downloading the tools

pygain (Python implementation of GAIN) can be downloaded from Github http://github.com/insilico/pygain.
Click the Downloads button on the right for the latest tagged release
(0.1.0 as of this writing).

pysnprank (Python implementation of SNPrank) is also available via
Github
http://github.com/insilico/pysnprank
. Click the Downloads button
on the right for the latest tagged release (0.1.0 as of this writing).

Instructions

1. Convert the input CSV to PLINK .map and .ped files using the
csv2plink.py utility

$ csv2plink.py -c 1 -p 1 sample-data.csv sample-data

2. Use PLINK to recode the .map and .ped files as a .raw file

$ plink --file sample-data --recodeA --map3 --out sample-data

(If there are errors with the PLINK recode, try adding options to
exclude missing data columns):

$ plink --file sample-data --recodeA --map3 --no-sex --no-fid --no-parents --missing-genotype ? --out sample-data

3. Once the PLINK command successfully generates the .raw file, run
GAIN on the PLINK .raw data

$ gain.py -i sample-data.raw -o sample-data.gain

4. Run SNPrank on the GAIN matrix data to output a ranked list of
SNPs

$ snprank.py -i sample-data.gain -o sample-data-rankings.txt

There are two currently two implementations of GAIN available.

  • a command-line tool written in Python, hosted on Github
  • an older Java-based GUI version of GAIN, hosted on Google Code