### General overview of FUMAGWAS

The main purpose of FUMA is to use functional, biological information to prioritize genes based on GWAS outcomes.

FUMA consists of two separate process; SNP2GENE and GENE2FUNC.

To annotate and prioritize SNPs and genes from your GWAS summary statistics, go to SNP2GENE which compute LD structure, annotates functions to SNPs, and prioritize candidate genes.

You can then use the prioritized genes as input to GENE2FUNC to check expression patterns and shared molecular functions between genes. GENE2FUNC can also be used for any list of pre-selected genes (i.e. created outside of SNP2GENE).

## Quick Start

### General Information

Each page contains information where needed and brief descriptions of inputs and results to help you understand them without going through entire tutorial.

This is information of inputs or results.

:click the question mark to display a brief description.

This is for optional inputs/parameters.

This is the message if everything is fine.

This is the message if the input/parameter is mandatory and not given or invalid input is given.

This is the warning message for the input/parameter. Please check your input settings.

### Prioritize genes based on your own GWAS summary statistics

For risk loci identified by FUMA in your summary statistics, you can obtain functional annotation of SNPs and map them to genes. By changing parameter settings, you can control which annotations or filters need to be used to prioritize genes.

Because you will upload your own GWAS summary statistics, we require you to register. All uploaded files are handled securely and can only be seen by you. Results can be queried at later times, but can also be deleted. If you delete a previously run job, your uploaded file will be deleted from the FUMA server.

If you haven't registered yet, please do so from Register.

#### 2. Submit new job at SNP2GENE

A new job stats with a GWAS summary statistics file. A variety of file formats are supported. Please refer the section of Input files for details. If your input file is an output from PLINK, SNPTEST or METAL, you can directly submit the file without specifying column names.
The input GWAS summary statistics file could be a subset of SNPs (e.g. only SNPs which are interesting in your study), but in this case, MAGMA results are not relevant anymore.
Optionally, if you would like to pre-specify lead SNPs, you can upload a file with 3 columns; rsID, chromosome and position. FUMA will then use these SNPs to select LD-related SNPs for annotation and mapping, instead of using lead SNPs identified by FUMA (it requires to disable an option for "identify additional lead SNPs").
In addition, if you are interested in specific genomic regions, you can also provide them by uploading a file with 3 columns; chromosome, start and end position. FUMA will then use these genomic regions to select LD-related SNPs for annotation and mapping, instead of determining the regions itself.

#### 3. Set parameters

On the same page as where you specify the input files, there are a variety of optional parameters that control the prioritization of genes. Please check your parameters carefully. The default settings are to perform identification of independent genome-wide significant SNPs at r2 0.6 and lead SNPs at r2 0.1, to maps SNPs to genes up to 10kb apart.
To filter SNPs by specific functional annotations and to use eQTL mapping, please change parameters (please refer the parameter section of this tutorial from here).
If all inputs are valid, 'Submit Job' button will be activated. Once you submit a job, this will be listed in My Jobs.
Please do not navigate away from the page while your file is uploading (this may take up to couple of minutes depending on the file size and your internet speed).

After you submit files and parameter settings, a JOB has the status NEW which will be updated to QUEUES to RUNNING. Depending on the number of significant genomic regions, this may take between a couple of minutes and an hour. Once a JOB has finished running, you will receive an email. Unless an error occurred during the process, the email includes the link to the result page (this again requires login). You can also access to the results page from My Jobs page.
The result page displays 4 additional side bars.
Genome-wide plots: Manhattan plots and Q-Q plots for GWAS summary statistics and gene-based test by MAGMA, results of MAGMA gene-set analysis and tissue expression analysis.
Summary of results: Summary of results such as the number of lead and LD-related SNPs, and mapped genes for overall and per identified genomic risk locus.
Results: Tables of lead SNPs, genomic risk loci, candidate SNPs with annotations, eQTLs (only when eQTL mapping is performed), mapped genes and GWAS-catalog reported SNPs matched with candidate SNPs. You can also create interactive regional plots with functional annotations from this tab.
Details of all FUMA outputs are provided in the SNP2GENE Outputs section of this tutorial.

### Gene functions: Tissue specific gene expression and shared biological functions of a list of genes

GENE2FUNC can take the list of prioritized genes from SNP2GENE or alternatively you can provide another list of pre-specified genes. Note that the genes prioritized in SNP2GENE are based on the functional and/or eQTL mapping, but not on MAGMA based gene output.

For every input genes, GENE2FUNC provides information on expression in different tissue types, tissue specificity and enrichment of publicly available gene sets.

#### 1. Submit a list of genes

Both a list of genes of interest and background genes (for hypergeometric test) are mandatory input.
You can use mapped genes from SNP2GENE by clicking the "Submit" button in the result page (Results tab).

#### 2. Results

Once genes are submitted, four extra side bars are shown.
Gene Expression: An interactive heatmap of gene expression of user selected data sets.
Tissue Specificity: Bar plots for enrichment test of differentially expressed genes in a certain label compared to all other samples for a use selected data sets. See GENE2FUNC Outputs section for details.
Gene Sets: Plots and tables of enriched gene sets.
Gene Table: Table of input genes with links to external databases; OMIM, Drugbank and GeneCards.
Further details are provided in the GENE2FUNC Outputs section of this tutorial.

## SNP2GENE

### Prepare Input Files

#### 1. GWAS summary statistics

GWAS summary statistics is a mandatory input of SNP2GENE process. FUMA accept various types of format. For example, PLINK, SNPTEST and METAL output formats can be used as it is. For other formats, column names can be provided. Input files should be prepared in ascii txt or (preferably) gzipped or zipped. Every row should contain information on one SNP. An input GWAS summary statistics file could contain only subset of SNPs (e.g. SNPs of interest for your study to annotate them), but in this case, results of MAGMA will not be relevant anymore. Please note that variants which do not exists in the selected reference panel will not be included in any analyses.
For indels, both alleles need to be matched exactly with reference panel to be included in the analysis. For example, an indel rs144029872 needs to be encoded with AG/A (the order of alleles does not matter), anything else such as G/- or I2/D will not match with the selected reference panel.

Mandatory columns
The input file must include a P-value and either an rsID or chromosome index + genetic position on hg19 reference genome. When either chromosome or position is missing, they are extracted from dbSNP build 146 based on rsID. In this case, input rsID is updated to dbSNP build 146. When rsID is missing, it is extracted from dbSNP build 146 based on chromosome and position. The column of chromosome can be a string such as "chr1" or just an integer such as 1. When "chr" is attached, this will be removed in output files. When the input file contains chromosome X, this will be encoded as chromosome 23, however, the input file can contain "X".

Allele columns
Alleles are not mandatory but if only one allele is provided, that is considered to be the effect allele. When two alleles are provided, the effect allele will be defined depending on column name. If alleles are not provided, they will be extracted from the dbSNP build 146 and minor alleles will be assumed to be the effect alleles. Effect and non-effect alleles are not distinguished during annotations, but used for alignment with eQTLs. Whenever alleles are provided, they are matched with dbSNP build 146 if extraction of rsID, chromosome or position is necessary.
Alleles are case insensitive.

Column names are automatically detected based on the following headers (case insensitive).

• SNP | snpid | markername | rsID: rsID
• CHR | chromosome | chrom: chromosome
• BP | pos | position: genomic position (hg19)
• A1 | effect_allele | allele1 | alleleB: affected allele
• A2 | non_effect_allele | allele2 | alleleA: another allele
• P | pvalue | p-value | p_value | frequentist_add_pvalue | pval: P-value (Mandatory)
• OR: Odds Ratio
• Beta | be: Beta
• SE: Standard error
If your input file has alternative names, these can be entered in the respective input boxes when specifying the input file. Note that any columns with the name listed above but with different element need to be avoided. For example, when the column name is "SNP" but the actual element is an id such as "chr:position" rather than rsID will cause an error.
Extra columns will be ignored.
Column "N" is described in the Parameters section.
Be careful with the alleles header in which A1 is defined as effect allele by default. Please specify both effect and non-effect allele column to avoid mislabeling.
If wrong labels are provided for alleles, it does not affect any annotation and prioritization results. It does however affect eQTLs results (alignment of risk increasing allele of GWAS and tested allele of eQTLs). Be aware of that when you interpret results.

Delimiter
Delimiter can be any of white space including single space, multiple space and tab. Because of this, each element including column names must not include any space.

#### Note and Tips

When the input file has all of the following columns; rsID, chromosome, position, allele1 and allele2, the process will be much quicker than extracting information.

The pipeline currently supports human genome hg19. If your input file is not based on hg19, please update the genomic position using liftOver from UCSC. However, there is an option for you!! When you provide only rsID without chromosome and genomic position, FUMA will extract them from dbSNP build 146 based on hg19. To do this, remove columns of chromosome and genomic position or rename headers to ignore those columns. Note that extracting chromosome and genomic position will take extra time.

This is an optional input file.
This option would be useful when
1. You have lead SNPs of interest but they do not reach significant P-value threshold.
2. You are only interested in specific lead SNPs and do not want to identify additional lead SNPs which are independent. In this case, you also have to UNCHECK option of Identify additional independent lead SNPs.
If you want to specify lead SNPs, input file should have the following 3 columns:

• rsID : rsID of the lead SNPs
• chr : chromosome
• pos : genomic position (hg19)

The order of columns has to be exactly the same as shown above but header could be anything (the first row is ignored). Extra columns will be ignored.

#### 3. Pre-defined genomic region

This is an optional input file. This option would be useful when you have already done some follow-up analyses of your GWAS and are interested in specific genomic regions. When pre-defined genomic region is provided, regardless of parameters, only lead SNPs and SNPs in LD with them within provided regions will be reported in outputs.
If you want to analyze only specific genomic regions, the input file should have the following 3 columns:

• chr : chromosome
• start : start position of the genomic region of interest (hg19)
• end : end position of the genomic region of interest (hg19)

The order of columns has to be exactly the same as shown above but header could be anything (the first row is ignored). Extra columns will be ignored.

### Parameters

Annotation and prioritization depends on several settings, which can be adjusted if desired. The default settings will result in performing naive positional mapping which maps all independent lead SNPs and SNPs in LD to genes up to 10kb apart. It does not include eQTL mapping by default, and it also does not filter on specific functional consequences of SNPs. If for example you are interested in prioritizing genes only when they are indicated by an eQTL that is in LD with a significant lead SNP, or by exonic SNPs, then you need to adjust the parameter settings.

Each of user inputs and parameters have status as described below. Please make sure all input has non-red status, otherwise the submit button will not be activated.

This is for optional inputs/parameters.

This is the message if everything is fine.

This is the message if the input/parameter is mandatory and not given or invalid input is given.

This is the warning message for the input/parameter. Please check your input settings.

In this section, every parameter that can be adjusted will be described in detail.

#### 1. Input files

Parameter Mandatory Description Type Default
GWAS summary statistics Mandatory Input file of GWAS summary statistics. Plain text file or zipped or gzipped files are acceptable. The maximum file size which can be uploaded is 600Mb. As well as full results of GWAS summary statistics, subset of results can also be used. e.g. If you would like to look up specific SNPs, you can filter out other SNPs. Please refer to the Input files section for specific file format. File upload none
Pre-defined lead SNPs Optional Optional pre-defined lead SNPs. The file should have 3 columns, rsID, chromosome and position. File upload none
Identify additional lead SNPs Optional only when predefined lead SNPs are provided If this option is CHECKED, FUMA will identify additional independent lead SNPs after defining the LD block for pre-defined lead SNPs. Otherwise, only given lead SNPs and SNPs in LD of them will be used for further annotations. Check Checked
Pre-defined genetic region Optional Optional pre-defined genomic regions.
FUMA only looks at provided regions to identify lead SNPs and SNPs in LD of them. If you are only interested in specific regions, this option will increase the speed of process.

#### 2. Parameters for lead SNPs and candidate SNPs identification

Parameter Mandatory Description Type Default Direction
Sample size (N) Mandatory The total number of individuals in the GWAS or the number of individuals per SNP. This is only used for MAGMA to compute the gene-based P-values. For total sample size, input should be an integer. When the input file of GWAS summary statistics contains a column of sample size per SNP, the column name can be provided in the second text box.
When column name is provided, please make sure that the column only contains integers (no float or scientific notation). If there are any float values, they will be rounded up by FUMA.
Integer or text none Does not affect any candidates
Maximum lead SNP P-value (≤) Mandatory FUMA identifies lead SNPs with P-value less than or equal to this threshold and independent from each other. numeric 5e-8 lower: decrease #lead SNPs.
Minimum r2 (≥) Mandatory The minimum r2 for determining LD with independent genome-wide significant SNPs, which is used to determine the borders of the genomic risk loci. SNPs with r2 ≥ user defined threshold with any of the detected independent significant SNPs will be included for further annotations and are used fro gene prioritization. Note that the identification of independent lead SNPs is independent from this and is based on fixed r2 of 0.1. numeric 0.6 higher: decrease #candidate SNPs and increase #independent significant SNPs.
lower: increase #candidate SNPs and decrease #independent significant SNPs.
Maximum GWAS P-value (≤) Mandatory This is the P-value threshold for candidate SNPs in LD of independent significant SNPs. This will be applied only for GWAS-tagged SNPs as SNPs which do not exist in the GWAS input but are extracted from 1000 genomes reference do not have P-value. numeric 0.05 higher: decrease #candidate SNPs.
lower: increase #candidate SNPs.
Reference panel Mandatory The reference panel to compute r2 and MAF. Five populations from 1000 genomes Phase 3 and 3 different versions of UK Biobank are available. See here for details. Select 1000G Phase EUR -
Include variants from reference panel Mandatory If Yes, all SNPs in strong LD with any of independent significant SNPs including non-GWAS-tagged SNPs will be included and used for gene mapping. Yes/No Yes -
Minimum MAF (≥) Mandatory The minimum Minor Allele Frequency to be included in annotation and prioritization. MAF is computed based on 1000 genomes reference panel (Phase 3). This filter also applies to lead SNPs. If there is any pre-defined lead SNPs with MAF less than this threshold, those SNPs will be skipped. numeric 0.01 higher: decrease #candidate SNPs.
lower: increase #candidate SNPs.
Maximum distance of LD blocks to merge (≤) Mandatory This is the maximum distance between LD blocks of independent significant SNPs to merge into a single genomic locus. When this is set at 0, only physically overlapping LD blocks are merged. Defining genomic loci does not affect identifying which SNPs fulfil selection criteria to be used for annotation and prioritization. It will only result in a different number of reported risk loci, which can be desired when certain loci are partly overlapping or physically very close. numeric 250kb higher: decrease #genomic loci.
lower: increase #genomic loci.

#### 3. Parameters for gene mapping

There are two options for gene mapping; positional and eQTL mappings. By default, positional mapping with maximum distance 10kb is performed. Since parameters in this section largely affect the result of mapped genes, please set carefully.

#### 3.1 Positional mapping

Parameter Mandatory Description Type Default Direction
Positional mapping Optional Check this option to perform positional mapping. Positional mapping is based on ANNOVAR annotations by specifying the maximum distance between SNPs and genes or based on functional consequences of SNPs on genes. These parameters can be specified in the option below. Check Checked -
Distance to genes or functional consequences of SNPs on genes to map Mandatory if positional mapping is activated. Positional mapping criterion either map SNPs to genes based on physical distances or functional consequences of SNPs on genes.
When maximum distance is provided SNPs are mapped to genes based on the distance given the user defined maximum distance. Alternatively, specific functional consequences of SNPs on genes can be selected which filtered SNPs to map to genes. Note that when functional consequences are selected, all SNPs are locating on the gene body (distance 0) except upstream and downstream SNPs which are up to 1kb apart from TSS or TSE.
When the maximum distance is set at > 0kb and < 1kb all upstream and downstream SNPs are included since the actual distance is not provided by ANNOVAR. Therefore, the maximum distance > 0kb and < 1kb is same as the maximum distance 1 kb. For SNPs which are locating on a genomic region where multiple genes are overlapped, ANNOVAR has its own prioritization criteria to report the most deleterious function. For those SNPs, only prioritized annotations are used.
Integer / Multiple selection Maximum distance 10 kb -

#### 3.2 eQTL mapping

Parameter Mandatory Description Type Default Direction
eQTL mapping Optional Check this option to perform eQTL mapping. eQTL mapping will map SNPs to genes which likely affect expression of those genes up to 1 Mb (cis-eQTL). eQTLs are highly tissue specific and tissue types can be selected in the following option. eQTL mapping can be used together with positional mapping. Check Unchecked -
Tissue types Mandatory if eQTL mapping is CHECKED All available tissue types with data sources are shown in the select boxes. From FUMA v1.3.0, GTEx v7 became available but GTEx v6 are kept available. Therefore, when "all" is selected, both GTEx v6 and v7 are used for mapping. For detail of eQTL data resources, please refer to the eQTL section in this tutorial. Multiple selection none -
eQTL maximum P-value (≤) Optional The P-value threshold of eQTLs. Two options are available, Use only significant snp-gene pairs or nominal P-value threshold. When Use only significant snp-gene pairs is checked, only eQTLs with FDR ≤ 0.05 will be used. Otherwise, defined nominal P-value is used to filter eQTLs.
Some of eQTL data source only contained eQTLs with a certain FDR threshold. Please refer to the eQTLs section for details of each data sources.
Check / Numeric Checked / 1e-3 lower: increase #eQTLs and #mapped genes.
higher: decrease #eQTLs and #mapped genes.

#### 3.3 Chromatin interaction mapping

Parameter Mandatory Description Type Default Direction
chromatin interaction mapping Optional Check this option to perform chromatin interaction mapping. Check Unchecked -
Builtin chromatin interaction data Optional Build in chromatin interaction data can be selected in this option. Details of available build in data are available in the Chromatin interactions section in this tutorial. Multiple selection none -
Custom chromatin interaction matrices Optional In addition to build in chromatin interaction data, user can upload custom data. The data should be pre-computed chromatin loops with significance (ideally FDR but another score can be used, see the Chromatin interactions section for details). The file should be gzipped and named as "(name-of-data).txt.gz". Multiple files can be uploaded. For each data, user can also provide data type, such as Hi-C, ChIA-PET or C5 which is not mandatory but will be used in the result table and regional plot. The file format is described in the Chromatin interactions section in this tutorial.
Please avoid uploading more than one file with identical file names. In that case, the files are over-written by the last uploaded one.
FDR threshold (≤) Mandatory if chromatin interaction mapping is CHECKED FDR threshold for significant loops. The default value is set at 1e-6 which is suggested by Schmitt et al. (2016)
This threshold will be applied both build in and user uploaded chromatin loops.
Numeric 1e-6 lower: increase #chromatin interactions and #mapped genes.
higher: decrease #chromatin interactions and #mapped genes.
Promoter region window Mandatory if chromatin interaction mapping is CHECKED Promoter regions of genes to map in significantly interacting regions. The input format should be "(upstream bp)-(donwstream bp)" from transcription start site (TSS). For example, the default "250-500" means that promoter regions are defined as 250bp upstream and 500bp downstream of the TSS. By the chromatin interaction mapping, genes whose user defined promoter regions are overlapped with the significantly interacting regions will be mapped. Please refer the Chromatin interactions section in this tutorial for details. text 250-500 lower: increase #mapped genes.
smaller: decrease #mapped genes.
Annotate enhancer/promoter regions (Roadmap 111 epigenomes) Optional Predicted enhancer and promoter regions from Roadmap epigenomics project for 111 epigenomes can be annotated to significantly interaction regions. If any epigenome is not selected, enhancer and promoter regions are not annotated. Annotated enhancer/promoter regions can be used to filter SNPs and mapped genes in the next two options. Multiple selection none -
Filter SNPs by enhancers Optional This option is only available when at least one epigenome is selected in the previous option to annotate enhancer/promoter regions. When this option is checked, SNPs are filtered on such that overlap with one of the annotated enhancer regions for chromatin interaction mapping. Please refer the Chromatin interactions section in this tutorial for details. Check Unchecked -
Filter genes by promoters Optional This option is only available when at least one epigenome is selected in the previous option to annotate enhancer/promoter regions. When this option is checked, chromatin interaction mapping is only performed for genes whose promoter regions are overlap with one of the annotated promoter regions. Please refer the Chromatin interactions section in this tutorial for details. Check Unchecked -

#### 3.4 Functional annotation filtering

Positional, eQTL and chromatin interaction mappings have the following options separately, for the filtering of SNPs based on functional annotation. All filters below apply to selected SNPs in LD with independent significant SNPs that are used to prioritize genes and influence the number of SNPs that are mapped to genes, and consequently influence the number of prioritized genes.

Parameter Mandatory Description Type Default Direction
CADD score Optional Check this if you want to perform filtering of SNPs by CADD score. This applies to selected SNPs in LD with independent significant SNPs that are used to prioritize genes. CADD score is the score of deleteriousness of SNPs predicted by 63 functional annotations. 12.37 is the threshold to be deleterious suggested by Kicher et al (2014). Please refer to the original publication for details from links. Check Unchecked -
Minimum CADD score (≥) Mandatory if CADD score is checked The higher the CADD score, the more deleterious. numeric 12.37 higher: less SNPs will be mapped to genes.
lower: more SNPs will be mapped to genes.
RegulomeDB score Optional Check if you want to perform filtering of SNPs by RegulomeDB score. This applies to selected SNPs in LD with independent significant SNPs that are used to prioritize genes. RegulomeDB score is a categorical score representing regulatory functionality of SNPs based on eQTLs and chromatin marks. Please refer to the original publication for details from links. Check Unchecked -
Minimum RegulomeDB score (≥) Mandatory if RegulomeDB score is checked RegulomeDB score is a categorical score from 1a to 7) Score 1a means that those SNPs are most likely affecting regulatory elements and 7 means that those SNPs do not have any annotations. SNPs are recorded as NA if they are not present in the database. SNPs with NA will not be included for filtering on RegulomeDB score. string 7 higher: more SNPs will be mapped to genes.
lower: less SNPs will be mapped to genes.
15-core chromatin state Optional Check if you want to perform filtering of SNPs by chromatin state. This applies to selected SNPs in LD with independent significant SNPs that are used to prioritize genes. The chromatin state represents accessibility of genomic regions (every 200bp) with 15 categorical states predicted by ChromHMM based on 5 chromatin marks for 127 epigenomes. Check Unchecked -
15-core chromatin state tissue/cell types Mandatory if 15-core chromatin state is checked Multiple tissue/cell types can be selected from the list. Multiple selection none -
Maximum state of chromatin(≤) Mandatory if 15-core chromatin state is checked The maximum state to filter SNPs. Between 1 and 15. Generally, between 1 and 7 is open state. numeric 7 higher: more SNPs will be mapped to genes.
lower: less SNPs will be mapped to genes.
Method for 15-core chromatin state filtering Mandatory if 15-core chromatin state is checked When multiple tissue/cell types are selected, either any (filtered on SNPs which have state above than threshold in any of selected tissue/cell types), majority (filtered on SNPs which have state above than threshold in majority (≥50%) of selected tissue/cell type), or all (filtered on SNPs which have state above than threshold in all of selected tissue/cell type). Selection any -

#### 4. Gene types

Biotype of genes to map can be selected. Please refer to Ensembl for details of biotypes.

Parameter Mandatory Description Type Default
Gene type Mandatory Gene type to map. This is based on gene_biotype obtained from BioMart of Ensembl build 85. Please see here for details Multiple selection. Protein coding genes.

#### 5. MHC region

The MHC region is often excluded due to its complicated LD structure. Therefore, this option is checked by default. Please uncheck to include MHC region. Note that it doesn't change any results if there is no significant hit in the MHC region.

Parameter Mandatory Description Type Default
Exclude MHC region Optional Check if you want to exclude the MHC region. The default region is defined as between "MOG" and "COL11A2" genes. Check Checked
Options for excluding MHC region Optional MHC region can be excluded only from either annotations or MAGMA gene analysis, or from both by selecting this option. Select Only from annotations
Extended MHC region Optional User specified MHC region to exclude (for extended or shorter region). The input format should be like "25000000-34000000" on hg19. Text Null

#### 6. MAGMA analysis

MAGMA gene and gene-set analyses are performed for the input summary statistics by default, but user can also select to omit MAGMA process that reduce the run time of SNP2GENE process. Gene expression data sets for MAGMA gene expression analysis can be also selected from here.

Parameter Mandatory Description Type Default
Perform MAGMA Optional UNCHECK to SKIP MAGMA analyses. Check Checked
MAGMA gene annotation window Mandatory when MAGMA is active. The window of the genes to assign SNPs (symmetric). e.g. when 5kb is selected, SNPs within 5kb window of a gene (both side) will be assigned to that gene. The option is available from 0, 5, 10, 15, 20kb window. Select 0kb from both side of the genes
MAGMA gene expression analysis Mandatory when MAGMA is active. Gene expression data sets used for MAGMA gene-property analysis to test positive association between genetic associations and gene expression in a given label. Select GTEx v6

#### 7. Title of job submission

Title of job submission can be provided at above the "Submit Job" button. This is not mandatory but this would be useful to keep track your jobs.

### Outputs of SNP2GENE

Once your job is completed, you will receive an email. Unless an error occurred during the process, the email includes the link to results page (this again requires login). You can also access to the results page from My Job list.

#### 1. Genome-wide plots

This panel displays manhattan plots and Q-Q plots for both GWAS summary statistics (input file) and gene-based association test.
In addition MAGMA based gene-set P-values are provided. Note that MAGMA gene-set analysis uses the full distribution of SNP p-values and is different from pathway enrichment test that only test for enrichment of low P-values, or enrichment of prioritized genes.

Plots for GWAS summary statistics
For plotting purposes, overlapping data points are filtered out based on the following criteria.

• Manhattan plot: Overlapping data points (SNPs) were filtered out such that there is only one data point per pixel, but only when the average data points per pixel (x-axis) across y-axis is above 1. For each pixel, the plotted data point was randomly selected. This filtering was only performed for SNPs with P-value ≥ 1e-5.
• Overlapping data points (SNPs) were filtered such that only one data point per pixel was kept. For each pixel, the plotted data point was randomly selected. This filtering was only performed for SNPs with P-value ≥ 1e-5.
MHC region is shown in this manhattan plot even if option is set to exclude MHC region from annotations.

Plots for gene-based test (MAGMA)
Gene analysis was performed by using MAGMA (v1.6) with default setting. SNPs were assigned to the genes obtained from Ensembl build 85 (only protein-coding genes).
Genome-wide significance (red dashed line) was set at 0.05 / (the number of tested genes). Genes whose P-value reached the genome-wide significance can be labeled in the manhattan plot. The number of genes to label can be controlled by typing the number at the left upper side of the plot. MAGMA results are available from the download panel.
When the option is selected to exclude MHC region from MAGMA gene analysis, the results of MAGMA does not include MHC region, therefore manhattan plot also does not display genes in MHC region.

MAGMA Gene-Set Analysis
Using the result of gene analysis (gene level p-value), (competitive) gene-set analysis is performed with default parameters with MAGMA v1.6. Gene sets were obtained from Msigdb v5.2 for "Curated gene sets" and "GO terms".

MAGMA Tissue Expression Analysis (FUMA v1.1.0)
To test the (positive) relationship between highly expressed genes in a specific tissue and genetic associations, gene-property analysis is performed using average expression of genes per tissue type as a gene covariate. Gene expression values are log2 transformed average RPKM per tissue type after winsorized at 50 based on GTEx RNA-seq data. Tissue expression analysis is performed for 30 general tissue types and 53 specific tissue types separately. MAGMA was performed using the result of gene analysis (gene-based P-value) and tested for one side (greater) with conditioning on average expression across all tissue types.

#### 2. Summary of results

This panel shows a general summary of the results based on your GWAS input. Images are downloadable in several formats.

• Summary of SNPs and mapped genes
• #Genomic risk loci: The number of genomic risk loci defined from independent significant SNPs by merging LD blocks if they are less apart than the user defined distance. A genomic risk locus can contain multiple lead SNPs and/or independent significant SNPs.
• #lead SNPs: The number of lead SNPs identified from independent significant SNPs which are independent each other at r2 0.1.
• #independent significant SNPs: The number of independent significant SNPs which reached the user defined genome-wide significant P-value and are independent each other at the user defined r2
• #candidate SNPs: The number of candidate SNPs which are in LD (given r2) of one of the independent significant SNPs. This includes non-GWAS tagged SNPs which are extracted from the 1000 genomes reference panel. When SNPs are filtered based on functional annotation for gene mapping, this number refers to the number of SNPs before the functional filtering.
• #candidate GWAS tagged SNPs: The number of candidate SNPs (described above) which are tagged in GWAS (exists in your input file).
• #mapped genes: The number of genes mapped based on the user-defined parameters.
• Positional annotation of candidate SNPs
This is a histogram of the number of SNPs per functional consequences on genes. When SNPs have more than one (different) annotations, they are counted for each annotation. SNPs assigned NA; this may be because alleles do not match with the fasta files of ANNOVAR Ensembl genes.
• Summary per genomic locus
This histogram displays the size of genomic risk loci, the number of candidate SNPs, the number of prioritized genes and the number of genes physically locating within the genomic locus.

#### 3. Result tables

This panel contains multiple tables of the results. Each column is described in Table columns.

By clicking one of the rows of tables of genomic risk loci, lead SNPs or independent significant SNPs, FUMA will create regional plots of candidate SNPs (GWAS P-value). Optionally, regional plots with genes and functional annotations can be created from the panel at the bottom of the page.

Regional plots can be created with the following optional annotations:
• GWAS association statistics: input P-value
• RegulomeDB score
• 15-core chromatin state: tissue/cell types have to be selected.
• eQTLs: This option is only available when eQTL mapping is performed. eQTLs are plotted per gene and colored per tissue types.
• chromatin interactions: This option is only available when chromatin mapping is performed. Interactions are plotted per data set.

When chromatin interaction mapping is performed, circos plots are created for each chromosome that contains at least one risk locus. The circos plots are displayed in the panel where the chromatin interaction tables are displayed. Each plot is clickable and opens in a new tab showing a larger plot. PNG, SVG and circos config files are downloadable.
All chromatin interactions overlapping with any of risk loci (including interactions that do not map to genes) will be shown in the circos plot.

The specific layers and color-coding of the circos plot is described below.

• Manhattan plot: The most outer layer. Only SNPs with P < 0.05 are displayed. SNPs in genomic risk loci are color-coded as a function of their maximum r2 to the one of the independent significant SNPs in the locus, as follows: red (r2 > 0.8), orange (r2 > 0.6), green (r2 > 0.4) and blue (r2 > 0.2). SNPs that are not in LD with any of the independent significant SNPs (with r2 ≤ 0.2) are grey.
The rsID of the top SNPs in each risk locus are displayed in the most outer layer. Y-axis are raned between 0 to the maximum -log10(P-value) of the SNPs.
• Chromosome ring: The second layer. Genomic risk loci are highlighted in blue.
• Mapped genes by chromatin interactions or eQTLs: Only mapped genes by either chromatin interaction and/or eQTLs (conditional on user defined parameters) are displayed. If the gene is mapped only by chromatin interactions or only by eQTLs, it is colored orange or green, respectively. When the gene is mapped by both, it is colored red.
• Chromosome ring: The third layer. This is the same as second layer but without coordinates to make it easy to align position of genes with genomic coordinate.
• Chromatin interaction links: Links colored orange are chromatin interactions. Since v1.2.7, only the interactions used for mapping based on user defined parameters are displayed.
• eQTL links: Links colored green are eQTLs. Since v1.2.7, only the eQTLs used for mapping based on user defined parameters are displayed.
Since creating a circos plot might take long time with a large number of points and links, the maximum number of points and links are limited to 50,000 and 10,000 per plot (chromosome), respectively, in the default plot. Therefore, if there are more than 50,000 SNPs with P-value < 0.05 in a chromosome, top 50,000 SNPs (sorted by P-value) are displayed in the plot. This is same for eQTLs and chromatin interactions, e.g. if there are more than 10,000 eQTLs in a chromosome, top 10,000 eQTLs (sorted by P-value for eQTLs, FDR for chromatin interactions) are displayed in the plot. These can be optimized by downloading config file and re-creating input text files for SNPs and links. Please refer github repository FUMA circos plot for details.

All results are downloadable as text file. Columns are described in Table columns. README file is also included in a zip file.
When the SNP table is selected to downloaded, ld.txt will be also included in the zip file. This file contains the r2 values computed from selected reference panel for all pairs of one of the independent significant SNPs and all other SNPs within the LD.

### Table Columns

• Genomic risk loci

• Genomic locus : Index of genomic rick loci.
• uniqID : Unique ID of SNPs consisting of chr:position:allele1:allele2 where alleles are alphabetically ordered.
• rsID : rsID of the top lead SNP based on dbSNP build 146.
• chr : chromosome of top lead SNP
• pos : position of top lead SNP on hg19
• P-value : P-value of top lead SNP (from the input file).
• start : Start position of the locus
• end : End position of the locus
• nSNPs : The number of unique candidate SNPs in the genomic locus, including non-GWAS-tagged SNPs (which are available in the user selected reference panel). Candidate SNPs are all SNPs that are in LD (give user-defined r2) with any of independent significant SNPs and either have a P-value below the user defined threshold or are only available in 1000G.
• nGWASSNPs : The number of unique GWAS-tagged candidate SNPs in the genomic locus which is available in the GWAS summary statistics input file. This is a subset of "nSNPs".
• nIndSigSNPs : The number of the independent (at user defined r2) significant SNPs in the genomic locus.
• IndSigSNPs : rsID of the independent significant SNPs in the genomic locus.
• nLeadSNPs : The number of lead SNPs in the genomic locus. Lead SNPs are subset of independent significant SNPs at r2 0.1.

• No : Index of lead SNPs
• Genomic Locus : Index of assigned genomic locus matched with "Genomic risk loci" table. Multiple lead SNPs can be assigned to the same genomic locus.
• uniqID : Unique ID of SNPs consisting of chr:position:allele1:allele2 where alleles are alphabetically ordered.
• rsID : rsID based on dbSNP build 146.
• chr : chromosome
• pos : position on hg19
• P-value : P-value (from the input file).
• nIndSigSNPs : Number of independent significant SNPs which are in LD with the lead SNP at r2 0.1.
• IndSigSNPs : rsID of independent significant SNPs which are in LD with the lead SNP at r2 0.1.
• independent significant SNPs (Independent significant SNPs)

All independent lead SNPs identified by FUMA.

• No : Index of independent significant SNPs
• Genomic Locus : Index of assigned genomic locus matched with "Genomic risk loci" table. Multiple independent lead SNPs can be assigned to the same genomic locus.
• uniqID : Unique ID of SNPs consisting of chr:position:allele1:allele2 where alleles are alphabetically ordered.
• rsID : rsID based on dbSNP build 146.
• chr : chromosome
• pos : position on hg19
• P-value : P-value (from the input file).
• nSNPs : The number of SNPs in LD with the lead SNP given r2, including non-GWAS-tagged SNPs (which are extracted from 1000G).
• nGWASSNPs : The number of GWAS-tagged SNPs in LD with the lead SNP given r2. This is a subset of "nSNPs".
• SNPs

All candidate SNPs (SNPs which are in LD of any independent lead SNPs) with annotations. Note that depending on your mapping criterion, not all candidate SNPs displaying in this table are mapped to genes.

• uniqID : Unique ID of SNPs consisting of chr:position:allele1:allele2 where alleles are alphabetically ordered.
• rsID : rsID based on dbSNP build 146.
• chr : chromosome
• pos : position on hg19
• effect_allele : Effect/risk allele if it is provided in the input GWAS summary statistics file. If not, this is the alternative (minor) allele in 1000G.
• non_effect_allele : Non-effect/non-risk allele if it is provided in the input GWAS summary statistics file. If not, this is the reference (major) allele in 1000G.
• MAF : Minor allele frequency computed based on 1000G.
• gwasP : P-value provided in the input GWAS summary statistics file. Non-GWAS tagged SNPs (which do not exist in input file but are extracted from the reference panel) have "NA" instead.
• or : Odds ratio provided in the input GWAS summary statistics file if available. Non-GWAS tagged SNPs (which do not exist in input file but are extracted from the reference panel) have "NA" instead.
• beta : Beta provided in the input GWAS summary statistics file if available. Non-GWAS tagged SNPs (which do not exist in input file but are extracted from the reference panel) have "NA" instead.
• se : Standard error provided in the input GWAS summary statistics file if available. Non-GWAS tagged SNPs (which do not exist in input file but are extracted from the reference panel) have "NA" instead.
• r2 : The maximum r2 of the SNP with one of the independent significant SNPs.
• IndSigSNP : rsID of the independent significant SNP which has the maximum r2 with the SNP.
• Genomic locus : Index of the genomic risk loci matching with "Genomic risk loci" table.
• nearestGene : The nearest Gene of the SNP based on ANNOVAR annotations. Note that ANNOVAR annotates "consequence" function by prioritizing the most deleterious annotation for SNPs which are locating a genomic region where multiple genes are obverlapped. Genes are ecoded in symbol, if it is available otherwise Ensembl ID. Genes include all transcripts from Ensembl gene build 85 including non-protein coding genes and RNAs.
• dist : Distance to the nearest gene. SNPs which are locating in the gene body or 1kb up- or down-stream of TSS or TES have 0.
• func : Functional consequence of the SNP on the gene obtained from ANNOVAR. For exonic SNPs, detailed annotation (e.g. non-synonymous, stop gain and so on) is available in the ANNOVAR table (annov.txt).
• CADD : CADD score which is computed based on 63 annotations. The higher the score, the more deleterious the SNP is. 12.37 is the suggested threshold by Kicher et al (2014).
• RDB : RegulomeDB score which is a categorical score (from 1a to 7). 1a is the highest score for SNPs with the most biological evidence to be a regulatory element.
• minChrState : The minimum 15-core chromatin state across 127 tissue/cell type.
• commonChrState : The most common 15-core chromatin state across 127 tissue/cell types.
• posMapFilt : Whether the SNP was used for eQTL mapping or not. 1 is used, otherwise 0. When eqtl mapping is not performed, all SNPs have 0.
• Complete annotations of 15-core chromatin state (for every 127 epigenomes) are available in the "annot.txt" from download.
• ANNOVAR

Since one SNP can be annotated to multiple positional information, the table of ANNOVAR output is separated from SNPs table. This table contains unique SNP-annotation combinations.

• uniqID : Unique ID of SNPs consisting of chr:position:allele1:allele2 where alleles are alphabetically ordered.
• chr : chromosome
• pos : position on hg19
• Gene : ENSG ID
• Symbol : Gene Symbol
• Distance : Distance to the gene
• Function : Functional consequence on the gene
• Exonic function : Functional annotation of exonic SNPs
• Exon : Index of exon
• Mapped genes

The genes which are mapped by SNPs in the SNPs table based on user-defined mapping parameters. Columns with posMap, eqtlMap or ciMap in the parentheses are only available when positional, eQTL or chromatin interaction mapping is performed, respectively.

• Gene : ENSG ID
• Symbol : Gene Symbol
• entrezID : entrez ID
• Genomic locus : Index of genomic loci where mapped SNPs are from. This could contain more than one interval in the case that eQTLs are mapped to genes from distinct genomic risk loci.
• chr : chromosome
• start : Starting position of the gene
• end : Ending position of the gene
• strand : Strand of gene
• status : Status of gene from Ensembl
• type : Gene biotype from Ensembl
• HUGO : HUGO (HGNC) gene symbol
• pLI : pLI score from ExAC database. The probability of being loss-of-function intolerant. The higher the score is, the more intolerant to loss-of-function mutations the gene is.
• ncRVIS : Non-coding residual variation intolerance score. The higher the score is, the more intolerant to noncoding variants the gene is.
• posMapSNPs (posMap): The number of SNPs mapped to gene based on positional mapping (after functional filtering if parameters are given).
• posMapMaxCADD (posMap): The maximum CADD score of mapped SNPs by positional mapping.
• eqtlMapSNPs (eqtlMap): The number of SNPs mapped to the gene based on eQTL mapping.
• eqtlMapminP (eqtlMap): The minimum eQTL P-value of mapped SNPs.
• eqtlMapminQ (eqtlMap): The minimum eQTL FDR of mapped SNPs.
• eqtlMapts (eqtlMap): Tissue types of mapped eQTL SNPs.
• eqtlDirection (eqtlMap): Consecutive direction of mapped eQTL SNPs after aligning risk increasing alleles in GWAS and tested alleles in eQTL data source.
• ciMap (ciMap): "Yes" if the gene is mapped by chromatin interaction mapping.
• ciMapts (ciMap): Tissue/cell types of mapped chromatin interactions.
• minGwasP : The minimum P-value of mapped SNPs.
• IndSigSNPs : rsID of the all independent significant SNPs of mapped SNPs.
• eQTL

This table is only shown when eQTL mapping is performed. The table contains unique pairs of SNP-gene-tissue, therefore, a SNP could appear multiple times.

• uniqID : Unique ID of SNPs consisting of chr:position:allele1:allele2 where alleles are alphabetically ordered.
• chr : chromosome
• pos : position on hg19
• DB : Data source of eQTLs. Currently GTEx, BloodeQTL, BIOS and BRAINEAC are available. Please refer to the eQTL section for details.
• tissue : Tissue type
• Gene : ENSG ID
• Symbol : Gene symbol
• P-value : P-value of eQTLs
• FDR : FDR of eQTLs. Note that the method to compute FDR differs between data sources. Please refer to the eQTL section for details.
• signed_stats : Signed statistics, the actual value depends on the data source. Please refer to the eQTL sectuib fir details.
• RiskIncAllele : Risk increasing allele obtained from the input GWAS summary statistics.
"NA" if signed effect is not provided in the input file. SNPs which are not in the input GWAS but included from reference panel are also encoded as "NA".
• alignedDirection : The direction of effect to gene expression after aligning risk increasing allele of GWAS and tested allele of eQTLs.
• Chromatin interaction (Chromatin interactions tab)

This file is only available when chromatin interaction mapping is performed. The file contains significant interactions of user defined data or user uploaded data filtered.

• GenomicLocus : Index of genomic loci where the significant interaction is overlapped.
• region1 : One end of significant chromatin interaction which overlap with at least one candidate SNPs in one of the genomic risk loci.
• region2 : The other end of significant chromatin interaction. This region could be located outside the risk loci.
• FDR : FDR of interaction.
• type : Type of chromatin interaction data, e.g. Hi-C or ChIA-PET
• DB : The name of data source.
• tissue/cell : Tissue or cell type of the interaction.
• intra/inter : Intra- or Inter-chromosomal interaction.
• SNPs : rsID of candidate SNPs which are overlapping with the region 1.
• genes : ENSG ID of genes whose promoter regions are overlapped with region 2.
• SNPs and overlapped regulatory elements in region 1 (Chromatin interaction tab)

This file is only available when chromatin interaction mapping is performed. The file contains candidate SNPs which overlap with one end (region 1) of significant chromatin interaction and enhancer regions of user selected epigenomes. If any epigenome was selected, this file is empty.

• uniqID : Unique ID of SNPs consists of chr:position:allele1:allele2 where alleles are alphabetically ordered.
• rsID : rsID based on dbSNP build 146
• chr : chromosome
• pos : position on hg19
• reg_region : Predicted enhancer or dyadic regions
• type : enh for enhancer and dyadic for dyadic enhancer/promoter regions
• tissue/cell : EID of 111 Roadmap epigenomes
• Regulatory elements and genes in region 2 (Chromatin interaction tab)

This file is only available when chromatin interaction mapping is performed. The file contains promoter regions of user selected epigenomes (if selected any) and genes whose promoter regions are overlapped. If any epigenome was selected, this file is empty.

• region2 : region 2 in the chromatin interaction table
• reg_region : Predicted promoter or dyadic regions
• type : prom for promoter and dyadic for dyadic enhancer/promoter regions
• tissue/cell : EID of 111 Roadmap epigenomes
• genes : genes whose promoter regions are overlapped with region2
• GWAScatalog

List of SNPs reported in GWAScatalog which are candidate SNPs of your GWAS summary statistics.

• Genomic locus : Index of genomic risk loci.
• IndSigSNP : One of the independent significant SNPs of the SNP in GWAScatalog.
• chr : chromosome
• bp : position on hg19
• snp : rsID of reported SNP in GWAS catalog
• PMID : PubMed ID
• Trait : The trait reported in GWAScatalog
• FirthAuth : First author reported in GWAScatalog
• Date : Date added in GWAScatalog
• P-value : Reported P-value
• Parameters

The table of input parameters. The downloadable file is a config file with INI format.

[jobinfo]
• created_at : Date of job created
• title : Job title
• [inputfiles]
• gwasfile : File name of GWAS summary statistics
• addleadSNPs : 1 if option is checked, 0 otherwise. If pre-defined lead SNPs are not provided, it is always 1.
• regionsfile : File name of pre-defined genetic regions if provided.
• **col : The column names of input GWAS summary statistics file if provided.
• [params]
• N : Sample size of GWAS
• exMHC : 1 to exclude MHC region, 0 otherwise
• extMHC : user defined MHC region if provided, NA otherwise
• genetype : All selected gene type.
• leadP : the maximum threshold of P-value to be lead SNP
• r2 : the minimum threshold for SNPs to be in LD of the lead SNPs
• gwasP : the maximum threshold of P-value to be candidate SNP
• pop : The population of reference panel
• MAF : the minimum minor allele frequency based on 1000 genome reference of given population
• Incl1KGSNPs : 1 to include non-GWAS-tagged SNPs from reference panel, 0 otherwise
• mergeDist : The maximum distance between LD blocks to merge into interval
• [posMap]
• posMap : 1 to perform positional mapping, 0 otherwise
• posMapWindowSize : If provided, this distance is used as the maximum distance between SNPs to genes. Otherwise "NA".
• posMapAnnot : Functional consequences of SNPs on genes to map.
• posMapRDBth : The minimum RegulomeDB score for SNP filtering
• posMapChr15 : Select tissue/cell types, NA otherwise
• posMapChr15Max : The maximum 15-core chromatin state
• posMapChr15Meth : The method of chromatin state filtering
• [eqtlMap]
• eqtlMap : 1 to perform eQTL mapping, 0 otherwise
• eqtlMaptss : Selected tissue typed for eQTL mapping
• eqtlMapSig : 1 to use only significant snp-gene pairs, 0 otherwise
• eqtlMapP : The P-value threshold for eQTLs if  eqtlMap significant only is not selected.
• eqtlMapRDBth : The minimum RegulomeDB score for SNP filtering
• eqtlMapChr15 : Select tissue/cell types, NA otherwise
• eqtlMapChr15Max : The maximum 15-core chromatin state
• eqtlMapChr15Meth : The method of chromatin state filtering
• [ciMap]
• ciMap : 1 to perform chromatin interaction mapping, 0 otherwise
• ciMapBuiltin : Selected builtin chromatin interaction data
• ciMapFileN : The number of uploaded chromatin interaction matrices
• ciMapFiles: File names of uploaded chromatin interaction matrices
• ciMapFDR : The FDR threshold of chromatin interactions
• ciMapPromWindow : Window of the promoter regions from TSS. 250-500 means, 250bp up- and 500bp down-stream of TSS region is defined as promoter.
• ciMapRoadmap : Select epigenome ID of roadmap epigenomes for annotation of promoter/enhancer regions
• ciMapEnhFilt : 1 to filter SNPs on such that are overlapped with annotated enhancer regions of selected epigenomes, 0 otherwise
• ciMapPromFilt: 1 to filter mapped genes on such that whose promoter regions are overlapped with annotated promoter regions of selected epigenomes, 0 otherwise
• ciMapRDBth : The minimum RegulomeDB score for SNP filtering
• ciMapChr15 : Select tissue/cell types, NA otherwise
• ciMapChr15Max : The maximum 15-core chromatin state
• ciMapChr15Meth : The method of chromatin state filtering

### Redo gene mapping for existing jobs

From FUMA v1.3.0, gene mapping can be re-performed for existing job with a different parameter setting. This allows users to tune gene mapping parameters without performing entire process again, by duplicating the selected job, which reduce a large amount of time.

#### 1. Select a jobID to duplicate

At the top of the page, users can select a jobID of existing job on the account. Note that only jobs which are succeeded are selectable. This is only available for users who already have SNP2GENE jobs.

#### 2. Modify parameters

Once a jobID is selected, the previous parameters are automatically loaded. Modify parameters before submitting, otherwise the results will be same as the selected job. For chromatin interaction mapping, user custom files need to be re-uploaded.
Users are allowed to provide new title and suffix "_copied_(jobID)" will be automatically added to the title.
Users are only allowed to modify gene mapping parameters. Other parameters such as P-value or r2 threshold for defining independent significant SNPs cannot be changed.

#### 3. Submit

User can submit the job by clicking the button at the bottom of the page. After submission, the process is same as submitting a new SNP2GENE job, you will get an email once the process is done and results are accessible from your job list table.

### Reference panel

To define independent significant SNPs, lead SNPs and genomic risk loci, FUMA uses reference panels. In this section, each reference panel is described details.

#### 1. 1000 Genome Phase3

Downloaded vcf files were split into 5 (super) populations based on panel file (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel) using PLINK. Number of individuals per population is AFR: 661, AMR: 347, EAS: 504, EUR: 503 and SAS: 489.
Only bi-allelic SNPs with "PASS" were extracted per population using PLINK (--biallelic-only) and manually excluding SNPs with duplicated position. SNPs with MAF=0 were excluded.
MAF and pairwise LD were computed by PLINK (--r2 --ld-window 99999 --ld-window-r2 0.05).

#### 2. UK Biobank release 1

Genotype data was obtained under application ID 16406. 10,000 white british individuals were randomly selected to create reference panel.
Chromosome X is not available for this reference panel.
Only bi-allelic SNPs with imputation INFO score > 0.9 were extracted by PLINK (--biallelic-only) and manually excluding SNPs with duplicated position. SNPs with MAF=0 were excluded.
MAF and pairwise LD were computed by PLINK (--r2 --ld-window 99999 --ld-window-r2 0.05).

#### 3. UK Biobank release 2

Genotype data was obtained under application ID 16406. SNPs imputed based on UK10K/1000G were excluded (only SNPs imputed based on HRC were included). For release 2, two reference panels were created; white british and european individuals projected onto 1000G population. For white british, 10,000 unrelated individuals were randomly selected. For european, each individuals were first assigned to one of the 5 1000G populations based on the minimum Mahalanobis distance. Then randomly selected 10,000 unrelated EUR individuals were used.
Chromosome X is not available for this reference panel.
Only bi-allelic SNPs with imputation INFO score > 0.9 were extracted by PLINK (--biallelic-only) and manually excluding SNPs with duplicated position. SNPs with MAF=0 were excluded.
MAF and pairwise LD were computed by PLINK (--r2 --ld-window 99999 --ld-window-r2 0.05).

### MAGMA analyses

FUMA performs MAGMA gene analysis, gene-set analysis and gene-property analysis. In this section, reference panels and gene expression data sets are described detils.

#### Reference panel

To avoid mismatch of rsID, unique ID (chr:pos:allele1:allele2) is used for MAGMA.

#### 1. 1000 Genome Phase3

Same as described in Reference panel section.

#### 2. UK Biobank release1

Same as described in Reference panel section, except further 1,000 individuals were randomly selected to reduce runtime of MAGMA (it takes >3 hours with 10,000 individuals).

#### 3. UK Biobank release2

Same as described in Reference panel section, except further 1,000 individuals were randomly selected for each white British and european reference panel to reduce runtime of MAGMA (it takes >3 hours with 10,000 individuals).

#### Gene analysis

Gene analysis is performed with default parameters (SNP-wide mean model) with the user selected reference panel.
The command FUMA uses is the following.
 magma --bfile [path to the selected reference panel] \ --pval [magma input file] ncol=3 (or N=[total sample size]) \ --gene-annot [path to the annotation file with use selected window size] \ --out [output file] 

#### Gene set analyses

In FUMA, curated gene sets (c2.all) and go terms (c5.bp, c5.cc and c5.mf) from MsigDB are tested. For FUMA ≤ v1.3.0, 10894 gene sets (curated gene sets: 4728, GO terms: 6166) from MsigdB v5.2 are used. For FUMA ≥ v1.3.1, 10655 gene sets (curated gene sets: 4738, GO terms: 5917) from MsigDB v6.1 are used. For FUMA ≥ v1.3.4, 10678 gene sets (curated gene sets: 4761, GO terms: 5917) from MsigDB v6.2 are used. Bonferroni correction was performed for the all tested gene sets. To customise, you can download the output file and select a specific gene sets.

Gene set analysis is performed by the following command.
 magma --gene-results [path to]/magma.genes.raw \ --set-annot [path to gene set file] \ --out [output file] 

#### Gene property analysis for tissue specificity

To identify tissue specificity of the phenotype, FUMA performs MAGMA gene-property analyses to test relationships between tissue specific gene expression profiles and disease-gene associations. The gene-property analysis is based on the regression model, $$Z \sim \beta_0 + E_t\beta_E + A\beta_A + B\beta_B + \epsilon$$ where $$Z$$ is a gene-based Z-score converted from the gene-based P-value, $$B$$ is a matrix of several technical confounders included by default. $$E_t$$ is the gene expression value of a testing tissue type c and $$A$$ is the average expression across tissue types in a data set, defined as follows: $$E_t = \sum_{i}^{n} log_2(e_i + 1)/n$$ $$A = \sum_{j \in T}^{N} E_j/N$$ where $$n$$ is the number of samples in tissue type t, $$e_i$$ is the expression value of a sample in the tissue type t (e.g. RPKM count or TPM), $$N$$ is the number of tissue types in a data set and $$T = \{tissue\ type\ 1, tissue\ type\ 2, ..., tissue\ type\ N\}$$. We performed a one-sided test ($$\beta_E>0$$) which is essentially testing the positive relationship between tissue specificity and genetic association of genes.

MAGMA gene-property analysis is run with the following command,
 magma --gene-results [input file name].genes.raw \ --gene-covar [file name of selected RNA-seq data set] onesided=greater condition=Average \ --out [output file name] 

#### 1. GTEx v6

Data source
RNAseq data set was downloaded from http://www.gtexportal.org/home/datasets. Gene level RPKM was used (GTEx_Analysis_v6_RNA-seq_RNA-SeQCv1.1.8_gene_rpkm.gct.gz).

Pre-process
Primary gene ID was Ensemble ID. In total, 8,555 samples were available. From 56,318 annotated genes, genes were filtered on such that average RPKM per tissue is >1 in at least on of the 53 tissues. This resulted in 28,577 genes. RPKM was winsorized at 50 (replaced RPKM>50 with 50). Then average of log transformed RPKM with pseudocount 1 (log2(RPKM+1)) per tissue (for either 53 detail or 30 general tissues) was used as the covariates conditioning on the average across all the tissues.

#### 2. GTEx v7

Data source
RNAseq data set was downloaded from http://www.gtexportal.org/home/datasets. Gene level TPM was used (GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_rpm.gct.gz).

Pre-process
Primary gene ID was Ensemble ID. In total, 11,688 samples were available. From 56,203 annotated genes, genes were filtered on such that average TPM per tissue is >1 in at least on of the 53 tissues. This resulted in 32,335 genes. TPM was winsorized at 50 (replaced TPM>50 with 50). Then average of log transformed TPM with pseudocount 1 (log2(TPM+1)) per tissue (for either 53 detail or 30 general tissues) was used as the covariates conditioning on the average across all the tissues.

#### 3. BrainSpan

Data source

Pre-process
Primary gene ID was Ensemble ID. In total, 524 samples were available. General developmental stages were annotated for each sample based on the age. We used 11 developmental stages and 29 ages as the label. For the label of age, we excluded age groups with <3 samples (25 pcw and 35 pcw). From 52,376 annotated genes, genes were filtered on such that average RPKM per label is >1 in at least one of the either developmental stage or age. This resulted in 19,601 and 21,001 genes for developmental stages and age groups, respectively. RPKM was winsorized at 50 (replaced RPKM>50 with 50). Then average of log transformed RPKM with pseudocount 1 (log2(RPKM+1)) per label (for either 11 developmental stages or 29 age groups) was used as the covariates conditioning on the average across all the labels.

### Risk loci and lead SNPs

In this section, "Genomic risk loci", "lead SNPs" and "Independent significant SNPs (Ind. sig. SNPs)" are explained in more detail.

#### 1. Independent significant SNPs (Ind. sig. SNPs)

Ind. sig. SNPs are defined as SNPs that have a P-value ≤ the user define threshold for genome-wide significance (5e-8 by default) and are independent from each other at the user defined r2 (0.6 by default). Therefore, ind. sig. SNPs are essentially the same as SNPs that are contained after clumping GWAS tagged SNPs at the same P-value and r2. Ind. sig. SNPs are used to select candidate SNPs that are in LD with the ind. sig. SNPs.
The candidate SNPs (and ind. sig. SNPs) are used for gene prioritization.
Relaxing the threshold for the genome-wide significant P-value results in an increased number of ind. sig. SNPs. When you would like to identify ind. sig. SNPs in genomic loci which do not reach the commonly adopted genome-wide significance level of 5e-8, less significant P-value can be used. Alternatively, by providing pre-defined lead SNPs in a separate file, these provided SNPs will be defined as ind. sig. SNPs regardless of their P-value.
The higher the threshold for r2, the more SNPs are defined as ind. sig. SNPs. At the same time, the number of SNPs in the LD with the ind. sig. SNPs (the candidate SNPs; which are the SNPs annotated in FUMA and used for gene prioritization) decreases.

Lead SNPs are defined as SNPs which are ind. sig. SNPs and are independent from each other at r2 < 0.1 (currently not adjustable). Therefore, lead SNPs are same as the SNPs clumped ind. sig. SNPs at the user defined P-value and r2 = 0.1 by plink.
When r2 is set at 0.1, lead SNPs are exactly the same as ind. sig. SNPs. However, this will also result in selecting candidate SNPs that have r2 above 0.1 with any of ind. sig. SNPs. We thus advise to set r2 at 0.6 or higher.

#### 3. Genomic risk loci

On top of lead SNPs, FUMA defines genomic risk loci, including all independent signals that are physically close or overlapping in a single locus. First, ind. sig. SNPs which are dependent each other at r2 ≥ 0.1 are assigned to the same genomic risk locus. Then, ind. sig. SNPs which are closer than the user defined distance (250 kb by default) are merged into one genomic risk locus. The distance between two LD blocks of two ind. sig. SNPs is the distance between the closest SNPs (which are in LD of the ind. sig. SNPs at user defined r2) from each LD block.
Each locus is represented by the top lead SNP which has the minimum P-value in the locus.

#### 4. Candidate SNPs (SNPs in LD of ind. sig. SNPs)

Candidate SNPs are SNPs that are in LD with any of the ind. sig. SNPs at the user defined r2. Candidate SNPs, together with the ind. sig. SNPs, are the SNPs that are used to prioritize genes. The most left and most right SNPs which are in LD of a ind. sig. SNP define a LD block in which those SNPs are used to compute distance between LD blocks.
Note that not all SNPs are necessary in LD with lead SNPs, although they must be in LD with ind. sig. SNPs at the user defined r2.
All candidate SNPs are annotated and their functions and listed in the "SNPs" table.
The higher the threshold r2, the less candidate SNPs are identified. The number of candidate SNPs can also be controlled by the parameter of the maximum P-value for gwas-tagged SNPs (0.05 by default). For example, when r2 is set at less than 0.6, a parameter of P-value threshold for GWAS tagged SNPs might need to be set at more significant since SNPs with r2 often have very high P-value.

Effect of r2 parameter

### eQTLs

FUMA contains several data sources of eQTLs and each data source is described in this section.

#### 1. GTEx v6

Data source
eQTL data was downloaded from http://www.gtexportal.org/home/datasets. Under the section of GTEx V6, from single tissue eQTL data both GTEx_analysis_V6_eQTLs.tar.gz for significant SNP-gene association based on permutation, and GTEx_Analysis_V6_all-snp-gene-associations.tar for every SNP-gene association test (including non-significant paris) were downloaded.
GTEx eQTL v6 contains 44 different tissue types across 30 general tissue types.

Description
FUMA contains all SNP-gene pairs of cis-eQTL including non-significant association. Significant eQTLs are defined as FDR (gene q-value) ≤ 0.05. The gene FDR is pre-calculated by GTEx and every gene-tissue pair has a defined P-value threshold for eQTLs based on permutation.
Signed statistics is t-statistics.

Samples

The table contains the list of tissue types available in GTEx v6 for cis-eQTL (only tissues with genotyped sample size ≥ 70).
General tissue type Tissue type Genotyped sample size
BloodCells EBV-transformed lymphocytes114
Blood VesselArtery Aorta197
Blood VesselArtery Coronary118
Blood VesselArtery Tibial285
BloodWhole Blood338
BrainBrain Anterior cingulate cortex BA2472
BrainBrain Caudate basal ganglia100
BrainBrain Cerebellar Hemisphere89
BrainBrain Cerebellum103
BrainBrain Cortex96
BrainBrain Frontal Cortex BA992
BrainBrain Hippocampus81
BrainBrain Hypothalamus81
BrainBrain Nucleus accumbens basal ganglia93
BrainBrain Putamen basal ganglia82
BreastBreast Mammary Tissue183
ColonColon Sigmoid124
ColonColon Transverse169
EsophagusEsophagus Gastroesophageal Junction127
EsophagusEsophagus Mucosa241
EsophagusEsophagus Muscularis218
HeartHeart Atrial Appendage159
HeartHeart Left Ventricle190
LiverLiver97
LungLung278
MuscleMuscle Skeletal361
NerveNerve Tibial256
OvaryOvary85
PancreasPancreas149
PituitaryPituitary87
ProstateProstate87
SkinCells Transformed fibroblasts272
SkinSkin Not Sun Exposed Suprapubic196
SkinSkin Sun Exposed Lower leg302
Small IntestineSmall Intestine Terminal Ileum77
SpleenSpleen89
StomachStomach170
TestisTestis157
ThyroidThyroid278
UterusUterus70
VaginaVagina79

#### 2. Blood eQTL browser (Westra et al. 2013)

Data source

Description
The data only include eQTLs with FDR ≤ 0.5. Genes in the original files were mapped to Ensembl ID in which genes are removed if they are not mapped to Ensembl ID.
Signed statistics is Z-score.

Samples
5,311 peripheral blood samples from 7 studies (Westra et al. 2013).

#### 3. BIOS QTL browser (Zhernakova et al. 2017)

Data source

Description
The data only include eQTLs with FDR ≤ 0.05.
Signed statistics is beta.

Samples
2,116 whole peripheral blood samples of healthy adults from 4 Dutch cohorts (Zhernakova et al. 2017).

#### 4. BRAINEAC

Data source
eQTL was obtained from http://www.braineac.org/.

Description
The data include all eQTLs with nominal P-value < 0.05. Since tested allele was not provided in the original data source, minor alleles in 1000 genome phase 3 are assigned as tested alleles.
Signed statistics is t-statistics.
eQTLs were identified for each of the following 10 brain regions and based on averaged expression across all of them.
Alignment of risk increasing allele and eQTL tested allele was not performed for this data source, since tested allele is not available in the original data source (replaced with "NA" in the result table).

• Cerebellar cortex
• Frontal cortex
• Hippocampus
• Inferior olivary nucleus (sub-dissected from the medulla)
• Occipital cortex
• Putamen (at the level of the anterior commissure)
• Substantia nigra
• Temporal cortex
• Thalamus (at the level of the lateral geniculate nucleus)
• Intralobular white matter

Samples
134 neuropathologically confirmed control individuals of European descent from UK Brain Expression Consortium (Ramasamy et al. 2014).

#### 5. GTEx v7

Data source
eQTL data was downloaded from http://www.gtexportal.org/home/datasets. Under the section of GTEx V7, from single tissue eQTL data both GTEx_analysis_v7_eQTLs.tar.gz for significant SNP-gene association based on permutation, and GTEx_Analysis_v7_all_associations.tar.gz for every SNP-gene association test (including non-significant pairs) were downloaded.
GTEx eQTL v6 contains 48 different tissue types across 30 general tissue types.

Description
FUMA contains all SNP-gene pairs of cis-eQTL including non-significant association. Significant eQTLs are defined as FDR (gene q-value) ≤ 0.05. The gene FDR is pre-calculated by GTEx and every gene-tissue pair has a defined P-value threshold for eQTLs based on permutation.
Signed statistics is slope.

Samples

The table contains the list of tissue types available in GTEx v7 for cis-eQTL (only tissues with genotyped sample size ≥ 70).
General tissue type Tissue type Genotyped sample size
BloodCells EBV-transformed lymphocytes117
BloodWhole Blood369
Blood VesselArtery Aorta267
Blood VesselArtery Coronary152
Blood VesselArtery Tibial388
BrainBrain Amygdala88
BrainBrain Anterior cingulate cortex BA24109
BrainBrain Caudate basal ganglia144
BrainBrain Cerebellar Hemisphere125
BrainBrain Cerebellum154
BrainBrain Cortex136
BrainBrain Frontal Cortex BA9118
BrainBrain Hippocampus111
BrainBrain Hypothalamus108
BrainBrain Nucleus accumbens basal ganglia130
BrainBrain Putamen basal ganglia111
BrainBrain Spinal cord cervical c-183
BrainBrain Substantia nigra80
BreastBreast Mammary Tissue251
ColonColon Sigmoid203
ColonColon Transverse246
EsophagusEsophagus Gastroesophageal Junction213
EsophagusEsophagus Mucosa358
EsophagusEsophagus Muscularis335
HeartHeart Atrial Appendage264
HeartHeart Left Ventricle272
LiverLiver153
LungLung383
MuscleMuscle Skeletal491
NerveNerve Tibial361
OvaryOvary122
PancreasPancreas220
PituitaryPituitary157
ProstateProstate132
Salivary GlandMinor Salivary Gland85
SkinCells Transformed fibroblasts300
SkinSkin Not Sun Exposed Suprapubic335
SkinSkin Sun Exposed Lower leg414
Small IntestineSmall Intestine Terminal Ileum122
SpleenSpleen146
StomachStomach237
TestisTestis225
ThyroidThyroid399
UterusUterus101
VaginaVagina106

#### 6. MuTHER (Grundberg et al. 2012)

Data source

Description
Chromosome coordinate was lifted over to hg19 from hg18 using liftOver software. Gene names are mapped to Ensembl ID (excluded genes which are not mapped to ENSG ID). Since only tested allele was provided, other allele was extracted from 1000G EUR population. FDR (or any corrected P-value) was not available in the original data (in the FUMA, FDR column was replaced with -9).
Signed statistics is beta.
Since FDR is not available, MuTHER eQTLs can be only used when P-value threshold provided by user, not "only significant snp-gene pairs" option.

Samples
856 female individuals of European descent recruited from the TwinsUK Adult twin registry (Grundberg et al. 2012).

• Skin (N=847)
• LCL (N=837)

#### 7. xQTLServer (Ng et al. 2017)

Data source

Description
Gene names are mapped to Ensembl ID (excluded genes which are not mapped to ENSG ID). Since alleles were not available in the original data, extracted from 1000G EUR population based on chromosome coordinate. FDR was not provided in the original data source, but the FDR column was replaced with Bonferroni corrected p-value, as it was used in the original study (corrected for all tested SNP-gene pairs 60,456,556).
Signed statistics is not available.
Alignment of risk increasing allele and eQTL tested allele was not performed for this data source, since tested allele and signed statistics are not available in the original data source (replaced with "NA" in the result table).

Samples
494 dorsolateral prefrontal cortex samples (Ng et al. 2017).

#### 8. CommonMind Consortium (Fromer et al. 2016)

Data source
eQTL data was downloaded from https://www.synapse.org//#!Synapse:syn5585484. Both eQTLs with and without SVA are included.

Description
Publicly available eQTLs from CMC (without application) is binned by FDR. Therefore, nominal P-value is not available (replaced with -9). FDR was binned into the following four groups, <0.2, <0.1, <0.05 and <0.01. As numeric value is required for filtering during SNP2GENE process, those categorical values are replaced with 0.199, 0.099, 0.049 and 0.009 respectively.
Signed statistics is not available but since expressed increasing allele was provided, signed_stats column is replaced with 1.
Trans eQTLs are also available for CMC data set (as a separated option from cis-eQTLs).

Samples
Post-mortem brain samples from 467 Caucasian individuals (209 with SCZ, 206 controls and 52 AFF cases; Fromer et al. 2016).

#### 9. eQTLGen (Vosa et al. 2018)

Data source
eQTL data was downloaded from http://www.eqtlgen.org/index.html. For cis-eQTLs, cis-eQTLs_full_20180905.txt.gz, for trans-eQTLs, trans-eQTL_significant_20181017.txt.gz was used.

Description
Full summary statistics were downloaded. For cis-eQTLs, full summary statistics was downloaded. In the dataset, every SNP-gene pair with a distance <1Mb from the center of the gene and tested in at least 2 cohorts was included. For trans-eQTLs, only significant eQTLs were included in FUMA since the cross-mapping effects were not filtered in the downloadable full summary statistics. In the original study, every SNP-gene pair with a distance >5Mb and tested in at least 2 cohorts was included. FDR was estimated based on permutations. Please refer the original study for more details (Vosa et al. 2018). Ensembl gene ID is used as provided in the original file.
Signed statistics is z-statistics.

Samples
Meta-analysis of cis-/trans-eQTLs from 37 datasets with a total of 31,684 individuals.

#### Risk increasing allele in GWAS

When "beta" or "OR" column is provided in the input GWAS file, risk increasing alleles are defined as follows: if beta > 0 or OR > 1, effect/risk allele is defined as the risk increasing allele, if beta < 0 or OR < 1, non-effect/non-risk allele is defined as the risk increasing allele.
If signed effect is not provided in the input GWAS file, risk increasing allele is not defined ("NA"). SNPs which are not in the input GWAS file but obtained from reference panel due to high LD are also encoded as "NA". When both effect and non-effect alleles are not provided in the input GWAS file, this alignment is not relevant. Please be careful to interpret the results.

#### Aligned direction of eQTLs

The sign of the t-statistics or z-score of the original eQTL data sources represents the direction of effect of tested allele. To obtain the direction of effect for risk increasing allele of GWAS, risk increasing allele and tested allele of eQTLs are aligned as follows: if risk increasing allele is the same allele as tested allele of the eQTL, direction is the same as the sign of the original t-statistics/z-score, if risk increasing allele is not same allele as tested allele of the eQTL, direction of t-statistics/z-score was flipped.
Direction is either "+" (risk increasing allele increases the expression of the gene) or "-" (risk increasing allele decreases the expression of the gene).

#### Examples

Here are some examples how the alleles are aligned.

uniqID effect allale non-effect allele beta risk increasing allele tested allele of eQTL t-statistics of eQTL aligned direction
1:201885026:C:T T C 0.22 T T -7.98 -
11:43843579:C:G C G 0.004 C G 17.23 -
16:28537971:C:T T C -0.028 C C 5.04 +

### Chromatin interaction data and mapping

In this section, build in chromatin interaction data, file format of custom chromatin interaction matrices and details of chromatin interaction mapping are described. Since chromatin interaction mapping is more complicated than other two mappings (positional and eQTL), please read this section carefully.

#### Terminology

Region 1
One end of a significant interaction which overlap with one of the candidate SNPs (independent significant SNPs and SNPs which are in LD of them). This region is always overlap with one of the genomic risk loci identified by FUMA.
Region 2
Another end of the significant interaction. This region is used to map to genes. Region 2 could also be overlapped with one of the genomic risk loci.

#### Build in chromatin interaction data

1. Hi-C of 21 tissue/cell types from GSE87112.
Pre-processed significant loops computed by Fit-Hi-C were obtained from GSE87112. Loops were filtered at FDR 0.05. For mapping, loops can be further filter by the user defined FDR threshold. Available tissue/cell types are listed below.

• Aorta
• Dorsolateral Prefrontal Cortex
• Hippocampus
• Left Ventricle
• Liver
• Lung
• Ovary
• Pancreas
• Psoas
• Right Ventricle
• Small Bowel
• Spleen
• GM12878
• IMR90
• Mesenchymal Stem Cell
• Mesendoderm
• Neural Progenitor Cell
• Trophoblast-like Cell
• hESC

#### Custom chromatin interaction matrices file format

1. Input file format
The chromatin interaction matrix should have the following 7 columns in the same order as listed below. Header line is mandatory but the column names do not need to be the same as the below as long as the order is the same. Delimiter should be tab or white space(s). The input file should be gzipped and named as "(name_of_data).txt.gz" in which "(name_of_data)" will be used in the result table and regional plot.

Columns:

1. chromosome of region 1
2. start position of region 1
3. end position of region 1
4. chromosome of region 2
5. start position of region 2
6. end position of region 2
7. FDR
Example:
chr1 start1 end1 chr2 start2 end2 FDR
1 2920001 2960000 1 3160001 3200000 0.03186403
1 4160001 4200000 1 5880001 5920000 5.3e-8
1 4520001 4560000 3 83200001 83240000 0.03920674

Chromosome can be coded as string like "chr1" and "chrX" which will be converted into integer.
Order of region 1 and region 2 does not matter. Inter-chromosomal interactions can be encoded in the same file by specifying chromosome of region 1 and region 2.
The column of FDR will be used to filter interaction by the user defined threshold.
The maximum size of each file is 600Mb. If the file is larger than this, please filter interactions or split them into multiple files.

2. Data types
When uploading custom chromatin interaction matrices, users can specify the type of data such as Hi-C or ChIA-PET. Specifying the data type is not mandatory since it is only used to specify in the result table and regional plot for convenience.

3. Filtering of chromatin interactions
The 7th column (FDR) will be used to filter interactions. To prevent from this filtering, either set filtering threshold to 1 or assign 0 to the FDR column. Technically, the 7th column does not have to be FDR but any other scores. When one prefers to use different score or nominal P-value, that is also possible by setting proper filtering threshold. Note that, interactions will be filtered on which have score less than or equal to the threshold.

#### Enhancer and promoter regions

Enhancer and promoter regions were obtained from Roadmap Epigenomics Projects for 111 epigenomes. Those regions were predicted using DNase peaks and core 15-state chromatin state model. Please refer here for details.
For selected epigenomes, enhancer regions are annotated to region 1 and promoter regions are annotated to region 2. Dyadic enhancer/promoter regions are annotated for both.
Annotated enhancer and promoter regions can be used to filter SNPs or mapped genes which is described in the next section.

#### Chromatin interaction mapping

1. Basic mapping (without filtering)
Chromatin interaction mapping is performed with significant chromatin interactions at the user defined threshold. Regions 2 is mapped to genes whose promoter regions (250bp up- and 50bp down-stream of the TSS by default) are overlapped with the region 2. Those genes were considered as mapped by candidate SNPs which are overlapped with region 1.
In the case there is not genes in region 2, those interactions are not mapped to any genes.

2. Enhancer filtering
When enhancers are annotated to region 1, user can select the option to filter candidate SNPs on such that are overlapped with enhancer regions of selected epigenomes. Note that, in the result table, all significant interactions are included but not all are necessary used for mapping.

3. Promoter filtering
When promoters are annotated to region 2, user can select the option to limit the chromatin interaction mapping to only genes whose promoter regions are overlapped with annotated promoter regions of selected epigenomes. Note that, in the result table, all significant interactions are included but not all are necessary mapped to genes.

In very rare cares, when the promoter filtering is activated, genes whose promoter regions (250bp up- and 500bp down-stream of TSS) do not overlap with region 2 but do overlap with promoters from Roadmap that are overlapping with region 2 are mapped. In this case, these genes are not in "ci.txt" file but in "ciProm.txt" file which can be linked to "ci.txt" by region 2.

## GENE2FUNC

The main goal of GENE2FUNC is to provide information on expression of prioritized genes and test for enrichment of the set of genes in pre-defined pathways. You can use the genes prioritized with SNP2GENE or use a separate list of genes.

### Submit genes

#### Option 1. Use mapped genes from SNP2GENE

If you want to use mapped genes from SNP2GENE, just click a button in Mapped genes panel of the result page. It will open a new tab and automatically starts analyses. This will take all mapped genes and use background genes with selected gene types for gene mapping (such as "protein-coding" or "ncRNA"). The method for multiple testing correction (FDR BH), adjusted P-value cutoff (0.05) and minimum number of overlapped genes (2) are set at default values. These options can be adjusted by resubmitting your query (click "Submit" button in New Query tab).

#### Option 2. Use a list of genes of interest

To analyze a custom list of genes, you have to prepare a list of genes as either ENSG ID, entrez ID or gene symbol. Genes can be provided in the text are (one gene per line) or by uploading a file in the left panel. When you upload a file, genes have to be in the first column with a header. Header can be anything (even just a new line is fine) but FUMA will start reading your genes from the second row.

To analyze your genes, you need to specify background genes, which are used in the 2x2 enrichment tests. You can choose from the provided gene types. Alternatively, you can provide a custom list of background genes. Please provide this list either in the text area or by uploading a file of the right panel. File format should be the same as described for genes of interest.

### Outputs of GENE2FUNC

1) Summary of input genes
The table summarised the input genes and background genes. Input genes which are not used in the GENE2FUNC analyses due to lack of matching gene ID are also listed. Since the primary gene ID of FUMA is Ensembl ID and not all Ensembl IDs are mapped to unique entrez ID (NCBI gene ID), the number of unique entrez ID can be smaller than the number of input genes with Ensembl ID. Ensembl ID is used for expression heatmap and tissue specificity analyses, and entrez ID is used for gene set enrichment analysis.
3) Parameters
The table contains input parameters. This can be also downloaded from the option above.

#### 2. Gene Expression Heatmap

The heatmap displays two expression values.
1) Average expression per label
This is an averaged expression value per label (e.g. tissue types or developmental stage) per gene following to winsorization at 50 and log 2 transformation with pseudocount 1. The expression value depends on the data set, RPKM (Read Per Kilobase per Million) for GTEx v6 and BrainSapn, TPM (Transcripts Per Million) for GTEx v7. This allows for comparison across labels and genes. Hence, cells filled in red represent higher expression compared to cells filled in blue across genes and labels.
2) Average of normalized expression per label
This is the average of normalized expression (zero mean across samples) following to winsorization at 50 and log 2 transformation of the expression value with pseudocount 1. This allows comparison of gene expression across labels (horizontal comparison) within a gene. Thus expression values of different genes within a label (vertical comparison) are not comparable. Hence, cells filled in red represents higher expression of the genes in a corresponding label compared to other labels, but it DOES NOT represent higher expression compared to other genes.

Labels (columns) and genes (rows) can be ordered by alphabetically or cluster (hierarchical clustering). Hierarchical clustering is performed using python scipy package (using "average" method).

#### 3. Tissue specificity

Tissue specificity is tested using the differentially expressed genes defined for each label of each expression data set

Differentially Expressed Gene (DEG) Sets
DEG sets were pre-calculated by performing two-sided t-test for any one of labels against all others. For this, expression values were normalized (zero-mean) following to a log 2 transformation of expression value (EPKM or TPM). Genes which with P-value ≤ 0.05 after Bonferroni correction and absolute log fold change ≥ 0.58 were defined as differentially expressed genes in a given label compared to others. On top of DEG, up-regulated DEG and down-regulated DEG were also pre-calculated by taking sign of t-statistics into account.

Input genes were tested against each of the DEG sets using the hypergeometric test. The background genes are genes that have average expression value > 1 in at least one of the labels and exist in the user selected background genes. Significant enrichment at Bonferroni corrected P-value ≤ 0.05 are coloured in red.
Note that for DEG sets, Bonferroni correction is performed for each of up-regulated, down-regulated and both-sided DEG sets separately.

Results and images are downloadable as text files and in several image file formats.

#### 4. Gene Sets

Hypergeometric tests are performed to test if genes of interest are overrepresented in any of the pre-defined gene sets. Multiple test correction is performed per category, (i.e. canonical pathways, GO biological processes and so on, separately). Gene sets were obtained from MsigDB, WikiPathways and reported genes from the GWAS-catalog.

The full results are downloadable as a text file at the top of the page.
In each category, plot view and table view are selectable. In the plot view, images are downloadable in several file formats.

#### 5. Gene Table

Input genes are mapped to OMIM ID, UniProt ID, Drug ID of DrugBank and links to GeneCards. Drug IDs are assigned if the UniProt ID of the gene is one of the targets of the drug.
Each link to OMIM, Drugbank and GeneCards will open in a new tab.

## Cell type specificity analyses with scRNA-seq

### How to perform cell type specificity analysis on FUMA

We use MAGMA gene-property analysis to test cell type specificity of phenotype with GWAS summary statistics. As an input, it requires XXX.genes.raw file from MAGMA gene analysis. You can either select your existing SNP2GENE job or upload MAGMA output file you run by yourself.

FUMA uses Ensembl gene ID for all scRNA-seq data. If the input file contains different gene ID or gene symbols, FUMA will map to Ensembl gene ID. To do so, please UNCHECK the option right below the file selection, "Ensembl gene ID is used in the provided file". Otherwise MAGMA will result in an error due to mismatch of gene ID.

From FUMA v1.3.4, a 3-step workflow is implemented for the cell type analysis. This workflow consists of
• Step 1: per dataset analysis (same as implemented in v1.3.4)
After multiple testing correction across selected datasets, significant cell types were retained for Step 2.
• Step 2: within dataset conditional analysis
Identify independent signals per dataset by performing forward-selection.
• Step 3: cross datasets conditional analysis
Cell types retained from Step 2 are further conditioned each other across datasets to disentangle relationship between association of cell types from different datasets.
Please refer to 3-step workflow for details.
Note that step 2 and step 3 are not activated by default. To perform entire workflow, please CHECK the options.

### MAGMA gene-property analysis with scRNA-seq

The gene-property analysis aims to test relationships between cell specific gene expression profiles and disease-gene associations. The gene-property analysis is based on the regression model, $$Z = \beta_0 + E_c\beta_E + A\beta_A + B\beta_B + \epsilon$$ where $$Z$$ is a gene-based Z-score converted from the gene-based P-value, $$B$$ is a matrix of several technical confounders included by default. $$E_c$$ is the gene expression value of a testing cell type c and $$A$$ is the average expression across cell types in a data set, defined as follows: $$E_c = \sum_{i}^{n} log_2(e_i + 1)/n$$ $$A = \sum_{j \in C}^{N} E_j/N$$ where $$n$$ is the number of cells in the cell type c, $$e_i$$ is the expression value of a cell in the cell type c (e.g. UMI count or CPM), $$N$$ is the number of cell types in a data set and $$C = \{cell\ type\ 1, cell\ type\ 2, ..., cell\ type\ N\}$$. Note that log transformation was omitted when available data was already log transformed.
We performed a one-sided test ($$\beta_E>0$$) which is essentially testing the positive relationship between cell specificity and genetic association of genes.
In principle, this model is same as tissue specificity analyses with MAGMA on SNP2GENE process where tissue specific expression was used instead of cell specific expression.

The file format of scRNA-seq data set is, Ensembl gene ID in the first column with column name "GENE", N columns for per cell type average expression and average expression across cell types with column name "Average". MAGMA gene-property analysis is run with the following command.

From FUMA v1.3.4 (MAGMA v1.07)
 magma --gene-results [input file name].genes.raw \ --gene-covar [file name of selected scRNA-seq data set] \ --model condition-hide=Average direction=greater \ --out [output file name] 
Until FUMA v1.3.3d (MAGMA v1.06)
 magma --gene-results [input file name].genes.raw \ --gene-covar [file name of selected scRNA-seq data set] condition=Average onesided=greater \ --out [output file name] 
The extension of output files from gene-property analysis using MAGMA v1.06 and v1.07 are different. Please refer to Outputs for details.

### 3-step workflow with multiple scRNA-seq data sets

Since integration of scRNA-seq across datasets is highly challenging due to complex batch effects, the 3-step workflow is aimed to bypass this problem by systematically compare cell type associations across datasets using conditional analyses.

#### Step 1. per dataset cell type analysis

In the first step, MAGMA cell specificity analyses are performed for each of the user selected datasets separately using the regression model described in the previous section.
Multiple testing correction is applied to the results for all tested cell type across datasets and significant cell types are retained for the next step. For example, when dataset A with 5 cell types and B with 10 cell types are selected, then multiple test correction is performed for 15 tested cell types.
Note that outputs (both plots in result page and output files) also include adjusted P-value per dataset.

#### Step 2. within dataset conditional analysis

The second step is a within dataset conditional analysis. It is often the case that there are multiple similar cell types defined in a scRNA-seq dataset, especially when the resolution of cell types is high. The gene expression profiles of those cell types tend to strongly correlate with each other, and when a cell type is strongly associated with a trait it is therefore not clear whether that reflects a genuine involvement of that cell type or whether there is confounding due to expression in another cell type correlated with it.
In step 2, a systematical step-wise conditional analysis per dataset is performed, by setting thresholds for proportional significance ($$PS$$) of the conditional P-value of a cell type relative to the marginal P-value as described in the table.
$$PS$$ is defined as
$$PS_{a,b}=-log10(p_{a,b})/-log10(p_a)$$ where $$p_a$$ is the marginal P-value for the cell type $$a$$ using the base-model $$Z = \beta_0 + E_a\beta_{E_a} + A\beta_A + B\beta_B + \epsilon$$ and $$p_{a,b}$$ is the conditional P-value of the cell type $$a$$ conditioning on the cell type $$b$$ using the following model, $$Z = \beta_0 + E_a\beta_{E_a} + E_b\beta_{E_b} + A\beta_A + B\beta_B + \epsilon$$ In summary, forward selection (retain the cell type with the lowest marginal P-value) was performed for a pair of cell types which were jointly explained ($$PS_{a,b}$$<0.2 and $$PS_{b,a}$$<0.2) or one association was mainly driving the other’s ($$PS_{a,b}$$≥0.5 and $$p_{b,a}$$<0.05, or $$PS_{a,b}$$≥0.8 and $$PS_{b,a}$$<0.5). In the case of partially joint associations ($$PS_{a,b}$$≥0.5 and $$PS_{b,a}$$≥0.5) or independent ($$PS_{a,b}$$≥0.8 and $$PS_{b,a}$$≥0.8), both cell types were retained.

Forward selection criteria when cell type $$a$$ showed lower marginal P-value than cell type $$b$$
(scenarios are ordered by the priority)
Scenario Cell type a Cell type b Cell type a state Cell type b state Description
1 $$PS_{a,b}≥0.8$$ $$PS_{b,a}≥0.8$$ indep indep The association of cell type $$a$$ and $$b$$ are independent.
2 $$p_{a,b}≥0.05$$ $$p_{b,a}≥0.05$$ join joint-drop The association of cell type $$a$$ and $$b$$ are depending each other, and the model cannot distinguish association of two cell types. In this case, cell type $$a$$ is retained and $$b$$ is dropped as cell type $$a$$ has more significant marginal P-value, but it does not mean association of cell type $$a$$ is true and $$b$$ is not.
3 $$PS_{a,b}<0.2$$ $$PS_{b,a}<0.2$$ joint joint-drop Similar to the scenario 2, but the association of cell type $$a$$ and $$b$$ are not completely explained by each other. In this case, only cell type $$a$$ is retained as the significance of cell type $$b$$ drop to less than 20% of the marginal association. The output (state of cell types) is exactly the same as scenario 2, however there might be still some signals specific to each cell type $$a$$ and $$b$$.
4 $$PS_{a,b}≥0.5$$ $$p_{b,a}≥0.05$$ main drop The association of cell type $$b$$ is completely depending on the association of cell type $$a$$. Only cell type $$a$$ is retained.
5 $$PS_{a,b}≥0.8$$ $$PS_{b,a}<0.2$$ main partial-drop The association of cell type $$b$$ is mostly depending on the association of cell type $$a$$ but cell type $$a$$ cannot completely explain the association of cell type $$b$$. In this case, only cell type $$a$$ is retained as the significance of cell type $$b$$ drop to less than 20% of the marginal association, however there are some amount of signals remained (since P-value is still less than 0.05).
6 $$PS_{a,b}≥0.5$$ $$PS_{b,a}≥0.5$$ partial-joint partial-joint The association of cell type $$a$$ and $$b$$ are only partially explained by each other but majority of signals are coming from the independent associations. Both cell type $$a$$ and $$b$$ are retained.
7 $$PS_{a,b}≥0.2$$ $$PS_{b,a}≥0.2$$ partial-joint partial-joint-drop Similar to scenario 6 but larger proportion of signals are explained by each other. In this case, only cell type $$a$$ is retained as cell type $$b$$ remain less than 20% of marginal significance, however there might still be specific underlying signal for cell type $$b$$.
8 $$PS_{a,b}≥0.5$$ $$PS_{b,a}<0.2$$ partial-joint joint-drop The association of cell type $$b$$ is mostly explained by cell type $$a$$ but there are part of association dependent on both cell types. In this case, only cell type $$a$$ is retained.

Note that when associations of two cell types are jointly explained, only one cell type with the lowest marginal P-value is retained for the third step. However, this does not mean the discarded cell type is less important than the retained cell type, but the result suggests that the associations of these two cell types cannot be distinguished.
Although conditional P-values are often proportional to marginal P-values, it is possible that cell type with higher marginal P-value results in less conditional P-value for a pair of cell types (i.e. $$p_{b,a}$$<$$p_{a,b}$$). Therefore, when $$PS_{a,b}$$<0.2 and $$PS_{b,a}$$≥0.2, the order of cell types was flipped for forward selection.
Although only retained cell types were used for the third step, the results of within dataset conditional analyses for any pair of cell types were further breakdown into 8 categories as described in the table. This is to provide better understanding of the relationship of two significantly associated cell types. For example, in both scenario 4 and 5, cell type B is dropped and cell type A is considered as the main driver of the association. However, in scenario 4, association of cell type B cannot be completely explained by cell type A as conditional P-value of cell type B is still <0.05. Therefore, there might still be a unique signal to cell type B, however, as large amount of significance is dropped, the cell type B is not retained for the further step.
Note that step 2 is only performed for dataset where more than one cell types reached significance after multiple testing correction across datasets.

#### Step 3. cross datasets conditional analysis

The last step is to unravel relationships between significantly associated cell types across datasets. Although the absolute gene expression values in different datasets are not directly comparable, cross-datasets conditional analysis allows us to test the extent to which the significant gene expression profiles found in different data sets reflect the same or similar association signals. The analysis is performed for all possible cross-dataset pairs of significant cell types retained from the second step. Then the $$PS$$ of the cross-datasets (CD) conditional P-value of a cell type relative to the CD marginal P-value is computed for each cell type of all possible pairs.
For each pair of cell types from different datasets, the following three regression models were tested to incorporate the effect of the average expression from the other dataset:
$$Z=\beta_0 + E_c1\beta_{E_{c1}} + A_1\beta_{A_1} + A_2\beta_{A_2} + B\beta_B + \epsilon$$ $$Z=\beta_0 + E_c2\beta_{E_{c2}} + A_1\beta_{A_1} + A_2\beta_{A_2} + B\beta_B + \epsilon$$ $$Z=\beta_0 + E_c1\beta_{E_{c1}} + E_c2\beta_{E_{c2}} + A_1\beta_{A_1} + A_2\beta_{A_2} + B\beta_B + \epsilon$$ where $$E_{cx}$$ is an average log transformed expression of cell type $$c$$ from dataset $$x$$, and $$A_x$$ is an average expression across cell types in dataset $$x$$. In this step, we define P-value of testing alternative hypothesis $$\beta_{E_{cx}}>$$0 from 1st and 2nd models as CD marginal P-value, and $$\beta_{E_{cx}}$$>0 from 3rd model as CD conditional P-value for a cell type $$c$$ from a dataset $$x$$.
Note that, when associations of two cell types from different datasets with a trait are largely disappeared by conditioning on each other, it suggests that associations of those cell types were driven by similar genetic signals but this does not measure the similarity of two cell types (i.e. it cannot be concluded that the cell types from the different datasets are the same).
Note that step 3 is only performed where there are significant cell types from more than one datasets.
Please be aware that, some of scRNA-seq, there are multiple datasets available from a single scRNA-seq data resource. For example, Tabula Muris FACS data, one dataset contains all cell types from all 20 tissues, and there are 20 datasets for each tissue separately. When both TabulaMuris_FACS_all and TabulaMuris_FACS_Aorta datasets are selected, for instance, exact same cell types in the Aorta dataset exist in the dataset with all tissues. Testing both datasets are still relevant as average expression across cell types is different for each dataset, however, step 3 is not relevant in this case as they are exactly the same cell type. When step 3 is activated, FUMA will still perform all possible pair of significant cell types across datasets. However the pair of exact same cell type will be collinear (MAGMA outputs NA for such pairs).

### Data sets

Each data set available on FUMA is described details. Scripts for pre-processing are available on github repository at https://github.com/Kyoko-wtnb/FUMA_scRNA_data. Processed data can be also downloaded from this repository (so you can run it by yourself!!).
Pre-process was performed as the following steps. Please see each script for more details.
1. When the obtained value was the read count, the count was converted into the count per million (CPM) to allow correction for the total number of reads per cell.
2. QC of cells was performed as described in the original study unless the obtained data was already QCed.
3. Cells with uninformative cell type labels (e.g. ‘unclassified’ or ‘unknown’) were excluded, unless specified.
4. The expression value (UMI count, CPM, RPKM or TPM) was log2 transformed with pseudo-count 1 (unless the it's already done) and per gene per cell type average was computed. When there were multiple levels of cell type labels, the average expression was computed for each level separately.
5. Genes provided in the processed datasets were mapped to human Ensembl gene ID (v92 GRCh37).

Data name Link Description Reference Last update
Tabula Muris FACS: https://figshare.com/articles/Single-cell_RNA-seq_data_from_Smart-seq2_sequencing_of_FACS_sorted_cells_v2_/5829687 , droplet: https://figshare.com/articles/Single-cell_RNA-seq_data_from_microfluidic_emulsion/5715025 Multiple tissues/organs of mouse samples.
FACS: From 53,760 cells in the raw read count matrix, 44,949 cells exist in the annotation file. Cells with label "unknown" were included as it was stated in the original study that they are potential novel cell types. From 23,433 genes, 15,131 genes were mapped to hs ENSG ID. In total, 22 data sets were created (1 for cell types from all tissues/organs together, 20 for each tissue/organ separately and 1 for all Brain cell types (including both Brain_Meyloid and Brain_Non-Meyloid)).
droplet: From 2,990,808 cells in the raw read count matrix, 54,837 cells exist in the annotation file. Cells with label "unknown" were included as it was stated in the original study that they are potential novel cell types. From 23,433 genes, 15,131 genes were mapped to hs ENSG ID. In total, 13 data sets were created (1 for cell types from all tissues/organs together and 12 for each tissue/organ separately).
The Tabula Muris Consortium et al. 2018. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature. 562, 367-372.
PMID: 30283141
4 Feb 2019
Mouse Cell Atlas (GSE108097) Website: http://bis.zju.edu.cn/MCA/, Data: https://figshare.com/s/865e694ad06d5857db4b Multiple tissues/organs of mouse samples.
A file "Figure2-batch-removed.txt.gz" was used in which batch was removed and cells were already QCed. 61,637 cells were available and not additional filtering was performed. From 25,133 genes, 15,640 genes were mapped to hs ENSG ID. In total 37 data sets were created as the following; 1) all tissues/developmental stages together (731 unique cell types), 2) only adult mouse samples (437 cell types from 18 tissue), 3) only embryo samples (including fetal tissues, 137 cell types), 4) only neonatal samples (108 cell types), 5-37) per tissue per sample type (adult, embryo, neonatal and cell line, 33 combination in total).
Han et al. 2018. Mapping the Mouse Cell Atlas by Microwell-Seq. Cell. 172, 1091-1107.
PMID: 29474909
17 July 2018
Allen Brain Atlas Cell Type Human
Mouse new version
Mouse old version
Human and mouse brain samples.
For each data, level 1, 2 and 3 cell types were processed separately.
For Human MTG, LGN and mouse ALM2, LGd2 and VISp2, sum of read counts for exon and introns were computed for each gene, to obtain gene level read counts.
Human MTG: From 15,928 cells, 325 with "no class" were excluded, resulted in 15,603 cells. From 50,281 genes, 29,115 genes were mapped to unique ENSG ID.
Human LGN: From 1,576 cells, 23 with "no class" were excluded, resulted in 1,553 cells. From 50,281 genes, 29,115 genes were mapped to unique ENSG ID.
Mouse ALM2: From 10,068 cells, 99 which "Low Quality" and 396 cells with "no class" were excluded, resulted in 9,573 cells. From 45,768 genes, 16,093 genes were mapped to unique hs ENSG ID.
Mouse LGd2: From 1,996 cells, 122 with "Outlier" were excluded, resulted in 1,874 cells. From 45,768 genes, 16,093 genes were mapped to unique hs ENSG ID.
Mouse VISp2: From 15,413 cells, 490 with "Low Quality" and 674 cells with "no class" were excluded, resulted in 14,249 cells. From 45,768 genes, 16,093 genes were mapped to unique hs ENSG ID.
Mouse ALM: All 1,301 cells were used. From 45,764 genes, 16,068 genes were mapped to unique hs ENSG ID.
Mouse LGd: From 1,827 cells, 17 cels with label "Outlier" were excluded, resulted in 1,810 cells. From 45,761 genes, 15,837 genes were mapped to unique hs ENSG ID.
Mouse VISp: All 1,679 cells were used. From 24,057 genes, 15,097 genes were mapped to unique hs ENSG ID.
(For Human MTG)
Hodge, et al. 2018. Conserved cell types with divergent features between human and mouse cortex. bioRxiv. doi: https://doi.org/10.1101/384826 (For Mouse VISp2 and ALM2)
Tasic et al. 2018. Shared and distinct transcriptomic cell types across neocortical areas. Nature 563, 72-78.
PMID: 30382198 (For Mouse VISp data set)
Tasic et al. 2016. Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nat. Neurosci. 19, 335-346.
PMID: 26727548
4 Feb 2019
DropViz http://dropviz.org/ Mouse brain samples.
"Metacells" data downloaded from DropViz website was used which is the aggregated data per 565 sub-cluster not the individual cell level UMI counts. The UMI was the sum of all the cells in a subcluster, therefore we converted to CPM. In the annotation of each sub-cluster, "class" column was used as level 1 cell type and subcluster was used as level 2. From 32,307 genes, 16,097 genes were mapped to hs ENSG ID.
Arpair et al. 2018. Molecular diversity and specializations among the cells of adult mouse brain. Cell. 9, 1015-1030. PMID: 30096299 4 Feb 2019
DroNc Human: https://www.gtexportal.org/ / https://portals.broadinstitute.org/single_cell#study-dronc-seq-single-nucleus-rna-seq-on-human-archived-brain, Mouse: https://portals.broadinstitute.org/single_cell#study-dronc-seq-single-nucleus-rna-seq-on-mouse-archived-brain Human and mouse brain samples.
Human: Expression data was downloaded from GTEx website (also available from Broadinstitute Single Cell Portal). Cells with cluster 1-14 or 16 were used since those clusters were assigned in the original study. The cell type label was manually assigned to the cluster index based on the figure 2a in the original paper. From 14,963 cells, 14,137 cells were used. From 32,111 genes, 31,852 genes were mapped to ENSG ID.
Mouse: Cells with label "Unclassified", "Doublets" or "ChP" were excluded as they are not assigned in the original study. From 13,313 cells, 11,148 cells were used. From 17,3080 genes, 13,335 genes were mapped to hs ENSG ID.
Habib et al. 2017. Massively parallel single-nucleus RNA-seq with DroNc-seq. Nat. Methods. 14, 955-958.
PMID: 28846088
17 July 2018
Mouse Brain Atlas (Linnarsson's lab) http://mousebrain.org/ Mouse brain samples.
Five expression matrices were obtained for level 5, level 6 rank 1-4. Note that the expression value was already aggregated per cell type and we did not use individual cell level expression data. Each of 5 data sets were processed separately. From 27,997 genes, 16,420 genes were mapped to hs ENSG ID.
Zeisel et al. 2018. Molecular architecture of the mouse nervous system. Cell. 9 999-1014. PMID: 30096314 4 Feb 2019
GSE59739 (Linnarsson's lab) https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE59739 Mouse brain samples (dorsal root ganglion L4-L6 from 6-8 weeks old mice).
Expression data was obtained from GEO and annotation of each cell was extracted from family soft file. Cells with label NF, NP, PEP or TH in Level 1 cell types were used to be consistent with the original study. From 865 cells in the expression data, 622 cells were used. From 25,333 genes, 15,084 genes were mapped to hs ENSG ID. Per cell type average expression was computed for level 1, 2 and 3 separately.
Usoskin et al. 2015. Unbiased classification of sensory neuron types by large-scale single-cell RNS sequencing. Nat. Neurosci. 18, 145-153.
PMID: 25420068
17 July 2018
3,005 cells were available. From 19,972 genes, 15,161 genes were mapped to hs ENSG ID. Per cell type average expression was computed for level 1 and level 2 separately. For level 2, 189 cells with label "none" were excluded.
Zeisel et al. 2015. Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science. 347, 1138-1142.
PMID: 25700174
17 July 2018
GSE75330 (Linnarsson's lab) https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE75330 Mouse brain samples (oligodendrocytes from day21-90 mice).
5,069 cells were available. From 23,556 genes, 15,816 genes were mapped to hs ENSG.
Marques et al. 2016. Oligodendrocyte heterogeneity in the mouse juvenile and adult central nervous system. Science. 352, 1326-1329.
PMID: 27784195
17 July 2018
GSE78845 (Linnarsson's lab) https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE78845 Mouse brain samples (stellate and thoracic sympathetic ganglia from postnatal day 27-33 mice).
Cells with label "unclassified" were excluded. From 298 cells, 213 cells were used. From 16,892 genes, 13,804 genes were mapped to hs ENSG ID.
Furlan et al. 2016. Visceral motor neuron diversity delineates a cellular basis for nipple-and plio-erection muscle control. Nat. Neurosci. 19, 1331-1340.
https://www.ncbi.nlm.nih.gov/pubmed/27571008
17 July 2018
GSE76381 (Linnarsson's lab) https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE76381 Human brain samples (ventral midbrain from 6-11 weeks embryos) and mouse brain samples (ventral midbrain from E11.5-E18.5 embryos).
Only human embryo (1,977 cells) and mouse embryo (1,907 cells) data set were used. Cells with label "Unk" (unknown) were excluded. For human, from 1,977 cells, 1695 cell were used. From 19,531 genes, 16,885 genes were mapped to ENSG ID. For mouse, from 1,907 cells, 1,518 cells were used. From 24,378 genes, 15,826 genes were mapped to hs ENSG ID.
La Manno et al. 2016. Molecular diversity of midbrain development in mouse, human, and stem cells. Cell. 167, 556-580.
PMID: 27716510
17 July 2018
GSE95752, GSE95315 and GSE104323 (Linnarsson's lab) GSE95752: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE95752, GSE95315: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE95315, GSE104323: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE104323 Mouse brain samples (dentate gyrus from P5-P26 and P50-P65 for GSE95752, P12-P35 for GSE95315 and E16.5 and P0-P132 for GSE104323).
GSE95752: 2,303 cells were available. From 16,131 genes, 143,470 genes were mapped to hs ENSG ID.
GSE95315: Cell types obtained from family soft file (41 types) are merged into 22 cluster as presented in the original study. 5,454 cells were available. From 14,545 genes, 12,640 genes were mapped to hs ENSG ID.
GSE104323: From 24,216 cells, 24,185 cells with valid cell labels were used (cells with blank in the cell type column were excluded). From 27,933 genes, 16,146 genes were mapped to hs ENSG ID.
Hochgerner et al. 2018. Conserved properties of dentate gyrus neurogenesis across postnatal development revealed by single-cell RNA sequencing. Nat. Neurosci. 21, 290-299.
PMID: 29335606
17 July 2018
GSE101601 (Linnarsson's lab) https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE101601 Human brain samples (Temporal cortex from post-mortem samples) and mouse brain samples (Somatosensory cortex from postnatal days 21-37 mice).
Human (2,028 cells) and mouse (2,192 cells) data sets were processed separately. For human, from 28,274 genes, 21,459 genes were mapped to ENSG ID. For mouse, from 24,339 genes, 15,826 genes were mapped to hs ENSG ID.
Hochgerner et al. 2017. STRT-seq-2i: dual-index 5' single cell and nucleus RNA-seq on an addressable microwell array. Sci. Rep. 7: 16327.
PMID: 29180631
17 July 2018
GSE74672 (Linnarsson's lab) https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE74672 Mouse brain samples (hypothalamus from postnatal days 14-28 mice).
Only 2881 cells were available in the expression file, though it was mentioned that 3131 cells in the original paper. From 24,341 genes, 15,826 genes were mapped to hs ENSG ID. Per cell type average expression was computed for level 1 and 2 separately. Level 2 label was only available for neurons. From 898 neurons, 126 cells with level 2 label "uc" (unclassified) were excluded.
Romanov et al. 2017. Molecular interrogation of hypothalamic organization reveals distinct dopamine neuronal subtypes. Nat. Neurosci. 20, 176-188.
PMID: 27991900
17 July 2018
GSE67602 (Linnarsson's lab) https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE67602 Mouse epidermis from dorsal skin (~8 weeks).
1,422 cells were available. From 25,932 genes, 15,802 genes were mapped to hs ENSG.
Joost et al. 2016. Single-cell transcriptomics reveals that differentiation and spatial signatures shape epidermal and hair follicle heterogeneity. Cell Syst. 3, 221-237.
PMID: 27641957
17 July 2018
GSE103840 (Linnarsson's lab) https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE103840 Mouse brain samples (dorsal horn from 3-4 weeks old mice).
1,545 cells were available. From 24,378 genes, 15,826 genes were mapped to hs ENSG ID.
Haring et al. 2018. Neuronal atlas of the dorsal horn defines its architecture and links sensory input to transcriptional cell types. Nat. Neurosci. 21, 869-880.
PMID: 29686262
17 July 2018
GSE87544 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE87544 Mouse brain samples (hypothalamus from 8-10 weeks l=old mice).
From 14,437 cells, 6,507 cells with condition "Normal" were extracted. Cells with label "zothers" were further excluded resulted in 5,350 cells. To be consistent with the original study, cells with <=2000 genes expressed (0 expression) were excluded. In total, 1,039 cells were used. From 23,284 genes, 15,116 genes were mapped to hs ENSG ID. In the original study, there are 45 cell types but in the downloadable data there was no NFO but instead IMO and SCO. By checking with the authors, IMO (immature oligodendrocyte) = NFO and SCO (Subcommissural organ) is extra.
Chen et al. 2017. Single-cell RNA-seq reveals hypothalamic cell diversity. Cell Rep. 18, 3227-3241.
PMID: 28355573
17 July 2018
GSE98816 and GSE92235 GSE98816: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE98816, GSE92235: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE92235 Mouse brain vascular cells (GSE98816) and lung vascular cells (GSE92235) from 10-19 weeks old mice.
Cell type label was obtained directly from the authors by requesting.
GSE98816: 3,186 cells were available. From 19,937 genes, 15,302 genes were mapped to hs ENSG ID.
GSE92235: 1,504 cells were available. From 21,948 genes, 15,801 genes were mapped to hs ENSG ID.
Vanlandewijck et al. 2018. A molecular atlas of cell types and zonation in the brain vasculature. Nature. 554, 475-480.
PMID: 29443965
17 July 2018
GSE81547 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE81547 Human pancreas samples (healthy donors between 1 month to 54 years old).
2,544 cells were available. Cells with label "unsure" was defined as "PP" in the original study. From 23,465 genes, 20,706 genes were mapped to hs ENSG ID.
Enge et al. 2017. Single-cell analysis of human pancreas reveals transcriptional signatures of ageing and somatic mutation patterns. Cell. 171, 321-330.
PMID: 28965763
17 July 2018
GSE104276 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE104276 Human brain samples (prefrontal cortex from 8-26 weeks after gestation).
2,309 cells were available. From 24,153 genes, 21,177 genes were mapped to ENSG ID. Two data sets were created; 1) per cell type average across different ages, 2) per cell type per age average expression.
Zhong et al. 2018. A single-cell RNA-seq survey of the developmental landscape of the human prefrontal cortex. Nature. 555, 524-528.
PMID: 29539641
17 July 2018
GSE82187 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE82187 Mouse brain samples (striatum from 5-7 weeks old mice).
Only microfluid data was used since FACS data was limited to neurons. From 1,208 cells, 705 cells from microfluid were used. From 18,840 genes, 14,189 genes were mapped to hs ENSG ID.
Gokce et al. 2016. Cellular taxonomy of the mouse striatum as revealed by single-cell RNA-seq. Cell Repo. 16, 1126-1137.
PMID: 27425622
17 July 2018
GSE89232 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE89232 Human blood samples.
957 cells were available. From, 20.689 genes, 17,035 genes were mapped to hs ENSG ID.
Breton et al. 2016. Human dendritic cells (DCs) are derived from distinct circulating precursors that are precommitted to become CD1c+ or CD141+ DCs. J. Exp. Med. 213, 2861-2870.
PMID: 27864467
17 July 2018
GSE100597 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE100597 Mouse embryos (E3.5, E4.5, E5.5 and E6.5).
Developmental stage was used as cell label. 721 cells were available. From 24,83 genes, 14,513 genes were mapped to hs ENSG ID.
Mohammed et al. 20174. Single-cell landscape of transcriptional heterogeneity and cell fate decisions during mouse early gastrulation. Cell Repo. 20, 1215-1228.
PMID: 28768204
17 July 2018
GSE93374 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE93374 Mouse brain samples (hypothalamic arcuate-median eminence complex from 4-12 weeks old mice).
Cells with label "miss" in the column "clust_all" were excluded. From 21,086 cells, 20,921 cells were used. Level 1, level 2 and clusters for neurons were processed separately resulted in three data sets. For clusters for neurons, non-neuronal cells were excluded (with label "miss" in "clust_neurons" column; 13,079 neuronal cells in total). From 19,743 genes, 14,366 genes were mapped to hs ENSG ID.
Campbell et al. 2017. A molecular census of arcuate hypothalamus and median eminence cell types. Nat. Neurosci. 20, 484-496.
PMID: 28166221
17 July 2018
GSE92332 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE92332 Mouse small intestine epithelium samples (7-10 weeks old mice).
Expression data was obtained for SMATRseq (1,522 cells) and droplet (7,216 cells) data set. Each data set was processed separately. For SMARTseq, from 20,108 genes, 14,714 genes were mapped to hs ENSG ID. For droplet, from 15,971 genes, 12,865 genes were mapped to hs ENSG ID.
Haber et al. 2017. A single-cell survey of the small intestinal epithelium. Nature. 551, 333-339.
PMID: 29144463
17 July 2018
GSE89164 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE89164 Mouse brain samples (hindbrain from P0 mice).
Two count matrices for mouse replicates were combined and extracted 4366 cells exist in the cluster information. Cell label was manually assigned to the cluster index based on the original study. From 20,648 genes, 13,176 genes were mapped to hs ENSG ID.
Alies et al. 2017. Cell fixation and preservation for droplet-based single-cell transcriptomics. BMC Biol. 15: 44.
PMID: 28526029
17 July 2018
GSE67835 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE67835 Human brain samples (cortex from adult and fetal samples).
Two data sets with and without fetal sample were created. 466 cells were available (of which 135 cells were fetal samples). From 22,088 genes, 19,749 genes were mapped to ENSG ID.
Darmanis et al. 2015. A survey of human brain transcriptome diversity at the single cell level. Proc. Natl. Acad. Sci. USA. 112, 7285-90.
PMID: 26060301
17 July 2018
GSE106678 https://portals.broadinstitute.org/single_cell/study/snucdrop-seq-dissecting-cell-type-composition-and-activity-dependent-transcriptional-state-in-mammalian-brains-by-massively-parallel-single-nucleus-rna-seq Mouse brain samples (cortex from 6-10 weeks old mice).
Expression data was obtained from Broadinstitute Single Cell Portal. 18,194 cells were available. From 30,341 genes, 15,782 genes were mapped to hs ENSG ID.
Hu et al. 2017. Dissecting cell-type composition and activity-dependent transcriptional state in mammalian brains by massively parallel single-nucleus RNA-seq. Mol. Cell. 68, 1006-1015.
PMID: 29220646
17 July 2018
GSE84133 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE84133 Human and mouse pancreas samples.
Humam: Human sample 4 was excluded as it is with sample status T2D. All 7,266 cell were used and from 20,125 genes 19,546 genes were mapped to unique ENSG ID.
Mouse: All 1,886 cells were and from 14,878 genes, 12,741 genes were mapped to unique hs ENSG ID.
Baron, et al. 2016. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Systems 3, 346-360.
PMID: 27667365
4 Feb 2019
10x PBMC https://community.10xgenomics.com/t5/Data-Sharing/10x-Single-Cell-3-Paper-Zheng-et-al-2016-Datasets/td-p/231 Human peripheral blood mononuclear cells (PBMCs).
Cell label was downloaded from github repository https://github.com/10XGenomics/single-cell-3prime-paper. Data set was downloaded from 10X website directory. All 68,579 cells were used. Genes were annotated to ENSG ID in the original data (32,738 genes).
Zheng. et al. 2017. Massively parallel digital transcriptional profiling of single cells. Nat. Communs. 8, 14049.
PMID: 28091601
4 Feb 2019

## Publish results

### How to publish FUMA results

#### 1. Prepare jobs in SNP2GENE and GENE2FUNC

You can publish any of existing SNP2GENE job in your account but only the ones without any error. MAGMA results are optional but it's highly recommended to include them too. You can also publish GENE2FUNC results together with SNP2GENE results if the GENE2FUNC job is performed for mapped genes from the corresponding SNP2GENE job.

#### 2. Publish results

You can publish your results from your job list on SNP2GENE page. There is a "publish" button for each SNP2GENE job.

When you click the "publish" button, a popup will open where you can specify some features of the job. Please fill the features in the table below as much as possible before submit your job.

FeaturesDescription
Selected SNP2GENE jobID Auto filled when you click the "publish" button. This value is not changeable.
Corresponding GENE2FUNC jobID Auto filled when there is a recognized GENE2FUNC job. FUMA recognizes a matched GENE2FUNC job only when the GENE2FUNC job has been performed by using "GENE2FUNC" button (internal submission). If you manually submit GENE2FUNC for the corresponding SNP2GENE job, you can manually specify here.
Title Title of the published job should be self-descriptive, although it is auto filled by the title of the selected SNP2GENE job. If the title is not clear enough, the developer might contact you to provide a sufficient information.
Author Auto filled with the user name but please provide your full name.
Email Auto filled with your registered email address. Please provide an email address that is reachable to you. Any future modification/deletion of the published job will be only processed when it is requested by the matched email.
Phenotype Please provide phenotype of the GWAS if applicable.
Publication This is the publication where the selected SNP2GENE job is described (not the reference to the summary statistics). This can be any format as long as users are able to find the publication. Please provide PubMed ID if possible (e.g. PMID: 29184056). If you don't have publication yet, please let the developer know once the publication becomes available. You can also provide preprint DOI.
Link to summary statistics If the summary statistics used in this job is publicly available, please provide the original link.
Reference of summary statistics This should be the original publication of the summary statistics. This can be same as the publication above when a new GWAS result is presented in the publication.
Notes You can provide any additional information here (max 300 characters). For example, when there are multiple summary statistics available from the same study, you should specify which result this is referring to.

#### 3. Check your published results

Published job will be listed in the "Browse Public Results" page. Please have a look at your published job to check if there is any problem.

### Modify/delete published result

Modification and deletion of published jobs can be done by similar way as publishing the job. From the SNP2GENE job list, the published jobs now have "edit" button instead of "publish" button. By clicking this, you can update the features listed above or delete the published job. Note that deleting the published job does not delete the job from your account. At the same time, when you delete the original job from your account, it does not delete the corresponding published job. If you want to modify/delete published job whose original job is deleted from your account, please contact to the FUMA developer, (Kyoko Watanabe: k.watanabe@vu.nl). Please also provide id of the published job together.
Modification/deletion is only possible when the user is logged in with the same email address as the entry.

### Users' responsibility

We do not take any responsibility for your published results. Any question specific to a published result from other users is required to answer by the user who published the result not the FUMA developer.

### What other users can do?

Any FUMA users are able to browse your published results and download any text files and images. They are also able to create regional plot with annotations.
The "Browse Public Result" page does not require users to login.

## Data Security

### GWAS summary statistics

User uploaded input file (GWAS summary statistics) is stored into the server until the SNP2GENE process is done. Once the process is done, the input file is completely deleted. There is no backup for user uploaded files.

### FUMA results

FUMA results are stored into the server, however only logged in users can access to their own data. When users delete jobs, all results including the job submission recodes will be completely removed from the server. Therefore, it is no longer possible to recover jobs which are accidentally removed. We recommend to download results (text files) and images as much as possible before deleting the job. Even though, FUMA results are only accessible from the logged in users, we strongly recommend to remove jobs whose results need to be highly secured.