Last updated: 2022-12-01

Checks: 7 0

Knit directory: virilisProteomics/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20201210)

The command set.seed(20201210) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: f18ceb3

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version f18ceb3. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .RData
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.DS_Store
    Ignored:    analysis/.Rhistory
    Ignored:    code/.DS_Store
    Ignored:    data/.DS_Store
    Ignored:    output/.DS_Store

Untracked files:
    Untracked:  .Rapp.history
    Untracked:  TrialDataFirstLook.R
    Untracked:  alignment_dotplot_dark_AN.png
    Untracked:  code/.ipynb_checkpoints/
    Untracked:  code/find_seq.py
    Untracked:  comparisons/
    Untracked:  data/ABetal.2020/
    Untracked:  data/Big_table_virilis_group_with_divergence_geneflow_Leeban.csv
    Untracked:  data/Dvir_stringtie_transcripts_lengths.txt
    Untracked:  data/FlyBase_GO_STEP.txt
    Untracked:  data/Garlovsky_etal_Dmon_MaxQuant_ParkerID.csv
    Untracked:  data/Orthogroups.tsv
    Untracked:  data/PEAKS/
    Untracked:  data/ProteomeDiscoverer/
    Untracked:  data/RNASeq_data/
    Untracked:  data/SPN_ids.csv
    Untracked:  data/mRNA_abundances/
    Untracked:  data/melanogaster/
    Untracked:  data/orthology_matching/
    Untracked:  data/reciprocal_orths.csv
    Untracked:  data/signal_peptides/
    Untracked:  data/vir_annotation.rds
    Untracked:  data/virilis_Accession2FBgn_uniprot.csv
    Untracked:  data/virilis_male_AG_SFP_EB_genes.txt
    Untracked:  data/virilis_molecular_evolution_results/
    Untracked:  output/ClueGOlists/
    Untracked:  output/all_ejac_IDs.csv
    Untracked:  output/combined_database.csv
    Untracked:  output/combined_database_tpp.csv
    Untracked:  output/duplicated_proteins.csv
    Untracked:  output/ejac_IDs_no-orths.csv
    Untracked:  plots/

Unstaged changes:
    Deleted:    analysis/data-exploration.Rmd
    Deleted:    analysis/differential-abundance.Rmd
    Modified:   analysis/multi-database.Rmd
    Modified:   code/mRNA_integration_analysis.ipynb

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/TPP-analysis.Rmd) and HTML (docs/TPP-analysis.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	f18ceb3	MartinGarlovsky	2022-12-01	wflow_publish("analysis/TPP-analysis.Rmd")

Introduction

The following analyses are the results of a 16-plex TMT labelling proteomics experiment. Flies were flash frozen in liquid nitrogen within ~30 seconds after copulation ended and stored at -80ºC. Later, female flies were thawed and the lower reproductive tract dissected with fine forceps in a drop of 1X PBS, rinsed in a second drop, and then added to a pool (n = ) of XXX kept on ice. Samples were shipped on dry ice to the Cambridge Core Proteomics Facility, UK, for LC-MS/MS. Resulting .raw files were processed in Proteome Discoverer using each species’ proteome.

Load packages

library(tidyverse)
library(tidybayes)
library(ggrepel)
library(ComplexHeatmap)
library(ggalluvial)
library(cluster)
library(factoextra)
library(grid)
library(UpSetR)
library(RColorBrewer)
library(eulerr)
library(edgeR)
library(kableExtra)
#library(DT)

# colour palettes

# colourblind friendly palette
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#CC79A7", "#D55E00", "#0072B2", "#CC79A7")

# viridis palettes
v.pal <- viridis::viridis(n = 3, direction = -1)
m.pal <- viridis::magma(n = 5, direction = -1)
c.pal <- viridis::inferno(n = 7)

# met brewer egypt palette
egypt.pal <- MetBrewer::met.brewer('Egypt')

# nice tables
my_data_table <- function(df){
  datatable(
    df, rownames = FALSE,
    autoHideNavigation = TRUE,
    extensions = c("Scroller",  "Buttons"),
    options = list(
      dom = 'Bfrtip',
      deferRender = TRUE,
      scrollX = TRUE, scrollY = 400,
      scrollCollapse = TRUE,
      buttons = 
        list('csv', list(
          extend = 'pdf',
          pageSize = 'A4',
          orientation = 'landscape',
          filename = 'Dpseudo_respiration')),
      pageLength = 50
    )
  )
}

Load data

Peptide abundance data exported from TransProteomicPipeline using each species search
Trinotate gene annotations for D. virilis and .gff
virilis group accessory gland, ejaculatory bulb and seminal fluid protein genes
List of D. melanogaster sperm protein orthologs from Wasborough et al. (2009), comprising 1108 proteins, combined with the DmSP1
List of putative D. melanogaster Sfps identified by Wigby et al. (2020). Phil. trans. B
Orthology table between D. melanogaster and D. virilis FBgns
Male accessory gland or ejaculatory bulb biased genes and putative SFPs from Ahmed-Braimah et al. (2017) G3
Female reproductive tract biased genes and postmating response genes from Ahmed-Braimah et al. (2021) MBE
Pairwise dN/dS results also from Ahmed-Braimah et al. (2021) MBE
Signal peptide predictions for each species from SignalP-5.0
Gene ontology (GO) information for all D. virilis genes from FlyBase.org
Orthofinder orthogroups results for reciprocal orthology between species for all proteins

Load peptide intensities and calculate protein abudnances

We load results from the TPP and filter peptides and proteins before calculating protein abundance across all peptides for a given protein.

# correct column names for intensities
new.names <- c('AM1', 'AM2', 'AM3', 'AV1', 'AV2', 'NM1', 'NM2', 'NM3', 'NV1', 'NV2', 'VM1', 'VM2', 'VM3', 'VV1', 'VV2', 'VV3')

# load TPP data and give proper names to abundance columns, filter out decoy peptides and those without sufficient evidence
tmt_ame <- read.delim('../TPP_aalysis/DameSB_db/OUTPUT/combined/TMT_quant.tsv', 
                      na.strings = "") %>% rename(protein = X.protein) %>% 
  drop_na(peptide) %>% 
  filter(!str_detect(protein, 'DECOY'), kept. == 'Yes') %>% 
  select(protein, peptide, contains("inten"))
colnames(tmt_ame)[3:18] <- new.names

tmt_nov <- read.delim('../TPP_aalysis/Dnov14_db/OUTPUT/combined/TMT_quant.tsv', 
                      na.strings = "") %>% rename(protein = X.protein) %>% 
  drop_na(peptide) %>% 
  filter(!str_detect(protein, 'DECOY'), kept. == 'Yes') %>% 
  select(protein, peptide, contains("inten"))
colnames(tmt_nov)[3:18] <- new.names

tmt_vir <- read.delim('../TPP_aalysis/Dvir87_db/OUTPUT/combined/TMT_quant.tsv', 
                      na.strings = "") %>% rename(protein = X.protein) %>% 
  drop_na(peptide) %>% 
  filter(!str_detect(protein, 'DECOY'), kept. == 'Yes') %>% 
  select(protein, peptide, contains("inten"))
colnames(tmt_vir)[3:18] <- new.names

# # extract peptide stats
# tmt_ame_stats <- read.delim('../TPP_aalysis/DameSB_db/OUTPUT/combined/TMT_quant.tsv', 
#                       na.strings = "") %>% 
#   distinct(protein, .keep_all = TRUE) %>% 
#   select(protein, protratio0:proterr15)
# 
# tmt_nov_stats <- read.delim('../TPP_aalysis/Dnov14_db/OUTPUT/combined/TMT_quant.tsv', 
#                       na.strings = "") %>% 
#   distinct(protein, .keep_all = TRUE) %>% 
#   select(protein, protratio0:proterr15)
# 
# tmt_vir_stats <- read.delim('../TPP_aalysis/Dvir87_db/OUTPUT/combined/TMT_quant.tsv', 
#                       na.strings = "") %>% 
#   distinct(protein, .keep_all = TRUE) %>% 
#   select(protein, protratio0:proterr15)

# load iProphet results
ame_ip <- read.delim('../TPP_aalysis/DameSB_db/OUTPUT/combined/iProph/xinteract.iProph_comb.prot.xls') %>% 
  mutate(peptide = gsub("\\[|\\]", "", 
                        gsub("[[:digit:]]+", "", 
                             gsub("n", "", peptide.sequence)))) %>% 
  filter(!str_detect(protein, 'DECOY'))

nov_ip <- read.delim('../TPP_aalysis/Dnov14_db/OUTPUT/combined/iProph/xinteract.iProph_comb.prot.xls') %>% 
  mutate(peptide = gsub("\\[|\\]", "", 
                        gsub("[[:digit:]]+", "", 
                             gsub("n", "", peptide.sequence)))) %>% 
  filter(!str_detect(protein, 'DECOY'))

vir_ip <- read.delim('../TPP_aalysis/Dvir87_db/OUTPUT/combined/iProph/xinteract.iProph_comb.prot.xls') %>% 
  mutate(peptide = gsub("\\[|\\]", "", 
                        gsub("[[:digit:]]+", "", 
                             gsub("n", "", peptide.sequence)))) %>% 
  filter(!str_detect(protein, 'DECOY'))


# add peptide stats and filter based on:
# adj peptide probability >= 0.95
# protein group.probability >= 0.95

tmt_ame_prot <- tmt_ame %>% 
  left_join(ame_ip, by = c("protein", "peptide"), na_matches = "never") %>% 
  filter(nsp.adjusted.probability >= 0.95, 
         group.probability >= 0.99,
         is.contributing.evidence == 'Y') %>% 
  select(protein, num.unique.peps, AM1:VV3) %>% 
  # calculate protein abundance as sum of associated peptide abundances
  group_by(protein) %>% summarise(UP = unique(num.unique.peps),
                                  across(2:17, sum, na.rm = TRUE))

tmt_nov_prot <- tmt_nov %>% 
  left_join(nov_ip, by = c("protein", "peptide"), na_matches = "never") %>% 
  filter(nsp.adjusted.probability >= 0.95, 
         group.probability >= 0.99,
         is.contributing.evidence == 'Y') %>% 
  select(protein, num.unique.peps, AM1:VV3) %>% 
  # calculate protein abundance as sum of associated peptide abundances
  group_by(protein) %>% summarise(UP = unique(num.unique.peps),
                                  across(2:17, sum, na.rm = TRUE))

tmt_vir_prot <- tmt_vir %>% 
  left_join(vir_ip, by = c("protein", "peptide"), na_matches = "never") %>% 
  filter(nsp.adjusted.probability >= 0.95, 
         group.probability >= 0.99,
         is.contributing.evidence == 'Y') %>% 
  select(protein, num.unique.peps, AM1:VV3) %>% 
  # calculate protein abundance as sum of associated peptide abundances
  group_by(protein) %>% summarise(UP = unique(num.unique.peps),
                                  across(2:17, sum, na.rm = TRUE))

# remove rows with all 0's
tmt_ame_prot <- tmt_ame_prot[-which(apply(tmt_ame_prot %>% select(3:18), 1, var) == 0), ]
tmt_nov_prot <- tmt_nov_prot[-which(apply(tmt_nov_prot %>% select(3:18), 1, var) == 0), ]
tmt_vir_prot <- tmt_vir_prot[-which(apply(tmt_vir_prot %>% select(3:18), 1, var) == 0), ]

# # numbers of proteins identified by 2 or more unique peptides
# tmt_ame_prot %>% filter(UP >= 2) %>% nrow()/ tmt_ame_prot %>% nrow()
# tmt_nov_prot %>% filter(UP >= 2) %>% nrow()/ tmt_nov_prot %>% nrow()
# tmt_vir_prot %>% filter(UP >= 2) %>% nrow()/ tmt_vir_prot %>% nrow()

# compare abundances between runs
bind_rows(
  tmt_ame_prot %>% 
    select(-protein, -UP) %>% 
    pivot_longer(1:16) %>% 
    mutate(species = str_sub(name, 1, 1),
           mating = str_sub(name, 2, 2),
           condition = str_sub(name, 1, 2),
           mapping = 'ame'),
  tmt_nov_prot %>% 
    select(-protein, -UP) %>% 
    pivot_longer(1:16) %>% 
    mutate(species = str_sub(name, 1, 1),
           mating = str_sub(name, 2, 2),
           condition = str_sub(name, 1, 2),
           mapping = 'nov'),
  tmt_vir_prot %>% 
    select(-protein, -UP) %>% 
    pivot_longer(1:16) %>% 
    mutate(species = str_sub(name, 1, 1),
           mating = str_sub(name, 2, 2),
           condition = str_sub(name, 1, 2),
           mapping = 'vir')) %>% 
  ggplot(aes(x = name, y = log2(value + 1))) +
  geom_boxplot(aes(fill = mapping), notch = TRUE, outlier.shape = 1) +
  scale_fill_manual(values = viridis::viridis(n = 3),
                    name = "Database:",
                    labels = c(expression(italic('D. ame')),
                               expression(italic('D. nov')),
                               expression(italic('D. vir')))) +
  labs(y = 'log2(Abundance + 1)') +
  theme_bw() +
  theme(legend.position = 'bottom',
        legend.title = element_blank(),
        axis.title.x = element_blank()) +
  NULL

Load Orthofinder results

# OrthoFinder orthogroups output
orthogroups <- read_tsv('../TPP_aalysis/orthofinder/Results_Sep13/Orthogroups/Orthogroups.tsv')

## omit NAs as we are only interseted in reciprocal 1:1:1 between all species
orthogroup_long <- na.omit(orthogroups) %>% 
  pivot_longer(cols = 2:4) %>% 
  separate_rows(value, sep = ", ") %>% 
  rename(protein = value, species = name)

# percentage of all proteins assigned to orthogroups
orthogroup_long %>%
  group_by(species) %>%
  count() %>%
  # add total numbers of proteins in each proteome
  #bind_cols(N = c(11958, 12729, 12792)) %>% 
  # or n_distinct orthogroups
  bind_cols(N = n_distinct(orthogroups$Orthogroup)) %>% 
  mutate(prop.orth = n/N)

# A tibble: 3 × 4
# Groups:   species [3]
  species     n     N prop.orth
  <chr>   <int> <int>     <dbl>
1 Dame     8281 10777     0.768
2 Dnov     7878 10777     0.731
3 Dvir     7956 10777     0.738

og_counts <- orthogroup_long %>% 
  group_by(species, Orthogroup) %>% 
  count(name = "N") %>% 
  mutate(orth_numb = as.numeric(str_sub(Orthogroup, 3))) %>% 
  ungroup()

# count numbers of proteins by orthogroup 
og_counts %>% 
  group_by(species, N) %>% 
  count() %>% #filter(N > 1) %>% 
  mutate(N2 = ifelse(N > 4, "5+", as.character(N)),
         n2 = ifelse(N == 1, n/7, n)) %>% 
  ggplot(aes(x = N2, y = n2, fill = species)) + 
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_viridis_d()

Load signal peptides predictions

We submitted each species proteome to Phobius and ran SignalP-5.0 locally to identify predicted signal peptide sequences and then combined the results.

# Phobius results
phob_ame <- read.delim('../TPP_aalysis/DameSB_db/phobius.txt') %>% 
  dplyr::rename(protein = SEQENCE.ID)
phob_nov <- read.delim('../TPP_aalysis/Dnov14_db/phobius.txt') %>% 
  dplyr::rename(protein = SEQENCE.ID)
phob_vir <- read.delim('../TPP_aalysis/Dvir87_db/phobius.txt') %>% 
  dplyr::rename(protein = SEQENCE.ID)

# SigP/Trinotate results
ame_sigp <- read.csv('../TPP_aalysis/DameSB_db/Trinotate_ame.csv') %>% filter(SignalP != '.') %>% 
  rename(gene_id = X.gene_id) %>% distinct(gene_id) %>% pull(gene_id)

nov_sigp <- read.csv('../TPP_aalysis/Dnov14_db/Trinotate_nov.csv') %>% filter(SignalP != '.') %>% 
  rename(gene_id = X.gene_id) %>% distinct(gene_id) %>% pull(gene_id)

vir_sigp <- read.csv('../TPP_aalysis/Dvir87_db/Trinotate_vir.csv') %>% filter(SignalP != '.') %>% 
  rename(gene_id = X.gene_id) %>% distinct(gene_id) %>% pull(gene_id)

# #overlap between methods
# upset(fromList(list(
#   Phobius = phob_vir %>%
#     filter(SP == 'Y') %>% pull(protein),
#   SignalP = vir_sigp)))

# combine results
signal_peps_ame <- combine(phob_ame %>% filter(SP == 'Y') %>% pull(protein),
                           ame_sigp) %>% unique()

signal_peps_nov <- combine(phob_nov %>% filter(SP == 'Y') %>% pull(protein),
                           nov_sigp) %>% unique()

signal_peps_vir <- combine(phob_vir %>% filter(SP == 'Y') %>% pull(protein),
                           vir_sigp) %>% unique()

# add orthogroup ID
ame_sig <- data.frame(protein = signal_peps_ame) %>% 
  left_join(orthogroup_long %>% filter(species == "Dame"), by = "protein", na_matches = "never")

nov_sig <- data.frame(protein = signal_peps_nov) %>% 
  left_join(orthogroup_long %>% filter(species == "Dnov"), by = "protein", na_matches = "never")

vir_sig <- data.frame(protein = signal_peps_vir) %>% 
  left_join(orthogroup_long %>% filter(species == "Dvir"), by = "protein", na_matches = "never")

Load other data

vir_ids <- read.delim('data/virilis_molecular_evolution_results/gffread_annotated_transcripts.gene_trans_map', 
                      header = FALSE) %>% 
  dplyr::rename(FBgn = V1,
                FBtr = V2)

# virilis group AG/SFP/EB biased genes
Dvir_SFPs <- read.delim('data/virilis_male_AG_SFP_EB_genes.txt', header = FALSE) %>% 
  dplyr::rename(bias = V1,
                FBgn = V2)

# Dmel sperm proteome orthologs
sperm_mel <- inner_join(
  # Dmel Sperm proteome II (Wasbrough et al. 2010 J. Prot.) 
  read.csv('data/melanogaster/DmSPii_Supp.Table3.csv'), 
  # melanogaster orthologs
  read.table(file = "data/melanogaster/mel_orths.txt", header = T),
  by = c('Gene.Symbol' = 'mel_GeneSymbol')) %>% 
  select(FBgn_v = FBgn_ID, mel_FBgn_ID, Gene.Name, Gene.Symbol, mel_Arm) %>% 
  left_join(vir_ids, by = c('FBgn_v' = 'FBgn'), na_matches = "never") %>% 
  distinct(FBgn_v, .keep_all = TRUE)

# Dmel SFP orthologs
wigbySFP <- inner_join(
  # List of SFPs (Wigby et al. 2020 Phil. Trans. B.) 
  read.csv('data/melanogaster/dmel_SFPs_wigby_etal2020.csv') %>% 
  filter(category == 'highconf'), 
  # melanogaster orthologs
  read.table(file = "data/melanogaster/mel_orths.txt", header = T),
  by = c('FBgn' = 'mel_FBgn_ID'), na_matches = "never") %>% 
  select(FBgn = FBgn_ID, FBgn_mel = FBgn, Symbol) %>% 
  left_join(vir_ids, by = c('FBgn'), na_matches = "never") %>% 
  distinct(FBgn, .keep_all = TRUE)

# Ahmed-Braimah et al. 2020 MBE data
# virilis gene FBgns FRT biased genes
FRTbiased <- readxl::read_excel('data/ABetal.2020/File_S1.xlsx')
# virilis gene FBgns changing in expression after mating 
virilisPMDE <- readxl::read_excel('data/ABetal.2020/File_S3.xlsx')

# pairwise Ka.Ks values
kaks_results <- read.delim('data/virilis_molecular_evolution_results/KaKs.ALL.results.txt') %>%
  left_join(vir_ids, by = c('FBtr_ID' = 'FBtr'), na_matches = "never")

# FlyBase GO terms
flybase_GO <- read.delim('data/FlyBase_GO_STEP.txt') %>% 
  dplyr::rename(FBgn = X.SUBMITTED.ID) %>% 
  # merge Dmel orthologs and IDs
  left_join(read.table(file = "data/melanogaster/mel_orths.txt", header = T),
            by = c('FBgn' = 'FBgn_ID'), na_matches = "never") %>% 
  distinct(FBgn, .keep_all = TRUE)

# All genes with annotation including 'serin' from FlyBase
STEP <- flybase_GO %>% 
  filter(stringr::str_detect(GO_BIOLOGICAL_PROCESS, 'serin') | 
           stringr::str_detect(GO_MOLECULAR_FUNCTION, 'serin'))

get gene ids and orthologs

# Add D. virilis FBgns Ids
ame_fbgn <- read.csv('../TPP_aalysis/DameSB_db/Trinotate_ame.csv') %>% 
  mutate(FBtr = gsub('\\^.*', '', x = dvir.proteins.fasta_BLASTX)) %>% 
  select(gene_id = X.gene_id, FBtr) %>% 
  left_join(vir_ids %>% select(FBtr, FBgn), na_matches = "never") %>% 
  distinct(gene_id, FBgn, FBtr) %>% filter(FBtr != '.') %>% 
  left_join(orthogroup_long %>% filter(species == 'Dame') %>% 
              select(Orthogroup, protein),
            by = c('gene_id' = 'protein'), na_matches = "never")

nov_fbgn <- read.csv('../TPP_aalysis/Dnov14_db/Trinotate_nov.csv') %>% 
  mutate(FBtr = gsub('\\^.*', '', x = dvir.proteins.fasta_BLASTX)) %>% 
  select(gene_id = X.gene_id, FBtr) %>% 
  left_join(vir_ids %>% select(FBtr, FBgn), na_matches = "never") %>% 
  distinct(gene_id, FBgn, FBtr) %>% filter(FBtr != '.') %>% 
  left_join(orthogroup_long %>% filter(species == 'Dnov') %>% 
              select(Orthogroup, protein),
            by = c('gene_id' = 'protein'), na_matches = "never")

vir_fbgn <- read.csv('../TPP_aalysis/Dvir87_db/Trinotate_vir.csv') %>% 
  mutate(FBtr = gsub('\\^.*', '', x = dvir.proteins.fasta_BLASTX)) %>% 
  select(gene_id = X.gene_id, FBtr) %>% 
  left_join(vir_ids %>% select(FBtr, FBgn), na_matches = "never") %>% 
  distinct(gene_id, FBgn, FBtr) %>% filter(FBtr != '.') %>% 
  left_join(orthogroup_long %>% filter(species == 'Dvir') %>% 
              select(Orthogroup, protein),
            by = c('gene_id' = 'protein'), na_matches = "never")

Differential abundance analysis

Single species database analysis

Identifying ejaculate candidates

We performed differential abundance analysis between mated and virgin samples for each species, and between species, using each species database. We considered proteins as differentially abundant based on an adjusted p-value < 0.05 and logFC > |1|.

# filter 2 unique peptides and replace blanks (NA's) with 0's
ame_abund <- tmt_ame_prot %>% filter(UP >= 2) %>% mutate(across(3:18, ~replace_na(.x, 0)))
nov_abund <- tmt_nov_prot %>% filter(UP >= 2) %>% mutate(across(3:18, ~replace_na(.x, 0)))
vir_abund <- tmt_vir_prot %>% filter(UP >= 2) %>% mutate(across(3:18, ~replace_na(.x, 0)))

# get sample info - same for all db's
sampInfo = data.frame(condition = str_sub(colnames(ame_abund[-c(1, 2)]), 1, 2),
                      Replicate = str_sub(colnames(ame_abund[-c(1, 2)]), 3, 3))

# make design matrix to test diffs between groups
design <- model.matrix(~0 + sampInfo$condition)
colnames(design) <- unique(sampInfo$condition)
rownames(design) <- sampInfo$Replicate

# make contrasts - higher values = higher in mated
cont.matrix <- makeContrasts(M.a.V = AM - AV,
                             M.n.V = NM - NV,
                             M.v.V = VM - VV,
                             levels = design)

# create DGElist
dge_ame <- DGEList(counts = ame_abund[, -c(1, 2)], genes = ame_abund$protein, group = sampInfo$condition)
dge_nov <- DGEList(counts = nov_abund[, -c(1, 2)], genes = nov_abund$protein, group = sampInfo$condition)
dge_vir <- DGEList(counts = vir_abund[, -c(1, 2)], genes = vir_abund$protein, group = sampInfo$condition)

# ## Remove rows consistently have zero or very low counts
# keep_ame <- filterByExpr(dge_ame)
# keep_nov <- filterByExpr(dge_nov)
# keep_vir <- filterByExpr(dge_vir)

keep_ame <- rowSums(cpm(dge_ame) > 1) >= 8
keep_nov <- rowSums(cpm(dge_nov) > 1) >= 8
keep_vir <- rowSums(cpm(dge_vir) > 1) >= 8

dge_ame <- dge_ame[keep_ame, keep.lib.sizes = FALSE]
dge_nov <- dge_nov[keep_nov, keep.lib.sizes = FALSE]
dge_vir <- dge_vir[keep_vir, keep.lib.sizes = FALSE]

dge_ame <- calcNormFactors(dge_ame, method = 'TMM')
dge_nov <- calcNormFactors(dge_nov, method = 'TMM')
dge_vir <- calcNormFactors(dge_vir, method = 'TMM')

dge_ame <- estimateCommonDisp(dge_ame)
dge_nov <- estimateCommonDisp(dge_nov)
dge_vir <- estimateCommonDisp(dge_vir)

dge_ame <- estimateTagwiseDisp(dge_ame)
dge_nov <- estimateTagwiseDisp(dge_nov)
dge_vir <- estimateTagwiseDisp(dge_vir)

# voom normalisation
dge_ameV <- voom(dge_ame, design, plot = FALSE)
dge_novV <- voom(dge_nov, design, plot = FALSE)
dge_virV <- voom(dge_vir, design, plot = FALSE)

# fit linear model
lm_ame <- lmFit(dge_ameV, design = design)
lm_nov <- lmFit(dge_novV, design = design)
lm_vir <- lmFit(dge_virV, design = design)

Diagnostic plots

diag_plot <- function(dgelist = NA, design = NA) {
  
  par(mfrow = c(2,2))
  # Biological coefficient of variation
  plotBCV(dgelist)
  # mean-variance trend
  voomed = voom(dgelist, design, plot = TRUE)
  # QQ-plot
  g <- gof(glmFit(dgelist, design))
  z <- zscoreGamma(g$gof.statistics,shape = g$df/2,scale = 2)
  qqnorm(z); qqline(z, col = 4,lwd = 1,lty = 1)
  # log2 transformed and normalize boxplot of counts across samples
  boxplot(voomed$E, xlab = "", ylab = "log2(Abundance)", las = 2, main = "Voom transformed logCPM",
          col = c(rep(2:7, c(3, 2, 3, 2, 3, 3))))
  abline(h = median(voomed$E), col = "blue")
  par(mfrow = c(1,1))
  
}

ame.db

diag_plot(dge_ame, design)

nov.db

diag_plot(dge_nov, design)

vir.db

diag_plot(dge_vir, design)

compare DA between mated/virgin

# ame using each database
lm_ame2ame <- contrasts.fit(lm_ame, cont.matrix[,"M.a.V"])
lm_ame2nov <- contrasts.fit(lm_nov, cont.matrix[,"M.a.V"])
lm_ame2vir <- contrasts.fit(lm_vir, cont.matrix[,"M.a.V"])

lm_ame2ame <- eBayes(lm_ame2ame)
lm_ame2nov <- eBayes(lm_ame2nov)
lm_ame2vir <- eBayes(lm_ame2vir)

lm_ame2ame.tTags.table <- topTable(lm_ame2ame, adjust.method = "BH", number = Inf)
lm_ame2nov.tTags.table <- topTable(lm_ame2nov, adjust.method = "BH", number = Inf)
lm_ame2vir.tTags.table <- topTable(lm_ame2vir, adjust.method = "BH", number = Inf)

# nov using each database
lm_nov2ame <- contrasts.fit(lm_ame, cont.matrix[,"M.n.V"])
lm_nov2nov <- contrasts.fit(lm_nov, cont.matrix[,"M.n.V"])
lm_nov2vir <- contrasts.fit(lm_vir, cont.matrix[,"M.n.V"])

lm_nov2ame <- eBayes(lm_nov2ame)
lm_nov2nov <- eBayes(lm_nov2nov)
lm_nov2vir <- eBayes(lm_nov2vir)

lm_nov2ame.tTags.table <- topTable(lm_nov2ame, adjust.method = "BH", number = Inf)
lm_nov2nov.tTags.table <- topTable(lm_nov2nov, adjust.method = "BH", number = Inf)
lm_nov2vir.tTags.table <- topTable(lm_nov2vir, adjust.method = "BH", number = Inf)

# vir using each database
lm_vir2ame <- contrasts.fit(lm_ame, cont.matrix[,"M.v.V"])
lm_vir2nov <- contrasts.fit(lm_nov, cont.matrix[,"M.v.V"])
lm_vir2vir <- contrasts.fit(lm_vir, cont.matrix[,"M.v.V"])

lm_vir2ame <- eBayes(lm_vir2ame)
lm_vir2nov <- eBayes(lm_vir2nov)
lm_vir2vir <- eBayes(lm_vir2vir)

lm_vir2ame.tTags.table <- topTable(lm_vir2ame, adjust.method = "BH", number = Inf)
lm_vir2nov.tTags.table <- topTable(lm_vir2nov, adjust.method = "BH", number = Inf)
lm_vir2vir.tTags.table <- topTable(lm_vir2vir, adjust.method = "BH", number = Inf)

# combine results
ame_DATABLES <- rbind(lm_ame2ame.tTags.table %>% mutate(species = 'ame'),
                      lm_nov2ame.tTags.table %>% mutate(species = 'nov'), 
                      lm_vir2ame.tTags.table %>% mutate(species = 'vir')) %>% 
  mutate(DB = 'ame.db',
         threshold = if_else(adj.P.Val < 0.05 & abs(logFC) > 1, 'DA', 'NS'),
         # add variable for signal peptides reaching significance threshold
         sigP = case_when(genes %in% ame_sig$protein & threshold == 'DA' ~ 'sigP',
                          threshold == 'DA' ~ 'DA',
                          TRUE ~ 'NS'),
         # add variable splitting by bias to virgin vs. mated and signal peptide
         DA = case_when(genes %in% ame_sig$protein & adj.P.Val < 0.05 & logFC > 1 ~ "MBsec",
                        genes %in% ame_sig$protein & adj.P.Val < 0.05 & logFC < -1 ~ "FMsec",
                        adj.P.Val < 0.05 & logFC > 1 ~ "MB",
                        adj.P.Val < 0.05 & logFC < -1 ~ "FM", 
                        TRUE ~ 'NS')) %>% 
  # add orthologs
  left_join(orthogroup_long %>% filter(species == "Dame") %>% select(-species), by = c("genes" = "protein"), na_matches = "never")

nov_DATABLES <- rbind(lm_ame2nov.tTags.table %>% mutate(species = 'ame'),
                      lm_nov2nov.tTags.table %>% mutate(species = 'nov'), 
                      lm_vir2nov.tTags.table %>% mutate(species = 'vir')) %>% 
  mutate(DB = 'nov.db',
         threshold = if_else(adj.P.Val < 0.05 & abs(logFC) > 1, 'DA', 'NS'),
         # add variable for signal peptides reaching significance threshold
         sigP = case_when(genes %in% nov_sig$protein & threshold == 'DA' ~ 'sigP',
                          threshold == 'DA' ~ 'DA',
                          TRUE ~ 'NS'),
         # add variable splitting by bias to virgin vs. mated and signal peptide
         DA = case_when(genes %in% nov_sig$protein & adj.P.Val < 0.05 & logFC > 1 ~ "MBsec",
                        genes %in% nov_sig$protein & adj.P.Val < 0.05 & logFC < -1 ~ "FMsec",
                        adj.P.Val < 0.05 & logFC > 1 ~ "MB",
                        adj.P.Val < 0.05 & logFC < -1 ~ "FM", 
                        TRUE ~ 'NS')) %>% 
  # add orthologs
  left_join(orthogroup_long %>% filter(species == "Dnov") %>% select(-species), by = c("genes" = "protein"), na_matches = "never")

vir_DATABLES <- rbind(lm_ame2vir.tTags.table %>% mutate(species = 'ame'),
                      lm_nov2vir.tTags.table %>% mutate(species = 'nov'), 
                      lm_vir2vir.tTags.table %>% mutate(species = 'vir')) %>% 
  mutate(DB = 'vir.db',
         threshold = if_else(adj.P.Val < 0.05 & abs(logFC) > 1, 'DA', 'NS'),
         # add variable for signal peptides reaching significance threshold
         sigP = case_when(genes %in% vir_sig$protein & threshold == 'DA' ~ 'sigP',
                          threshold == 'DA' ~ 'DA',
                          TRUE ~ 'NS'),
         # add variable splitting by bias to virgin vs. mated and signal peptide
         DA = case_when(genes %in% vir_sig$protein & adj.P.Val < 0.05 & logFC > 1 ~ "MBsec",
                        genes %in% vir_sig$protein & adj.P.Val < 0.05 & logFC < -1 ~ "FMsec",
                        adj.P.Val < 0.05 & logFC > 1 ~ "MB",
                        adj.P.Val < 0.05 & logFC < -1 ~ "FM", 
                        TRUE ~ 'NS')) %>% 
  # add orthologs
  left_join(orthogroup_long %>% filter(species == "Dvir") %>% select(-species), by = c("genes" = "protein"), na_matches = "never")

Volcano plots

# plot all vs. all
bind_rows(ame_DATABLES,
          nov_DATABLES,
          vir_DATABLES) %>% 
  ggplot(aes(x = logFC, y = -log10(adj.P.Val), colour = DA)) +
  geom_point(alpha = .6) +
  scale_colour_manual(values = c('hotpink', 'red', 'blue', 'purple', 'grey40')) +
  labs(y = '-log10(p-value)') +
  facet_grid(species ~ DB) +
  theme_bw() +
  theme(legend.position = 'bottom',
        legend.title = element_blank(),
        legend.background = element_rect(fill = NA)) +
  #ggsave('plots/database_comps/volcano_mated-virgin_all.vs.all.pdf', height = 17.8, width = 17.8, units = 'cm', dpi = 600, useDingbats = FALSE) +
  NULL

Plot species specific database

facet_names <- c(ame = "D. americana", 
                 nov = 'D. novamexicana', 
                 vir = "D. virilis")

lab_text <- data.frame(species = c('ame', 'nov', 'vir'),
                       P.Value = min(nov_DATABLES$P.Value), 
                       logFC = 20,
                       lab = c("D. americana", 'D. novamexicana', "D. virilis"),
                       DA = NA, 
                       threshold = NA)

# plot
bind_rows(ame_DATABLES %>% filter(species == 'ame'),
          nov_DATABLES %>% filter(species == 'nov'),
          vir_DATABLES %>% filter(species == 'vir')) %>% 
  ggplot(aes(x = logFC, y = -log10(P.Value), colour = DA, shape = threshold)) +
  geom_point(alpha = .6) +
  scale_shape_manual(values = c(16, 1), guide = 'none') +
  scale_colour_manual(values = c(egypt.pal[4:1], "grey60")) +
  labs(y = '-log10(p-value)') +
  facet_wrap(~species, nrow = 1, labeller = as_labeller(facet_names)) +
  theme_bw() +
  theme(legend.position = '',#c(0.04, 0.8),
        legend.title = element_blank(),
        legend.background = element_rect(fill = NA),
        strip.text = element_text(size = 15, face = "italic"),
        strip.background = element_blank(),
        strip.text.x = element_blank()) +
  geom_text(data = lab_text, colour = 'black', hjust = 1,
            aes(y = -log10(P.Value), label = paste0(lab)), size = 4, fontface = "italic") +
  #ggsave('plots/TPP_volcano_mated-virgin.pdf', height = 7, width = 19, units = 'cm', dpi = 600, useDingbats = FALSE) +
  NULL

Overlap in ‘Sfps’ between species

We compare the numbers of ejaculate candidates detected using each species database.

ame db

# subset only higher in mated
ame_cands <- ame_DATABLES %>% filter(str_detect(DA, pattern = 'MB'))

# # overlap for ejaculate candidates with orthologs using each database
# upset(fromList(list(ame = split(ame_cands, ame_cands$species)[[1]][[1]],
#                     nov = split(ame_cands, ame_cands$species)[[2]][[1]],
#                     vir = split(ame_cands, ame_cands$species)[[3]][[1]])))

plot(euler(c('ame' = 50, "nov" = 22, "vir" = 26, 
             'ame&nov' = 51, 'ame&vir' = 16, 'nov&vir' = 0, 
             'ame&nov&vir' = 92)),
     quantities = TRUE,
     fills = list(fill = viridis::viridis(n = 3), alpha = .5))

nov db

# subset only higher in mated
nov_cands <- nov_DATABLES %>% filter(str_detect(DA, pattern = 'MB'))

# # overlap for ejaculate candidates with orthologs using each database
# upset(fromList(list(ame = split(nov_cands, nov_cands$species)[[1]][[1]],
#                     nov = split(nov_cands, nov_cands$species)[[2]][[1]],
#                     vir = split(nov_cands, nov_cands$species)[[3]][[1]])))

plot(euler(c('ame' = 40, "nov" = 22, "vir" = 27, 
             'ame&nov' = 56, 'ame&vir' = 10, 'nov&vir' = 3, 
             'ame&nov&vir' = 104)),
     quantities = TRUE,
     fills = list(fill = viridis::viridis(n = 3), alpha = .5))

vir db

# subset only higher in mated
vir_cands <- vir_DATABLES %>% filter(str_detect(DA, pattern = 'MB'))

# # overlap for ejaculate candidates with orthologs using each database
# upset(fromList(list(ame = split(vir_cands, vir_cands$species)[[1]][[1]],
#                     nov = split(vir_cands, vir_cands$species)[[2]][[1]],
#                     vir = split(vir_cands, vir_cands$species)[[3]][[1]])))

plot(euler(c('ame' = 34, "nov" = 25, "vir" = 19, 
             'ame&nov' = 41, 'ame&vir' = 18, 'nov&vir' = 2, 
             'ame&nov&vir' = 91)),
     quantities = TRUE,
     fills = list(fill = viridis::viridis(n = 3), alpha = .5))

# combine ejaculate candidates 
ejac_cands <- bind_rows(ame_cands %>% left_join(ame_fbgn, by = c("genes" = "gene_id", "Orthogroup")), 
                        nov_cands %>% left_join(nov_fbgn, by = c("genes" = "gene_id", "Orthogroup")), 
                        vir_cands %>% left_join(vir_fbgn, by = c("genes" = "gene_id", "Orthogroup"))) %>% 
  distinct(genes, species, DB, Orthogroup, .keep_all = TRUE)

# # overlap for ejaculate candidates with orthologs using each database
# upset(fromList(list(ame.db = split(ejac_cands, ejac_cands$DB)[[1]][[13]],
#                     nov.db = split(ejac_cands, ejac_cands$DB)[[2]][[13]],
#                     vir.db = split(ejac_cands, ejac_cands$DB)[[3]][[13]])))

plot(venn(c('D. ame' = 39, 'D. nov' = 34, 'D. vir' = 26, 
            'D. ame&D. nov' = 20, 'D. ame&D. vir' = 12, 'D. nov&D. vir' = 36,
            'D. ame&D. nov&D. vir' = 118)),
     quantities = TRUE,
     fills = list(fill = viridis::viridis(n = 3), alpha = .5))

# overlap in ejaculate candidates between species
ejac_cands %>% 
  mutate(orth = if_else(is.na(Orthogroup) == TRUE, 'no', 'yes')) %>% 
  group_by(DB, orth) %>%
  summarise(N = n_distinct(genes)) %>% 
  pivot_wider(id_cols = DB, names_from = orth, values_from = N) %>% 
  mutate(total = yes + no,
         prop.orths = yes/total) %>% 
  kable(digits = 3, 
        caption = 'Numbers and proportions of ejaculate candidates in each species with or without orthologs') %>% 
  kable_styling(full_width = FALSE)

Numbers and proportions of ejaculate candidates in each species with or without orthologs
DB	no	yes	total	prop.orths
ame.db	60	197	257	0.767
nov.db	51	211	262	0.805
vir.db	35	195	230	0.848

# # write file of all distinct proteins with/out 1:1:1 orthologs
# ejac_cands %>% 
#   mutate(orth = if_else(is.na(Orthogroup) == TRUE, 'no', 'yes')) %>% 
#   #write_csv('output/ClueGOlists/all_ejac_IDs.csv')
#   filter(orth == "no") %>% 
#   distinct(genes, DB) %>% 
#   write_csv('output/ClueGOlists/ejac_IDs_no-orths.csv')

#number of ejaculate candidates using each species database including without orthologs
ejac_ids <- bind_rows(ame_DATABLES %>% filter(str_detect(DA, pattern = 'MB')),
                      nov_DATABLES %>% filter(str_detect(DA, pattern = 'MB')),
                      vir_DATABLES %>% filter(str_detect(DA, pattern = 'MB'))) %>% 
  mutate(OrthID = if_else(is.na(Orthogroup) == TRUE, paste0(genes, "-", species), Orthogroup)) %>% 
  distinct(genes, DB, .keep_all = TRUE) #%>% drop_na(Orthogroup)

# total number of unique ejaculate candidates identified
#n_distinct(ejac_ids$OrthID)

# upset(fromList(list(ame = split(ejac_ids, ejac_ids$DB)[[1]][[14]],
#                     nov = split(ejac_ids, ejac_ids$DB)[[2]][[14]],
#                     vir = split(ejac_ids, ejac_ids$DB)[[3]][[14]])))

#pdf('plots/sfp_overlap_all_db_noOrths.pdf', height = 4, width = 4)
plot(venn(c('D. ame' = 99, 'D. nov' = 85, 'D. vir' = 61, 
            'D. ame&D. nov' = 20, 'D. ame&D. vir' = 12, 'D. nov&D. vir' = 36,
            'D. ame&D. nov&D. vir' = 117)
          ),
     quantities = TRUE,
     fills = list(fill = viridis::viridis(n = 3), alpha = .5))

#dev.off()

# number and proportion secreted
bind_rows(ame_DATABLES %>% filter(str_detect(DA, pattern = 'MB')),
          nov_DATABLES %>% filter(str_detect(DA, pattern = 'MB')),
          vir_DATABLES %>% filter(str_detect(DA, pattern = 'MB'))) %>% 
  distinct(genes, DB, .keep_all = TRUE) %>% 
  group_by(DB, DA) %>% dplyr::count() %>% 
  pivot_wider(id_cols = DB, names_from = DA, values_from = n) %>% 
  mutate(prop.sec = MBsec/(MBsec + MB)) %>% 
  kable(digits = 3, 
        caption = 'The number of proteins higher in abundance in mated comapred to virgin samples with or without a secretion signal using each species database') %>% 
  kable_styling(full_width = FALSE)

The number of proteins higher in abundance in mated comapred to virgin samples with or without a secretion signal using each species database
DB	MB	MBsec	prop.sec
ame.db	161	96	0.374
nov.db	126	136	0.519
vir.db	103	127	0.552

# #write FBgns for ClueGO - ejaculate candidates
# ejac_cands %>% distinct(FBgn, .keep_all = TRUE) %>%
#   write_csv('output/ClueGOlists/TPP_all_ejac_FBgn.csv')

# # proteins higher abundance in FRT
# bind_rows(ame_DATABLES %>% filter(str_detect(DA, pattern = 'FM')) %>% 
#             left_join(ame_fbgn, by = c("genes" = "gene_id", "Orthogroup")),
#           nov_DATABLES %>% filter(str_detect(DA, pattern = 'FM')) %>% 
#             left_join(nov_fbgn, by = c("genes" = "gene_id", "Orthogroup")),
#           vir_DATABLES %>% filter(str_detect(DA, pattern = 'FM')) %>% 
#             left_join(vir_fbgn, by = c("genes" = "gene_id", "Orthogroup"))) %>% 
#   distinct(genes, species, DB, Orthogroup, .keep_all = TRUE) %>%
#   write_csv('output/ClueGOlists/TPP_virgin_biased_FBgn.csv')

Divergence between ejaculate candidates

Here we perform pairwise differential abundance analysis between species for mated samples only to identify differences in the male ejaculate between species. We subset the data to include only ‘ejaculate candidates’ - i.e. proteins significantly higher abundance in mated compared to virgin samples. We performed analyses using each species database.

# subset data to only include ejaculate candidates
ame_mated <- ame_abund %>% dplyr::select(protein, contains('M')) %>% 
  filter(protein %in% ejac_cands$genes[ejac_cands$DB == 'ame.db'])

nov_mated <- nov_abund %>% dplyr::select(protein, contains('M')) %>% 
  filter(protein %in% ejac_cands$genes[ejac_cands$DB == 'nov.db'])

vir_mated <- vir_abund %>% dplyr::select(protein, contains('M')) %>% 
  filter(protein %in% ejac_cands$genes[ejac_cands$DB == 'vir.db'])

# get sample info - same for all db's
sampInfo.m <- data.frame(condition = str_sub(colnames(ame_mated[2:10]), 1, 2),
                         Replicate = str_sub(colnames(ame_mated[2:10]), 3, 3))

# make design matrix to test diffs between groups
design.m <- model.matrix(~0 + sampInfo.m$condition)
colnames(design.m) <- unique(sampInfo.m$condition)
rownames(design.m) <- sampInfo.m$Replicate

# make contrasts - higher values = higher in mated
cont.mated <- makeContrasts(a.m.n = AM - NM,
                            a.m.v = AM - VM,
                            n.m.v = NM - VM,
                            levels = design.m)

# create DGElist and fit model 
dge_ame.m <- DGEList(counts = ame_mated[, 2:10], genes = ame_mated$protein, group = sampInfo.m$condition)
dge_nov.m <- DGEList(counts = nov_mated[, 2:10], genes = nov_mated$protein, group = sampInfo.m$condition)
dge_vir.m <- DGEList(counts = vir_mated[, 2:10], genes = vir_mated$protein, group = sampInfo.m$condition)

dge_ame.m <- calcNormFactors(dge_ame.m, method = 'TMM')
dge_nov.m <- calcNormFactors(dge_nov.m, method = 'TMM')
dge_vir.m <- calcNormFactors(dge_vir.m, method = 'TMM')

dge_ame.m <- estimateCommonDisp(dge_ame.m)
dge_nov.m <- estimateCommonDisp(dge_nov.m)
dge_vir.m <- estimateCommonDisp(dge_vir.m)

dge_ame.m <- estimateTagwiseDisp(dge_ame.m)
dge_nov.m <- estimateTagwiseDisp(dge_nov.m)
dge_vir.m <- estimateTagwiseDisp(dge_vir.m)

# voom normalisation
dge_ame.m <- voom(dge_ame.m, design.m, plot = FALSE)
dge_nov.m <- voom(dge_nov.m, design.m, plot = FALSE)
dge_vir.m <- voom(dge_vir.m, design.m, plot = FALSE)

# fit linear model
lm_ame.m <- lmFit(dge_ame.m, design = design.m)
lm_nov.m <- lmFit(dge_nov.m, design = design.m)
lm_vir.m <- lmFit(dge_vir.m, design = design.m)

# compare DA between mated samples

# ame using each database
mated_an2ame <- contrasts.fit(lm_ame.m, cont.mated[,"a.m.n"])
mated_an2nov <- contrasts.fit(lm_nov.m, cont.mated[,"a.m.n"])
mated_an2vir <- contrasts.fit(lm_vir.m, cont.mated[,"a.m.n"])

mated_an2ame <- eBayes(mated_an2ame)
mated_an2nov <- eBayes(mated_an2nov)
mated_an2vir <- eBayes(mated_an2vir)

mated_an2ame.tTags.table <- topTable(mated_an2ame, adjust.method = "BH", number = Inf)
mated_an2nov.tTags.table <- topTable(mated_an2nov, adjust.method = "BH", number = Inf)
mated_an2vir.tTags.table <- topTable(mated_an2vir, adjust.method = "BH", number = Inf)

# nov using each database
mated_av2ame <- contrasts.fit(lm_ame.m, cont.mated[,"a.m.v"])
mated_av2nov <- contrasts.fit(lm_nov.m, cont.mated[,"a.m.v"])
mated_av2vir <- contrasts.fit(lm_vir.m, cont.mated[,"a.m.v"])

mated_av2ame <- eBayes(mated_av2ame)
mated_av2nov <- eBayes(mated_av2nov)
mated_av2vir <- eBayes(mated_av2vir)

mated_av2ame.tTags.table <- topTable(mated_av2ame, adjust.method = "BH", number = Inf)
mated_av2nov.tTags.table <- topTable(mated_av2nov, adjust.method = "BH", number = Inf)
mated_av2vir.tTags.table <- topTable(mated_av2vir, adjust.method = "BH", number = Inf)

# vir using each database
mated_nv2ame <- contrasts.fit(lm_ame.m, cont.mated[,"n.m.v"])
mated_nv2nov <- contrasts.fit(lm_nov.m, cont.mated[,"n.m.v"])
mated_nv2vir <- contrasts.fit(lm_vir.m, cont.mated[,"n.m.v"])

mated_nv2ame <- eBayes(mated_nv2ame)
mated_nv2nov <- eBayes(mated_nv2nov)
mated_nv2vir <- eBayes(mated_nv2vir)

mated_nv2ame.tTags.table <- topTable(mated_nv2ame, adjust.method = "BH", number = Inf)
mated_nv2nov.tTags.table <- topTable(mated_nv2nov, adjust.method = "BH", number = Inf)
mated_nv2vir.tTags.table <- topTable(mated_nv2vir, adjust.method = "BH", number = Inf)

# combine results
ame_MATED <- rbind(mated_an2ame.tTags.table %>% mutate(comparison = 'ame.v.nov'),
                   mated_av2ame.tTags.table %>% mutate(comparison = 'ame.v.vir'), 
                   mated_nv2ame.tTags.table %>% mutate(comparison = 'nov.v.vir')) %>% 
  mutate(DB = 'ame.db',
         threshold = if_else(adj.P.Val < 0.05 & abs(logFC) > 1, 'DA', 'NS'),
         # add variable for signal peptides
         sigP = if_else(genes %in% ame_sig$protein, 'sigP', 'not'),
         DA = case_when(sigP == 'sigP' & threshold == 'DA' ~ 'sigP',
                        threshold == 'DA' ~ 'DA',
                        TRUE ~ 'NS'))

nov_MATED <- rbind(mated_an2nov.tTags.table %>% mutate(comparison = 'ame.v.nov'),
                   mated_av2nov.tTags.table %>% mutate(comparison = 'ame.v.vir'), 
                   mated_nv2nov.tTags.table %>% mutate(comparison = 'nov.v.vir')) %>% 
  mutate(DB = 'nov.db',
         threshold = if_else(adj.P.Val < 0.05 & abs(logFC) > 1, 'DA', 'NS'),
         # add variable for signal peptides
         sigP = if_else(genes %in% nov_sig$protein, 'sigP', 'not'),
         DA = case_when(sigP == 'sigP' & threshold == 'DA' ~ 'sigP',
                        threshold == 'DA' ~ 'DA',
                        TRUE ~ 'NS'))

vir_MATED <- rbind(mated_an2vir.tTags.table %>% mutate(comparison = 'ame.v.nov'),
                   mated_av2vir.tTags.table %>% mutate(comparison = 'ame.v.vir'), 
                   mated_nv2vir.tTags.table %>% mutate(comparison = 'nov.v.vir')) %>% 
  mutate(DB = 'vir.db',
         threshold = if_else(adj.P.Val < 0.05 & abs(logFC) > 1, 'DA', 'NS'),
         # add variable for signal peptides
         sigP = if_else(genes %in% vir_sig$protein, 'sigP', 'not'),
         DA = case_when(sigP == 'sigP' & threshold == 'DA' ~ 'sigP',
                        threshold == 'DA' ~ 'DA',
                        TRUE ~ 'NS'))

Compare results using each species database

Volcano plots all vs. all

bind_rows(ame_MATED,
          nov_MATED,
          vir_MATED) %>% 
  ggplot(aes(x = logFC, y = -log10(adj.P.Val), colour = DA)) +
  geom_point(alpha = .6) +
  scale_colour_manual(values = c('hotpink', 'grey40', 'red')) +
  labs(y = '-log10(p-value)') +
  facet_grid(comparison ~ DB) +
  theme_bw() +
  theme(#legend.position = '',
        legend.title = element_blank(),
        legend.background = element_rect(fill = NA)) +
  NULL

Numbers of differentially abundant proteins more highly abundant in one species vs. the other

bind_rows(ame_MATED,
          nov_MATED,
          vir_MATED) %>% filter(threshold != 'NS') %>% 
  mutate(up = ifelse(logFC > 1, 'a', 'b')) %>% 
  group_by(comparison, DB, up) %>% dplyr::count() %>% 
  ggplot(aes(x = up, y = n, fill = up)) +
  geom_col() +
  facet_grid(comparison ~ DB)

We tested whether signal peptides are more likely to be differentially abundant between species than other proteins

bind_rows(ame_MATED,
          nov_MATED,
          vir_MATED) %>% 
  group_by(comparison, DB, threshold, sigP) %>% dplyr::count() %>% 
  ggplot(aes(x = sigP, y = n, fill = threshold)) +
  geom_col(position = 'fill') +
  facet_grid(comparison ~ DB)

# do Fisher's exact tests
fish_sig <- bind_rows(
  ame_MATED %>% 
    group_by(comparison) %>%
    do(fit = broom::tidy(fisher.test(as.matrix(xtabs(~ threshold + sigP, ., sparse = TRUE))))) %>%
    unnest(fit) %>% 
    mutate(db = 'ame.db'),
  nov_MATED %>% 
    group_by(comparison) %>%
    do(fit = broom::tidy(fisher.test(as.matrix(xtabs(~ threshold + sigP, ., sparse = TRUE))))) %>% 
    unnest(fit) %>% 
    mutate(db = 'nov.db'),
  vir_MATED %>% 
    group_by(comparison) %>%
    do(fit = broom::tidy(fisher.test(as.matrix(xtabs(~ threshold + sigP, ., sparse = TRUE))))) %>% 
    unnest(fit) %>% 
         mutate(db = 'vir.db')) 

fish_sig$FDR <- p.adjust(fish_sig$p.value, method = 'fdr') 

fish_sig %>% 
  mutate(p.val = ifelse(FDR < 0.001, '< 0.001', round(FDR, 3)),
         comparison = recode(comparison, 
                      ame.v.nov = "D. ame vs. D. nov", 
                      ame.v.vir = "D. ame vs. D. vir", 
                      nov.v.vir = 'D. nov vs. D. vir'),
         across(2:5, ~ round(.x, 2)),
         Estimate = paste0(estimate, ' (', conf.low, '-', conf.high, ')')) %>% 
  dplyr::select(Comparison = comparison, Estimate, p.val) %>% 
  kable(digits = 3, 
        caption = 'Fisher\'s exact tests comparing the representation of signal peptides in differentially abundant proteins between species using each species database. P-values are corrected for multiple testing') %>% 
  kable_styling(full_width = FALSE) %>%
  group_rows("D. americana", 1, 3) %>%
  group_rows("D. novamexicana", 4, 6) %>%
  group_rows("D. virilis", 7, 9)

Fisher’s exact tests comparing the representation of signal peptides in differentially abundant proteins between species using each species database. P-values are corrected for multiple testing
Comparison	Estimate	p.val
D. americana
D. ame vs. D. nov	0.51 (0.24-1.07)	0.161
D. ame vs. D. vir	0.57 (0.33-0.98)	0.156
D. nov vs. D. vir	0.62 (0.36-1.06)	0.161
D. novamexicana
D. ame vs. D. nov	0.78 (0.38-1.59)	0.64
D. ame vs. D. vir	0.35 (0.2-0.61)	< 0.001
D. nov vs. D. vir	0.72 (0.43-1.21)	0.386
D. virilis
D. ame vs. D. nov	1.28 (0.6-2.74)	0.64
D. ame vs. D. vir	0.83 (0.45-1.52)	0.64
D. nov vs. D. vir	0.93 (0.52-1.67)	0.889

Divergence between (virgin) female reproductive tracts

Here we compare the abundance of proteins in virgin samples between species to investigate divergence in female reproductive tract proteins between species. First we filter out “ejaculate candidates” and only use virgin samples for making comparisons.

# already filtered 2 unique peptides
ame_virgin <- ame_abund %>% filter(!protein %in% ejac_cands$genes[ejac_cands$DB == 'ame.db']) %>%
  dplyr::select(protein, !contains('M'), -UP)
nov_virgin <- nov_abund %>% filter(!protein %in% ejac_cands$genes[ejac_cands$DB == 'nov.db']) %>%
  dplyr::select(protein, !contains('M'), -UP)
vir_virgin <- vir_abund %>% filter(!protein %in% ejac_cands$genes[ejac_cands$DB == 'vir.db']) %>%
  dplyr::select(protein, !contains('M'), -UP)

# get sample info - same for all db's
sampInfo.v = data.frame(condition = str_sub(colnames(ame_virgin[-1]), 1, 2),
                        Replicate = str_sub(colnames(ame_virgin[-1]), 3, 3))

# make design matrix to test diffs between groups
design.v <- model.matrix(~0 + sampInfo.v$condition)
colnames(design.v) <- unique(sampInfo.v$condition)
rownames(design.v) <- sampInfo.v$Replicate

# make contrasts - higher values = higher in mated
cont.virgin <- makeContrasts(a.v.n = AV - NV,
                             a.v.v = AV - VV,
                             n.v.v = NV - VV,
                             levels = design.v)

# create DGElist and fit model 
dge_ame.v <- DGEList(counts = ame_virgin[, -1], genes = ame_virgin$protein, group = sampInfo.v$condition)
dge_nov.v <- DGEList(counts = nov_virgin[, -1], genes = nov_virgin$protein, group = sampInfo.v$condition)
dge_vir.v <- DGEList(counts = vir_virgin[, -1], genes = vir_virgin$protein, group = sampInfo.v$condition)

dge_ame.v <- calcNormFactors(dge_ame.v, method = 'TMM')
dge_nov.v <- calcNormFactors(dge_nov.v, method = 'TMM')
dge_vir.v <- calcNormFactors(dge_vir.v, method = 'TMM')

dge_ame.v <- estimateCommonDisp(dge_ame.v)
dge_nov.v <- estimateCommonDisp(dge_nov.v)
dge_vir.v <- estimateCommonDisp(dge_vir.v)

dge_ame.v <- estimateTagwiseDisp(dge_ame.v)
dge_nov.v <- estimateTagwiseDisp(dge_nov.v)
dge_vir.v <- estimateTagwiseDisp(dge_vir.v)

# voom normalisation
dge_ame.v <- voom(dge_ame.v, design.v, plot = FALSE)
dge_nov.v <- voom(dge_nov.v, design.v, plot = FALSE)
dge_vir.v <- voom(dge_vir.v, design.v, plot = FALSE)

# fit linear model
lm_ame.v <- lmFit(dge_ame.v, design = design.v)
lm_nov.v <- lmFit(dge_nov.v, design = design.v)
lm_vir.v <- lmFit(dge_vir.v, design = design.v)

# compare DA between virgin samples

# ame using each database
virgin_an2ame <- contrasts.fit(lm_ame.v, cont.virgin[,"a.v.n"])
virgin_an2nov <- contrasts.fit(lm_nov.v, cont.virgin[,"a.v.n"])
virgin_an2vir <- contrasts.fit(lm_vir.v, cont.virgin[,"a.v.n"])

virgin_an2ame <- eBayes(virgin_an2ame)
virgin_an2nov <- eBayes(virgin_an2nov)
virgin_an2vir <- eBayes(virgin_an2vir)

virgin_an2ame.tTags.table <- topTable(virgin_an2ame, adjust.method = "BH", number = Inf)
virgin_an2nov.tTags.table <- topTable(virgin_an2nov, adjust.method = "BH", number = Inf)
virgin_an2vir.tTags.table <- topTable(virgin_an2vir, adjust.method = "BH", number = Inf)

# nov using each database
virgin_av2ame <- contrasts.fit(lm_ame.v, cont.virgin[,"a.v.v"])
virgin_av2nov <- contrasts.fit(lm_nov.v, cont.virgin[,"a.v.v"])
virgin_av2vir <- contrasts.fit(lm_vir.v, cont.virgin[,"a.v.v"])

virgin_av2ame <- eBayes(virgin_av2ame)
virgin_av2nov <- eBayes(virgin_av2nov)
virgin_av2vir <- eBayes(virgin_av2vir)

virgin_av2ame.tTags.table <- topTable(virgin_av2ame, adjust.method = "BH", number = Inf)
virgin_av2nov.tTags.table <- topTable(virgin_av2nov, adjust.method = "BH", number = Inf)
virgin_av2vir.tTags.table <- topTable(virgin_av2vir, adjust.method = "BH", number = Inf)

# vir using each database
virgin_nv2ame <- contrasts.fit(lm_ame.v, cont.virgin[,"n.v.v"])
virgin_nv2nov <- contrasts.fit(lm_nov.v, cont.virgin[,"n.v.v"])
virgin_nv2vir <- contrasts.fit(lm_vir.v, cont.virgin[,"n.v.v"])

virgin_nv2ame <- eBayes(virgin_nv2ame)
virgin_nv2nov <- eBayes(virgin_nv2nov)
virgin_nv2vir <- eBayes(virgin_nv2vir)

virgin_nv2ame.tTags.table <- topTable(virgin_nv2ame, adjust.method = "BH", number = Inf)
virgin_nv2nov.tTags.table <- topTable(virgin_nv2nov, adjust.method = "BH", number = Inf)
virgin_nv2vir.tTags.table <- topTable(virgin_nv2vir, adjust.method = "BH", number = Inf)

# combine results
ame_VIRGIN <- rbind(virgin_an2ame.tTags.table %>% mutate(comparison = 'ame.v.nov'),
                    virgin_av2ame.tTags.table %>% mutate(comparison = 'ame.v.vir'), 
                    virgin_nv2ame.tTags.table %>% mutate(comparison = 'nov.v.vir')) %>% 
  mutate(DB = 'ame.db',
         threshold = if_else(adj.P.Val < 0.05 & abs(logFC) > 1, 'DA', 'NS'),
         # add variable for signal peptides reaching significance threshold
         sigP = case_when(genes %in% ame_sig$protein & threshold == 'DA' ~ 'sigP',
                          threshold == 'DA' ~ 'DA',
                          TRUE ~ 'NS'))

nov_VIRGIN <- rbind(virgin_an2nov.tTags.table %>% mutate(comparison = 'ame.v.nov'),
                    virgin_av2nov.tTags.table %>% mutate(comparison = 'ame.v.vir'), 
                    virgin_nv2nov.tTags.table %>% mutate(comparison = 'nov.v.vir')) %>% 
  mutate(DB = 'nov.db',
         threshold = if_else(adj.P.Val < 0.05 & abs(logFC) > 1, 'DA', 'NS'),
         # add variable for signal peptides reaching significance threshold
         sigP = case_when(genes %in% nov_sig$protein & threshold == 'DA' ~ 'sigP',
                          threshold == 'DA' ~ 'DA',
                          TRUE ~ 'NS'))

vir_VIRGIN <- rbind(virgin_an2vir.tTags.table %>% mutate(comparison = 'ame.v.nov'),
                    virgin_av2vir.tTags.table %>% mutate(comparison = 'ame.v.vir'), 
                    virgin_nv2vir.tTags.table %>% mutate(comparison = 'nov.v.vir')) %>% 
  mutate(DB = 'vir.db',
         threshold = if_else(adj.P.Val < 0.05 & abs(logFC) > 1, 'DA', 'NS'),
         # add variable for signal peptides reaching significance threshold
         sigP = case_when(genes %in% vir_sig$protein & threshold == 'DA' ~ 'sigP',
                          threshold == 'DA' ~ 'DA',
                          TRUE ~ 'NS'))

# plot all vs. all
bind_rows(ame_VIRGIN,
          nov_VIRGIN,
          vir_VIRGIN) %>% 
  ggplot(aes(x = logFC, y = -log10(adj.P.Val), colour = sigP)) +
  geom_point(alpha = .6) +
  scale_colour_manual(values = c('hotpink', 'grey40', 'red')) +
  labs(y = '-log10(p-value)') +
  facet_grid(comparison ~ DB) +
  theme_bw() +
  theme(#legend.position = '',
        legend.title = element_blank(),
        legend.background = element_rect(fill = NA)) +
  NULL

Multi-database analysis

Compiling the species specific database

First we add Orthogroup IDs to proteins identified using each species database. Then we combine the results in to a single dataframe. Finally, we add the list of manually curated ejaculate candidate orthologs which were not assigned to orthogroups. See here for details.

# add orthogroup ID to abundance data
ortho_ame <- tmt_ame_prot %>% 
  left_join(orthogroup_long %>% filter(species == 'Dame') %>% select(Orthogroup, protein), 
            by = "protein", na_matches = "never") %>% 
  rowid_to_column("ID") %>% 
  mutate(Orthogroup = if_else(is.na(Orthogroup) == TRUE, paste("AME", ID, sep = "_"),
                              Orthogroup))

ortho_nov <- tmt_nov_prot %>% 
  left_join(orthogroup_long %>% filter(species == 'Dnov') %>% select(Orthogroup, protein), 
            by = "protein", na_matches = "never") %>% 
  rowid_to_column("ID") %>% 
  mutate(Orthogroup = if_else(is.na(Orthogroup) == TRUE, paste("NOV", ID, sep = "_"),
                              Orthogroup))

ortho_vir <- tmt_vir_prot %>% 
  left_join(orthogroup_long %>% filter(species == 'Dvir') %>% select(Orthogroup, protein), 
            by = "protein", na_matches = "never") %>% 
  rowid_to_column("ID") %>% 
  mutate(Orthogroup = if_else(is.na(Orthogroup) == TRUE, paste("VIR", ID, sep = "_"),
                              Orthogroup))

# total numbers of proteins with orthologs compared to the proteome
bind_cols(N_orth = c(sum(!is.na(orthogroups$Dame)), 
                     sum(!is.na(orthogroups$Dnov)), 
                     sum(!is.na(orthogroups$Dvir))),
          N_prot = c(11958, 12729, 12792)) %>% 
  mutate(prop.orth = N_orth/N_prot)

# A tibble: 3 × 3
  N_orth N_prot prop.orth
   <int>  <dbl>     <dbl>
1   8651  11958     0.723
2   9922  12729     0.779
3   9813  12792     0.767

# total number of proteins with orthologs in our dataset compared to proteins detected with each database
bind_cols(N_orth = c(n_distinct(ortho_ame %>% filter(str_detect(Orthogroup, "OG")) %>% pull(Orthogroup)), 
                     n_distinct(ortho_nov %>% filter(str_detect(Orthogroup, "OG")) %>% pull(Orthogroup)), 
                     n_distinct(ortho_vir %>% filter(str_detect(Orthogroup, "OG")) %>% pull(Orthogroup))),
          N_prot = c(n_distinct(ortho_ame$protein),
                     n_distinct(ortho_nov$protein), 
                     n_distinct(ortho_vir$protein))) %>% 
  mutate(prop.orth = N_orth/N_prot)

# A tibble: 3 × 3
  N_orth N_prot prop.orth
   <int>  <int>     <dbl>
1   2329   3014     0.773
2   2604   3156     0.825
3   2583   3149     0.820

# # plot overlap
# upset(fromList(list(
#   Dame = na.omit(ortho_ame$Orthogroup),
#   Dnov = na.omit(ortho_nov$Orthogroup),
#   Dvir = na.omit(ortho_vir$Orthogroup))), order.by = "freq")

plot(euler(c('Dame' = 86, 'Dnov' = 86, 'Dvir' = 90, 
             'Dame&Dnov' = 98, 'Dame&Dvir' = 73, 'Dnov&Dvir' = 348,
             'Dame&Dnov&Dvir' = 2072)),
     quantities = TRUE,
     fills = list(fill = viridis::viridis(n = 3), alpha = .5))

# combine all orthologous data after removing proteins without orthologs
mdb <- ortho_ame %>% 
  inner_join(ortho_nov, by = 'Orthogroup', suffix = c('_ame', '_nov'), na_matches = "never") %>% 
  inner_join(ortho_vir, by = 'Orthogroup', na_matches = "never") %>% 
  relocate(Orthogroup, starts_with("protein"), starts_with("UP")) %>% 
  select(-starts_with("ID"))

colnames(mdb)[c(4, 7, 40:55)] <- paste0(colnames(mdb)[c(4, 7, 40:55)], '_vir')

mdb %>% dim

[1] 2621   55

mdb %>% map_dbl(n_distinct)

 Orthogroup protein_ame protein_nov protein_vir      UP_ame      UP_nov 
       2072        2160        2145        2141          79         105 
     UP_vir     AM1_ame     AM2_ame     AM3_ame     AV1_ame     AV2_ame 
        103        2160        2159        2159        2160        2159 
    NM1_ame     NM2_ame     NM3_ame     NV1_ame     NV2_ame     VM1_ame 
       2160        2160        2159        2160        2158        2159 
    VM2_ame     VM3_ame     VV1_ame     VV2_ame     VV3_ame     AM1_nov 
       2159        2153        2158        2154        2153        2145 
    AM2_nov     AM3_nov     AV1_nov     AV2_nov     NM1_nov     NM2_nov 
       2143        2144        2143        2144        2145        2144 
    NM3_nov     NV1_nov     NV2_nov     VM1_nov     VM2_nov     VM3_nov 
       2144        2145        2144        2144        2145        2142 
    VV1_nov     VV2_nov     VV3_nov     AM1_vir     AM2_vir     AM3_vir 
       2143        2138        2140        2138        2136        2136 
    AV1_vir     AV2_vir     NM1_vir     NM2_vir     NM3_vir     NV1_vir 
       2138        2138        2137        2138        2137        2141 
    NV2_vir     VM1_vir     VM2_vir     VM3_vir     VV1_vir     VV2_vir 
       2140        2141        2141        2139        2138        2137 
    VV3_vir 
       2137

#mdb %>% filter(duplicated(Orthogroup))
#mdb %>% filter(duplicated(protein_ame))

# some proteins are represented more than once due to more than 1:1:1 orthology as they belong to gene families, i.e. orthogroups span multiple genes
#mdb %>% filter(Orthogroup == "OG0000593")
#mdb %>% filter(Orthogroup == "OG0000593") %>% map_dbl(n_distinct)

# which proteins are duplicate members of Orthogroups
# Some proteins are duplicated when merging all databases as they belong to Orthgroups which contain more than one protein. Next we identify proteins which belong to gene families with each database
#ortho_dups <- unique(mdb[which(duplicated(mdb$Orthogroup) == TRUE), "Orthogroup"])
#n_distinct(ortho_dups$Orthogroup)


# # remove proteins with more than 1 to 1 orthology for seperate analysis
# recip_tpp <- mdb %>% 
#   filter(!Orthogroup %in% ortho_dups$Orthogroup)

# # is equivalent to:
# eqd <- mdb %>% 
#   filter(!protein_ame %in% mdb$protein_ame[which(duplicated(mdb$protein_ame) == TRUE)],
#          !protein_nov %in% mdb$protein_nov[which(duplicated(mdb$protein_nov) == TRUE)],
#          !protein_vir %in% mdb$protein_vir[which(duplicated(mdb$protein_vir) == TRUE)])

# dim(recip_tpp)
# colnames(recip_tpp)
# 
# recip_tpp %>% map_dbl(n_distinct)
# 
# # # write file for dryad
# # write_csv(recip_tpp, 'output/combined_database_tpp.csv')
# 
# recip_tpp %>% nrow / tmt_ame_prot %>% nrow()
# recip_tpp %>% nrow / tmt_nov_prot %>% nrow()
# recip_tpp %>% nrow / tmt_vir_prot %>% nrow()
 

# match the BLAST synteny results between species

a2n_orths <- read.delim("data/orthology_matching/Dame__v__Dnov_blastSyn.txt", header = FALSE) %>% 
  select(protein_ame = V1, protein_novID = V2) %>% 
  left_join(read.delim("data/orthology_matching/nov_gene_to_protein.txt", header = FALSE) %>% 
              select(protein_novID = V1, protein_nov = V2), by = "protein_novID", na_matches = "never") %>% 
  drop_na(protein_nov)

a2v_orths <- read.delim("data/orthology_matching/Dame__v__Dvir_blastSyn.txt", header = FALSE) %>% 
  select(protein_ame = V1, protein_virID = V2) %>% 
  left_join(read.delim("data/orthology_matching/vir_gene_to_protein.txt", header = FALSE) %>% 
              select(protein_virID = V1, protein_vir = V2), by = "protein_virID", na_matches = "never") %>% 
  drop_na(protein_vir)

n2a_orths <- read.delim("data/orthology_matching/Dnov__v__Dame_blastSyn.txt", header = FALSE) %>% 
  select(protein_nov = V1, protein_ameID = V2) %>% 
  mutate(protein_ame = str_replace(protein_ameID, "\\..\\.p.$", "")) %>% 
  filter(protein_ame != "")

n2v_orths <- read.delim("data/orthology_matching/Dnov__v__Dvir_blastSyn.txt", header = FALSE) %>% 
  select(protein_nov = V1, protein_virID = V2) %>% 
  left_join(read.delim("data/orthology_matching/vir_gene_to_protein.txt", header = FALSE) %>% 
              select(protein_virID = V1, protein_vir = V2), by = "protein_virID", na_matches = "never") %>% 
  drop_na(protein_vir)

v2a_orths <- read.delim("data/orthology_matching/Dvir__v__Dame_blastSyn.txt", header = FALSE) %>% 
  select(protein_vir = V1, protein_ameID = V2) %>% 
  mutate(protein_ame = str_replace(protein_ameID, "\\..\\.p.$", "")) %>% 
  filter(protein_ame != "")

v2n_orths <- read.delim("data/orthology_matching/Dvir__v__Dnov_blastSyn.txt", header = FALSE) %>% 
  select(protein_vir = V1, protein_novID = V2) %>% 
  left_join(read.delim("data/orthology_matching/nov_gene_to_protein.txt", header = FALSE) %>% 
              select(protein_novID = V1, protein_nov = V2), by = "protein_novID", na_matches = "never") %>% 
  drop_na(protein_nov)


# now stick the new nov ID on to the americana data, new ame on to virilis, etc.
sticky_an <- ortho_ame %>% rename(protein_ame = protein) %>% 
  left_join(a2n_orths, by = "protein_ame", na_matches = "never") %>% drop_na(protein_nov) %>% 
  left_join(ortho_nov, by = c("protein_nov" = "protein"), suffix = c('_ame', '_nov'), na_matches = "never")

sticky_av <- ortho_ame %>% rename(protein_ame = protein) %>% 
  left_join(a2v_orths, by = "protein_ame", na_matches = "never") %>% drop_na(protein_vir) %>% 
  left_join(ortho_vir, by = c("protein_vir" = "protein"), suffix = c('_ame', '_vir'), na_matches = "never")

sticky_na <- ortho_nov %>% rename(protein_nov = protein) %>% 
  left_join(n2a_orths, by = "protein_nov", na_matches = "never") %>% drop_na(protein_ame) %>% 
  left_join(ortho_ame, by = c("protein_ame" = "protein"), suffix = c('_nov', '_ame'), na_matches = "never")

sticky_nv <- ortho_nov %>% rename(protein_nov = protein) %>% 
  left_join(n2v_orths, by = "protein_nov", na_matches = "never") %>% drop_na(protein_vir) %>% 
  left_join(ortho_vir, by = c("protein_vir" = "protein"), suffix = c('_nov', '_vir'), na_matches = "never")

sticky_va <- ortho_vir %>% rename(protein_vir = protein) %>% 
  left_join(v2a_orths, by = "protein_vir", na_matches = "never") %>% drop_na(protein_ame) %>% 
  left_join(ortho_ame, by = c("protein_ame" = "protein"), suffix = c('_vir', '_ame'), na_matches = "never")

sticky_vn <- ortho_vir %>% rename(protein_vir = protein) %>% 
  left_join(v2n_orths, by = "protein_vir", na_matches = "never") %>% drop_na(protein_nov) %>% 
  left_join(ortho_nov, by = c("protein_nov" = "protein"), suffix = c('_vir', '_nov'), na_matches = "never")

# combine all the results and merge missing values across rows using 'fill'
ortho_dat <- bind_rows(sticky_an, sticky_av,
                       sticky_na, sticky_nv,
                       sticky_va, sticky_vn) %>% 
  relocate(starts_with("Orthogroup"), starts_with("protein"), starts_with("UP")) %>% 
  select(-contains("ID")) %>%
  group_by(protein_ame) %>%
  fill(everything(), .direction = "downup") %>%
  slice(1) %>%
  group_by(protein_nov) %>%
  fill(everything(), .direction = "downup") %>%
  slice(1) %>%
  group_by(protein_vir) %>%
  fill(everything(), .direction = "downup") %>%
  slice(1) 

# combine orthogroup IDs where possible or make a new ID
ortho_dat <- ortho_dat %>% 
  bind_cols(Orthogroup = coalesce(coalesce(ortho_dat$Orthogroup_ame, 
                                           ortho_dat$Orthogroup_nov), ortho_dat$Orthogroup_vir)) %>% 
  rowid_to_column("ID") %>% 
  mutate(Orthogroup = if_else(grepl("OG", Orthogroup) == FALSE, paste("NEWG", ID, sep = "_"),
                              Orthogroup))

dim(ortho_dat)

[1] 63 59

ortho_dat %>% map_dbl(n_distinct)

            ID Orthogroup_ame Orthogroup_nov Orthogroup_vir    protein_ame 
            63             58             48             33             63 
   protein_nov    protein_vir         UP_ame         UP_nov         UP_vir 
            63             63             20             22             19 
       AM1_ame        AM2_ame        AM3_ame        AV1_ame        AV2_ame 
            58             58             58             58             58 
       NM1_ame        NM2_ame        NM3_ame        NV1_ame        NV2_ame 
            58             58             58             58             58 
       VM1_ame        VM2_ame        VM3_ame        VV1_ame        VV2_ame 
            58             58             58             58             58 
       VV3_ame        AM1_nov        AM2_nov        AM3_nov        AV1_nov 
            56             48             48             48             48 
       AV2_nov        NM1_nov        NM2_nov        NM3_nov        NV1_nov 
            48             48             48             48             48 
       NV2_nov        VM1_nov        VM2_nov        VM3_nov        VV1_nov 
            48             48             48             48             47 
       VV2_nov        VV3_nov        AM1_vir        AM2_vir        AM3_vir 
            47             47             33             33             33 
       AV1_vir        AV2_vir        NM1_vir        NM2_vir        NM3_vir 
            33             33             33             33             33 
       NV1_vir        NV2_vir        VM1_vir        VM2_vir        VM3_vir 
            33             33             33             33             33 
       VV1_vir        VV2_vir        VV3_vir     Orthogroup 
            33             33             33             63

# add the new data with the combined database
mdb2 <- bind_rows(mdb, ortho_dat %>% select(-starts_with("Orthogroup_"), -ID))
dim(mdb2)

[1] 2684   55

mdb2 %>% map_dbl(n_distinct)

 Orthogroup protein_ame protein_nov protein_vir      UP_ame      UP_nov 
       2132        2221        2197        2195          80         106 
     UP_vir     AM1_ame     AM2_ame     AM3_ame     AV1_ame     AV2_ame 
        105        2215        2214        2214        2215        2214 
    NM1_ame     NM2_ame     NM3_ame     NV1_ame     NV2_ame     VM1_ame 
       2215        2215        2214        2215        2213        2213 
    VM2_ame     VM3_ame     VV1_ame     VV2_ame     VV3_ame     AM1_nov 
       2213        2207        2212        2208        2206        2181 
    AM2_nov     AM3_nov     AV1_nov     AV2_nov     NM1_nov     NM2_nov 
       2179        2180        2179        2180        2181        2180 
    NM3_nov     NV1_nov     NV2_nov     VM1_nov     VM2_nov     VM3_nov 
       2180        2181        2180        2180        2181        2178 
    VV1_nov     VV2_nov     VV3_nov     AM1_vir     AM2_vir     AM3_vir 
       2177        2173        2175        2161        2159        2159 
    AV1_vir     AV2_vir     NM1_vir     NM2_vir     NM3_vir     NV1_vir 
       2161        2161        2160        2161        2160        2164 
    NV2_vir     VM1_vir     VM2_vir     VM3_vir     VV1_vir     VV2_vir 
       2163        2164        2164        2162        2161        2160 
    VV3_vir 
       2160

# proportion of each database represented by Orthogrous in the combined database
n_distinct(mdb2$Orthogroup) / c(n_distinct(ortho_ame$protein),
                                n_distinct(ortho_nov$protein), 
                                n_distinct(ortho_vir$protein))

[1] 0.7073656 0.6755387 0.6770403

# duplicated orthogroups
ortho_dups <- mdb2 %>% 
  filter(Orthogroup %in% unique(mdb2[which(duplicated(mdb2$Orthogroup) == TRUE), "Orthogroup"])$Orthogroup)

dim(ortho_dups)

[1] 669  55

n_distinct(ortho_dups$Orthogroup)

[1] 117

# number of single copy orthogroups
mdb2 %>% filter(!Orthogroup %in% ortho_dups$Orthogroup) %>% dim()

[1] 2015   55

mdb2 %>% filter(Orthogroup %in% ortho_dups$Orthogroup) %>% dim()

[1] 669  55

# count the numbers of proteins and numbers of proteins belonging to each orthogroup
mdb_counts <- mdb2 %>% 
  group_by(Orthogroup) %>% 
  summarise(ame = n_distinct(protein_ame),
            nov = n_distinct(protein_nov),
            vir = n_distinct(protein_vir)) %>% 
  pivot_longer(cols = 2:4)

# number of proteins belonging to an orthogroup - most proteins are single copy
mdb_counts %>% 
  group_by(name, value) %>% 
  count() %>% 
  mutate(N2 = ifelse(value > 6, "7+", as.character(value)),
         n2 = ifelse(value == 1, n/10, n)) %>% 
  #ggplot(aes(x = N2, y = n2, fill = species)) + 
  ggplot(aes(x = value, y = n2, fill = name)) + 
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_viridis_d(name = "Species: ",
                       labels = c(expression(italic('D. ame')),
                                  expression(italic('D. nov')),
                                  expression(italic('D. vir')))) +
  labs(x = "Copy number", y = "Count") +
  theme_bw() +
  theme(legend.position = "bottom") +
  #ggsave('plots/TPP_results/orthogroup_counts.pdf', height = 4, width = 4, dpi = 600, useDingbats = FALSE) +
  NULL

sco_mdb <- mdb_counts %>% 
  group_by(Orthogroup, value) %>% 
  count(name = "N") %>% filter(N == 3)

mdb_counts %>% 
  #filter(!Orthogroup %in% sco_mdb$Orthogroup) %>% 
  filter(Orthogroup %in% ortho_dups$Orthogroup) %>% 
  group_by(Orthogroup, name, value) %>% 
  count(name = "N") %>% ungroup() %>% 
  mutate(value = factor(value)) %>% 
  ggplot(aes(x = name, stratum = value, alluvium = Orthogroup,
             y = N, label = value)) +
  #geom_flow(aes(fill = value)) +
  geom_alluvium(aes(fill = value)) +
  geom_stratum(aes(fill = value)) + 
  geom_text(stat = "stratum") +
  scale_fill_viridis_d(name = "Copy no.") +
  labs(x = "Species", y = "No. proteins") +
  theme_bw() +
  #ggsave('plots/TPP_results/alluvial_duplicates.pdf', height = 6.4, width = 4.8, dpi = 600, useDingbats = FALSE) +
  NULL

# ortho_dups %>% 
#   group_by(Orthogroup) %>% 
#   mutate(unique_ame = n_distinct(protein_ame),
#          unique_nov = n_distinct(protein_nov),
#          unique_vir = n_distinct(protein_vir)) %>% 
#   #select(Orthogroup, starts_with("unique")) %>% 
#   distinct(Orthogroup, unique_ame, unique_nov, unique_vir) %>% 
#   pivot_longer(cols = 2:4) %>% 
#   ggplot(aes(x = name, y = Orthogroup, fill = value)) + 
#   geom_tile() +
#   scale_fill_viridis_c()

Differential abundance analysis

We performed differential abundance analyses using the combined database with species-specific abundance data collated using each iterative run from the TPP.

# filter must have two unique peptides in at least 1 dataset
multiDB <- mdb2 %>% 
  filter(UP_ame >= 2 | UP_nov >= 2 | UP_vir >= 2) %>% 
  dplyr::select(protein_vir, Orthogroup,
                AM1_ame, AM2_ame, AM3_ame, AV1_ame, AV2_ame,
                NM1_nov, NM2_nov, NM3_nov, NV1_nov, NV2_nov,
                VM1_vir, VM2_vir, VM3_vir, VV1_vir, VV2_vir, VV3_vir) %>% 
  mutate(across(3:18, ~replace_na(.x, 0)))

colnames(multiDB)[3:18] <- gsub('_.*', '', x = colnames(multiDB)[3:18])

# make object for protein abundance data and replace NA's with 0's
expr_data <- multiDB[, -c(1:2)]

# get sample info
sampInfo = data.frame(species = str_sub(colnames(expr_data), 1, 1),
                      mating = str_sub(colnames(expr_data), 2, 2),
                      condition = str_sub(colnames(expr_data), 1, 2),
                      Replicate = str_sub(colnames(expr_data), -1))

# make design matrix to test diffs between groups
design <- model.matrix(~0 + sampInfo$condition)
colnames(design) <- unique(sampInfo$condition)
rownames(design) <- sampInfo$Replicate

# create DGElist and fit model 
dgeList <- DGEList(counts = expr_data, genes = multiDB$protein_vir, group = sampInfo$condition)
dgeList <- calcNormFactors(dgeList, method = 'TMM')
dgeList <- estimateCommonDisp(dgeList)
dgeList <- estimateTagwiseDisp(dgeList)

# make contrasts - higher values = higher in mated
cont.matrix <- makeContrasts(M.a.V = AM - AV,
                             M.n.V = NM - NV,
                             M.v.V = VM - VV,
                             levels = design)

# voom normalisation
dgeListV <- voom(dgeList, design, plot = FALSE)

# fit linear model
lm_fit <- lmFit(dgeListV, design = design)

Heatmap

All replicates

# make DF
sample_voomed <- data.frame(genes = dgeListV$genes,
                            dgeListV$E) %>% 
  left_join(vir_fbgn %>% select(-FBtr), by = c('genes' = 'gene_id'), na_matches = "never") %>% 
  # add mel Sfp ortholog
  left_join(wigbySFP %>% dplyr::select(FBgn, mel_Sfp = Symbol), na_matches = "never") %>% 
  # add annotations
  mutate(Ejac = ifelse(genes %in% vir_cands$genes, 'Ejac', NA),
         sperm = ifelse(FBgn %in% sperm_mel$FBgn_v, 'sperm', NA),
         sigP = ifelse(genes %in% vir_sig$protein, 'sig', NA),
         #mel_Sfp = ifelse(FBgn %in% wigbySFP$FBgn, 'Sfp', NA),
         category = case_when(FBgn %in% sperm_mel$FBgn_v ~ 'sperm',
                              FBgn %in% wigbySFP$FBgn ~ 'Sfp',
                              genes %in% vir_sig$protein ~ 'sigP',
                              TRUE ~ 'other')) %>% 
  distinct(genes, Orthogroup, .keep_all = TRUE) %>% 
  mutate(dupl = ifelse(Orthogroup %in% ortho_dups$Orthogroup, "Dup", "not"))

samp_vm <- sample_voomed %>% dplyr::select(genes, FBgn, 2:17, mel_Sfp, category)

#row.names(samp_vm) <- sample_voomed$genes

vm_scaled <- as.matrix(pheatmap:::scale_rows(sample_voomed[, 2:17]))
colnames(vm_scaled) <- colnames(sample_voomed[, 2:17])

# heatmap annotations
labs1 <- rowAnnotation(cat_lab = sample_voomed$category,
                       col = list(cat_lab = c(sperm = v.pal[2],
                                              Sfp = v.pal[3], 
                                              sigP = v.pal[1],
                                              other = NA)),
                       Sfp_labs = anno_mark(at = c(grep('Sfp', x = sample_voomed$category)), 
                                            labels = sample_voomed$mel_Sfp[sample_voomed$category == 'Sfp'],
                                            labels_gp = gpar(fontsize = 5)),
                       title = NULL,
                       show_annotation_name = FALSE)

dupl_lab <- rowAnnotation(Dupl = sample_voomed$dupl,
                          col = list(Dupl = c(Dup = "red", not = NA)),
                          title = NULL,
                          show_annotation_name = FALSE)

vmha <- HeatmapAnnotation(species = str_sub(colnames(samp_vm[, 3:18]), 1, 1), 
                          mating = str_sub(colnames(samp_vm[, 3:18]), 2, 2),
                          col = list(species = c('V' = v.pal[1], 
                                                 'N' = v.pal[2], 
                                                 'A' = v.pal[3]),
                                     mating = c('M' = viridis::magma(n = 4)[3], 
                                                'V' = viridis::magma(n = 4)[2])))

#pdf('plots/TPP_results/TPP_all_compheatmap.pdf', height = 8, width = 5)
Heatmap(vm_scaled, 
        #col = viridis::inferno(100),
        heatmap_legend_param = list(title = "log2(intensities)",
                                    title_position = "leftcenter-rot"),
        left_annotation = dupl_lab,
        right_annotation = labs1,
        top_annotation = vmha, 
        show_row_names = FALSE, 
        show_column_names = FALSE, 
        column_split = 3,
        column_gap = unit(0, "mm"),
        row_title = NULL,
        column_title = NULL)

#dev.off()

Sample medians

# make DF
sample_medians <- data.frame(genes = dgeListV$genes,
                             dgeListV$E) %>% 
  rowwise() %>% 
  mutate(AM = median(c(!!! rlang::syms(grep('AM', names(.), value = TRUE)))),
         NM = median(c(!!! rlang::syms(grep('NM', names(.), value = TRUE)))),
         VM = median(c(!!! rlang::syms(grep('VM', names(.), value = TRUE)))),
         AV = median(c(!!! rlang::syms(grep('AV', names(.), value = TRUE)))),
         NV = median(c(!!! rlang::syms(grep('NV', names(.), value = TRUE)))),
         VV = median(c(!!! rlang::syms(grep('VV', names(.), value = TRUE))))) %>% 
  dplyr::select(genes, AM, NM, VM, AV, NV, VV) %>% 
  left_join(vir_fbgn %>% select(-FBtr), by = c('genes' = 'gene_id')) %>% 
  # add mel Sfp ortholog
  left_join(wigbySFP %>% dplyr::select(FBgn, mel_Sfp = Symbol)) %>% 
  # add annotations
  mutate(Ejac = ifelse(genes %in% vir_cands$genes, 'Ejac', NA),
         sperm = ifelse(FBgn %in% sperm_mel$FBgn_v, 'sperm', NA),
         sigP = ifelse(genes %in% vir_sig$protein, 'sig', NA),
         #mel_Sfp = ifelse(FBgn %in% wigbySFP$FBgn, 'Sfp', NA),
         category = case_when(FBgn %in% sperm_mel$FBgn_v ~ 'sperm',
                              FBgn %in% wigbySFP$FBgn ~ 'Sfp',
                              genes %in% vir_sig$protein ~ 'sigP',
                              TRUE ~ 'other')) %>% 
  distinct(genes, Orthogroup, .keep_all = TRUE)

samp_hm <- sample_medians %>% dplyr::select(genes, FBgn, 2:7, mel_Sfp, category)

#row.names(samp_hm) <- sample_medians$genes

mat_scaled <- as.matrix(pheatmap:::scale_rows(sample_medians[, 2:7]))
colnames(mat_scaled) <- colnames(sample_medians[, 2:7])

labs2 <- rowAnnotation(cat_lab = sample_medians$category,
                       col = list(cat_lab = c(sperm = v.pal[2],
                                              Sfp = v.pal[3], 
                                              sigP = v.pal[1],
                                              other = NA)),
                       Sfp_labs = anno_mark(at = c(grep('Sfp', x = sample_medians$category)), 
                                            labels = sample_medians$mel_Sfp[sample_medians$category == 'Sfp'],
                                            labels_gp = gpar(fontsize = 5)),
                       title = NULL,
                       show_annotation_name = FALSE)

mdha <- HeatmapAnnotation(species = str_sub(colnames(samp_hm[, 3:8]), 1, 1), 
                          mating = str_sub(colnames(samp_hm[, 3:8]), 2, 2),
                          col = list(species = c('V' = v.pal[1], 
                                                 'N' = v.pal[2], 
                                                 'A' = v.pal[3]),
                                     mating = c('M' = viridis::magma(n = 4)[3], 
                                                'V' = viridis::magma(n = 4)[2])))

#pdf('plots/TPP_results/TPP_all_hm-md.pdf', height = 8, width = 5)
Heatmap(mat_scaled, 
        col = viridis::inferno(50),
        heatmap_legend_param = list(title = "log2(intensities)",
                                    title_position = "leftcenter-rot"),
        right_annotation = labs2,
        top_annotation = mdha, 
        show_row_names = FALSE, 
        show_column_names = FALSE, 
        column_split = 3,
        column_gap = unit(0, "mm"),
        row_title = NULL,
        column_title = NULL)

#dev.off()

Diagnostic plots

par(mfrow = c(2,2))
# Biological coefficient of variation
plotBCV(dgeList)
# mean-variance trend
voomed = voom(dgeList, design, plot = TRUE)
# QQ-plot
g <- gof(glmFit(dgeList, design))
z <- zscoreGamma(g$gof.statistics,shape = g$df/2,scale = 2)
qqnorm(z); qqline(z, col = 4,lwd = 1, lty = 1)
# log2 transformed and normalize boxplot of counts across samples
boxplot(voomed$E, xlab = "", ylab = "log2(Abundance)", las = 2 , main = "Voom transformed logCPM",
        col = c(rep(brewer.pal(n = 6, name = 'Spectral'), each = 3)[-c(6,12)]))
abline(h = median(voomed$E), col = "blue")

par(mfrow = c(1,1))

Principal component analysis

# pca <- prcomp(t(dgeListV$E %>% as_tibble()), center = TRUE, scale. = TRUE)
# # broom::tidy(pca, 'pcs') %>% 
# #   mutate(percent = percent * 100, 
# #          cumulative = cumulative * 100)
# 
# PCA_dat <- as.data.frame(pca$x)[, 1:3] %>%
#   rownames_to_column() %>%
#   mutate(species = str_sub(rowname, 1, 1),
#          mating = str_sub(rowname, 2, 2),
#          condition = str_sub(rowname, 1, 2))

# only most variable proteins
most_var <- dgeListV$E %>% as_tibble() %>% 
  # exclude duplicated orthogroups
  slice(-which(duplicated(multiDB$Orthogroup) == TRUE)) %>% 
  mutate(vars = apply(., 1, var)) %>% 
  slice_max(vars, n = 500) %>% 
  select(-vars)

pca.mostvar <- prcomp(t(as.matrix(most_var)), center = TRUE, scale. = TRUE)
# broom::tidy(pca.mostvar, 'pcs') %>% 
#   mutate(percent = percent * 100, 
#          cumulative = cumulative * 100)

PCA_mostvar <- as.data.frame(pca.mostvar$x)[, 1:3] %>%
  rownames_to_column() %>%
  mutate(species = str_sub(rowname, 1, 1),
         mating = str_sub(rowname, 2, 2),
         condition = str_sub(rowname, 1, 2))

# # Plot for figure
# PCA_mostvar %>%
#   ggplot(aes(x = PC1, y = PC2, colour = species, shape = mating)) +
#   geom_point(size = 8, alpha = .7) +
#   labs(x = paste0('PC1 (', (100*round(summary(pca.mostvar)$importance[2, 1], 3)), '%)'),
#        y = paste0('PC2 (', (100*round(summary(pca.mostvar)$importance[2, 2], 3)), '%)')) +
#   scale_colour_viridis_d(labels = c(expression(italic('D. ame')),
#                                     expression(italic('D. nov')),
#                                     expression(italic('D. vir')))) +
#   scale_shape_manual(values = c(16, 18), labels = c('Mated', 'Virgin')) +
#   theme_bw() +
#   theme(legend.title = element_blank(),
#         legend.text.align = 0,
#         legend.text = element_text(size = 12),
#         legend.background = element_blank(),
#         axis.text = element_text(size = 10),
#         axis.title = element_text(size = 12)) +
#   #ggsave('plots/TPP_results/TPP_PCA_12.pdf', height = 3.4, width = 4.8, dpi = 600, useDingbats = FALSE) +
#   NULL

rbind(as.matrix(PCA_mostvar[, c(2, 3)]),
      as.matrix(PCA_mostvar[, c(2, 4)]),
      as.matrix(PCA_mostvar[, c(3, 4)])) %>% 
  bind_cols(species = rep(PCA_mostvar$species, 3),
            mating = rep(PCA_mostvar$mating, 3),
            pc = rep(c(paste0('PC1 (',100*round(summary(pca.mostvar)$importance[2, 1], 3), '%) vs. PC2 (',
                              100*round(summary(pca.mostvar)$importance[2, 2], 3), '%)'), 
                       paste0('PC1 (',100*round(summary(pca.mostvar)$importance[2, 1], 3), '%) vs. PC3 (',
                              100*round(summary(pca.mostvar)$importance[2, 3], 3), '%)'), 
                       paste0('PC2 (',100*round(summary(pca.mostvar)$importance[2, 2], 3), '%) vs. PC3 (',
                              100*round(summary(pca.mostvar)$importance[2, 3], 3), '%)')), 
                     each = 16)) %>% 
  ggplot(aes(x = PC1, y = PC2, colour = species, shape = mating, alpha = .5)) +
  geom_point(size = 8, alpha = .7) +
  scale_colour_viridis_d(labels = c(expression(italic('D. ame')),
                                    expression(italic('D. nov')),
                                    expression(italic('D. vir')))) +
  scale_shape_manual(values = c(16, 18), labels = c('Mated', 'Virgin')) + 
  facet_wrap(~pc) +
  theme_bw() +
  theme(legend.title = element_blank(),
        legend.text.align = 0,
        legend.text = element_text(size = 12),
        legend.background = element_blank(),
        axis.text = element_text(size = 10),
        axis.title = element_blank(),
        strip.text = element_text(size = 15)) +
  NULL

Identifying ejaculate candidates

We compared abundances between mated and virgin samples for each species.

# perform lmFit tests
# ame 
lm_ame <- contrasts.fit(lm_fit, cont.matrix[,"M.a.V"])
lm_ame <- eBayes(lm_ame)
lm_ame.tTags.table <- topTable(lm_ame, adjust.method = "BH", number = Inf) %>% 
  mutate(species = "ame",
         threshold = if_else(adj.P.Val < 0.05 & abs(logFC) > 1, "SD", "NS"))

# nov
lm_nov <- contrasts.fit(lm_fit, cont.matrix[,"M.n.V"])
lm_nov <- eBayes(lm_nov)
lm_nov.tTags.table <- topTable(lm_nov, adjust.method = "BH", number = Inf) %>% 
  mutate(species = "nov",
         threshold = if_else(adj.P.Val < 0.05 & abs(logFC) > 1, "SD", "NS"))

# vir
lm_vir <- contrasts.fit(lm_fit, cont.matrix[,"M.v.V"])
lm_vir <- eBayes(lm_vir)
lm_vir.tTags.table <- topTable(lm_vir, adjust.method = "BH", number = Inf) %>% 
  mutate(species = "vir",
         threshold = if_else(adj.P.Val < 0.05 & abs(logFC) > 1, "SD", "NS"))

comb_TABLES <- rbind(lm_ame.tTags.table,
                     lm_nov.tTags.table, 
                     lm_vir.tTags.table) %>% 
  left_join(vir_fbgn %>% select(-FBtr), by = c('genes' = 'gene_id'), na_matches = "never") %>% 
  mutate(
    # add variable for signal peptides reaching significance threshold
    sigP = if_else(genes %in% vir_sig$protein, 'sigP', 'not'),
    # add variable splitting by bias to virgin vs. mated and signal peptide
    sperm = if_else(FBgn %in% sperm_mel$FBgn_v, 'Sperm', 'not'),
    # add variable splitting by bias to virgin vs. mated and signal peptide
    DA = case_when(threshold == 'SD' & logFC > 1 & sigP == 'sigP' ~ "MBsec",
                   threshold == 'SD' & logFC < -1 & sigP == 'sigP' ~ "FMsec",
                   threshold == 'SD' & logFC > 1 ~ "MB",
                   threshold == 'SD' & logFC < -1 ~ "FM",
                   TRUE ~ 'NS')) %>% 
  distinct(genes, species, .keep_all = TRUE)

# remove duplicated genes
comb_TABLES_nd <- comb_TABLES %>% filter(!Orthogroup %in% ortho_dups$Orthogroup)

# save differentially abundant proteins to table for ClueGO
ejac_cands_multiDB <- comb_TABLES %>% filter(adj.P.Val < 0.05 & logFC > 1) %>%
  distinct(genes, .keep_all = TRUE)

#ejac_cands_multiDB %>% map_dbl(n_distinct)

# ejac_cands_multiDB %>% write_csv('output/ClueGOlists/TPP_ejac_cands_multiDB_mel.csv')

# number of Dmel Sperm proteins id'd
ejac_cands_multiDB %>%
  left_join(sperm_mel, by = c("FBgn" = "FBgn_v")) %>%
  distinct(FBgn, .keep_all = TRUE) %>% drop_na(mel_FBgn_ID) %>% 
  select(FBgn, mel_FBgn_ID, Gene.Symbol) %>% 
  kable(caption = 'D. melanogaster sperm protein orthologs') %>% 
  kable_styling(full_width = FALSE)

D. melanogaster sperm protein orthologs
FBgn	mel_FBgn_ID	Gene.Symbol
FBgn0210936	FBgn0038395	CG10407
FBgn0203892	FBgn0086907	Cyt-c-d
FBgn0210972	FBgn0051872	CG31872
FBgn0204803	FBgn0260453	CG17140
FBgn0211422	FBgn0039257	tnc
FBgn0205934	FBgn0031171	CG1801
FBgn0204735	FBgn0031869	CG18304
FBgn0205696	FBgn0030362	regucalcin
FBgn0198514	FBgn0015278	Pi3K68D
FBgn0197452	FBgn0013812	Dhc93AB
FBgn0202027	FBgn0030482	CG1673
FBgn0203308	FBgn0027560	Tps1
FBgn0201010	FBgn0266136	Gyc76C
FBgn0199271	FBgn0051644	CG31644
FBgn0199705	FBgn0046704	Liprin-alpha
FBgn0204802	FBgn0069354	Porin2
FBgn0210522	FBgn0037891	CG5214
FBgn0204328	FBgn0004507	GlyP
FBgn0202308	FBgn0001187	Hex-C
FBgn0016474	FBgn0004045	Yp1

# number of Dmel SFPs id'd
ejac_cands_multiDB %>% left_join(wigbySFP, by = "FBgn") %>% drop_na(FBgn_mel) %>% 
  distinct(FBgn, FBgn_mel, Symbol) %>% 
  kable(caption = 'D. melanogaster SFP orthologs') %>% 
  kable_styling(full_width = FALSE)

D. melanogaster SFP orthologs
FBgn	FBgn_mel	Symbol
FBgn0210521	FBgn0037889	scpr-A
FBgn0211573	FBgn0039568	CG4815
FBgn0211755	FBgn0037889	scpr-A
FBgn0206811	FBgn0034417	CG15117
FBgn0208251	FBgn0054002	CG34002
FBgn0211681	FBgn0260766	CG42564
FBgn0197362	FBgn0000592	Est-6
FBgn0211506	FBgn0051418	CG31418
FBgn0199249	FBgn0031741	CG11034
FBgn0208385	FBgn0034753	CG2852
FBgn0211069	FBgn0037889	scpr-A
FBgn0210455	FBgn0083965	CG34129
FBgn0198354	FBgn0038014	CG10041
FBgn0211265	FBgn0039597	CG9997
FBgn0211705	FBgn0038198	Npc2b
FBgn0204528	FBgn0250847	CG14034
FBgn0207587	FBgn0011695	EbpIII
FBgn0013080	FBgn0001112	Gld
FBgn0210972	FBgn0265264	CG17097
FBgn0207308	FBgn0033777	CG17574
FBgn0206035	FBgn0041629	Hexo2
FBgn0205696	FBgn0030362	regucalcin
FBgn0204606	FBgn0037650	CG11977
FBgn0211264	FBgn0039598	aqrs
FBgn0198486	FBgn0052190	NUCB1
FBgn0211457	FBgn0058160	CG40160
FBgn0207519	FBgn0050104	NT5E-2

# SEX PEPTIDE NETWORK
#SPN <- read.csv("data/SPN_ids.csv")

Volcano plot

comb_TABLES_nd %>% 
  ggplot(aes(x = logFC, y = -log10(P.Value))) +
  geom_point(data = . %>% filter(threshold == 'NS'), alpha = .25, colour = 'grey') +
  geom_point(data = . %>% filter(threshold != 'NS'), aes(colour = DA), alpha = .5) + 
  #geom_point(alpha = .6) +
  #scale_colour_manual(values = cbPalette[5:1]) +
  scale_colour_manual(values = c(egypt.pal[4:1])) +
  labs(y = '-log10(p-value)') +
  facet_wrap(~species) +
  theme_bw() +
  theme(legend.position = 'bottom',
        legend.title = element_blank(),
        legend.background = element_rect(fill = NA),
        strip.text = element_text(size = 15, face = "italic"),
        strip.background = element_blank(),
        strip.text.x = element_blank()) +
  # geom_point(
  #   data = comb_TABLES_nd %>% filter(threshold != 'NS') %>% inner_join(wigbySFP, by = 'FBgn'),
  #   colour = 'green') +
  #too many to plot all Sfps
  # geom_label_repel(
  #   data = comb_TABLES_nd %>% filter(threshold != 'NS') %>% left_join(wigbySFP, by = 'FBgn'),
  #   aes(label = Symbol),
  #   size = 4,
  #   colour = 'black',
  #   box.padding = unit(0.35, "lines"),
  #   point.padding = unit(0.3, "lines")
  # ) +
  geom_text(data = lab_text, colour = 'black', hjust = 1,
            aes(x = 20, y = 0.5, label = paste0(lab)), size = 4, fontface = "italic") +
  #ggsave('plots/TPP_results/TPP_volcano_mated-virgin_comb.pdf', height = 8, width = 18, units = 'cm', dpi = 600, useDingbats = FALSE) +
  NULL

Overlap of ejaculate candidates between species

# #comb db
# upset(fromList(list(ame = comb_TABLES_nd$Orthogroup[comb_TABLES_nd$species == 'ame' & comb_TABLES_nd$logFC > 1 & comb_TABLES_nd$adj.P.Val < 0.05],
#                     nov = comb_TABLES_nd$Orthogroup[comb_TABLES_nd$species == 'nov' & comb_TABLES_nd$logFC > 1 & comb_TABLES_nd$adj.P.Val < 0.05],
#                     vir = comb_TABLES_nd$Orthogroup[comb_TABLES_nd$species == 'vir' & comb_TABLES_nd$logFC > 1 & comb_TABLES_nd$adj.P.Val < 0.05])))

#pdf('plots/TPP_results/TPP_sfp_overlap.pdf', height = 4, width = 4)
plot(venn(c('D. ame' = 46, "D. nov" = 18, "D. vir" = 13, 
            'D. ame&D. nov' = 31, 'D. ame&D. vir' = 5, 'D. nov&D. vir' = 1, 
            'D. ame&D. nov&D. vir' = 55)),
     quantities = TRUE,
     fills = list(fill = viridis::viridis(n = 3), alpha = .5))

#dev.off()

# # percent overlap
# comb_TABLES_nd %>%
#   filter(logFC > 1, adj.P.Val < 0.05) %>%
#   group_by(species) %>%
#   count() %>%
#   mutate(p = 55/n * 100)

# proportion of ejaculate candidates with signal peptide
comb_TABLES_nd %>% 
  filter(str_detect(DA, pattern = 'MB')) %>% 
  distinct(genes, species, .keep_all = TRUE) %>% 
  group_by(species, DA) %>% dplyr::count() %>% 
  pivot_wider(id_cols = species, names_from = DA, values_from = n) %>% 
  mutate(prop.sec = MBsec/(MBsec + MB)) %>% 
  kable(digits = 3, 
        caption = 'The number of proteins higher in abundance in mated comapred to virgin samples with or without a secretion signal for each species') %>% 
  kable_styling(full_width = FALSE)

The number of proteins higher in abundance in mated comapred to virgin samples with or without a secretion signal for each species
species	MB	MBsec	prop.sec
ame	77	84	0.522
nov	55	66	0.545
vir	38	46	0.548

Sfp overlap with previous studies

We compare the number of putative seminal fluid proteins identified in previous studies in the virilis group from Ahmed-Braimah et al. 2017 and Ahmed-Braimah et al. 2021.

overlap with Ahmed-Braimah et al. 2017

# upset(fromList(list(proteomic = ejac_cands_multiDB$FBgn,
#                     AG_biased = Dvir_SFPs$FBgn[Dvir_SFPs$bias == 'AG-biased'],
#                     EB_biased = Dvir_SFPs$FBgn[Dvir_SFPs$bias == 'EB-biased'],
#                     SFP       = Dvir_SFPs$FBgn[Dvir_SFPs$bias == 'SFP'])))

#pdf('plots/TPP_sfp_orth_overlap.pdf', height = 4, width = 4)
plot(euler(c('AG' = 107, "EB" = 85, "Prot" = 196, 'SFP' = 49,
             'SFP&Prot' = 22, 'AG&Prot' = 12, 'EB&Prot' = 7)),
     quantities = TRUE,
     fills = list(fill = c("turquoise", "orange", "red", 'purple'), alpha = .5))

#dev.off()

Overlap with female reproductive tract genes

# upset(fromList(list(proteomic = unique(comb_TABLES$FBgn),
#                     FRT_biased = FRTbiased$FBgn_ID,
#                     PMR_biased = virilisPMDE$FBgn_ID)))

plot(euler(c('FRT' = 84, "PMR" = 334, "PRO" = 2031, 
             'FRT&PMR' = 14, 'FRT&PRO' = 48, 'PRO&PMR' = 93,
             'FRT&PMR&PRO' = 2)
              ),
     quantities = TRUE,
     fills = list(fill = c("orange", "red", 'purple'), alpha = .5))

Detecting artifactual differences

We look at potential differences in abundance between groups of proteins which may indicate flaws in the experimental design. For instance, ejaculate proteins may be ‘swamped out’ by the more highly abundant/speciose female reproductive tract proteins. Furthermore, serine-type endopeptidases (STEP) are down-regulated after mating - a signature of the postmating female response. If our method is effective at capturing the female reproductive tract in a ‘virgin-like’ state, then we expect no difference in the abundance of STEPs between mated and virgin samples.

Abundance of ejaculate proteins compared to remaining FRT proteins

wilc_abund <- comb_TABLES_nd %>% 
  mutate(up = if_else(adj.P.Val < 0.05 & logFC > 1, "ejac", 'frt')) %>% 
  group_by(species) %>%
  do(fit = broom::tidy(wilcox.test(AveExpr ~ as.factor(up), data = .))) %>% 
  unnest(fit) %>% 
  mutate(up = NA,
         p.val = ifelse(p.value < 0.001, "p < 0.001", paste0('p = ', round(p.value, 3))))

comb_TABLES_nd %>% 
  mutate(up = if_else(adj.P.Val < 0.05 & logFC > 1, "Ejaculate", 'FRT')) %>% 
  ggplot(aes(x = up, y = AveExpr)) +
  geom_violin(aes(fill = up), alpha = .5) +
  geom_boxplot(aes(fill = up), width = .3, notch = TRUE) + 
  geom_jitter(data = comb_TABLES_nd %>% 
                mutate(up = if_else(adj.P.Val < 0.05 & logFC > 1, "Ejaculate", 'FRT')) %>% 
                filter(up == 'Ejaculate'), 
              colour = cbPalette[8], width = .3, alpha = .5) + 
  # gghalves::geom_half_boxplot(aes(fill = up), outlier.shape = NA) +
  # gghalves::geom_half_point(aes(colour = up), alpha = .3) +
  scale_fill_manual(values = cbPalette[8:9]) +
  scale_colour_manual(values = cbPalette[8:9]) +
  facet_wrap(~species, nrow = 1, labeller = as_labeller(facet_names)) +
  labs(y = 'Average abundance') +
  theme_bw() +
  theme(legend.position = '',
        axis.title.x = element_blank(),
        strip.text = element_text(size = 15, face = "italic")) +
  geom_text(data = wilc_abund,
            aes(x = 1.5, y = 12, 
                label = #paste0('Chi^2(1) = ', round(statistic, 2), 
                        #       ', ', 
                  p.val)) +
  # ggsignif::geom_signif(comparisons = list(c("ejac", "frt")),
  #                       map_signif_level = TRUE) +
  #ggsave('plots/TPP_results/TPP_ejac_vs_frt.pdf', height = 4, width = 12, dpi = 600, useDingbats = FALSE) +
  NULL

Abundance of serine-type endopeptidases

We downloaded gene ontology (GO) information for all D. virilis genes from FlyBase.org and identified all genes with serine-type endopeptidase (STEP) annotation. We tested whether these STEPs differed in abundance between samples using Kruskal-Wallis rank sum tests.

wilc_STEP <- comb_TABLES_nd %>% 
  mutate(up = if_else(adj.P.Val < 0.05 & logFC > 1, "Ejaculate", 'FRT')) %>% 
  filter(FBgn %in% STEP$FBgn) %>% 
  group_by(species) %>%
  do(fit = broom::tidy(wilcox.test(AveExpr ~ as.factor(up), data = .))) %>% 
  unnest(fit) %>% 
  mutate(up = NA,
         p.val = ifelse(p.value < 0.001, "p < 0.001", paste0('p = ', round(p.value, 3))))

comb_TABLES_nd %>% 
  mutate(up = if_else(adj.P.Val < 0.05 & logFC > 1, "Ejaculate", 'FRT')) %>% 
  filter(FBgn %in% STEP$FBgn) %>% 
  ggplot(aes(x = up, y = AveExpr, fill = up)) +
  gghalves::geom_half_boxplot(aes(fill = up), outlier.shape = NA) +
  gghalves::geom_half_point(aes(colour = up), alpha = .3) +
  scale_fill_manual(values = cbPalette[8:9]) +
  scale_colour_manual(values = cbPalette[8:9]) +
  labs(y = 'Average abundance') +
  facet_wrap(~species, nrow = 1, labeller = as_labeller(facet_names)) +
  theme_bw() +
  theme(legend.position = '',
        axis.title.x = element_blank(),
        strip.text = element_text(size = 15, face = "italic")) +
  geom_text(data = wilc_STEP,
            aes(x = 1.5, y = 11, 
                label = p.val)) +
  # ggsignif::geom_signif(comparisons = list(c("Ejaculate", "FRT")),
  #                       map_signif_level = TRUE) +
  #ggsave('plots/TPP_results/TPP_STEP_abundance.pdf', height = 3, width = 9, dpi = 600, useDingbats = FALSE) +
  NULL

We replot the volcano plots to highlight serine-type endopeptidases, postmating female response genes and female reproductive tract genes perturbed after a heterospecific mating.

comb_TABLES %>% 
  ggplot(aes(x = logFC, y = -log10(P.Value))) +
  geom_point(alpha = .1) +
  scale_colour_manual(values = c('red', 'hotpink', 'blue', 'purple', 'grey90')) +
  geom_point(data = comb_TABLES %>% 
    filter(threshold == 'SD' & FBgn %in% STEP$FBgn), colour = 'yellow', alpha = .8, size = 3) + 
  geom_point(data = comb_TABLES %>% 
    filter(threshold == 'SD' & FBgn %in% FRTbiased$FBgn_ID), colour = 'blue', alpha = .8, size = 3) + 
  geom_point(data = comb_TABLES %>% 
    filter(threshold == 'SD' & FBgn %in% virilisPMDE$FBgn_ID), colour = 'green', alpha = .8, size = 3) + 
  facet_wrap(~species) +
  theme_bw() +
  theme(strip.text = element_text(size = 15, face = "italic")) +
  NULL

# #### Abundance of female reproductive tract biased genes and postmating female response genes
# upset(fromList(list(proteomic = comb_TABLES$FBgn,
#                     FRT_biased = FRTbiased$FBgn_ID,
#                     PMR_biased = virilisPMDE$FBgn_ID)))
# 
# comb_TABLES %>% filter(FBgn %in% FRTbiased$FBgn_ID & threshold == 'SD')
# comb_TABLES %>% filter(FBgn %in% virilisPMDE$FBgn_ID & threshold == 'SD') %>% distinct(genes, .keep_all = TRUE)
# 
# intersect(comb_TABLES_nd %>% filter(FBgn %in% FRTbiased$FBgn_ID & threshold == 'SD') %>% pull(genes),
#           comb_TABLES_nd %>% filter(FBgn %in% virilisPMDE$FBgn_ID & threshold == 'SD') %>% pull(genes))
# 
# # STEP showing differences in abundance
# comb_TABLES %>% 
#     filter(threshold == 'SD' & FBgn %in% STEP$FBgn) %>% distinct(genes, .keep_all = TRUE)

Divergence between ejaculate candidates

We tested for differential abundance of proteins found in mated samples between each species. We restrict the data to those proteins which showed differential abundance between mated and virgin samples (adjusted p-value < 0.05 and logFC > |1|) in any species, i.e. putative male derived/Sfps/sperm proteins.

sfp_dat <- multiDB %>% 
  filter(protein_vir %in% unique(comb_TABLES %>% filter(adj.P.Val < 0.05  & logFC > 1) %>% pull(genes))) %>% 
  select(Orthogroup, protein_vir, contains('M')) %>% 
  mutate(across(3:11, ~replace_na(.x, 0)))

# make design matrix to test diffs between groups
matedSample <- sampInfo$condition[grep('M', x = sampInfo$condition)]
matedDesign <- model.matrix(~0 + matedSample)
colnames(matedDesign) <- unique(matedSample)
rownames(matedDesign) <- rep(1:3, 3)

# create DGElist and fit model 
dgeSFP <- DGEList(counts = sfp_dat[, -c(1:2)], genes = sfp_dat$protein_vir, group = matedSample)
dgeSFP <- calcNormFactors(dgeSFP, method = 'TMM')
dgeSFP <- estimateCommonDisp(dgeSFP)
dgeSFP <- estimateTagwiseDisp(dgeSFP)
# voom normalisation
dgeSFPvoom <- voom(dgeSFP, matedDesign, plot = FALSE)
# fit linear model
lmSFP <- lmFit(dgeSFPvoom, design = matedDesign)

# make contrasts - higher values = higher in first alphabetically 
cont.mated <- makeContrasts(ame.MTD.nov = AM - NM,
                            ame.MTD.vir = AM - VM,
                            nov.MTD.vir = NM - VM,
                            levels = matedDesign)

# perform lmFit tests
# ame vs. nov
lm_anM <- contrasts.fit(lmSFP, cont.mated[,"ame.MTD.nov"])
lm_anM <- eBayes(lm_anM)
lm_anM.tTags.table <- topTable(lm_anM, adjust.method = "BH", number = Inf) %>% 
  mutate(comparison = "aMn",
         threshold = if_else(adj.P.Val < 0.05 & abs(logFC) > 1, "SD", "NS"))

# ame vs. vir
lm_avM <- contrasts.fit(lmSFP, cont.mated[,"ame.MTD.vir"])
lm_avM <- eBayes(lm_avM)
lm_avM.tTags.table <- topTable(lm_avM, adjust.method = "BH", number = Inf) %>% 
  mutate(comparison = "aMv",
         threshold = if_else(adj.P.Val < 0.05 & abs(logFC) > 1, "SD", "NS"))

# nov vs. vir
lm_nvM <- contrasts.fit(lmSFP, cont.mated[,"nov.MTD.vir"])
lm_nvM <- eBayes(lm_nvM)
lm_nvM.tTags.table <- topTable(lm_nvM, adjust.method = "BH", number = Inf) %>% 
  mutate(comparison = "nMv",
         threshold = if_else(adj.P.Val < 0.05 & abs(logFC) > 1, "SD", "NS"))

mated_TABLES <- rbind(lm_anM.tTags.table,
                      lm_avM.tTags.table, 
                      lm_nvM.tTags.table) %>% 
  left_join(vir_fbgn %>% select(-FBtr), by = c('genes' = 'gene_id')) %>% 
  mutate(
    # add variable for signal peptides reaching significance threshold
    sigP = if_else(genes %in% vir_sig$protein, 'sigP', 'not'),
    # add variable splitting by bias to virgin vs. mated and signal peptide
    sperm = if_else(FBgn %in% sperm_mel$FBgn_v, 'Sperm', 'not'),
    Sfp = if_else(FBgn %in% wigbySFP$FBgn, 'Sfp', 'not'),
    DA = case_when(is.na(genes) == TRUE & threshold == 'SD' ~ "uni",
                   sigP == 'sigP' & threshold == 'SD' ~ "sigP",
                   threshold == 'SD' ~ "DA", 
                   TRUE ~ 'NS')) %>% 
  distinct(genes, comparison, .keep_all = TRUE)

# remove duplicated genes
mated_TABLES_nd <- mated_TABLES %>% filter(!Orthogroup %in% ortho_dups$Orthogroup)

# #save differentially abundant proteins to table for ClueGO
# mated_TABLES %>% filter(threshold == 'SD') %>%
#   distinct(FBgn, .keep_all = TRUE) %>% write_csv('output/ClueGOlists/TPP_Sfp_DA_multiDB.csv')

MA plots

mated_TABLES_nd %>% filter(DA != 'NS') %>% 
  ggplot(aes(x = AveExpr, y = logFC, colour = DA)) +
  geom_point(data = mated_TABLES_nd %>% filter(DA == 'NS'), colour = 'grey') +
  geom_point(alpha = .6) + 
  geom_point(data = mated_TABLES %>% 
               filter(Orthogroup %in% ortho_dups$Orthogroup, threshold == "SD"),
             colour = "red") + 
  scale_colour_manual(values = egypt.pal) +
  labs(y = '-log10(p-value)') +
  facet_wrap(~comparison, nrow = 1, 
             labeller = as_labeller(c(aMn = "D. americana vs. D. novamexicana", 
                                      aMv = 'D. americana vs. D. virilis', 
                                      nMv = "D. novamexicana vs. D. virilis"))) +
  theme_bw() +
  theme(legend.position = c(0.04, 0.2),
        legend.background = element_rect(fill = NA),
        legend.title = element_blank(),
        strip.text = element_text(size = 15, face = "italic")) +
  # geom_label_repel(
  #   data = mated_TABLES_nd %>% filter(DA != 'NS') %>% left_join(wigbySFP, by = 'FBgn'),
  #   aes(label = Symbol),
  #   size = 4,
  #   colour = 'black',
  #   box.padding = unit(0.35, "lines"),
  #   point.padding = unit(0.3, "lines")
  # ) +
  #ggsave('plots/TPP_results/MA_sfp.pdf', height = 4, width = 12, dpi = 600, useDingbats = FALSE) +
  NULL

# total number of SFPs differentially abundant
mated_TABLES %>% filter(threshold == "SD") %>%
  left_join(wigbySFP, by = 'FBgn') %>% drop_na(Symbol) %>% pull(Symbol) %>% unique()

 [1] "Est-6"   "CG15117" "NT5E-2"  "CG2852"  "scpr-A"  "CG40160" "CG11977"
 [8] "aqrs"    "CG34002" "CG17574" "Hexo2"   "CG34129" "CG42564" "Npc2b"  
[15] "CG14034" "CG4815"

# number of copies differentially abundant in each comparison
mated_TABLES %>% filter(threshold == "SD") %>%
  left_join(wigbySFP, by = 'FBgn') %>% drop_na(Symbol) %>% 
  group_by(comparison, Symbol) %>% 
  count() %>% pivot_wider(names_from = comparison, values_from = n)

# A tibble: 16 × 4
# Groups:   Symbol [16]
   Symbol    aMn   aMv   nMv
   <chr>   <int> <int> <int>
 1 aqrs        1     1    NA
 2 CG11977     1     1    NA
 3 CG15117     1     1     1
 4 CG17574     1     1     1
 5 CG2852      1    NA     1
 6 CG34002     1     1     1
 7 CG34129     1    NA     1
 8 CG40160     1     1    NA
 9 CG42564     1     1     1
10 Est-6       1     1    NA
11 Hexo2       1     1    NA
12 NT5E-2      1     1     1
13 scpr-A      3     3     3
14 CG14034    NA     1     1
15 CG4815     NA     1     1
16 Npc2b      NA     1     1

# table of all differentially abundant SFPs or sperm proteins
bind_rows(mated_TABLES %>% filter(threshold == "SD") %>%
            left_join(wigbySFP, by = 'FBgn') %>%
            distinct(FBgn_mel, .keep_all = TRUE) %>% drop_na(Symbol) %>%
            select(FBgn, FBgn_mel, Symbol),
          mated_TABLES %>% filter(threshold == "SD") %>%
            left_join(sperm_mel, by = c("FBgn" = "FBgn_v")) %>%
            distinct(mel_FBgn_ID, .keep_all = TRUE) %>% drop_na(mel_FBgn_ID) %>%
            select(FBgn, FBgn_mel = mel_FBgn_ID, Symbol = Gene.Symbol)) %>% 
  #write_csv()
  kable(caption = 'D. melanogaster SFPs and sperm protein orthologs showing differential abundannce between ejaculate candidates') %>% 
  kable_styling(full_width = FALSE) %>%
  group_rows("SFPs", 1, 13) %>%
  group_rows("Sperm proteins", 14, 26)

D. melanogaster SFPs and sperm protein orthologs showing differential abundannce between ejaculate candidates
FBgn	FBgn_mel	Symbol
SFPs
FBgn0197362	FBgn0000592	Est-6
FBgn0206811	FBgn0034417	CG15117
FBgn0207519	FBgn0050104	NT5E-2
FBgn0208385	FBgn0034753	CG2852
FBgn0211755	FBgn0037889	scpr-A
FBgn0211457	FBgn0058160	CG40160
FBgn0204606	FBgn0037650	CG11977
FBgn0211264	FBgn0039598	aqrs
FBgn0208251	FBgn0054002	CG34002
FBgn0207308	FBgn0033777	CG17574
FBgn0206035	FBgn0041629	Hexo2
FBgn0210455	FBgn0083965	CG34129
FBgn0211681	FBgn0260766	CG42564
Sperm proteins
FBgn0211705	FBgn0038198	Npc2b
FBgn0204528	FBgn0250847	CG14034
FBgn0211573	FBgn0039568	CG4815
FBgn0202308	FBgn0001187	Hex-C
FBgn0210522	FBgn0037891	CG5214
FBgn0204802	FBgn0069354	Porin2
FBgn0016474	FBgn0004045	Yp1
FBgn0199705	FBgn0046704	Liprin-alpha
FBgn0197452	FBgn0013812	Dhc93AB
FBgn0204735	FBgn0031869	CG18304
FBgn0198514	FBgn0015278	Pi3K68D
FBgn0203892	FBgn0086907	Cyt-c-d
FBgn0202027	FBgn0030482	CG1673
FBgn0205934	FBgn0031171	CG1801
FBgn0199271	FBgn0051644	CG31644
FBgn0201010	FBgn0266136	Gyc76C
FBgn0210936	FBgn0038395	CG10407
FBgn0204803	FBgn0260453	CG17140
FBgn0211422	FBgn0039257	tnc

Overlap between differentially abundant proteins

# upset(fromList(list(
#   a.vs.n = lm_anM.tTags.table$genes[lm_anM.tTags.table$threshold == 'SD'],
#   a.vs.v = lm_avM.tTags.table$genes[lm_avM.tTags.table$threshold == 'SD'],
#   n.vs.v = lm_nvM.tTags.table$genes[lm_nvM.tTags.table$threshold == 'SD'])))

#pdf('plots/TPP_results/TPP_divergence_overlap_comb.pdf', height = 4, width = 4)
plot(venn(c('a.vs.n' = 6, 'a.vs.v' = 8, 'n.vs.v' = 6, 
             'a.vs.n&a.vs.v' = 46, 'a.vs.n&n.vs.v' = 39, 'a.vs.v&n.vs.v' = 29,
             'a.vs.n&a.vs.v&n.vs.v' = 64)),
     quantities = TRUE,
     fills = list(fill = cbPalette[2:4], alpha = .5))

#dev.off()

Are signal peps more likely to be differentially abundant?

mated_TABLES %>% 
  group_by(comparison, threshold, sigP) %>% dplyr::count() %>% 
  ggplot(aes(x = threshold, y = n, fill = sigP)) +
  geom_col(position = 'fill') +
  facet_wrap(~comparison, nrow = 1, 
             labeller = as_labeller(c(aMn = "D. americana vs. D. novamexicana", 
                                      aMv = 'D. americana vs. D. virilis', 
                                      nMv = "D. novamexicana vs. D. virilis"))) +
  theme_bw() +
  NULL

# Fisher's exact tests
mated_TABLES %>% 
  group_by(comparison) %>% 
  do(fit = broom::tidy(fisher.test(as.matrix(xtabs(~ threshold + sigP, ., sparse = TRUE))))) %>% 
  unnest(fit) %>% 
  mutate(comparison = recode(comparison, 
                             aMn = "D. ame vs. D. nov", 
                             aMv = "D. ame vs. D. vir", 
                             nMv = 'D. nov vs. D. vir'),
         across(2:5, ~ round(.x, 2)),
         p.value = ifelse(p.value < 0.001, "< 0.001", p.value),
         Estimate = paste0(estimate, ' (', conf.low, '-', conf.high, ')')) %>% 
  dplyr::select(Comparison = comparison, Estimate, p.value) %>% 
  kable(caption = 'Fisher\'s exact tests for signal peptides that are differentially abundant or not') %>% 
  kable_styling(full_width = FALSE)

Fisher’s exact tests for signal peptides that are differentially abundant or not
Comparison	Estimate	p.value
D. ame vs. D. nov	1.06 (0.62-1.84)	0.90
D. ame vs. D. vir	0.59 (0.34-1.05)	0.06
D. nov vs. D. vir	0.9 (0.52-1.54)	0.70

# are sperm genes more likely to be differentially abundant?
mated_TABLES %>% 
  #filter(sperm == 'Sperm') %>% 
  group_by(comparison, threshold, sperm) %>% dplyr::count() %>% 
  ggplot(aes(x = sperm, y = n, fill = threshold)) +
  geom_col(position = 'fill') +
  facet_wrap(~comparison, nrow = 1, 
             labeller = as_labeller(c(aMn = "D. americana vs. D. novamexicana", 
                                      aMv = 'D. americana vs. D. virilis', 
                                      nMv = "D. novamexicana vs. D. virilis"))) +
  theme_bw() +
  theme(strip.text = element_text(size = 15, face = "italic")) +
  NULL

# Fisher's exact tests
mated_TABLES %>% 
  group_by(comparison) %>% 
  do(fit = broom::tidy(fisher.test(as.matrix(xtabs(~ threshold + sperm, ., sparse = TRUE))))) %>% 
  unnest(fit) %>% 
  mutate(across(2:5, ~ round(.x, 2)),
         p.value = ifelse(p.value < 0.001, '< 0.001', round(p.value, 3)),
         Estimate = paste0(estimate, ' (', conf.low, '-', conf.high, ')')) %>% 
  dplyr::select(comparison, Estimate, p.value) %>% 
  kable(digits = 3, 
        caption = 'Fisher\'s exact tests for sperm proteins that are differentially abundant or not') %>% 
  kable_styling(full_width = FALSE)

Fisher’s exact tests for sperm proteins that are differentially abundant or not
comparison	Estimate	p.value
aMn	1 (0.36-2.95)	1.00
aMv	0.64 (0.23-1.82)	0.34
nMv	0.74 (0.26-2.06)	0.64

Heatmap

# make DF
sfp_md <- data.frame(dgeSFPvoom$genes,
                     dgeSFPvoom$E) %>% 
  rowwise() %>% 
  mutate(`D. ame` = median(c(!!! rlang::syms(grep('AM', names(.), value = TRUE)))),
         `D. nov` = median(c(!!! rlang::syms(grep('NM', names(.), value = TRUE)))),
         `D. vir` = median(c(!!! rlang::syms(grep('VM', names(.), value = TRUE))))) %>% 
  #dplyr::select(genes, `D. ame`, `D. nov`, `D. vir`) %>% 
  left_join(vir_fbgn %>% select(-FBtr), by = c('genes' = 'gene_id')) %>% 
  # add mel Sfp ortholog
  left_join(wigbySFP %>% dplyr::select(FBgn, mel_Sfp = Symbol)) %>% 
  mutate(Sfp = if_else(FBgn %in% wigbySFP$FBgn, 'Sfp', 'other'),
         category = case_when(FBgn %in% sperm_mel$FBgn_v ~ 'sperm',
                              FBgn %in% wigbySFP$FBgn ~ 'Sfp',
                              genes %in% vir_sig$protein ~ 'sigP',
                              TRUE ~ as.character(NA)),
         dupl = ifelse(Orthogroup %in% ortho_dups$Orthogroup, "Dupl", NA),
         orthdup = ifelse(dupl == "Dupl", Orthogroup, NA)) %>% 
  distinct(genes, .keep_all = TRUE) #%>% 
  # remove duplicated genes 
  #filter(!Orthogroup %in% ortho_dups$Orthogroup)

sfp_hm <- sfp_md %>% select(`D. ame`, `D. nov`, `D. vir`,
                            FBgn, mel_Sfp, category)

#row.names(sfp_hm) <- sfp_md$genes

mat_scaled <- as.matrix(pheatmap:::scale_rows(sfp_hm[, 1:3]))

# # Dissimilarity matrix
# d <- dist(mat_scaled, method = "euclidean")
# # # Hierarchical clustering using Complete Linkage
# hc1 <- hclust(d, method = "complete" )
# # Plot the obtained dendrogram
# #plot(hc1, cex = 0.6, hang = -1)

fviz_nbclust(mat_scaled, FUN = hcut, method = "silhouette")

sub_grp <- kmeans(mat_scaled, centers = 6)$cluster

# # Cut tree into groups
# sub_grp <- cutree(hc1, k = 6)

# heatmap annotations
cat_lab <- rowAnnotation(Cluster = sub_grp,
                         col = list(Cluster = setNames(cbPalette[2:(max(sub_grp) + 1)], unique(sub_grp))),
                         title = NULL,
                         show_annotation_name = FALSE)

Sfp_labs <- rowAnnotation(#Dup_labs = sfp_md$orthdup,
                          Category = sfp_md$category,
                          col = list(Category = c(sperm = "blue",
                                                  Sfp = "purple", 
                                                  sigP = "yellow"),
                                     Dup_labs = setNames(rep(viridis::turbo(
                                       n_distinct(sfp_md$orthdup[is.na(sfp_md$dupl) == FALSE])), 
                                       rle(sfp_md$Orthogroup[is.na(sfp_md$dupl) == FALSE])$lengths), 
                                       sfp_md$Orthogroup[is.na(sfp_md$dupl) == FALSE])),
                          na_col = NA,
                          Sfp_labs = anno_mark(at = c(grep('Sfp', x = sfp_md$Sfp)), 
                                               labels = sfp_md$mel_Sfp[sfp_md$Sfp == 'Sfp'],
                                               labels_gp = gpar(fontsize = 8)),
                          title = NULL,
                          #show_legend = c("Dup_labs" = FALSE),
                          show_annotation_name = FALSE)

sfpha <- HeatmapAnnotation(species = anno_block(gp = gpar(fill = c(v.pal[c(3, 2, 1)])),
                                                labels = c(expression(italic('D. ame')), 
                                                           expression(italic('D. nov')),
                                                           expression(italic('D. vir'))), 
                                                labels_gp = gpar(col = c('white', 'white', 'black'), 
                                                                 fontsize = 10)))

sfp_chm <- Heatmap(mat_scaled, 
                   col = viridis::inferno(50),
                   heatmap_legend_param = list(title = "log2(intensities)",
                                               #title_position = "leftcenter-rot",
                                               direction = "horizontal",
                                               at = c(-1.5, 0, 1.5)),
                   right_annotation = Sfp_labs,
                   left_annotation = cat_lab,
                   top_annotation = sfpha, 
                   show_row_names = FALSE, 
                   show_column_names = FALSE, 
                   row_split = max(sub_grp),
                   #row_km = max(sub_grp),
                   cluster_rows = cluster_within_group(t(mat_scaled), sub_grp),
                   column_split = 3,
                   column_gap = unit(0, "mm"),
                   row_title = NULL,
                   column_title = NULL)

#pdf('plots/TPP_results/TPP_sfp_heatmap.pdf', height = 8, width = 5)
sfp_chm

#dev.off()

We tested how protein abundance differed between species based on k-means cluster

sfp_cluster <- data.frame(genes = sfp_md$genes, sfp_hm[, 1:3], cluster = sub_grp) %>%
  pivot_longer(cols = 2:4) %>% 
  # #exclude proteins absent from each species (edgeR gives v. low estimate of abundance)
  #filter(value > -10) %>% 
  mutate(species = str_sub(name, 4, 4))

sfp_cluster %>% 
  mutate(cluster = paste('Cluster', cluster, sep = ' ')) %>% 
  #ggplot(aes(x = factor(name, levels = c('D..vir', 'D..ame', 'D..nov')), y = value, fill = name)) +
  ggplot(aes(x = name, y = value, fill = name)) +
  gghalves::geom_half_point(aes(colour = name)) +
  gghalves::geom_half_violin(aes(fill = name)) +
  stat_pointinterval(pch = 21, fill = 'white') +
  labs(x = 'Species', y = 'normalised abundance') +
  scale_fill_viridis_d() +
  scale_colour_viridis_d() +
  facet_wrap(~cluster, scales = 'free_y') +
  theme_bw() +
  theme(legend.position = '',
        axis.text.x = element_text(face = 'italic')) +
  ggpubr::stat_compare_means(aes(label = ..p.signif..),
                             comparisons = list(c('D..ame', 'D..nov'),
                                                c('D..ame', 'D..vir'),
                                                c('D..nov', 'D..vir')),
                             method = 't.test') +
  #ggsave('plots/TPP_results/sfp_pointinterval.pdf', height = 6, width = 6, dpi = 600, useDingbats = FALSE) +
  NULL

Model fits

sfp_cluster %>% group_by(cluster) %>%
  do(fit = broom::glance(lm(value ~ name, data = .))) %>% 
  unnest(fit) %>% 
  mutate(p.value = ifelse(p.value < 0.001, '< 0.001', round(p.value, 3))) %>% 
  dplyr::select(-adj.r.squared, -sigma, -c(df:nobs)) %>% 
  kable(digits = 3, 
        caption = 'Model fits') %>% 
  kable_styling(full_width = FALSE)

Model fits
cluster	r.squared	statistic	p.value
1	0.140	10.724	< 0.001
2	0.112	6.249	0.003
3	0.035	1.356	0.264
4	0.196	21.195	< 0.001
5	0.293	39.701	< 0.001
6	0.045	0.851	0.436

sfp_cluster %>% group_by(cluster) %>%
  do(fit = broom::tidy(lm(value ~ name, data = .))) %>% 
  unnest(fit) %>% 
  mutate(p.value = ifelse(p.value < 0.001, '< 0.001', round(p.value, 3)),
         sigLabel = case_when(p.value < 0.001 ~ "***", 
                              p.value < 0.01 ~ "**",
                              p.value < 0.05 ~ "*",
                              TRUE ~ '')) %>% 
  kable(digits = 3, 
        caption = 'Model estimates') %>% 
  kable_styling(full_width = FALSE)

Model estimates
cluster	term	estimate	std.error	statistic	p.value	sigLabel
1	(Intercept)	8.526	0.719	11.861	< 0.001	***
1	nameD..nov	-3.915	1.017	-3.852	< 0.001	***
1	nameD..vir	0.306	1.017	0.301	0.764
2	(Intercept)	9.052	1.300	6.962	< 0.001	***
2	nameD..nov	-5.341	1.839	-2.904	0.005	**
2	nameD..vir	-5.881	1.839	-3.198	0.002	**
3	(Intercept)	7.540	0.729	10.335	< 0.001	***
3	nameD..nov	-0.214	1.032	-0.207	0.836
3	nameD..vir	1.353	1.032	1.311	0.194
4	(Intercept)	4.747	0.586	8.103	< 0.001	***
4	nameD..nov	4.751	0.829	5.734	< 0.001	***
4	nameD..vir	4.589	0.829	5.538	< 0.001	***
5	(Intercept)	9.321	0.761	12.249	< 0.001	***
5	nameD..nov	0.578	1.076	0.537	0.592
5	nameD..vir	-8.001	1.076	-7.434	< 0.001	***
6	(Intercept)	10.730	0.722	14.861	< 0.001	***
6	nameD..nov	-1.289	1.021	-1.262	0.215
6	nameD..vir	-0.936	1.021	-0.917	0.365

sfp_cluster %>% group_by(cluster) %>%
  do(fit = broom::tidy(TukeyHSD(aov(value ~ name, data = .)))) %>% 
  unnest(fit) %>% 
  mutate(contrast = recode(contrast, 
                      `D. nov-D. ame` = "D. ame vs. D. nov", 
                      `D. vir-D. ame` = "D. ame vs. D. vir", 
                      `D. vir-D. nov` = 'D. nov vs. D. vir'),
         p.value = ifelse(adj.p.value < 0.001, '< 0.001', round(adj.p.value, 3)),
         across(4:8, ~ round(.x, 2)),
         Estimate = paste0(estimate, ' (', conf.low, '-', conf.high, ')'),
         sigLabel = case_when(p.value < 0.001 ~ "***", 
                              p.value < 0.01 ~ "**",
                              p.value < 0.05 ~ "*",
                              TRUE ~ '')) %>% 
  dplyr::select(-term, -null.value, -c(estimate:adj.p.value)) %>% 
  kable(digits = 3, 
        caption = 'Post hoc contrasts') %>% 
  kable_styling(full_width = FALSE)

Post hoc contrasts
cluster	contrast	p.value	Estimate	sigLabel
1	D..nov-D..ame	< 0.001	-3.92 (-6.33–1.51)	***
1	D..vir-D..ame	0.951	0.31 (-2.1-2.72)
1	D..vir-D..nov	< 0.001	4.22 (1.81-6.63)	***
2	D..nov-D..ame	0.013	-5.34 (-9.72–0.97)
2	D..vir-D..ame	0.005	-5.88 (-10.26–1.5)	**
2	D..vir-D..nov	0.954	-0.54 (-4.92-3.84)
3	D..nov-D..ame	0.977	-0.21 (-2.68-2.25)
3	D..vir-D..ame	0.393	1.35 (-1.11-3.82)
3	D..vir-D..nov	0.288	1.57 (-0.9-4.03)
4	D..nov-D..ame	< 0.001	4.75 (2.79-6.71)	***
4	D..vir-D..ame	< 0.001	4.59 (2.63-6.55)	***
4	D..vir-D..nov	0.979	-0.16 (-2.12-1.8)
5	D..nov-D..ame	0.853	0.58 (-1.96-3.12)
5	D..vir-D..ame	< 0.001	-8 (-10.54–5.46)	***
5	D..vir-D..nov	< 0.001	-8.58 (-11.12–6.04)	***
6	D..nov-D..ame	0.425	-1.29 (-3.78-1.21)
6	D..vir-D..ame	0.633	-0.94 (-3.43-1.56)
6	D..vir-D..nov	0.936	0.35 (-2.14-2.85)

Divergence between (virgin) reproductive tracts

fem_dat <- multiDB %>% 
  filter(!protein_vir %in% c(comb_TABLES %>% filter(adj.P.Val < 0.05  & logFC > 1) %>% pull(genes))) %>% 
  #filter(!Orthogroup %in% c(ame_cands$Orthogroup, nov_cands$Orthogroup, vir_cands$Orthogroup)) %>% 
  dplyr::select(-contains('M')) %>% 
  mutate(across(3:9, ~replace_na(.x, 0)))

# # most identified sperm proteins remain in virgin samples after excluding DA proteins between mated/viring samples
# fem_dat %>%
#   left_join(vir_fbgn, by = c("protein_vir" = "gene_id", "Orthogroup")) %>%
#   distinct(protein_vir, .keep_all = TRUE) %>%
#   filter(FBgn %in% sperm_mel$FBgn_v)
# 
# multiDB %>%
#   left_join(vir_fbgn, by = c("protein_vir" = "gene_id", "Orthogroup")) %>%
#   distinct(protein_vir, .keep_all = TRUE) %>%
#   filter(FBgn %in% sperm_mel$FBgn_v)

# # number of FRT proteins with...
# fem_dat %>% map_dbl(n_distinct)
# fem_dat %>% distinct(protein_vir) %>% filter(protein_vir %in% vir_sig$protein) %>% dim()

# make design matrix to test diffs between groups
femSample <- sampInfo$condition[grep('V$', x = sampInfo$condition)]
femdesign <- model.matrix(~0 + femSample)
colnames(femdesign) <- unique(femSample)
rownames(femdesign) <- c(rep(1:2, 2), 1:3)

# create DGElist and fit model 
dgeFEM <- DGEList(counts = fem_dat[, -c(1:2)], genes = fem_dat$protein_vir, group = femSample)
dgeFEM <- calcNormFactors(dgeFEM, method = 'TMM')
dgeFEM <- estimateCommonDisp(dgeFEM)
dgeFEM <- estimateTagwiseDisp(dgeFEM)
# voom normalisation
dgeFEMvoom <- voom(dgeFEM, femdesign, plot = FALSE)
# fit linear model
lmFEM <- lmFit(dgeFEMvoom, design = femdesign)

# make contrasts - higher values = higher in first alphabetically 
cont.fem <- makeContrasts(ame.FEM.nov = AV - NV,
                          ame.FEM.vir = AV - VV,
                          nov.FEM.vir = NV - VV,
                          levels = femdesign)

# perform lmFit tests
# ame vs. nov
lm_anF <- contrasts.fit(lmFEM, cont.fem[,"ame.FEM.nov"])
lm_anF <- eBayes(lm_anF)
lm_anF.tTags.table <- topTable(lm_anF, adjust.method = "BH", number = Inf) %>% 
  mutate(comparison = "aVn",
         threshold = if_else(adj.P.Val < 0.05 & abs(logFC) > 1, "SD", "NS"))

# ame vs. vir
lm_avF <- contrasts.fit(lmFEM, cont.fem[,"ame.FEM.vir"])
lm_avF <- eBayes(lm_avF)
lm_avF.tTags.table <- topTable(lm_avF, adjust.method = "BH", number = Inf) %>% 
  mutate(comparison = "aVv",
         threshold = if_else(adj.P.Val < 0.05 & abs(logFC) > 1, "SD", "NS"))

# nov vs. vir
lm_nvF <- contrasts.fit(lmFEM, cont.fem[,"nov.FEM.vir"])
lm_nvF <- eBayes(lm_nvF)
lm_nvF.tTags.table <- topTable(lm_nvF, adjust.method = "BH", number = Inf) %>% 
  mutate(comparison = "nVv",
         threshold = if_else(adj.P.Val < 0.05 & abs(logFC) > 1, "SD", "NS"))

fem_TABLES <- rbind(lm_anF.tTags.table,
                    lm_avF.tTags.table, 
                    lm_nvF.tTags.table) %>% 
  left_join(vir_fbgn %>% select(-FBtr), by = c('genes' = 'gene_id')) %>% 
  mutate(
    # add variable for signal peptides reaching significance threshold
    sigP = if_else(genes %in% vir_sig$protein, 'sigP', 'not'),
    spec = case_when(FBgn %in% virilisPMDE$FBgn_ID ~ 'PMR', 
                     FBgn %in% FRTbiased$FBgn_ID ~ 'FRT',
                     genes %in% vir_sig$protein & threshold == 'SD' ~ 'sigP',
                     TRUE ~ 'other')) %>% 
  distinct(genes, comparison, .keep_all = TRUE)

# remove duplicated genes
fem_TABLES_nd <- fem_TABLES %>% filter(!Orthogroup %in% ortho_dups$Orthogroup)

# # numbers of differentially abundant...
# # PMR proteins
# fem_TABLES %>% filter(FBgn %in% virilisPMDE$FBgn_ID & threshold == 'SD') %>%
#   group_by(comparison) %>% count()
# 
# # FRT proteins
# fem_TABLES %>% filter(FBgn %in% FRTbiased$FBgn_ID & threshold == 'SD') %>%
#   group_by(comparison) %>% count()
# 
# # signal peptides
# fem_TABLES %>% filter(genes %in% vir_sig$protein & threshold == 'SD') %>%
#   group_by(comparison) %>% count()

Volcano plots

fem_TABLES_nd %>% 
  ggplot(aes(x = logFC, y = -log10(P.Value))) +
  geom_point(data = . %>% filter(threshold == 'NS'), alpha = .1, colour = 'grey') +
  geom_point(data = . %>% filter(threshold != 'NS'), aes(colour = spec), alpha = .5) + 
  scale_colour_brewer(palette = 'Dark2') +
  #scale_colour_viridis_d(option = 'C') +
  labs(y = '-log10(p-value)') +
  facet_wrap(~comparison, nrow = 1, 
             labeller = as_labeller(c(aVn = "D. americana vs. D. novamexicana", 
                                      aVv = 'D. americana vs. D. virilis', 
                                      nVv = "D. novamexicana vs. D. virilis"))) +
  theme_bw() +
  theme(legend.position = c(0.04, 0.2),
        legend.title = element_blank(),
        legend.background = element_rect(fill = NA),
        strip.text = element_text(size = 15, face = "italic")) +
  #ggsave('plots/TPP_results/volcano_FRTbtSp.pdf', height = 4, width = 12, dpi = 600, useDingbats = FALSE) +
  NULL

Are signal peps more likely to be DA?

# are signal peps more likely to be DA?
fem_TABLES %>% 
  group_by(comparison, threshold, sigP) %>% dplyr::count() %>% 
  ggplot(aes(x = threshold, y = n, fill = sigP)) +
  geom_col(position = 'fill') +
  facet_wrap(~comparison, nrow = 1, 
             labeller = as_labeller(c(aVn = "D. americana vs. D. novamexicana", 
                                      aVv = 'D. americana vs. D. virilis', 
                                      nVv = "D. novamexicana vs. D. virilis"))) +
  theme_bw() +
  NULL

# Fisher's exact tests
fem_TABLES %>% group_by(comparison) %>% 
  do(fit = broom::glance(fisher.test(as.matrix(xtabs(~ threshold + sigP, ., sparse = TRUE))))) %>% 
  unnest(fit) %>% 
  mutate(across(2:5, ~ round(.x, 2)),
         p.value = ifelse(p.value < 0.001, '< 0.001', round(p.value, 3)),
         comparison = recode(comparison, 
                             aVn = "D. ame vs. D. nov", aVv = "D. ame vs. D. vir", nVv = 'D. nov vs. D. vir'),
         Estimate = paste0(estimate, ' (', conf.low, '-', conf.high, ')')) %>% 
  dplyr::select(comparison, Estimate, p.value) %>% 
  kable(caption = 'Proportion of signal peptides showing differential abundance between species. Fisher\'s exact tests') %>% 
  kable_styling(full_width = FALSE)

Proportion of signal peptides showing differential abundance between species. Fisher’s exact tests
comparison	Estimate	p.value
D. ame vs. D. nov	1.17 (0.96-1.43)	0.10
D. ame vs. D. vir	1.12 (0.92-1.36)	0.24
D. nov vs. D. vir	1.08 (0.86-1.36)	0.49

# #save proteins minus ejaculate candidates i.e. remaining proteins as 'female'
# fem_TABLES %>%
#   filter(!genes %in% c(comb_TABLES %>% filter(adj.P.Val < 0.05  & logFC > 1) %>% pull(genes))) %>% 
#   distinct(FBgn, .keep_all = TRUE) %>%
#   write_csv('output/ClueGOlists/TPP_all_NOTejac_FBgn.csv')
# 
# #save proteins minus ejaculate candidates i.e. remaining proteins as 'female' & signal pep
# fem_TABLES %>%
#   filter(!genes %in% c(comb_TABLES %>% filter(adj.P.Val < 0.05  & logFC > 1) %>% pull(genes))) %>% 
#   filter(sigP == 'sigP') %>%
#   distinct(FBgn, .keep_all = TRUE) %>%
#   write_csv('output/ClueGOlists/TPP_all_frtSigP_FBgn.csv')
# 
# #save differentially abundant proteins to table for ClueGO
# fem_TABLES %>% 
#   filter(!genes %in% c(comb_TABLES %>% filter(adj.P.Val < 0.05  & logFC > 1) %>% pull(genes)), 
#          threshold == 'SD') %>%
#   distinct(FBgn, .keep_all = TRUE) %>% 
#   write_csv('output/ClueGOlists/TPP_virgin_DA_multiDB.csv')

Heatmap

All proteins

fem_md <- data.frame(dgeFEMvoom$genes,
                     dgeFEMvoom$E) %>% 
  rowwise() %>% 
  mutate(`D. ame` = median(c(!!! rlang::syms(grep('A', names(.), value = TRUE)))),
         `D. nov` = median(c(!!! rlang::syms(grep('NV', names(.), value = TRUE)))),
         `D. vir` = median(c(!!! rlang::syms(grep('VV', names(.), value = TRUE))))) %>% 
  left_join(vir_fbgn %>% select(-FBtr), by = c('genes' = 'gene_id')) %>% 
  mutate(category = case_when(FBgn %in% sperm_mel$FBgn_v ~ 'sperm',
                              FBgn %in% wigbySFP$FBgn ~ 'Sfp',
                              TRUE ~ as.character(NA)),
         fem_cat = case_when(FBgn %in% STEP$FBgn ~ 'STEP',
                             FBgn %in% FRTbiased$FBgn_ID ~ 'FRT',
                             FBgn %in% virilisPMDE$FBgn_ID ~ 'pmr',
                             genes %in% vir_sig$protein ~ 'sig',
                             TRUE ~ as.character(NA))) %>% 
  left_join(flybase_GO %>% dplyr::select(FBgn, Symbol = mel_GeneSymbol), by = 'FBgn') %>% 
  mutate(Symbol = if_else(is.na(Symbol), FBgn, Symbol)) %>% 
  distinct(genes, .keep_all = TRUE)

# scale data
mat_scaled_frt <- as.matrix(pheatmap:::scale_rows(fem_md[, 9:11]))

# # Dissimilarity matrix
df <- dist(mat_scaled_frt, method = "euclidean")
# # Hierarchical clustering using Complete Linkage
hc1f <- hclust(df, method = "complete" )

# Plot the dendrogram
#plot(hc1, cex = 0.6, hang = -1)

fviz_nbclust(mat_scaled_frt, FUN = hcut, method = "silhouette")

# Cut tree into 4 groups
sub_grpf <- cutree(hc1f, k = 5)

# Number of members in each cluster
#table(sub_grpf)

cat_labf <- rowAnnotation(clust_lab = sub_grpf,
                          col = list(clust_lab = setNames(cbPalette[2:(max(sub_grpf) + 1)], unique(sub_grpf))),
                          title = NULL,
                          show_annotation_name = FALSE)

frt_lab <- rowAnnotation(Category = fem_md$fem_cat,
                         col = list(Category = setNames(c('yellow', 'cyan', 'hotpink', 'red'),
                                                        unique(na.omit(fem_md$fem_cat)))),
                         frt_labs = anno_mark(at = c(grep('STEP', x = fem_md$fem_cat)),
                                              labels = na.omit(fem_md$Symbol[fem_md$fem_cat == 'STEP']),
                                              labels_gp = gpar(fontsize = 8)),
                         title = NULL,
                         na_col = NA,
                         show_annotation_name = FALSE)

frtha <- HeatmapAnnotation(species = anno_block(gp = gpar(fill = c(v.pal[c(3, 2, 1)])),
                                                labels = c(expression(italic('D. ame')), 
                                                           expression(italic('D. nov')), 
                                                           expression(italic('D. vir'))), 
                                                labels_gp = gpar(col = c('white', 'white', 'black'), 
                                                                 fontsize = 10)))

#pdf('plots/TPP_results/TPP_frt_compheatmap.pdf', height = 8, width = 5)
Heatmap(mat_scaled_frt, 
        col = viridis::inferno(50),
        heatmap_legend_param = list(title = "log2(intensities)",
                                    #title_position = "leftcenter-rot",
                                    direction = "horizontal",
                                    at = c(-1.5, 0, 1.5)),
        right_annotation = frt_lab,
        left_annotation = cat_labf,
        top_annotation = frtha, 
        show_row_names = FALSE, 
        show_column_names = FALSE, 
        row_split = max(sub_grpf),
        column_split = 3,
        column_gap = unit(0, "mm"),
        row_title = NULL,
        column_title = NULL)

#dev.off()

y2 <- fem_md %>% 
  bind_cols(cluster = sub_grpf) %>% 
  dplyr::select(genes, FBgn, 9:11, cluster) %>% 
  pivot_longer(cols = 3:5) %>% 
  mutate(species = str_sub(name, 4, 4))

y2 %>% 
  ggplot(aes(x = name, y = value, fill = name)) +
  geom_boxplot() +
  labs(x = 'Species') +
  scale_fill_viridis_d() +
  facet_wrap(~cluster) +
  ggpubr::stat_compare_means(aes(label = ..p.signif..),
                             comparisons = list(c('D. ame', 'D. nov'),
                                                c('D. ame', 'D. vir'),
                                                c('D. nov', 'D. vir')),
                             method = 't.test') +
  theme_bw() +
  theme(legend.position = '',
        axis.text.x = element_text(face = 'italic')) +
  NULL

# Model fits
y2 %>% group_by(cluster) %>%
  do(fit = broom::glance(lm(value ~ name, data = .))) %>% 
  unnest(fit) %>% 
  mutate(p.value = ifelse(p.value < 0.001, '< 0.001', round(p.value, 3))) %>% 
  dplyr::select(-adj.r.squared, -sigma, -c(df:df.residual)) %>% 
  kable(digits = 3, 
        caption = 'Model fits') %>% 
  kable_styling(full_width = FALSE)

Model fits
cluster	r.squared	statistic	p.value	nobs
1	0.018	14.383	< 0.001	1554
2	0.031	13.262	< 0.001	831
3	0.089	65.843	< 0.001	1344
4	0.028	17.877	< 0.001	1245
5	0.044	19.960	< 0.001	873

# y2 %>% group_by(cluster) %>%
#   do(fit = broom::tidy(lm(value ~ name, data = .))) %>% 
#   unnest(fit) %>% 
#   mutate(p.value = ifelse(p.value < 0.001, '< 0.001', round(p.value, 3)),
#          sigLabel = case_when(p.value < 0.001 ~ "***", 
#                               p.value < 0.01 ~ "**",
#                               p.value < 0.05 ~ "*",
#                               TRUE ~ '')) %>% 
#   kable(digits = 3, 
#         caption = 'Model estimates') %>% 
#   kable_styling(full_width = FALSE)

plyr::ldply(lapply(split(y2, y2$cluster, drop = TRUE), function(x) 
  {broom::tidy(TukeyHSD(aov(value ~ name, data = x)))})) %>% 
  mutate(cluster = recode(.id, 
                          `D. nov-D. ame` = "D. ame vs. D. nov", 
                          `D. vir-D. ame` = "D. ame vs. D. vir", 
                          `D. vir-D. nov` = 'D. nov vs. D. vir'),
         across(4:7, ~ round(.x, 2)),
         Estimate = paste0(estimate, ' (', conf.low, '-', conf.high, ')'),
         p.val = ifelse(adj.p.value < 0.001, "< 0.001", round(adj.p.value, 3)),
         sigLabel = case_when(adj.p.value < 0.001 ~ "***", 
                              adj.p.value < 0.01 ~ "**",
                              adj.p.value < 0.05 ~ "*",
                              TRUE ~ '')) %>% 
  dplyr::select(cluster, contrast, estimate, p.val, ` ` = sigLabel) %>% 
  kable(digits = 3, 
        caption = 'Post-hoc Tukey tests corrected for multiple testing') %>% 
  kable_styling(full_width = FALSE) %>%
  group_rows("Cluster 1", 1, 3) %>%
  group_rows("Cluster 2 ", 4, 6) %>%
  group_rows("Cluster 3", 7, 9) %>%
  group_rows("Cluster 4", 10, 12)

Post-hoc Tukey tests corrected for multiple testing
cluster	contrast	estimate	p.val
Cluster 1
1	D. nov-D. ame	-0.95	< 0.001	***
1	D. vir-D. ame	-0.97	< 0.001	***
1	D. vir-D. nov	-0.02	0.993
Cluster 2
2	D. nov-D. ame	1.20	< 0.001	***
2	D. vir-D. ame	0.40	0.21
2	D. vir-D. nov	-0.80	0.002	**
Cluster 3
3	D. nov-D. ame	1.64	< 0.001	***
3	D. vir-D. ame	1.81	< 0.001	***
3	D. vir-D. nov	0.17	0.609
Cluster 4
4	D. nov-D. ame	-0.72	0.001	**
4	D. vir-D. ame	0.49	0.044
4	D. vir-D. nov	1.21	< 0.001	***
5	D. nov-D. ame	0.08	0.946
5	D. vir-D. ame	-1.32	< 0.001	***
5	D. vir-D. nov	-1.39	< 0.001	***

#### Differentially abundant proteins only
# fem_da <- fem_md %>% filter(genes %in% unique(fem_TABLES %>% filter(threshold != 'NS') %>% pull(genes)))
# 
# # scale data
# mat_scaled_frt2 <- pheatmap:::scale_rows(fem_da[, 9:11])
# 
# cat_labf2 <- rowAnnotation(Cluster = sub_grpf,
#                            col = list(Cluster = setNames(cbPalette[2:(max(sub_grpf) + 1)], unique(sub_grpf2))),
#                            title = NULL,
#                            show_annotation_name = FALSE)
# 
# frt_lab <- rowAnnotation(Category = fem_da$fem_cat,
#                          col = list(Category = setNames(c('chocolate', 'orange', 'hotpink', 'cyan'),
#                                                         na.omit(unique(fem_da$fem_cat)))),
#                          frt_labs = anno_mark(at = c(grep('STEP', x = fem_da$fem_cat)),
#                                               labels = na.omit(fem_da$Symbol[fem_da$fem_cat == 'STEP']),
#                                               labels_gp = gpar(fontsize = 8)),
#                          title = NULL,
#                          na_col = NA,
#                          show_annotation_name = FALSE)
# 
# #pdf('plots/frt_compheatmap_da.pdf', height = 8, width = 5)
# Heatmap(mat_scaled_frt2, 
#         col = viridis::inferno(100),
#         heatmap_legend_param = list(title = "log2(intensities)",
#                                     title_position = "leftcenter-rot"),
#         right_annotation = frt_lab,
#         left_annotation = cat_labf2,
#         top_annotation = haha, 
#         show_row_names = FALSE, 
#         show_column_names = FALSE, 
#         row_split = max(sub_grpf),
#         column_split = 3,
#         column_gap = unit(0, "mm"),
#         row_title = NULL,
#         column_title = NULL)
# #dev.off()

# y3 <- fem_da %>% 
#   bind_cols(cluster = sub_grpf2) %>% 
#   dplyr::select(genes, FBgn, 9:11, cluster) %>% 
#   pivot_longer(cols = 3:5) %>% 
#   mutate(species = str_sub(name, 4, 4))
# 
# y3 %>% 
#   ggplot(aes(x = factor(name, levels = c('D. vir', 'D. ame', 'D. nov')), y = value, fill = name)) +
#   geom_boxplot() +
#   labs(x = 'Species') +
#   scale_fill_viridis_d() +
#   facet_wrap(~cluster) +
#   ggsignif::geom_signif(comparisons = list(c("D. ame", "D. nov"),
#                                            c("D. ame", "D. vir"),
#                                            c("D. nov", "D. vir")),
#                         map_signif_level = TRUE) +
#   #coord_cartesian(ylim = c(2, 20)) +
#   theme_bw() +
#   theme(legend.position = '',
#         axis.text.x = element_text(face = 'italic')) +
#   NULL
# 
# y3 %>% 
#   mutate(cluster = paste('Cluster', cluster, sep = ' ')) %>% 
#   ggplot(aes(x = factor(name, levels = c('D. vir', 'D. ame', 'D. nov')), y = value, fill = name)) +
#   gghalves::geom_half_point(aes(colour = name)) +
#   gghalves::geom_half_violin(aes(fill = name)) +
#   stat_pointinterval(pch = 21, fill = 'white') +
#   labs(x = 'Species', y = 'normalised abundance') +
#   scale_fill_viridis_d() +
#   scale_colour_viridis_d() +
#   facet_wrap(~cluster, scales = 'free_y') +
#   theme_bw() +
#   theme(legend.position = '',
#         axis.text.x = element_text(face = 'italic')) +
#   ggpubr::stat_compare_means(aes(label = ..p.signif..),
#                              comparisons = list(c('D. ame', 'D. nov'),
#                                                 c('D. ame', 'D. vir'),
#                                                 c('D. nov', 'D. vir')),
#                              method = 't.test') +
#   #ggsave('plots/frt_pointinterval.pdf', height = 6, width = 6, dpi = 600, useDingbats = FALSE) +
#   NULL
# 
# # model fits
# y3 %>% group_by(cluster) %>%
#   do(fit = broom::tidy(lm(value ~ name, data = .))) %>% 
#   unnest(fit) %>% 
#   mutate(p.value = ifelse(p.value < 0.001, '< 0.001', round(p.value, 3)),
#          sigLabel = case_when(p.value < 0.001 ~ "***", 
#                               p.value < 0.01 ~ "**",
#                               p.value < 0.05 ~ "*",
#                               TRUE ~ '')) %>% 
#   kable(digits = 3, 
#         caption = 'Model estimates') %>% 
#   kable_styling(full_width = FALSE)
# 
# # posthoc tests
# plyr::ldply(lapply(split(y3, y3$cluster, drop = T), function(x) 
#   {broom::tidy(TukeyHSD(aov(value ~ name, data = x)))})) %>% 
#   mutate(cluster = recode(.id, 
#                       `D. nov-D. ame` = "D. ame vs. D. nov", 
#                       `D. vir-D. ame` = "D. ame vs. D. vir", 
#                       `D. vir-D. nov` = 'D. nov vs. D. vir'),
#          across(4:7, ~ round(.x, 2)),
#          Estimate = paste0(estimate, ' (', conf.low, '-', conf.high, ')'),
#          p.val = ifelse(adj.p.value < 0.001, "< 0.001", round(adj.p.value, 3)),
#          sigLabel = case_when(adj.p.value < 0.001 ~ "***", 
#                               adj.p.value < 0.01 ~ "**",
#                               adj.p.value < 0.05 ~ "*",
#                               TRUE ~ '')) %>% 
#   dplyr::select(contrast, Estimate, p.val, ` ` = sigLabel) %>% 
#   kable(digits = 3, 
#         caption = 'Post-hoc Tukey tests corrected for multiple testing') %>% 
#   kable_styling(full_width = FALSE) %>%
#   group_rows("Cluster 1", 1, 3) %>%
#   group_rows("Cluster 2 ", 4, 6) %>%
#   group_rows("Cluster 3", 7, 9) %>%
#   group_rows("Cluster 4", 10, 12)

Duplicated genes

All duplicate data

multiDB %>% 
  filter(Orthogroup %in% unique(mdb2[which(duplicated(mdb2$Orthogroup) == TRUE), "Orthogroup"])$Orthogroup)

# A tibble: 669 × 18
   protein_vir Orthogroup      AM1      AM2      AM3    AV1    AV2    NM1    NM2
   <chr>       <chr>         <dbl>    <dbl>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 MSTRG.7265  OG0001115   130615.  148100.  105415. 1.54e5 1.44e5 2.13e6 2.61e6
 2 MSTRG.11558 OG0000593   181828.  208559.  165115. 2.20e5 2.29e5 2.13e5 1.90e5
 3 MSTRG.3484  OG0000593   181828.  208559.  165115. 2.20e5 2.29e5 2.13e5 1.90e5
 4 MSTRG.11574 OG0000594   570837.  418525.  508847. 6.20e5 7.62e5 3.69e6 3.86e6
 5 MSTRG.11574 OG0000594   206081.  147123.  158241. 2.35e5 2.35e5 3.69e6 3.86e6
 6 MSTRG.11618 OG0000312  1176062.  834103  1065069. 1.35e6 1.50e6 2.13e6 2.43e6
 7 MSTRG.11629 OG0000312  1176062.  834103  1065069. 1.35e6 1.50e6 2.13e6 2.43e6
 8 MSTRG.11618 OG0000312  2143772. 1408941. 2207962. 2.10e6 2.88e6 2.13e6 2.43e6
 9 MSTRG.11629 OG0000312  2143772. 1408941. 2207962. 2.10e6 2.88e6 2.13e6 2.43e6
10 MSTRG.11709 OG0000174   175350.  110755.  167556. 1.91e5 2.08e5 1.25e5 1.47e5
# … with 659 more rows, and 9 more variables: NM3 <dbl>, NV1 <dbl>, NV2 <dbl>,
#   VM1 <dbl>, VM2 <dbl>, VM3 <dbl>, VV1 <dbl>, VV2 <dbl>, VV3 <dbl>

Mated vs. virgin

# duplicated genes
comb_TABLES_dup <- comb_TABLES %>% filter(Orthogroup %in% ortho_dups$Orthogroup)

comb_TABLES_dup %>% 
  ggplot(aes(x = logFC, y = -log10(P.Value), colour = DA)) +
  geom_point(alpha = .6) +
  scale_colour_manual(values = cbPalette[4:1]) +
  labs(y = '-log10(p-value)') +
  facet_wrap(~species) +
  theme_bw()

n_distinct(comb_TABLES_dup$genes)

[1] 185

Divergence between species - ejaculate candidates

# duplicated genes
mated_TABLES_dup <- mated_TABLES %>% filter(Orthogroup %in% ortho_dups$Orthogroup)

mated_TABLES_dup %>% 
  ggplot(aes(x = logFC, y = -log10(P.Value))) +
  geom_point(data = . %>% filter(threshold == 'NS'), alpha = .75, colour = 'grey') +
  geom_point(data = . %>% filter(threshold != 'NS'), aes(colour = DA), alpha = .5) + 
  scale_colour_manual(values = cbPalette[4:1]) +
  labs(y = '-log10(p-value)') +
  facet_wrap(~comparison) +
  theme_bw()

mated_TABLES_dup %>% 
  group_by(Orthogroup, comparison) %>% 
  summarise(N = n(),
            mnlfc = mean(logFC),
            mnexp = mean(AveExpr),
            sdlfc = sd(logFC)) %>% 
  dplyr::mutate(sdlfc = replace_na(sdlfc, 1)) %>% 
  ggplot(aes(x = mnexp, y = mnlfc)) +
  geom_point(aes(colour = N,  size = sdlfc)) +
  facet_wrap(~comparison) + 
  scale_colour_viridis_c() +
  theme_bw() + 
  theme() + 
  NULL

# differentially abundant duplicates
da_dup <- mated_TABLES_dup %>% filter(threshold == 'SD')

upset(fromList(list(aVn = split(da_dup, da_dup$comparison)[[1]][[1]],
                    aVv = split(da_dup, da_dup$comparison)[[2]][[1]],
                    nVv = split(da_dup, da_dup$comparison)[[3]][[1]])))

# 
sfp_dup_dat <- data.frame(dgeSFPvoom$genes,
                          dgeSFPvoom$E) %>% 
  rowwise() %>% 
  mutate(`D. ame` = median(c(!!! rlang::syms(grep('AM', names(.), value = TRUE)))),
         `D. nov` = median(c(!!! rlang::syms(grep('NM', names(.), value = TRUE)))),
         `D. vir` = median(c(!!! rlang::syms(grep('VM', names(.), value = TRUE))))) %>% 
  left_join(vir_fbgn %>% select(-FBtr), by = c('genes' = 'gene_id')) %>% 
  filter(Orthogroup %in% ortho_dups$Orthogroup) %>% 
  arrange(Orthogroup) %>% 
  mutate(da = if_else(Orthogroup %in% da_dup$Orthogroup, "sd", "ns"))

sfp_dup_md <- sfp_dup_dat %>% select(`D. ame`, `D. nov`, `D. vir`,
                                     Orthogroup, da)

sf_dup_scaled <- as.matrix(pheatmap:::scale_rows(sfp_dup_md[, 1:3]))

# heatmap annotations
left_lab <- rowAnnotation(Orthogroup = sfp_dup_dat$Orthogroup,
                          col = list(Orthogroup = setNames(rainbow(n_distinct(sfp_dup_dat$Orthogroup)),
                                                           unique(sfp_dup_dat$Orthogroup))),
                          title = NULL,
                          show_annotation_name = FALSE,
                          show_legend = FALSE)

right_lab <- rowAnnotation(Category = sfp_dup_md$da,
                           col = list(Category = c(sd = "red",
                                                   ns = NA)),
                           title = NULL,
                           show_annotation_name = FALSE)

Heatmap(sf_dup_scaled, 
        col = viridis::inferno(25),
        heatmap_legend_param = list(title = "log2(intensities)",
                                    #title_position = "leftcenter-rot",
                                    direction = "horizontal",
                                    at = c(-1.5, 0, 1.5)),
        #right_annotation = right_lab,
        left_annotation = left_lab,
        top_annotation = sfpha, 
        show_row_names = FALSE, 
        cluster_rows = FALSE,
        #show_column_names = FALSE, 
        #row_split = max(sub_grp),
        column_split = 3,
        column_gap = unit(0, "mm"),
        row_title = NULL,
        column_title = NULL)

sfp_dup_dat %>% 
  pivot_longer(cols = 11:13) %>% 
  distinct(Orthogroup, genes, name, value) %>% 
  #mutate(species = str_sub(name, 1, 1)) %>% 
  ggplot(aes(name, y = value, colour = genes)) +
  geom_line(aes(group = Orthogroup), position = position_dodge(width = .25)) +
  geom_jitter(position = position_jitterdodge(dodge.width = .5, jitter.width = .25), alpha = .5) +
  facet_wrap(~ Orthogroup, scales = "free_y") +
  scale_colour_viridis_d() +
  theme_bw() +
  theme(legend.position = '') + 
  NULL

sfp_dup_dat %>% filter(Orthogroup %in% da_dup$Orthogroup) %>% 
  mutate(da_pattern = case_when(`D. ame` > `D. nov` & `D. ame` > `D. vir` ~ "ame_up",
                                `D. ame` < `D. nov` & `D. nov` > `D. vir` ~ "nov_up",
                                `D. ame` < `D. vir` & `D. nov` < `D. vir` ~ "vir_up")) %>% 
  pivot_longer(cols = 11:13) %>% 
  distinct(Orthogroup, genes, name, value, da_pattern) %>% 
  group_by(Orthogroup, genes, name, da_pattern) %>% 
  summarise(N = n(),
            #mnlfc = mean(logFC),
            mnexp = mean(value),
            sdexp = sd(value)) %>% 
  ggplot(aes(name, y = mnexp)) +
  geom_line(aes(group = Orthogroup), alpha = .5) +
  geom_point(aes(size = N, fill = sdexp), shape = 21) +
  facet_wrap(~ da_pattern) +
  scale_colour_viridis_d(guide = "none") +
  scale_fill_viridis_c() +
  theme_bw() +
  #theme(legend.position = '') + 
  NULL

sfp_orths <- orthogroup_long %>% 
  filter(Orthogroup %in% da_dup$Orthogroup)

sfp_orths %>% map_dbl(n_distinct)

Orthogroup    species    protein 
        31          3        255

da_dup %>% map_dbl(n_distinct)

     genes      logFC    AveExpr          t    P.Value  adj.P.Val          B 
        43        106         76        117        117        113        117 
comparison  threshold       FBgn Orthogroup       sigP      sperm        Sfp 
         3          1         43         31          2          2          2 
        DA 
         2

sfp_orths <- mdb_counts %>% 
  filter(Orthogroup %in% da_dup$Orthogroup)

# number of proteins belonging to an orthogroup - most proteins are single copy
sfp_orths %>% 
  group_by(name, value) %>% 
  count() %>% 
  ggplot(aes(x = factor(value), y = n, fill = name)) + 
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_viridis_d()

# proteins belonging to orthogroup/families
ame_dups_sfp <- ortho_ame %>% filter(Orthogroup %in% da_dup$Orthogroup) %>% drop_na(Orthogroup)
nov_dups_sfp <- ortho_nov %>% filter(Orthogroup %in% da_dup$Orthogroup) %>% drop_na(Orthogroup)
vir_dups_sfp <- ortho_vir %>% filter(Orthogroup %in% da_dup$Orthogroup) %>% drop_na(Orthogroup)

library(ggalluvial)
bind_rows(ame_dups_sfp %>% mutate(species = "Dame"), 
          nov_dups_sfp %>% mutate(species = "Dnov"), 
          vir_dups_sfp %>% mutate(species = "Dvir")) %>% 
  group_by(species, Orthogroup) %>% 
  count(name = "N") %>% 
  group_by(species, N, Orthogroup) %>% 
  count() %>% 
  mutate(N = fct_reorder(as.character(N), as.integer(N))) %>% 
  ggplot(aes(x = species, stratum = N, alluvium = Orthogroup,
           y = n, label = N)) +
  #geom_flow(aes(fill = N)) +
  geom_alluvium(aes(fill = N)) +
  geom_stratum(aes(fill = N), alpha = .05) + 
  geom_text(stat = "stratum") +
  scale_fill_viridis_d() +
  theme_bw() + 
  theme() +
  NULL

da_dup %>% #filter(Orthogroup == "OG0000006") %>% 
  left_join(wigbySFP, by = 'FBgn') %>%
  drop_na(Symbol)

         genes     logFC   AveExpr         t      P.Value    adj.P.Val
1  MSTRG.11262  6.675042 10.257687  44.14013 3.807778e-13 7.635597e-12
2   MSTRG.9917 -6.205753  3.619289 -34.08147 5.472505e-12 7.446516e-11
3    MSTRG.382  3.771075  9.852041  26.04709 8.633707e-11 7.649866e-10
4   MSTRG.1594  3.771075 10.865206  25.65312 1.008983e-10 8.542719e-10
5   MSTRG.1928  3.771075  9.663222  24.22290 1.812961e-10 1.469656e-09
6   MSTRG.1928  4.153463 10.709003  26.37464 7.597581e-11 5.462109e-10
7    MSTRG.382  3.584724 10.897822  24.94417 1.343601e-10 8.676473e-10
8   MSTRG.1594 -3.033735 10.716738 -20.92741 8.034032e-10 4.193104e-09
9   MSTRG.9917  4.150120  5.892643  16.23410 1.040782e-08 3.776553e-08
10 MSTRG.11262 -1.556136 10.257687 -12.04629 1.987268e-07 5.221717e-07
11 MSTRG.11262 -8.231178 10.257687 -56.73319 2.847237e-14 4.172297e-13
12  MSTRG.1928  3.520604  9.514754  22.72051 3.482615e-10 2.948614e-09
13  MSTRG.1594 -3.223834  9.670957 -22.29797 4.216032e-10 3.250621e-09
14   MSTRG.382  2.951865  9.703573  20.94004 7.984970e-10 5.156396e-09
15  MSTRG.9917  3.540888  3.619289  19.44471 1.691895e-09 8.710971e-09
           B comparison threshold        FBgn Orthogroup sigP sperm Sfp   DA
1  19.989191        aMn        SD FBgn0206811  OG0000550 sigP   not Sfp sigP
2  17.739527        aMn        SD FBgn0207519  OG0000061  not   not Sfp   DA
3  15.445811        aMn        SD FBgn0211755  OG0000070 sigP   not Sfp sigP
4  15.295985        aMn        SD FBgn0211069  OG0000070 sigP   not Sfp sigP
5  14.727926        aMn        SD FBgn0210521  OG0000070 sigP   not Sfp sigP
6  15.560827        aMv        SD FBgn0210521  OG0000070 sigP   not Sfp sigP
7  15.019514        aMv        SD FBgn0211755  OG0000070 sigP   not Sfp sigP
8  13.255783        aMv        SD FBgn0211069  OG0000070 sigP   not Sfp sigP
9  10.695023        aMv        SD FBgn0207519  OG0000061  not   not Sfp   DA
10  7.511216        aMv        SD FBgn0206811  OG0000550 sigP   not Sfp sigP
11 21.901959        nMv        SD FBgn0206811  OG0000550 sigP   not Sfp sigP
12 14.084740        nMv        SD FBgn0210521  OG0000070 sigP   not Sfp sigP
13 13.897513        nMv        SD FBgn0211069  OG0000070 sigP   not Sfp sigP
14 13.261334        nMv        SD FBgn0211755  OG0000070 sigP   not Sfp sigP
15 12.511509        nMv        SD FBgn0207519  OG0000061  not   not Sfp   DA
      FBgn_mel  Symbol        FBtr
1  FBgn0034417 CG15117 FBtr0434777
2  FBgn0050104  NT5E-2 FBtr0236304
3  FBgn0037889  scpr-A FBtr0240603
4  FBgn0037889  scpr-A FBtr0239907
5  FBgn0037889  scpr-A FBtr0239345
6  FBgn0037889  scpr-A FBtr0239345
7  FBgn0037889  scpr-A FBtr0240603
8  FBgn0037889  scpr-A FBtr0239907
9  FBgn0050104  NT5E-2 FBtr0236304
10 FBgn0034417 CG15117 FBtr0434777
11 FBgn0034417 CG15117 FBtr0434777
12 FBgn0037889  scpr-A FBtr0239345
13 FBgn0037889  scpr-A FBtr0239907
14 FBgn0037889  scpr-A FBtr0240603
15 FBgn0050104  NT5E-2 FBtr0236304

da_dup %>% filter(Orthogroup == "OG0000070")

       genes     logFC   AveExpr         t      P.Value    adj.P.Val        B
1  MSTRG.382  3.771075  9.852041  26.04709 8.633707e-11 7.649866e-10 15.44581
2 MSTRG.1594  3.771075 10.865206  25.65312 1.008983e-10 8.542719e-10 15.29599
3 MSTRG.1928  3.771075  9.663222  24.22290 1.812961e-10 1.469656e-09 14.72793
4 MSTRG.1928  4.153463 10.709003  26.37464 7.597581e-11 5.462109e-10 15.56083
5  MSTRG.382  3.584724 10.897822  24.94417 1.343601e-10 8.676473e-10 15.01951
6 MSTRG.1594 -3.033735 10.716738 -20.92741 8.034032e-10 4.193104e-09 13.25578
7 MSTRG.1928  3.520604  9.514754  22.72051 3.482615e-10 2.948614e-09 14.08474
8 MSTRG.1594 -3.223834  9.670957 -22.29797 4.216032e-10 3.250621e-09 13.89751
9  MSTRG.382  2.951865  9.703573  20.94004 7.984970e-10 5.156396e-09 13.26133
  comparison threshold        FBgn Orthogroup sigP sperm Sfp   DA
1        aMn        SD FBgn0211755  OG0000070 sigP   not Sfp sigP
2        aMn        SD FBgn0211069  OG0000070 sigP   not Sfp sigP
3        aMn        SD FBgn0210521  OG0000070 sigP   not Sfp sigP
4        aMv        SD FBgn0210521  OG0000070 sigP   not Sfp sigP
5        aMv        SD FBgn0211755  OG0000070 sigP   not Sfp sigP
6        aMv        SD FBgn0211069  OG0000070 sigP   not Sfp sigP
7        nMv        SD FBgn0210521  OG0000070 sigP   not Sfp sigP
8        nMv        SD FBgn0211069  OG0000070 sigP   not Sfp sigP
9        nMv        SD FBgn0211755  OG0000070 sigP   not Sfp sigP

circos

# library(circlize)
# 
# sfp_dup_dat %>% 
#   group_by(Orthogroup) %>% count()
# 
# 
# og_wide <- og_counts %>% filter(Orthogroup %in% multiDB$Orthogroup) %>% 
#   filter(Orthogroup %in% sfp_dup_dat$Orthogroup) %>%
#   #pivot_wider(names_from = "species", values_from = orth_numb) %>% 
#   pivot_wider(names_from = "species", values_from = Orthogroup)
# 
# 
# og_counts %>% filter(Orthogroup %in% sfp_dup_dat$Orthogroup) %>% 
#   group_by(species, Orthogroup) %>% count() %>% 
#   pivot_wider(names_from = "species", values_from = n)
# 
# ass_counts <- og_wide %>%
#   mutate(#means = rowMeans(.[elected_cols], na.rm = T),
#          an_count_na = 2 - apply(.[c("Dame", "Dnov")], 1, function(x) sum(is.na(x))),
#          av_count_na = 2 - apply(.[c("Dame", "Dvir")], 1, function(x) sum(is.na(x))),
#          nv_count_na = 2 - apply(.[c("Dnov", "Dvir")], 1, function(x) sum(is.na(x))))
# 
# df <- data.frame(from = rep(c("Dame", "Dame", "Dnov"), each = 66), 
#                  to = rep(c("Dnov", "Dvir", "Dvir"), each = 66),
#                  value = c(ass_counts$an_count_na, ass_counts$av_count_na, ass_counts$nv_count_na))
# 
# chordDiagram(df)
# 
# df <- data.frame(from = c("Dame", "Dame", "Dnov"), 
#                  to = c("Dnov", "Dvir", "Dvir"),
#                  value = c(sum(ass_counts$an_count_na == 2, na.rm = TRUE), 
#                            sum(ass_counts$av_count_na == 2, na.rm = TRUE), 
#                            sum(ass_counts$nv_count_na == 2, na.rm = TRUE)))
# 
# chordDiagram(df)
# 
# mat = matrix(1:9, 3)
# rownames(mat) = letters[1:3]
# colnames(mat) = LETTERS[1:3]
# mat
# chordDiagram(mat)
# 
# 
# 
# # Initialize the plot.
# par(mar = c(1, 1, 1, 1) ) 
# circos.initialize(factors = data$factor, x = data$x )
#  
# # Build the regions of track #1
# circos.trackPlotRegion(factors = data$factor, y=data$y , bg.col = rgb(0.1,0.1,seq(0,1,0.1),0.4) , bg.border = NA)
#  
# # Add a link between a point and another
# circos.link("a", 0, "b", 0, h = 0.4)
#  
# # Add a link between a point and a zone
# circos.link("e", 0, "g", c(-1,1), col = "green", lwd = 2, lty = 2, border="black" )
#  
# # Add a link between a zone and another
# circos.link("c", c(-0.5, 0.5), "d", c(-0.5,0.5), col = "red", border = "blue", h = 0.2)
# 
# 
# 
# 
# 
# circos.par("track.height" = 0.1)
# circos.initialize(og_counts$species, x = og_counts$orth_numb)
# 
# circos.track(og_counts$species, y = og_counts$N,
#     panel.fun = function(x, y) {
#         circos.text(CELL_META$xcenter, 
#             CELL_META$cell.ylim[2] + mm_y(5), 
#             CELL_META$sector.index)
#         circos.axis(labels.cex = 0.6)
# })
# 
# col = rep(c("#FF0000", "#00FF00"), 4)
# circos.trackPoints(og_counts$species, og_counts$orth_numb, og_counts$N, col = col, pch = 16, cex = 0.5)
# circos.text(-1, 0.5, "text", sector.index = "a", track.index = 1)
# 
# bgcol = rep(c("#EFEFEF", "#CCCCCC"), 4)
# circos.trackHist(og_counts$species, og_counts$N, bin.size = 0.2, bg.col = bgcol, col = NA)
# 
# circos.track(og_counts$species, og_counts$orth_numb, og_counts$N,
#     panel.fun = function(x, y) {
#         ind = sample(length(x), 10)
#         x2 = x[ind]
#         y2 = y[ind]
#         od = order(x2)
#         circos.lines(x2[od], y2[od])
# })
#

Divergence between species - FRT

# duplicated genes
fem_TABLES_dup <- fem_TABLES %>% filter(Orthogroup %in% ortho_dups$Orthogroup)

fem_TABLES_dup %>% 
  ggplot(aes(x = logFC, y = -log10(P.Value))) +
  geom_point(data = . %>% filter(threshold == 'NS'), alpha = .5, colour = 'grey') +
  geom_point(data = . %>% filter(threshold != 'NS'), aes(colour = spec), alpha = .5) + 
  scale_colour_manual(values = cbPalette[5:2]) +
  labs(y = '-log10(p-value)') +
  facet_wrap(~comparison) +
  theme_bw()

# differentially abundant duplicates
da_dup_fem <- fem_TABLES %>% filter(Orthogroup %in% ortho_dups$Orthogroup, threshold == 'SD')

upset(fromList(list(aVn = split(da_dup_fem, da_dup_fem$comparison)[[1]][[1]],
                    aVv = split(da_dup_fem, da_dup_fem$comparison)[[2]][[1]],
                    nVv = split(da_dup_fem, da_dup_fem$comparison)[[3]][[1]])))

frt_dup_dat <- data.frame(dgeFEMvoom$genes,
                          dgeFEMvoom$E) %>% 
  rowwise() %>% 
  mutate(`D. ame` = median(c(!!! rlang::syms(grep('AV', names(.), value = TRUE)))),
         `D. nov` = median(c(!!! rlang::syms(grep('NV', names(.), value = TRUE)))),
         `D. vir` = median(c(!!! rlang::syms(grep('VV', names(.), value = TRUE))))) %>% 
  left_join(vir_fbgn %>% select(-FBtr), by = c('genes' = 'gene_id')) %>% 
  filter(Orthogroup %in% ortho_dups$Orthogroup) %>% 
  arrange(Orthogroup) %>% 
  mutate(da = if_else(Orthogroup %in% da_dup_fem$Orthogroup, "sd", "ns"))

frt_dup_md <- frt_dup_dat %>% select(`D. ame`, `D. nov`, `D. vir`, Orthogroup, da)

frt_dup_scaled <- as.matrix(pheatmap:::scale_rows(frt_dup_md[, 1:3]))

# heatmap annotations
left_labf <- rowAnnotation(Orthogroup = frt_dup_dat$Orthogroup,
                          col = list(Orthogroup = setNames(rainbow(n_distinct(frt_dup_dat$Orthogroup)),
                                                           unique(frt_dup_dat$Orthogroup))),
                          title = NULL,
                          show_annotation_name = FALSE,
                          show_legend = FALSE)

right_labf <- rowAnnotation(Category = frt_dup_md$da,
                           col = list(Category = c(sd = "red",
                                                   ns = NA)),
                           title = NULL,
                           show_annotation_name = FALSE)

Heatmap(frt_dup_scaled, 
        col = viridis::inferno(50),
        heatmap_legend_param = list(title = "log2(intensities)",
                                    #title_position = "leftcenter-rot",
                                    direction = "horizontal",
                                    at = c(-1.5, 0, 1.5)),
        right_annotation = right_labf,
        left_annotation = left_labf,
        top_annotation = frtha, 
        show_row_names = FALSE, 
        cluster_rows = FALSE,
        #show_column_names = FALSE, 
        #row_split = max(sub_grp),
        column_split = 3,
        column_gap = unit(0, "mm"),
        row_title = NULL,
        column_title = NULL)

frt_dup_dat %>% 
  #filter(Orthogroup %in% da_dup_fem$Orthogroup) %>% 
  pivot_longer(cols = 9:11) %>% 
  distinct(genes, name, value, .keep_all = TRUE) %>% 
  #mutate(species = str_sub(name, 1, 1)) %>% 
  ggplot(aes(name, y = value, colour = genes)) +
  geom_line(aes(group = genes), position = position_dodge(width = .25)) +
  geom_jitter(position = position_jitterdodge(dodge.width = .5, jitter.width = .25), alpha = .5) +
  facet_wrap(~ Orthogroup, scales = "free_y") +
  scale_colour_viridis_d() +
  theme_bw() +
  theme(legend.position = '') + 
  NULL

frt_dup_dat %>% filter(Orthogroup %in% da_dup_fem$Orthogroup) %>% 
  pivot_longer(cols = 9:11) %>% 
  distinct(Orthogroup, genes, name, value) %>% 
  group_by(Orthogroup, genes, name) %>% 
  summarise(N = n(),
            #mnlfc = mean(logFC),
            mnexp = mean(value),
            sdexp = sd(value)) %>% 
  ggplot(aes(name, y = mnexp)) +
  geom_line(aes(group = Orthogroup, colour = Orthogroup)) +
  geom_point(aes(size = N, fill = sdexp), shape = 21) +
  facet_wrap(~ Orthogroup, scales = "free_y") +
  scale_colour_viridis_d(guide = "none") +
  scale_fill_viridis_c() +
  theme_bw() +
  #theme(legend.position = '') + 
  NULL

Evolutionary rates

mdb_id <- multiDB %>% 
  left_join(vir_fbgn %>% select(-FBtr), by = c('protein_vir' = 'gene_id', "Orthogroup")) %>% 
  distinct(protein_vir, .keep_all = TRUE) %>% 
  mutate(sigP = if_else(protein_vir %in% vir_sig$protein, "sigP", "not"))

evo_rates <- kaks_results %>% 
  filter(Ka.Ks < 10 & !str_detect(COMPARISON, 'lum')) %>% 
  distinct(FBgn, COMPARISON, .keep_all = TRUE) %>% 
  mutate(signal_cat = case_when(FBgn %in% ejac_cands_multiDB$FBgn[ejac_cands_multiDB$sigP == "sigP"] ~ 'Ejac.signal',
                                FBgn %in% ejac_cands_multiDB$FBgn ~ 'Ejac',
                                FBgn %in% mdb_id$FBgn[mdb_id$sigP == "sigP"] ~ 'FRT.signal',
                                FBgn %in% mdb_id$FBgn ~ 'FRT',
                                TRUE ~ as.character(NA)),
         STEP = if_else(FBgn %in% STEP$FBgn & signal_cat == 'FRT.signal',
                        'STEP', 'other'),
         dupl = if_else(FBgn %in% da_dup$FBgn, "dupl", 'other'))

Signal peptide rates

evo_rates %>% 
  drop_na(signal_cat) %>% 
  # calculate means and standard errors
  group_by(signal_cat, COMPARISON) %>% 
  summarise(N = n(),
            mn = mean(Ka.Ks),
            se = sd(Ka.Ks)/sqrt(N)) %>% 
  
  # add mean for all genes
  bind_rows(
    data.frame(kaks_results %>% 
                 dplyr::select(COMPARISON, contains('K')) %>% 
                 filter(Ka.Ks < 10 & !str_detect(COMPARISON, 'lum'))) %>% 
      group_by(COMPARISON) %>% 
      summarise(N = n(),
                mn = mean(Ka.Ks),
                se = sd(Ka.Ks)/sqrt(N)) %>% 
      mutate(signal_cat = 'All')
            ) %>% 
  
  mutate(cat2 = str_sub(signal_cat, 1, 3),
         cat3 = if_else(grepl('sig', x = signal_cat) == TRUE, 'Signal pep.', 'Other')) %>% 
  
  # plot
  ggplot(aes(x = factor(cat2, level = c('All', 'FRT', 'Eja')), 
             y = mn, colour = COMPARISON, shape = cat3)) +
  geom_point(size = 8, position = position_dodge(width = .8)) +
  geom_errorbar(aes(ymin = mn - se, ymax = mn + se), width = .5, position = position_dodge(width = .8)) +
  scale_colour_manual(values = cbPalette[-1],
                      labels = c(D.amr_vs_D.nov = 'D. ame vs. D. nov', 
                                 D.amr_vs_D.vir = 'D. ame vs. D. vir', 
                                 D.nov_vs_D.vir = 'D. nov vs. D. vir')) + 
  scale_x_discrete(labels = c('All', 'Female reproductive\ntract', 'Ejaculate')) +
  scale_shape_manual(values = c(16, 18)) +
  labs(x = 'category', y = 'dN/dS') +
  theme_bw() +
  theme(text = element_text(size = 20),
        legend.position = c(0.25, 0.75),
        legend.title = element_blank(),
        legend.text = element_text(face = "italic"),
        legend.background = element_rect(fill = NA),
        axis.title.x = element_blank()) +
  #ggsave('plots/TPP_results/all_KaKs.pdf', height = 9, width = 8.5, dpi = 600, useDingbats = FALSE) +
  NULL

STEP rates

mean_rates <- evo_rates %>% 
  group_by(COMPARISON) %>% 
  summarise(N = n(),
            mn = mean(Ka.Ks),
            se = sd(Ka.Ks)/sqrt(N)) %>% 
  mutate(signal_cat = NA,
         STEP = NA,
         COMPARISON = recode(COMPARISON, 
                             D.amr_vs_D.nov = 'D. ame vs. D. nov', 
                             D.amr_vs_D.vir = 'D. ame vs. D. vir', 
                             D.nov_vs_D.vir = 'D. nov vs. D. vir'))

#plot
evo_rates %>% 
  drop_na(signal_cat) %>% 
  group_by(signal_cat, STEP, COMPARISON) %>% 
  summarise(N = n(),
            mn = mean(Ka.Ks),
            se = sd(Ka.Ks)/sqrt(N)) %>% 
  
  mutate(COMPARISON = recode(COMPARISON, 
                             D.amr_vs_D.nov = 'D. ame vs. D. nov', 
                             D.amr_vs_D.vir = 'D. ame vs. D. vir', 
                             D.nov_vs_D.vir = 'D. nov vs. D. vir')) %>% 
  
  ggplot(aes(x = signal_cat, y = mn, fill = STEP)) +
  geom_hline(data = mean_rates, aes(yintercept = mn), lty = 2, lwd = 1.5) +
  geom_errorbar(aes(ymin = mn - se, ymax = mn + se), position = position_dodge(width = .85),
                width = .5) +
  geom_point(position = position_dodge(width = .85), pch = 21, size = 5) +
  scale_x_discrete(labels = c('Ejac.', 'Ejac.\nsig. pep.', 
                              'FRT', 'FRT\nsig. pep.')) +
  scale_fill_manual(values = c('black', 'red')) +
  labs(y = expression(omega)) +
  facet_wrap(~COMPARISON) +
  theme_bw() +
  theme(legend.position = 'bottom',
        legend.title = element_blank(),
        strip.text = element_text(size = 15, face = "italic"),
        axis.title.x = element_blank(),
        axis.title.y = element_text(size = 25),
        axis.text = element_text(size = 15)) +
  geom_text(aes(y = 0.05, label = N), size = 3, position = position_dodge(width = .85)) +
  #ggsave('plots/TPP_results/TPP_STEP_KaKs.pdf', height = 4, width = 9.5, dpi = 600, useDingbats = FALSE) +
  NULL

KW tests

bind_rows(

# Ejac (not signal peptides) vs. genome
evo_rates %>% 
  mutate(ejac = if_else(grepl('Ejac$', x = signal_cat), 'ejac', 'other')) %>% 
  group_by(COMPARISON) %>% 
  # ggplot(aes(x = ejac, y = Ka.Ks)) +
  # stat_summary() + facet_wrap(~COMPARISON)
  do(fit = broom::tidy(wilcox.test(Ka.Ks ~ ejac, data = .))) %>% 
  unnest(fit),

# ejac.signal vs. genome
evo_rates %>% 
  mutate(ejac = if_else(grepl('Ejac.signal', x = signal_cat), 'ejac', 'other')) %>% 
  group_by(COMPARISON) %>% 
  # ggplot(aes(x = ejac, y = Ka.Ks)) +
  # stat_summary() + facet_wrap(~COMPARISON)
  do(fit = broom::tidy(wilcox.test(Ka.Ks ~ ejac, data = .))) %>% 
  unnest(fit),

# ejac.signal vs. ejac.signal + STEP
evo_rates %>% 
  filter(str_detect(signal_cat, 'Ejac.signal')) %>% 
  dplyr::select(-STEP) %>% 
  mutate(STEP = if_else(FBgn %in% STEP$FBgn, 'STEP', 'other')) %>% 
  group_by(COMPARISON) %>% 
  # ggplot(aes(x = STEP, y = Ka.Ks)) +
  # stat_summary() + facet_wrap(~COMPARISON)
  do(fit = broom::tidy(wilcox.test(Ka.Ks ~ STEP, data = .))) %>% 
  unnest(fit),

# ejac/frt w/o signal peptide
evo_rates %>% 
  mutate(sex = case_when(grepl('Ejac', x = signal_cat) ~ 'xEjac',
                         grepl('FRT', x = signal_cat) ~ 'xFRT',
                         TRUE ~ as.character(NA)),
         sigp = if_else(grepl('signal', x = signal_cat), 'sigp', 'other')) %>% 
  filter(str_detect(sex, 'x')) %>% 
  group_by(sex, COMPARISON) %>% 
  # ggplot(aes(x = sex, y = Ka.Ks, colour = sigp)) +
  # stat_summary() + facet_wrap(~COMPARISON)
  do(fit = broom::tidy(wilcox.test(Ka.Ks ~ sigp, data = .))) %>% 
  unnest(fit),

# frt (not signal peptides) vs. genome
evo_rates %>% 
  mutate(frt = if_else(grepl('FRT$', x = signal_cat), 'frt', 'other')) %>% 
  group_by(COMPARISON) %>% 
  # ggplot(aes(x = frt, y = Ka.Ks)) +
  # stat_summary() + facet_wrap(~COMPARISON)
  do(fit = broom::tidy(wilcox.test(Ka.Ks ~ frt, data = .))) %>% 
  unnest(fit),

# FRT.signal vs. genome
evo_rates %>% 
  mutate(frt = if_else(grepl('FRT.signal', x = signal_cat), 'frt', 'other')) %>% 
  group_by(COMPARISON) %>% 
  # ggplot(aes(x = frt, y = Ka.Ks)) +
  # stat_summary() + facet_wrap(~COMPARISON)
  do(fit = broom::tidy(wilcox.test(Ka.Ks ~ frt, data = .))) %>% 
  unnest(fit),

# FRT STEP + signal vs. FRT + signal
evo_rates %>% 
  filter(str_detect(signal_cat, 'FRT')) %>% 
  # ggplot(aes(x = STEP, y = Ka.Ks)) +
  # stat_summary() + facet_wrap(~COMPARISON)
  group_by(COMPARISON) %>% 
  do(fit = broom::tidy(wilcox.test(Ka.Ks ~ STEP, data = .))) %>% 
  unnest(fit),

# FRT STEP + signal vs. genome
evo_rates %>% 
  mutate(STEP = case_when(STEP == 'STEP' ~ 'frt', 
                          TRUE ~ 'other')) %>% 
  # ggplot(aes(x = STEP, y = Ka.Ks)) +
  # stat_summary() + facet_wrap(~COMPARISON)
  group_by(COMPARISON) %>% 
  do(fit = broom::tidy(wilcox.test(Ka.Ks ~ STEP, data = .))) %>% 
  unnest(fit),

# FRT STEP vs. Ejac.signal
evo_rates %>% 
  mutate(sex = case_when(grepl('Ejac.signal', x = signal_cat) ~ 'Ejac.sig',
                         STEP == 'STEP' ~ 'STEP',
                         TRUE ~ as.character(NA))) %>%
  drop_na(sex) %>% 
  group_by(COMPARISON) %>% 
  # ggplot(aes(x = sex, y = Ka.Ks)) +
  # stat_summary() + facet_wrap(~COMPARISON)
  do(fit = broom::tidy(wilcox.test(Ka.Ks ~ sex, data = .))) %>% 
  unnest(fit)

) %>% 
  mutate(pval = ifelse(p.value < 0.001, '< 0.001', round(p.value, 3)),
         sigLabel = case_when(p.value < 0.001 ~ "***", 
                              p.value < 0.01 ~ "**",
                              p.value < 0.05 ~ "\\*",
                              TRUE ~ '')) %>% 
  dplyr::select(-c(statistic:sex)) %>% 
  kable(caption = 'Mann-Whitney U tests comparing evolutionary rates between groups') %>% 
  kable_styling(full_width = FALSE) %>%
  group_rows("Ejaculate (excl. sig. peps.) vs. genome", 1, 3) %>%
  group_rows("Ejaculate sig. peps. vs. genome", 4, 6) %>%
  group_rows("Ejaculate sig. peps. vs. Ejaculate sig. peps. + STEP", 7, 9) %>%
  group_rows("Ejaculate with vs. without sig. peps.", 10, 12) %>% 
  group_rows("FRT with vs. without sig. peps.", 13, 15) %>% 
  group_rows("FRT (excl. sig. peps.) vs. genome", 16, 18) %>% 
  group_rows("FRT sig. peps. vs. genome", 19, 21) %>% 
  group_rows("FRT STEP w/ sig. peps. vs. FRT w/ sig. peps.", 22, 24) %>% 
  group_rows("FRT STEP w/ sig. peps. vs. genome", 25, 27) %>% 
  group_rows("FRT STEP vs. Ejac.signal", 28, 30)

Mann-Whitney U tests comparing evolutionary rates between groups
COMPARISON	pval	sigLabel
Ejaculate (excl. sig. peps.) vs. genome
D.amr_vs_D.nov	0.218
D.amr_vs_D.vir	0.266
D.nov_vs_D.vir	0.08
Ejaculate sig. peps. vs. genome
D.amr_vs_D.nov	0.654
D.amr_vs_D.vir	0.037	*
D.nov_vs_D.vir	0.114
Ejaculate sig. peps. vs. Ejaculate sig. peps. + STEP
D.amr_vs_D.nov	0.499
D.amr_vs_D.vir	0.06
D.nov_vs_D.vir	0.139
Ejaculate with vs. without sig. peps.
D.amr_vs_D.nov	0.202
D.amr_vs_D.vir	0.023	*
D.nov_vs_D.vir	0.012	*
FRT with vs. without sig. peps.
D.amr_vs_D.nov	< 0.001	***
D.amr_vs_D.vir	< 0.001	***
D.nov_vs_D.vir	< 0.001	***
FRT (excl. sig. peps.) vs. genome
D.amr_vs_D.nov	< 0.001	***
D.amr_vs_D.vir	< 0.001	***
D.nov_vs_D.vir	< 0.001	***
FRT sig. peps. vs. genome
D.amr_vs_D.nov	< 0.001	***
D.amr_vs_D.vir	< 0.001	***
D.nov_vs_D.vir	< 0.001	***
FRT STEP w/ sig. peps. vs. FRT w/ sig. peps.
D.amr_vs_D.nov	0.008	**
D.amr_vs_D.vir	0.001	**
D.nov_vs_D.vir	0.003	**
FRT STEP w/ sig. peps. vs. genome
D.amr_vs_D.nov	0.926
D.amr_vs_D.vir	0.546
D.nov_vs_D.vir	0.86
FRT STEP vs. Ejac.signal
D.amr_vs_D.nov	0.81
D.amr_vs_D.vir	0.677
D.nov_vs_D.vir	0.558

Database comparisons

Numbers of differentially abundant proteins detected

Here we compare the numbers of differentially abundant proteins detected when using each species databases and the combined database.

Number of ejaculate candidates

Number of proteins more abundant in mated vs. virgin samples

# total number of ejaculate candidates per database
bind_rows(
  ame_DATABLES %>% filter(threshold == 'DA' & str_detect(DA, 'MB')) %>% 
    distinct(genes, .keep_all = TRUE),
  nov_DATABLES %>% filter(threshold == 'DA' & str_detect(DA, 'MB')) %>% 
    distinct(genes, .keep_all = TRUE),
  vir_DATABLES %>% filter(threshold == 'DA' & str_detect(DA, 'MB')) %>% 
    distinct(genes, .keep_all = TRUE),
  comb_TABLES_nd %>% filter(threshold == 'SD' & str_detect(DA, 'MB')) %>% 
    distinct(genes, .keep_all = TRUE) %>% 
    mutate(threshold = recode(threshold, SD = "DA"), DB = 'aa.comb') %>% 
    dplyr::select(species, DB, threshold)) %>% 
  group_by(DB) %>% 
  dplyr::count() %>% 
  left_join(data.frame(DB = c('aa.comb', 'ame.db', 'nov.db', 'vir.db'),
                       total = c(n_distinct(comb_TABLES_nd$genes),
                                 n_distinct(ame_DATABLES$genes),
                                 n_distinct(nov_DATABLES$genes),
                                 n_distinct(vir_DATABLES$genes)))) %>% 
  mutate(prop.ejac = n/total)

# A tibble: 4 × 4
# Groups:   DB [4]
  DB          n total prop.ejac
  <chr>   <int> <int>     <dbl>
1 aa.comb   199  2006    0.0992
2 ame.db    257  2838    0.0906
3 nov.db    262  2966    0.0883
4 vir.db    230  2939    0.0783

ejac_counts <- bind_rows(
  ame_DATABLES %>% filter(threshold == 'DA' & str_detect(DA, 'MB')),
  nov_DATABLES %>% filter(threshold == 'DA' & str_detect(DA, 'MB')),
  vir_DATABLES %>% filter(threshold == 'DA' & str_detect(DA, 'MB')),
  comb_TABLES_nd %>% filter(threshold == 'SD' & str_detect(DA, 'MB')) %>% 
    mutate(threshold = recode(threshold, SD = "DA"), DB = 'aa.comb') %>% 
    dplyr::select(species, DB, threshold)) %>% 
  group_by(species, DB) %>% 
  dplyr::count()

ejac_counts %>% 
  ggplot(aes(x = species, y = n, fill = DB)) +
  geom_col(position = 'dodge') +
  scale_x_discrete(labels = c(expression(italic('D. ame')),
                              expression(italic('D. nov')),
                              expression(italic('D. vir')))) +
  scale_fill_manual(values = c(viridis::magma(n = 4)[3], viridis::viridis(n = 3)),
                    name = "Database:",
                    labels = c('Combined',
                               expression(italic('D. ame')),
                               expression(italic('D. nov')),
                               expression(italic('D. vir')))) +
  labs(y = 'No. proteins') + 
  theme_bw() +
  theme(legend.position = 'bottom',
        legend.background = element_rect(fill = NA),
        axis.title.x = element_blank()) +
  #ggsave('plots/database_comps/TPP_comps/TPP_n_ejac_comb.pdf', height = 3, width = 4, dpi = 600, useDingbats = FALSE) +
  NULL

Differentially abundant ejaculate candidates

Number of proteins detected as differentially abundant in pairwise comparisons between species

ejac_da <- bind_rows(ame_MATED %>% filter(threshold == 'DA') %>% 
                       dplyr::select(comparison, DB, threshold),
                     nov_MATED %>% filter(threshold == 'DA') %>% 
                       dplyr::select(comparison, DB, threshold),
                     vir_MATED %>% filter(threshold == 'DA') %>% 
                       dplyr::select(comparison, DB, threshold),
                     mated_TABLES %>% filter(threshold == 'SD') %>% 
                       mutate(comparison = recode(comparison, 
                                                  aMn = "ame.v.nov", 
                                                  aMv = "ame.v.vir", 
                                                  nMv = 'nov.v.vir'),
                              threshold = recode(threshold, SD = "DA"),
                              DB = 'aa.comb') %>% 
                       dplyr::select(comparison, DB, threshold))

# number of DA ejaculate proteins for each species by database
ejac_da %>% 
  group_by(comparison, DB) %>% dplyr::count() %>% 
  ggplot(aes(x = comparison, y = n, fill = DB)) +
  geom_col(position = 'dodge') +
  scale_x_discrete(labels = c(expression(italic('D. ame vs. D. nov')),
                              expression(italic('D. ame vs. D. vir')),
                              expression(italic('D. nov vs. D. vir')))) +
  scale_fill_manual(values = c(viridis::magma(n = 4)[3], viridis::viridis(n = 3)),
                    name = "Database:",
                    labels = c('Combined',
                               expression(italic('D. ame')),
                               expression(italic('D. nov')),
                               expression(italic('D. vir')))) +
  labs(y = 'No. proteins') + 
  theme_bw() +
  theme(legend.position = 'bottom',
        legend.background = element_rect(fill = NA),
        axis.title.x = element_blank()) +
  #ggsave('plots/database_comps/TPP_comps/TPP_n_DA_comb.pdf', height = 3, width = 4, dpi = 600, useDingbats = FALSE) +
  NULL

Differentially abundant female reproductive tract proteins

Number of proteins detected as differentially abundant in pairwise comparisons between species

frt_da <- bind_rows(ame_VIRGIN %>% filter(threshold == 'DA') %>% 
                      dplyr::select(comparison, DB, threshold),
                    nov_VIRGIN %>% filter(threshold == 'DA') %>% 
                      dplyr::select(comparison, DB, threshold),
                    vir_VIRGIN %>% filter(threshold == 'DA') %>%
                      dplyr::select(comparison, DB, threshold),
                    fem_TABLES %>% filter(threshold == 'SD') %>% 
                      mutate(comparison = recode(comparison, 
                                                 aVn = "ame.v.nov", 
                                                 aVv = "ame.v.vir", 
                                                 nVv = 'nov.v.vir'),
                             threshold = recode(threshold, SD = "DA"),
                             DB = 'aa.comb') %>% 
                      dplyr::select(comparison, DB, threshold))

frt_da %>% 
  group_by(comparison, DB) %>% dplyr::count() %>% 
  ggplot(aes(x = comparison, y = n, fill = DB)) +
  geom_col(position = 'dodge') +
  scale_x_discrete(labels = c(expression(italic('D. ame vs. D. nov')),
                              expression(italic('D. ame vs. D. vir')),
                              expression(italic('D. nov vs. D. vir')))) +
  scale_fill_manual(values = c(viridis::magma(n = 4)[3], viridis::viridis(n = 3)),
                    name = "Database:",
                    labels = c('Combined',
                               expression(italic('D. ame')),
                               expression(italic('D. nov')),
                               expression(italic('D. vir')))) +
  labs(y = 'No. proteins') + 
  theme_bw() +
  theme(legend.position = 'bottom',
        legend.background = element_rect(fill = NA),
        axis.title.x = element_blank()) +
  #ggsave('plots/database_comps/TPP_comps/TPP_n_DAfem_comb.pdf', height = 3, width = 4, dpi = 600, useDingbats = FALSE) +
  NULL

Proportions detected

bind_rows(ejac_counts %>% 
            left_join(data.frame(DB = c('aa.comb', 'ame.db', 'nov.db', 'vir.db'),
                                 total = c(n_distinct(comb_TABLES_nd$genes),
                                           n_distinct(ame_DATABLES$genes),
                                           n_distinct(nov_DATABLES$genes),
                                           n_distinct(vir_DATABLES$genes)))) %>% 
            mutate(set = 'Ejaculate candidates', 
                   comparison = species),
          ejac_da %>% 
            group_by(comparison, DB) %>% dplyr::count() %>% 
            left_join(data.frame(DB = c('aa.comb', 'ame.db', 'nov.db', 'vir.db'),
                                 total = c(n_distinct(mated_TABLES$genes),
                                           n_distinct(ame_MATED$genes),
                                           n_distinct(nov_MATED$genes),
                                           n_distinct(vir_MATED$genes)))) %>% 
            mutate(set = 'Ejaculate divergence'),
          frt_da %>% 
            group_by(comparison, DB) %>% dplyr::count() %>% 
            left_join(data.frame(DB = c('aa.comb', 'ame.db', 'nov.db', 'vir.db'),
                                 total = c(n_distinct(fem_TABLES$genes),
                                           n_distinct(ame_VIRGIN$genes),
                                           n_distinct(nov_VIRGIN$genes),
                                           n_distinct(vir_VIRGIN$genes)))) %>% 
            mutate(set = 'FRT divergence')) %>% 
  mutate(prop.da = n/total) %>% 
  ggplot(aes(x = comparison, y = prop.da, colour = DB)) +
  geom_line(aes(group = DB), lwd = 2) +
  scale_x_discrete(labels = c(expression(italic('D. ame vs. D. nov')),
                              expression(italic('D. ame vs. D. vir')),
                              expression(italic('D. nov vs. D. vir')))) +
  scale_y_continuous(labels = scales::percent) + 
  scale_colour_manual(values = c(viridis::magma(n = 4)[3], viridis::viridis(n = 3)),
                      name = "Database",
                      labels = c('Combined',
                                 expression(italic('D. ame')),
                                 expression(italic('D. nov')),
                                 expression(italic('D. vir')))) +
  labs(y = '% differentially abundant') + 
  facet_wrap(~set, scales = 'free', nrow = 1) +
  theme_bw() +
  theme(#legend.position = 'bottom',
        legend.text.align = 0,
        legend.background = element_rect(fill = NA),
        axis.title.x = element_blank(),
        axis.text.x = element_text(size = 8),
        strip.text = element_text(size = 15)) +
  #ggsave('plots/database_comps/TPP_comps/TPP_prop_DA.pdf', height = 3.8, width = 13.5, dpi = 600, useDingbats = FALSE) +
  NULL

Abundance correlations

Here we compare the protein abundances when searching against each species proteome. First, we average the abundances for each search by taking the mean across all three replicates of AM (1-3).

all_abund <- mdb2 %>% 
  select(protein_vir, #FBgn, 
         starts_with('AV'), starts_with('AM'), 
         starts_with('NV'), starts_with('NM'), 
         starts_with('VV'), starts_with('VM'))

abund_cor <- bind_rows(
  # correlations between replicates within treatment using species specific DB
  all_abund %>% dplyr::select(-protein_vir) %>% +1 %>% log2 %>% cor(use = "complete.obs") %>% 
  reshape2::melt() %>% 
  mutate(condition1 = str_sub(Var1, 1, 2),
         condition2 = str_sub(Var2, 1, 2),
         sample1 = str_sub(Var1, 1, 3),
         sample2 = str_sub(Var2, 1, 3),
         db1 = str_sub(Var1, 5, 7),
         db2 = str_sub(Var2, 5, 7)) %>% 
  filter(condition1 == condition2) %>% 
  filter(Var1 != Var2) %>% 
  filter(db1 == db2) %>% 
  distinct(value, .keep_all = TRUE) %>% 
  
  # calculate means and standard errors
  group_by(condition1, db1, db2) %>% 
  summarise(N = n(),
            mn = mean(value),
            se = sd(value)/sqrt(N)) %>% 
  filter(str_sub(tolower(condition1), 1, 1) == str_sub(db1, 1, 1)),
  
  # correlations between replicates using alt. species DB
  all_abund %>% dplyr::select(-protein_vir) %>% +1 %>% log2 %>% cor(use = "complete.obs") %>% 
  reshape2::melt() %>% 
  mutate(condition1 = str_sub(Var1, 1, 2),
         condition2 = str_sub(Var2, 1, 2),
         sample1 = str_sub(Var1, 1, 3),
         sample2 = str_sub(Var2, 1, 3),
         db1 = str_sub(Var1, 5, 7),
         db2 = str_sub(Var2, 5, 7)) %>% 
  filter(sample1 == sample2) %>% 
  filter(db1 != db2) %>% 
  distinct(Var1, Var2, .keep_all = TRUE) %>% 
  
  # calculate means and standard errors
  group_by(condition1, db1, db2) %>% 
  summarise(N = n(),
            mn = mean(value),
            se = sd(value)/sqrt(N)) %>% 
  filter(str_sub(tolower(condition1), 1, 1) == str_sub(db1, 1, 1))
  )

abund_cor$NAMEX <- c('AMA', 'AVA', 'NMN', 'NVN', 'VMV', 'VVV', 
                     'AMN', 'AMV', 'AVN', 'AVV', 'NMA', 'NMV', 
                     'NVA', 'NVV', 'VMA', 'VMN', 'VVA', 'VVN')

# plot
abund_cor %>% 
  ggplot(aes(x = condition1, y = mn, colour = db2)) +
  geom_errorbar(aes(ymin = mn - se, ymax = mn + se), width = .5) +
  geom_point(size = 5)  +
  labs(y = 'Pearson\'s correlation') +
  scale_colour_viridis_d(name = "Database:",
                         labels = c(expression(italic('D. ame')),
                                    expression(italic('D. nov')),
                                    expression(italic('D. vir')))) +
  theme_bw() +
  theme(legend.position = 'bottom',
        legend.text.align = 0,
        axis.title.x = element_blank()) +
  #ggsave('plots/database_comps/TPP_comps/TPP_abun_correlations.pdf', height = 4.25, width = 4, dpi = 600, useDingbats = FALSE) +
  NULL

Coefficients of variation

We calculate the coefficient of variation for each treatment (mated or virgin within each species); (1) between biological replicates within each treatment using each species database as a measure of precision and (2) between each sample using each database as a measure of variation between databases.

# function to calculate CV
cv <- function(x) sd(x, na.rm = TRUE)/mean(x, na.rm = TRUE)

# combine data across individual databases
all_vals <- bind_rows(tmt_ame_prot %>% 
                        dplyr::select(protein, AM1:VV3) %>% 
                        rename_with(.cols = 2:17, function(x) {paste0('ame.', x)}) %>% 
                        pivot_longer(2:17),
                      tmt_nov_prot %>% 
                        dplyr::select(protein, AM1:VV3) %>% 
                        rename_with(.cols = 2:17, function(x) {paste0('nov.', x)}) %>% 
                        pivot_longer(2:17),
                      tmt_vir_prot %>% 
                        dplyr::select(protein, AM1:VV3) %>% 
                        rename_with(.cols = 2:17, function(x) {paste0('vir.', x)}) %>% 
                        pivot_longer(2:17)) %>% 
  mutate(log_val = log10(value + 1),
         db = str_sub(name, 1, 3),
         condition = str_sub(name, 5, 6),
         .sample = str_sub(name, 5, 7))

# CV for entire experiment using each database
all_vals %>% 
  group_by(protein, db) %>% 
  summarise(CV = cv(log_val)) %>% 
  group_by(db) %>% 
  summarise(N = n(),
            md = median(CV, na.rm = TRUE),
            mn = mean(CV, na.rm = TRUE),
            se = sd(CV, na.rm = TRUE)/sqrt(N)) %>% 
  ggplot(aes(x = db, y = mn, fill = db)) + 
  geom_errorbar(aes(ymin = mn - se, ymax = mn + se), width = 0) +
  geom_point(pch = 21, size = 5) + 
  scale_fill_viridis_d() + 
  scale_x_discrete(labels = c('ame' = expression(italic('D. ame')), 
                              'nov' = expression(italic('D. nov')),
                              'vir' = expression(italic('D. vir')))) +
  labs(x = 'Database', y = 'CV') +
  theme_bw() + 
  theme(legend.position = '') + 
  NULL

# compare CVs within each condition using each database
all_vals %>% 
  group_by(protein, db, condition) %>% 
  summarise(CV = cv(log_val)) %>% 
  group_by(db, condition) %>% 
  summarise(N = n(),
            md = median(CV, na.rm = TRUE),
            mn = mean(CV, na.rm = TRUE),
            se = sd(CV, na.rm = TRUE)/sqrt(N)) %>% 
  ggplot(aes(x = condition, y = mn, fill = db)) + 
  geom_errorbar(aes(ymin = mn - se, ymax = mn + se), width = 0,
                position = position_dodge(width = .5)) +
  geom_point(pch = 21, position = position_dodge(width = .5)) + 
  scale_fill_viridis_d(name = "Database:",
                       labels = c('ame' = expression(italic('D. ame')), 
                                  'nov' = expression(italic('D. nov')),
                                  'vir' = expression(italic('D. vir')))) + 
  labs(y = 'CV') +
  theme_bw() + 
  theme(legend.position = 'bottom') + 
  #ggsave('plots/database_comps/TPP_comps/TPP_CV_within_treatment_db.pdf', height = 8, width = 12, units = 'cm', dpi = 600, useDingbats = FALSE) +
  NULL

# compare samples across different databases using the combined database to allow comparison between the same proteins across databases. 
all_abund %>% 
  pivot_longer(2:49) %>% 
  mutate(log_val = log10(value),
         db = str_sub(name, 5, 7),
         condition = str_sub(name, 1, 2),
         .sample = str_sub(name, 1, 3)) %>% 
  group_by(protein_vir, .sample) %>% 
  summarise(CV = cv(log_val)) %>% 
  group_by(.sample) %>% 
  summarise(N = n(),
            md = median(CV, na.rm = TRUE),
            mn = mean(CV, na.rm = TRUE),
            se = sd(CV, na.rm = TRUE)/sqrt(N)) %>% 
  ggplot(aes(x = .sample, y = mn)) + 
  geom_errorbar(aes(ymin = mn - se, ymax = mn + se), width = 0) +
  geom_point(size = 5) + 
  labs(y = 'CV') +
  theme_bw() + 
  theme(legend.position = 'bottom') + 
  #ggsave('plots/database_comps/TPP_comps/TPP_CV_btw_sample_db.pdf', height = 5, width = 15, units = 'cm', dpi = 600, useDingbats = FALSE) +
  NULL

Test differences in abundances between databases

We compared differences in abundance between databases for each species/mating combination using edgeR. Significant differences in abundance indicate that using each species data provides different estimates of protein abundance.

Heatmap

abun_mn <- all_abund %>% 
  mutate(mn = rowMeans(dplyr::select(., -c(1)), na.rm = TRUE)) %>% 
  filter(mn > 0) %>% 
  mutate(AMA = rowMeans(dplyr::select(., matches("AM._ame")), na.rm = TRUE),
         AVA = rowMeans(dplyr::select(., matches("AV._ame")), na.rm = TRUE),
         AMN = rowMeans(dplyr::select(., matches("AM._nov")), na.rm = TRUE),
         AVN = rowMeans(dplyr::select(., matches("AV._nov")), na.rm = TRUE),
         AMV = rowMeans(dplyr::select(., matches("AM._vir")), na.rm = TRUE),
         AVV = rowMeans(dplyr::select(., matches("AV._vir")), na.rm = TRUE),
         
         NMA = rowMeans(dplyr::select(., matches("NM._ame")), na.rm = TRUE),
         NVA = rowMeans(dplyr::select(., matches("NV._ame")), na.rm = TRUE),
         NMN = rowMeans(dplyr::select(., matches("NM._nov")), na.rm = TRUE),
         NVN = rowMeans(dplyr::select(., matches("NV._nov")), na.rm = TRUE),
         NMV = rowMeans(dplyr::select(., matches("NM._vir")), na.rm = TRUE),
         NVV = rowMeans(dplyr::select(., matches("NV._vir")), na.rm = TRUE),
         
         VMA = rowMeans(dplyr::select(., matches("VM._ame")), na.rm = TRUE),
         VVA = rowMeans(dplyr::select(., matches("VV._ame")), na.rm = TRUE),
         VMN = rowMeans(dplyr::select(., matches("VM._nov")), na.rm = TRUE),
         VVN = rowMeans(dplyr::select(., matches("VV._nov")), na.rm = TRUE),
         VMV = rowMeans(dplyr::select(., matches("VM._vir")), na.rm = TRUE),
         VVV = rowMeans(dplyr::select(., matches("VV._vir")), na.rm = TRUE)) %>% 
  dplyr::select(protein_vir, AMA:VVV) %>% 
  drop_na(AMA:VVV)

# scale data
scaled_comp <- as.matrix(pheatmap:::scale_rows(log2(abun_mn[, -1] + 1)))

colnames(scaled_comp) <- colnames(abun_mn[, -1])

sp_anno <- str_sub(colnames(abun_mn[, -1]), 1, 1)
mt_anno <- str_sub(colnames(abun_mn[, -1]), 2, 2)
db_anno <- str_sub(colnames(abun_mn[, -1]), 3, 3)

# get column names for ordering
v <- colnames(scaled_comp)

ha <- HeatmapAnnotation(
  species = str_sub(colnames(abun_mn[, -1]), 1, 1), 
  mating = str_sub(colnames(abun_mn[, -1]), 2, 2),
  database = str_sub(colnames(abun_mn[, -1]), 3, 3),
  col = list(species = c('V' = v.pal[1], 
                         'N' = v.pal[2], 
                         'A' = v.pal[3]),
             mating = c('M' = viridis::magma(n = 4)[3], 
                        'V' = viridis::magma(n = 4)[2]), 
             database = c('V' = v.pal[1], 
                          'N' = v.pal[2], 
                          'A' = v.pal[3])
             )
  # cors = anno_points(
  #   abund_cor[match(v[order(substr(v, start = 1,
  #                                  stop = max(nchar(v))))], abund_cor$NAMEX), 'mn'])
  )

#pdf('plots/database_comps/TPP_comps/TPP_db_comp_compheatmap.pdf', height = 8, width = 5)
Heatmap(scaled_comp, 
        col = RColorBrewer::brewer.pal(name = 'Spectral', n = 10),
        heatmap_legend_param = list(title = "log2(intensities)",
                                    title_position = "leftcenter-rot"),
        column_order = v[order(substr(v, start = 1, stop = max(nchar(v))))],
        #right_annotation = frt_labs,
        #left_annotation = frt_lab,
        top_annotation = ha, 
        show_row_names = FALSE, 
        show_column_names = FALSE, 
        column_gap = unit(0, "mm"),
        row_title = NULL,
        column_title = NULL)

#dev.off()

Differential abundance analysis

# subset data for each condition using all databases
am_dat <- all_abund %>% dplyr::select(protein_vir, starts_with('AM')) %>% 
  mutate(across(2:10, ~replace_na(.x, 0)))
av_dat <- all_abund %>% dplyr::select(protein_vir, starts_with('AV')) %>% 
  mutate(across(2:7, ~replace_na(.x, 0)))

nm_dat <- all_abund %>% dplyr::select(protein_vir, starts_with('NM')) %>% 
  mutate(across(2:10, ~replace_na(.x, 0)))
nv_dat <- all_abund %>% dplyr::select(protein_vir, starts_with('NV')) %>% 
  mutate(across(2:7, ~replace_na(.x, 0)))

vm_dat <- all_abund %>% dplyr::select(protein_vir, starts_with('VM')) %>% 
  mutate(across(2:10, ~replace_na(.x, 0)))
vv_dat <- all_abund %>% dplyr::select(protein_vir, starts_with('VV')) %>% 
  mutate(across(2:10, ~replace_na(.x, 0)))

# function to perform differential abundance analysis between databases for the 
# same samples

do_limma <- function(df) {
  
  expr_dat = df[, -1]
  
  # get sample info
  sampInfo = data.frame(database = str_sub(colnames(expr_dat), 5, 7),
                        Replicate = str_sub(colnames(expr_dat), 3, 3))

  # make design matrix to test diffs between groups
  design = model.matrix(~0 + sampInfo$database)
  colnames(design) = unique(sampInfo$database)
  rownames(design) = sampInfo$Replicate

  # create DGElist and fit model 
  dgeList = DGEList(counts = expr_dat, genes = df$protein_vir, group = sampInfo$database)
  dgeList = calcNormFactors(dgeList, method = 'TMM')
  dgeList = estimateCommonDisp(dgeList)
  dgeList = estimateTagwiseDisp(dgeList)

  # make contrasts - higher values = higher alphabetically
  cont.matrix = makeContrasts(a.vs.n = ame - nov,
                              a.vs.v = ame - vir,
                              n.vs.v = nov - vir,
                              levels = design)

  # voom normalisation
  dgeListV = voom(dgeList, design, plot = FALSE)

  # fit linear model
  lm_fit <- lmFit(dgeListV, design = design)

  # compare differential abundance between databases 
  # ame vs. nov
  lm_an = contrasts.fit(lm_fit, cont.matrix[,"a.vs.n"])
  lm_an = eBayes(lm_an)
  lm_an.tTags.table = topTable(lm_an, adjust.method = "BH", number = Inf) %>% 
    mutate(comparison = "a.vs.n",
           threshold = if_else(adj.P.Val < 0.05 & abs(logFC) > 1, "SD", "NS"))

  # ame vs. vir
  lm_av <- contrasts.fit(lm_fit, cont.matrix[,"a.vs.v"])
  lm_av <- eBayes(lm_av)
  lm_av.tTags.table <- topTable(lm_av, adjust.method = "BH", number = Inf) %>% 
    mutate(comparison = "a.vs.v",
           threshold = if_else(adj.P.Val < 0.05 & abs(logFC) > 1, "SD", "NS"))

  # nov vs. vir
  lm_nv <- contrasts.fit(lm_fit, cont.matrix[,"n.vs.v"])
  lm_nv <- eBayes(lm_nv)
  lm_nv.tTags.table <- topTable(lm_nv, adjust.method = "BH", number = Inf) %>% 
    mutate(comparison = "n.vs.v",
           threshold = if_else(adj.P.Val < 0.05 & abs(logFC) > 1, "SD", "NS"))

  # combine results
  return(rbind(lm_an.tTags.table, lm_av.tTags.table, lm_nv.tTags.table))
  
}

# do tests
combi_2 <- bind_rows(do_limma(am_dat) %>% mutate(dat = 'AM'),
                     do_limma(av_dat) %>% mutate(dat = 'AV'),
                     do_limma(nm_dat) %>% mutate(dat = 'NM'),
                     do_limma(nv_dat) %>% mutate(dat = 'NV'),
                     do_limma(vm_dat) %>% mutate(dat = 'VM'),
                     do_limma(vv_dat) %>% mutate(dat = 'VV')) %>% 
  left_join(vir_fbgn, by = c('genes' = 'gene_id')) %>% 
  mutate(comparison = recode(comparison, 
                             a.vs.n = "D.amr_vs_D.nov", 
                             a.vs.v = "D.amr_vs_D.vir", 
                             n.vs.v = 'D.nov_vs_D.vir')) %>% 
  # add evolutionary rates
  left_join(kaks_results %>% dplyr::select(Ka:COMPARISON, FBgn), 
            by = c('FBgn', c('comparison' = 'COMPARISON'))) %>% 
  distinct(FBgn, comparison, dat, .keep_all = TRUE)

# volcanos
combi_2 %>% 
  ggplot(aes(x = logFC, y = -log10(P.Value), colour = threshold)) +
  geom_point(alpha = .5) + 
  labs(y = '-log10(p-value)') +
  facet_grid(dat~comparison) +
  theme_bw() +
  theme(legend.position = '') +
  NULL

# # numbers differentially abundant
# combi_2 %>% 
#   group_by(dat, comparison) %>% 
#   count(threshold) %>% 
#   ggplot(aes(x = comparison, y = n, fill = threshold)) +
#   geom_col() +
#   facet_wrap(~dat)
# 
# # percent different in estimate
# combi_2 %>% 
#   group_by(dat, comparison) %>% 
#   count(threshold) %>% 
#   group_by(dat, comparison) %>% 
#   summarise(p.diff = 100 * n[threshold == 'SD']/(n[threshold == 'SD'] + n[threshold == 'NS'])) %>% 
#   summary

Evolutionary rates

We compare dN/dS for proteins that showed differences in abundance when using each species database. We expect faster evolving proteins will be more likely differentially abundant between databases if amino acid substitutions lead to less accurate peptide assignment/abundance estimates.

combi_2 %>% filter(Ka.Ks < 10) %>% 
  mutate(condition = str_sub(dat, 2, 2)) %>% 
  group_by(comparison, threshold, condition) %>% 
  summarise(N = n(),
            mn = mean(Ka.Ks),
            se = sd(Ka.Ks)/sqrt(N)) %>% 
  ggplot(aes(x = comparison, y = mn, fill = threshold, shape = condition)) + 
  geom_errorbar(aes(ymin = mn - se, ymax = mn + se), width = 0,
                position = position_dodge(width = .5)) +
  geom_point(size = 3, position = position_dodge(width = .5)) +
  labs(x = 'Database comparison', y = expression(omega)) +
  scale_shape_manual(values = c(21:22),
                     labels = c('M' = 'Mated',
                                'V' = 'Virgin')) +
  scale_fill_manual(values = cbPalette[-1]) +
  scale_x_discrete(labels = 
                     c('D.amr_vs_D.nov' = 
                         expression(paste(italic('D. ame'), ' vs. ', italic('D. nov'))), 
                       'D.amr_vs_D.vir' = 
                         expression(paste(italic('D. ame'), ' vs. ', italic('D. vir'))),
                       'D.nov_vs_D.vir' = 
                         expression(paste(italic('D. nov'), ' vs. ', italic('D. vir'))))) +
  guides(fill = guide_legend(override.aes = list(shape = 21))) + 
  theme_bw() +
  theme(#legend.position = 'bottom',
        legend.title = element_blank(),
        axis.text.x = element_text(angle = 15, hjust = 1),
        axis.title.y = element_text(size = 25)) +
  #ggsave('plots/database_comps/TPP_comps/TPP_db_comp_rates.pdf', height = 9, width = 11, units = 'cm', dpi = 600, useDingbats = FALSE) +
  NULL

# perform KW tests
kw_rates <- combi_2 %>% filter(Ka.Ks < 10) %>% 
  group_by(dat, comparison) %>%
  do(fit = broom::tidy(wilcox.test(Ka.Ks ~ threshold, data = .))) %>% 
  unnest(fit)

kw_rates$FDR <- p.adjust(kw_rates$p.value, method = 'fdr') 

kw_rates %>% 
  mutate(FDR = ifelse(FDR < 0.001, '< 0.001', round(FDR, 3)),
         sigLabel = case_when(FDR < 0.001 ~ "***", 
                              FDR < 0.01 ~ "**",
                              FDR < 0.05 ~ "\\*",
                              TRUE ~ '')) %>% 
  dplyr::select(-c(p.value:alternative)) %>% 
  kable(digits = 3, 
        caption = 'Wilcoxon rank sum test with continuity correction') %>% 
  kable_styling(full_width = FALSE)

Wilcoxon rank sum test with continuity correction
dat	comparison	statistic	FDR	sigLabel
AM	D.amr_vs_D.nov	734584.5	< 0.001	***
AM	D.amr_vs_D.vir	767478.0	< 0.001	***
AM	D.nov_vs_D.vir	530712.0	0.017	*
AV	D.amr_vs_D.nov	751014.5	< 0.001	***
AV	D.amr_vs_D.vir	762942.0	< 0.001	***
AV	D.nov_vs_D.vir	514985.5	0.005	**
NM	D.amr_vs_D.nov	745570.5	< 0.001	***
NM	D.amr_vs_D.vir	752606.5	< 0.001	***
NM	D.nov_vs_D.vir	514221.0	0.004	**
NV	D.amr_vs_D.nov	748220.5	< 0.001	***
NV	D.amr_vs_D.vir	756255.0	< 0.001	***
NV	D.nov_vs_D.vir	517407.0	0.007	**
VM	D.amr_vs_D.nov	739669.0	< 0.001	***
VM	D.amr_vs_D.vir	768722.5	< 0.001	***
VM	D.nov_vs_D.vir	518474.0	0.019	*
VV	D.amr_vs_D.nov	751389.5	< 0.001	***
VV	D.amr_vs_D.vir	762515.5	< 0.001	***
VV	D.nov_vs_D.vir	491083.5	0.013	*

Similarly, we expect proteins without orthologs to be less abundant.

all_dat <- bind_rows(
  tmt_ame_prot %>% 
    left_join(ame_fbgn %>% dplyr::select(gene_id, Orthogroup),
              by = c('protein' = 'gene_id')) %>%
    mutate(AM = rowMeans(select(., starts_with('AM')), na.rm = TRUE),
           AV = rowMeans(select(., starts_with('AV')), na.rm = TRUE),
           NM = rowMeans(select(., starts_with('NM')), na.rm = TRUE),
           NV = rowMeans(select(., starts_with('NV')), na.rm = TRUE),
           VM = rowMeans(select(., starts_with('VM')), na.rm = TRUE),
           VV = rowMeans(select(., starts_with('VV')), na.rm = TRUE),
           ave_abundance = rowMeans(select(., AM1:VV3), na.rm = TRUE)) %>% 
    mutate(DB = 'ame'),
  tmt_nov_prot %>% 
    left_join(ame_fbgn %>% dplyr::select(gene_id, Orthogroup),
              by = c('protein' = 'gene_id')) %>% 
    mutate(AM = rowMeans(select(., starts_with('AM')), na.rm = TRUE),
           AV = rowMeans(select(., starts_with('AV')), na.rm = TRUE),
           NM = rowMeans(select(., starts_with('NM')), na.rm = TRUE),
           NV = rowMeans(select(., starts_with('NV')), na.rm = TRUE),
           VM = rowMeans(select(., starts_with('VM')), na.rm = TRUE),
           VV = rowMeans(select(., starts_with('VV')), na.rm = TRUE),
           ave_abundance = rowMeans(select(., AM1:VV3), na.rm = TRUE)) %>% 
    mutate(DB = 'nov'),
  tmt_vir_prot %>% 
    left_join(ame_fbgn %>% dplyr::select(gene_id, Orthogroup),
              by = c('protein' = 'gene_id')) %>% 
    mutate(AM = rowMeans(select(., starts_with('AM')), na.rm = TRUE),
           AV = rowMeans(select(., starts_with('AV')), na.rm = TRUE),
           NM = rowMeans(select(., starts_with('NM')), na.rm = TRUE),
           NV = rowMeans(select(., starts_with('NV')), na.rm = TRUE),
           VM = rowMeans(select(., starts_with('VM')), na.rm = TRUE),
           VV = rowMeans(select(., starts_with('VV')), na.rm = TRUE),
           ave_abundance = rowMeans(select(., AM1:VV3), na.rm = TRUE)) %>% 
    mutate(DB = 'vir')) %>% 
  mutate(ortholog = if_else(is.na(Orthogroup), 'no', 'yes'))

# # check average expression is similar between edgeR/manual calculation
# ame_MATED %>%
#   left_join(all_dat, by = c('genes' = 'protein')) %>%
#   drop_na(ave_abundance) %>%
#   ggplot(aes(x = AveExpr, y = log2(ave_abundance), colour = ortholog)) +
#   geom_point()

all_dat %>% 
  ggplot(aes(x = DB, y = log2(ave_abundance), fill = ortholog)) + 
  #stat_summary(position = position_dodge(width = .5), size = 2, pch = 21) +
  geom_boxplot(notch = TRUE) +
  labs(x = 'Database', y = 'log2(abundance)') +
  scale_x_discrete(labels = c('ame' = expression(italic('D. ame')), 
                              'nov' = expression(italic('D. nov')),
                              'vir' = expression(italic('D. vir')))) + 
  theme_bw() +
  theme(legend.position = 'bottom') +
  ggpubr::stat_compare_means(aes(label = ..p.signif..),
                             # comparisons = list(c('D..ame', 'D..nov'),
                             #                    c('D..ame', 'D..vir'),
                             #                    c('D..nov', 'D..vir')),
                             method = 't.test') +
  #ggsave('plots/database_comps/TPP_comps/TPP_orth_abun.pdf', height = 9, width = 8.8, units = 'cm', dpi = 600, useDingbats = FALSE) +
  NULL

# perform KW tests
all_dat %>% 
  group_by(DB) %>%
  do(fit = broom::tidy(wilcox.test(ave_abundance ~ ortholog, data = .))) %>% 
  unnest(fit) %>% 
  mutate(p.value = ifelse(p.value < 0.001, '< 0.001', round(p.value, 3)),
         sigLabel = case_when(p.value < 0.001 ~ "***", 
                              p.value < 0.01 ~ "**",
                              p.value < 0.05 ~ "*",
                              TRUE ~ '')) %>% 
  dplyr::select(-c(method:alternative)) %>% 
  kable(digits = 3, 
        caption = 'Wilcoxon rank sum test with continuity correction') %>% 
  kable_styling(full_width = FALSE)

Wilcoxon rank sum test with continuity correction
DB	statistic	p.value	sigLabel
ame	1113358	0.003	**
nov	1469558	0.024
vir	1506247	0.622

sessionInfo()

R version 4.1.3 (2022-03-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur/Monterey 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] kableExtra_1.3.4      edgeR_3.36.0          limma_3.50.3         
 [4] eulerr_6.1.1          RColorBrewer_1.1-3    UpSetR_1.4.0         
 [7] factoextra_1.0.7      cluster_2.1.3         ggalluvial_0.12.3    
[10] ComplexHeatmap_2.11.1 ggrepel_0.9.1         tidybayes_3.0.2      
[13] forcats_0.5.1         stringr_1.4.0         dplyr_1.0.9          
[16] purrr_0.3.4           readr_2.1.2           tidyr_1.2.0          
[19] tibble_3.1.7          ggplot2_3.3.6         tidyverse_1.3.1      
[22] workflowr_1.7.0      

loaded via a namespace (and not attached):
  [1] readxl_1.4.0         backports_1.4.1      circlize_0.4.15     
  [4] systemfonts_1.0.4    plyr_1.8.7           polylabelr_0.2.0    
  [7] svUnit_1.0.6         digest_0.6.29        foreach_1.5.2       
 [10] htmltools_0.5.2      viridis_0.6.2        fansi_1.0.3         
 [13] magrittr_2.0.3       checkmate_2.1.0      doParallel_1.0.17   
 [16] tzdb_0.3.0           modelr_0.1.8         matrixStats_0.62.0  
 [19] vroom_1.5.7          svglite_2.1.0        colorspace_2.0-3    
 [22] rvest_1.0.2          ggdist_3.1.1         haven_2.5.0         
 [25] xfun_0.31            callr_3.7.1          crayon_1.5.1        
 [28] jsonlite_1.8.0       iterators_1.0.14     glue_1.6.2          
 [31] polyclip_1.10-0      gtable_0.3.0         webshot_0.5.3       
 [34] GetoptLong_1.0.5     distributional_0.3.0 car_3.1-0           
 [37] shape_1.4.6          BiocGenerics_0.40.0  abind_1.4-5         
 [40] scales_1.2.0         pheatmap_1.0.12      DBI_1.1.3           
 [43] rstatix_0.7.0        Rcpp_1.0.9           viridisLite_0.4.0   
 [46] clue_0.3-61          bit_4.0.4            stats4_4.1.3        
 [49] MetBrewer_0.2.0      httr_1.4.3           arrayhelpers_1.1-0  
 [52] posterior_1.2.2      ellipsis_0.3.2       pkgconfig_2.0.3     
 [55] farver_2.1.1         sass_0.4.1           dbplyr_2.2.1        
 [58] locfit_1.5-9.6       utf8_1.2.2           reshape2_1.4.4      
 [61] tidyselect_1.1.2     labeling_0.4.2       rlang_1.0.4         
 [64] later_1.3.0          munsell_0.5.0        cellranger_1.1.0    
 [67] tools_4.1.3          cli_3.3.0            generics_0.1.3      
 [70] broom_1.0.0          evaluate_0.15        fastmap_1.1.0       
 [73] yaml_2.3.5           processx_3.7.0       knitr_1.39          
 [76] bit64_4.0.5          fs_1.5.2             whisker_0.4         
 [79] xml2_1.3.3           compiler_4.1.3       rstudioapi_0.13     
 [82] png_0.1-7            ggsignif_0.6.3       reprex_2.0.1        
 [85] bslib_0.3.1          stringi_1.7.8        highr_0.9           
 [88] ps_1.7.1             lattice_0.20-45      Matrix_1.4-1        
 [91] tensorA_0.36.2       vctrs_0.4.1          pillar_1.7.0        
 [94] lifecycle_1.0.1      jquerylib_0.1.4      GlobalOptions_0.1.2 
 [97] httpuv_1.6.5         R6_2.5.1             promises_1.2.0.1    
[100] gridExtra_2.3        IRanges_2.28.0       codetools_0.2-18    
[103] assertthat_0.2.1     rprojroot_2.0.3      rjson_0.2.21        
[106] withr_2.5.0          S4Vectors_0.32.4     parallel_4.1.3      
[109] hms_1.1.1            gghalves_0.1.3       coda_0.19-4         
[112] rmarkdown_2.14       carData_3.0-5        Cairo_1.6-0         
[115] ggpubr_0.4.0         git2r_0.30.1         getPass_0.2-2       
[118] lubridate_1.8.0

TPP-analysis

MartinGarlovsky

2022-10-25

Introduction

Load packages

Load data

Load peptide intensities and calculate protein abudnances

Load Orthofinder results

Load signal peptides predictions

Load other data

get gene ids and orthologs

Differential abundance analysis

Single species database analysis

Identifying ejaculate candidates

Diagnostic plots

ame.db

nov.db

vir.db

compare DA between mated/virgin

Volcano plots

Plot species specific database

Overlap in ‘Sfps’ between species

ame db

nov db

vir db

Divergence between ejaculate candidates

Compare results using each species database

Volcano plots all vs. all

Numbers of differentially abundant proteins more highly abundant in one species vs. the other

Divergence between (virgin) female reproductive tracts

Multi-database analysis

Compiling the species specific database

Differential abundance analysis

Heatmap

All replicates

Sample medians

Diagnostic plots

Principal component analysis

Identifying ejaculate candidates

Volcano plot

Overlap of ejaculate candidates between species

Sfp overlap with previous studies

overlap with Ahmed-Braimah et al. 2017

Overlap with female reproductive tract genes

Detecting artifactual differences

Abundance of ejaculate proteins compared to remaining FRT proteins

Abundance of serine-type endopeptidases

Divergence between ejaculate candidates

MA plots

Overlap between differentially abundant proteins

Are signal peps more likely to be differentially abundant?

Heatmap

Model fits

Divergence between (virgin) reproductive tracts

Volcano plots

Are signal peps more likely to be DA?

Heatmap

All proteins

Duplicated genes

All duplicate data

Mated vs. virgin

Divergence between species - ejaculate candidates

circos

Divergence between species - FRT

Evolutionary rates

Signal peptide rates

STEP rates

KW tests

Database comparisons

Numbers of differentially abundant proteins detected

Number of ejaculate candidates

Differentially abundant ejaculate candidates

Differentially abundant female reproductive tract proteins

Proportions detected

Abundance correlations

Coefficients of variation

Test differences in abundances between databases

Heatmap

Differential abundance analysis

Evolutionary rates