08 · VCF Mutation Analysis

Mutation frequency, Shannon entropy heatmaps, and tissue-type comparisons

genomics

virology

VCF

mutation

entropy

Per-sample VCF files from a variant caller. Coverage filtering, Shannon entropy per position, mutation frequency bar charts and entropy heatmaps per tissue type, Kruskal-Wallis comparison across tissue sources.

Overview

Item	Details
Input	Directory of `.vcf` files: `aa_change_{ID}-{Source}_{Segment}_{Threshold}.vcf`
Key packages	`vcfpy`, `pandas`, `scipy`, `seaborn`, `matplotlib`, `sklearn`
Statistics	Shannon entropy (natural log) · Kruskal–Wallis
Output	Mutation bar charts · entropy heatmaps (SVG) · frequency tables · KW results CSV
Download	template.ipynb

Edit SEGMENT_CONFIG to match your pathogen’s genome (total length, CDS boundaries, AA count). TISSUE_TYPES controls which sources are included in statistical comparisons.

← Gallery Download .ipynb

User Configuration

Code

# ── USER CONFIGURATION ────────────────────────────────────────────────────────

# Directory containing VCF files (relative to this file)
VCF_DIR = "data/aa_change_vcf"

# VCF filename format (regex). Named groups: id, source, segment, threshold
# Default: aa_change_{id}-{source}_{segment}_{threshold}.vcf
VCF_FILENAME_PATTERN = r"aa_change_(?P<id>[^-]+)-(?P<source>[^_]+)_(?P<segment>[SM])_(?P<threshold>\d+pct)\.vcf"

# Map threshold labels in filenames to display strings
THRESHOLD_LABEL_MAP = {"5pct": "5%", "2pct": "2%", "1pct": "1%"}

# Minimum read depth to retain a variant (coverage QC)
DEPTH_THRESHOLD = 500

# Tissue / source types to include in the analysis
TISSUE_TYPES = ["Lung", "Saliva", "Urine"]

# Segment geometry — edit to match your pathogen's genome
SEGMENT_CONFIG = {
    "S": {
        "length": 1902,
        "cds_start": 250,
        "cds_end": 1329,
        "aa_length": 360,
    },
    "M": {
        "length": 3675,
        "cds_start": 52,
        "cds_end": 3461,
        "aa_length": 1135,
    },
}

# VAF thresholds to display in bar charts (ordered low → high)
VAF_THRESHOLDS = ["1%", "2%", "5%"]

# Colors and transparency for each VAF threshold in bar charts
VAF_COLORS = {
    "5%": "#3594cc",
    "2%": "#ea801c",
    "1%": "#a00000",
}
VAF_ALPHA = {"5%": 0.3, "2%": 0.6, "1%": 0.9}

# Output directory for SVG plots (relative to this file)
OUTPUT_DIR = "Plots"

# ─────────────────────────────────────────────────────────────────────────────

Setup

Code

import os
import re
import warnings

import numpy as np
import pandas as pd
import vcfpy
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import entropy, kruskal

warnings.filterwarnings("ignore")

plt.rcParams.update({
    'axes.titlesize': 12,
    'axes.labelsize': 10,
    'xtick.labelsize': 10,
    'ytick.labelsize': 10,
    'legend.fontsize': 10,
})
pd.set_option('display.width', 1000)

os.makedirs(OUTPUT_DIR, exist_ok=True)
print(f"Output will be saved to: {os.path.abspath(OUTPUT_DIR)}")

Output will be saved to: /home/runner/work/lab-bioinfo-templates/lab-bioinfo-templates/templates/08_vcf-mutation-analysis/Plots

Load & Parse VCF Files

Code

pattern = re.compile(VCF_FILENAME_PATTERN)

records = []
vcf_files = [f for f in os.listdir(VCF_DIR) if f.endswith(".vcf")]
print(f"Found {len(vcf_files)} VCF files in {VCF_DIR}")

for fname in vcf_files:
    m = pattern.match(fname)
    if not m:
        print(f"  Skipping (name mismatch): {fname}")
        continue

    sample_id  = m.group("id")
    source     = m.group("source")
    segment    = m.group("segment")
    thresh_raw = m.group("threshold")
    threshold  = THRESHOLD_LABEL_MAP.get(thresh_raw, thresh_raw)

    fpath = os.path.join(VCF_DIR, fname)
    try:
        reader = vcfpy.Reader.from_path(fpath)
        for rec in reader:
            if not rec.ALT:
                continue
            alt_allele = rec.ALT[0].value if rec.ALT else "."
            aa_change  = rec.INFO.get("AA_CHANGE", "N/A")
            if isinstance(aa_change, list):
                aa_change = aa_change[0]

            sample_call = (
                rec.call_for_sample["SAMPLE"]
                if "SAMPLE" in rec.call_for_sample
                else rec.calls[0]
            )
            ad = sample_call.data.get("AD", [0, 0])
            dp = sample_call.data.get("DP", 0)

            records.append({
                "Position":          rec.POS,
                "Reference_Allele":  rec.REF,
                "Alternate_Allele":  alt_allele,
                "Quality":           rec.QUAL,
                "Amino_Acid_Change": aa_change,
                "Allelic_Depth_Ref": ad[0] if len(ad) > 0 else 0,
                "Allelic_Depth_Alt": ad[1] if len(ad) > 1 else 0,
                "Total_Depth":       dp,
                "Source":            source,
                "Segment":           segment,
                "Variant_Threshold": threshold,
                "ID":                sample_id,
            })
    except Exception as e:
        print(f"  Error reading {fname}: {type(e).__name__}: {e}")

combined_df = pd.DataFrame(records)
print(f"Total records parsed: {len(combined_df)}")

if combined_df.empty:
    raise RuntimeError(
        f"No VCF records were parsed from '{os.path.abspath(VCF_DIR)}'. "
        f"Check that simulate_data.py ran successfully and that files match "
        f"VCF_FILENAME_PATTERN. Files found: {vcf_files[:5]}"
    )

combined_df.head()

Found 108 VCF files in data/aa_change_vcf
Total records parsed: 881

	Position	Reference_Allele	Alternate_Allele	Quality	Amino_Acid_Change	Allelic_Depth_Ref	Allelic_Depth_Alt	Total_Depth	Source	Segment	Variant_Threshold	ID
0	304	A	C	55.6	Arg412Ser	1886	43	1929	Saliva	S	2%	SMPL006
1	366	T	C	50.7	Asn310Asp	1570	94	1664	Saliva	S	2%	SMPL006
2	1574	G	T	54.6	N/A	1861	148	2009	Saliva	S	2%	SMPL006
3	1613	T	G	31.1	N/A	2752	119	2871	Saliva	S	2%	SMPL006
4	1717	G	T	42.5	N/A	839	20	859	Saliva	S	2%	SMPL006

Quality Control: Coverage Filtering

Code

pre_filter = len(combined_df)
combined_df = combined_df[combined_df["Total_Depth"] >= DEPTH_THRESHOLD].copy()
print(f"Retained {len(combined_df)} / {pre_filter} records (depth ≥ {DEPTH_THRESHOLD}x)")

# Fix Allelic_Depth_Ref when reported as 0
mask = combined_df["Allelic_Depth_Ref"] == 0
combined_df.loc[mask, "Allelic_Depth_Ref"] = (
    combined_df.loc[mask, "Total_Depth"] - combined_df.loc[mask, "Allelic_Depth_Alt"]
)

os.makedirs("data", exist_ok=True)
combined_df["VAF"] = combined_df["Allelic_Depth_Alt"] / combined_df["Total_Depth"]

combined_df.to_csv("data/mutation_df.csv", index=False)
print("Saved: data/mutation_df.csv")
combined_df.describe()

Retained 881 / 881 records (depth ≥ 500x)
Saved: data/mutation_df.csv

	Position	Quality	Allelic_Depth_Ref	Allelic_Depth_Alt	Total_Depth	VAF
count	881.000000	881.000000	881.000000	881.000000	881.000000	881.000000
mean	1422.601589	44.644722	1684.480136	90.888763	1775.368899	0.051563
std	954.085243	8.462602	681.173586	91.714621	713.368280	0.043777
min	5.000000	30.100000	424.000000	6.000000	505.000000	0.009498
25%	661.000000	37.300000	1101.000000	35.000000	1152.000000	0.023283
50%	1295.000000	44.300000	1711.000000	60.000000	1792.000000	0.034854
75%	1939.000000	51.800000	2265.000000	110.000000	2379.000000	0.064785
max	3648.000000	60.000000	2960.000000	566.000000	3000.000000	0.198986

Mutation Distribution Bar Charts

Code

bar_widths = {"S": 9.0, "M": 14.0}

for source in TISSUE_TYPES:
    src_df = combined_df[combined_df["Source"] == source]
    if src_df.empty:
        continue

    fig, axes = plt.subplots(1, 2, figsize=(14, 3), constrained_layout=True)

    for ax, seg in zip(axes, ["S", "M"]):
        cfg   = SEGMENT_CONFIG[seg]
        seg_df = src_df[src_df["Segment"] == seg]

        for thresh in VAF_THRESHOLDS:
            t_df = seg_df[seg_df["Variant_Threshold"] == thresh]
            if t_df.empty:
                continue
            ax.bar(
                t_df["Position"],
                t_df["VAF"],
                width=bar_widths[seg],
                color=VAF_COLORS[thresh],
                alpha=VAF_ALPHA[thresh],
                label=thresh,
            )

        ax.set_xlim(1, cfg["length"])
        ax.set_ylim(0, 0.55)
        ax.set_xlabel("Genomic position (nt)")
        ax.set_ylabel("Variant allele frequency (VAF)")
        ax.set_title(f"{seg} segment — {source}")
        ax.axvspan(cfg["cds_start"], cfg["cds_end"], alpha=0.06, color="grey", label="CDS")
        ax.legend(loc="upper right", fontsize=8)

    fig.savefig(os.path.join(OUTPUT_DIR, f"mutation_map_{source}.svg"),
                format="svg", bbox_inches="tight")
    plt.show()
    print(f"Saved: mutation_map_{source}.svg")

Saved: mutation_map_Lung.svg

Saved: mutation_map_Saliva.svg

Saved: mutation_map_Urine.svg

Shannon Entropy

Code

combined_df["Freq_Ref"] = combined_df["Allelic_Depth_Ref"] / combined_df["Total_Depth"]
combined_df["Freq_Alt"] = combined_df["Allelic_Depth_Alt"] / combined_df["Total_Depth"]

def calculate_entropy(row):
    freqs = np.nan_to_num([row["Freq_Ref"], row["Freq_Alt"]], nan=0.0)
    return entropy(freqs, base=np.e)

combined_df["Entropy"] = combined_df.apply(calculate_entropy, axis=1)
print(f"Entropy range: {combined_df['Entropy'].min():.4f} – {combined_df['Entropy'].max():.4f}")
combined_df[["Position", "Segment", "Source", "Variant_Threshold", "Entropy"]].head(10)

Entropy range: 0.0537 – 0.4990

	Position	Segment	Source	Variant_Threshold	Entropy
0	304	S	Saliva	2%	0.106827
1	366	S	Saliva	2%	0.217199
2	1574	S	Saliva	2%	0.263026
3	1613	S	Saliva	2%	0.172522
4	1717	S	Saliva	2%	0.110554
5	1290	M	Urine	5%	0.482915
6	1833	M	Urine	5%	0.497796
7	2342	M	Urine	5%	0.365460
8	3644	M	Urine	5%	0.458609
9	3647	M	Urine	5%	0.497180

Entropy Heatmaps

Code

custom_cmap = sns.color_palette("Blues", as_cmap=True)

for seg in SEGMENT_CONFIG:
    seg_df = combined_df[
        (combined_df["Segment"] == seg) &
        (combined_df["Source"].isin(TISSUE_TYPES))
    ]

    pivot = (
        seg_df
        .groupby(["Source", "Position"])["Entropy"]
        .mean()
        .unstack(fill_value=0)
    )

    if pivot.empty:
        print(f"No data for {seg} segment — skipping heatmap")
        continue

    g = sns.clustermap(
        pivot,
        cmap=custom_cmap,
        row_cluster=True,
        col_cluster=False,
        figsize=(12, max(2, len(pivot) * 0.8 + 1)),
        linewidths=0,
        cbar_pos=(1.05, 0.2, 0.03, 0.6),
        xticklabels=False,
    )
    g.ax_heatmap.set_xlabel("Genomic position (nt)")
    g.ax_heatmap.set_ylabel("Tissue source")
    g.fig.suptitle(f"{seg} segment — Shannon Entropy by Source", y=1.02, fontsize=12)

    out_svg = os.path.join(OUTPUT_DIR, f"entropy_map_{seg}.svg")
    g.fig.savefig(out_svg, format="svg", bbox_inches="tight")
    plt.show()
    print(f"Saved: {out_svg}")

Saved: Plots/entropy_map_S.svg

Saved: Plots/entropy_map_M.svg

Mutation Frequency per Sample

Code

def filter_coding(df):
    masks = []
    for seg, cfg in SEGMENT_CONFIG.items():
        m = (
            (df["Segment"] == seg) &
            (df["Position"].between(cfg["cds_start"], cfg["cds_end"]))
        )
        masks.append(m)
    combined_mask = masks[0]
    for mask in masks[1:]:
        combined_mask = combined_mask | mask
    return df[combined_mask]


coding_df  = filter_coding(combined_df)
analysis_df = coding_df[coding_df["Source"].isin(TISSUE_TYPES)].copy()

freq_rows = []
for (sample_id, source, seg, thresh), grp in analysis_df.groupby(
    ["ID", "Source", "Segment", "Variant_Threshold"]
):
    cfg = SEGMENT_CONFIG[seg]
    n_total  = len(grp)
    n_nonsyn = (grp["Amino_Acid_Change"] != "N/A").sum()

    freq_rows.append({
        "ID":                   sample_id,
        "Source":               source,
        "Segment":              seg,
        "Threshold":            thresh,
        "Mutations_per_1000bp": (n_total / cfg["length"]) * 1000,
        "Mutations_per_100aa":  (n_nonsyn / cfg["aa_length"]) * 100,
    })

freq_df = pd.DataFrame(freq_rows)
print(freq_df.groupby(["Segment", "Threshold"])[
    ["Mutations_per_1000bp", "Mutations_per_100aa"]
].mean().round(2))
freq_df.head()

                   Mutations_per_1000bp  Mutations_per_100aa
Segment Threshold                                           
M       1%                         3.05                 0.85
        2%                         1.77                 0.48
        5%                         1.15                 0.31
S       1%                         3.74                 1.73
        2%                         2.25                 1.02
        5%                         1.31                 0.56

	ID	Source	Segment	Threshold	Mutations_per_1000bp	Mutations_per_100aa
0	SMPL001	Lung	M	1%	3.809524	1.145374
1	SMPL001	Lung	M	2%	1.904762	0.528634
2	SMPL001	Lung	M	5%	1.360544	0.352423
3	SMPL001	Lung	S	1%	3.680336	1.944444
4	SMPL001	Lung	S	2%	1.577287	0.555556

Mutation Frequency Violin Plots

Code

for seg in SEGMENT_CONFIG:
    seg_freq = freq_df[freq_df["Segment"] == seg]
    if seg_freq.empty:
        continue

    fig, axes = plt.subplots(1, 2, figsize=(10, 4), constrained_layout=True)
    fig.suptitle(f"{seg} segment — Mutation Frequency by Source", fontsize=12)

    metrics = [
        ("Mutations_per_1000bp", "Mutations per 1,000 bp"),
        ("Mutations_per_100aa",  "Non-syn. mutations per 100 aa"),
    ]
    for ax, (col, label) in zip(axes, metrics):
        sns.violinplot(
            data=seg_freq[seg_freq["Source"].isin(TISSUE_TYPES)],
            x="Source", y=col, hue="Threshold",
            palette=VAF_COLORS, order=TISSUE_TYPES,
            inner="quartile", ax=ax,
        )
        ax.set_xlabel("Tissue type")
        ax.set_ylabel(label)
        ax.set_title(label)

    fig.savefig(os.path.join(OUTPUT_DIR, f"mutation_freq_{seg}.svg"),
                format="svg", bbox_inches="tight")
    plt.show()

Kruskal–Wallis Tests

Code

kruskal_rows = []

metrics = [
    ("Mutations_per_1000bp", "Nucleotide (per 1,000 bp)"),
    ("Mutations_per_100aa",  "Amino acid (per 100 aa)"),
]

for thresh in VAF_THRESHOLDS:
    t_df = freq_df[freq_df["Threshold"] == thresh]
    for seg in SEGMENT_CONFIG:
        s_df = t_df[t_df["Segment"] == seg]
        for col, metric_label in metrics:
            groups = [s_df[s_df["Source"] == src][col].dropna()
                      for src in TISSUE_TYPES]
            groups = [g for g in groups if len(g) >= 2]
            if len(groups) < 2:
                continue
            stat, pval = kruskal(*groups)
            kruskal_rows.append({
                "Threshold":   thresh,
                "Segment":     seg,
                "Metric":      metric_label,
                "Statistic":   round(stat, 4),
                "p-value":     round(pval, 4),
                "Significant": pval < 0.05,
            })

kruskal_df = pd.DataFrame(kruskal_rows)
kruskal_df.to_csv(os.path.join(OUTPUT_DIR, "kruskal_results.csv"), index=False)
print("Kruskal–Wallis results:")
kruskal_df

Kruskal–Wallis results:

	Threshold	Segment	Metric	Statistic	p-value	Significant
0	1%	S	Nucleotide (per 1,000 bp)	1.5857	0.4526	False
1	1%	S	Amino acid (per 100 aa)	0.9485	0.6223	False
2	1%	M	Nucleotide (per 1,000 bp)	0.2406	0.8867	False
3	1%	M	Amino acid (per 100 aa)	0.3942	0.8211	False
4	2%	S	Nucleotide (per 1,000 bp)	0.3483	0.8402	False
5	2%	S	Amino acid (per 100 aa)	0.1621	0.9222	False
6	2%	M	Nucleotide (per 1,000 bp)	1.6571	0.4367	False
7	2%	M	Amino acid (per 100 aa)	2.9753	0.2259	False
8	5%	S	Nucleotide (per 1,000 bp)	0.8281	0.6610	False
9	5%	S	Amino acid (per 100 aa)	0.6719	0.7146	False
10	5%	M	Nucleotide (per 1,000 bp)	1.2049	0.5475	False
11	5%	M	Amino acid (per 100 aa)	0.9969	0.6075	False

Summary Table

Code

summary = (
    freq_df
    .groupby(["Segment", "Threshold", "Source"])[
        ["Mutations_per_1000bp", "Mutations_per_100aa"]
    ]
    .agg(["mean", "std"])
    .round(3)
)
summary.columns = ["_".join(c) for c in summary.columns]
summary.reset_index().to_csv(os.path.join(OUTPUT_DIR, "mutation_summary.csv"), index=False)
print("Saved: mutation_summary.csv")
summary

Saved: mutation_summary.csv

			Mutations_per_1000bp_mean	Mutations_per_1000bp_std	Mutations_per_100aa_mean	Mutations_per_100aa_std
Segment	Threshold	Source
M	1%	Lung	3.039	0.676	0.852	0.213
		Saliva	2.993	0.770	0.808	0.252
		Urine	3.129	0.784	0.896	0.258
	2%	Lung	1.950	0.469	0.543	0.117
		Saliva	1.542	0.477	0.426	0.103
		Urine	1.814	0.444	0.485	0.134
	5%	Lung	1.270	0.281	0.323	0.120
		Saliva	1.134	0.401	0.338	0.152
		Urine	1.043	0.401	0.279	0.130
S	1%	Lung	3.593	0.906	1.667	0.430
		Saliva	3.505	0.718	1.620	0.409
		Urine	4.118	0.965	1.898	0.567
	2%	Lung	2.366	0.644	1.019	0.380
		Saliva	2.103	0.880	0.972	0.421
		Urine	2.278	0.921	1.065	0.478
	5%	Lung	1.227	0.543	0.509	0.409
		Saliva	1.227	0.921	0.509	0.369
		Urine	1.490	0.699	0.648	0.380