This Streamlit app is driven by the output of data resulting from the blog post written by Stuart Ozer at Snowflake. It demonstrates how Snowflake can handle the ingestion and querying of thousands of complete DNA sequences which translates to over 17 billion rows of data.
The blog describes how this data is complemented with the following two additional data sets:
An Annotation dataset
A Panel Dataset
Simple SQL queries is then used gain answers to a multitude of questions held within the vast amount of data.
I utilised the same datasets in order to create a streamlit app.
Before I started
I read the blog to load all data into snowflake. I uncomment the following before you insert data into the GENOTYPES_BY_SAMPLE table:
--where relative_path rlike '.*.hard-filtered.vcf.gz' //select all 3202 genomes
I then commented out the original 'where clause' which ingests a set of 8 genomes.
I used a 4XL warehouse to load nearly 1TB of data and it took around 20 minutes to complete.
Once I loaded the data, this resulted in a table that looks just like this:
The table consists of 17.3 billion rows of both structured and semi structured data. The custom functions used for processing the VCF files ensures that the data's format is suitable for advanced analytical querying.
I then continued to follow the instructions and populated the CLINVAR and the PANEL table.
The PANEL table is really useful as it also ties back to the Father and Mother of the sample as well as their gender and where abouts they are from.
The CLINVAR table effectively provides lookups to each component of the genome which further enriches the data. Information such as related diseases is held here
The blog continues to explore several analytical examples of querying the data using simple SQL syntax. In my example, i will be exclusively querying with Snowpark Dataframes for Python.
Analysing the data before creating the app
In the supplied CLINVAR table, I focused on the CLDN field and filtered to only view rows which contained the disease 'Multiple Sclerosis'.
You will note that i am using the flatten function to create a list of all distinct diseases, then I simply filtered to produce a list of annotations.
disease = clinvar.join_table_function('flatten', F.col('CLNDN')).select(F.cast('VALUE', T.StringType()).alias('DISEASE')).distinct()
panel = session.table('DCR_DCR_GENOMES_RELEASE.PUBLIC.PANEL')
Next, I created a join to filter the annotation rows where each row only contains any of the specified values in the dataframe which i called ms_annotations.
variants_ms =
clinvar.join(ms_annotations, F.array_contains(F.cast(F.lit(ms_annotations['DISEASE']),
T.VariantType()),
F.col('CLNDN')))
Now, I used this dataframe to filter the Genome table. Just like the SQL examples in the published blog, i am joining on the fields POS, CHROM and REF.
patients_ms = all_genomes.join(variants_ms, on=(all_genomes['POS'] == variants_ms['POS']) & (all_genomes['CHROM'] == variants_ms['CHROM']) & (all_genomes['REF'] == variants_ms['REF']),lsuffix='annotation_')
Next, I created a dataframe which shows me all the distinct sample codes which are linked to the variants.
patients_ms_grp = patients_ms.group_by('SAMPLE_ID').count()
patients_ms_grp_filtered = patients_ms_grp.filter(F.col('COUNT')>2).sample(0.3)###sampleof 30% of the selection
patients_ms_grp_filtered
Now I have a panel subset (based on the above logic). I will use this to filter the complete genome dataframe to retrieve the complete sequence of each person in the panel subset.
panel_sample = panel.join(patients_ms_grp_filtered,on='SAMPLE_ID')
Joining the annotation (clinvar) table to this new dataframe will provide enrichment
join_annotations = patients_ms_complete_genomes\
.join(clinvar,
on=(patients_ms_complete_genomes['POS'] == clinvar['POS']) & (patients_ms_complete_genomes['CHROM'] == clinvar['CHROM']) & (patients_ms_complete_genomes['REF'] == clinvar['REF'])
,lsuffix='annotation_')
I then decided to create a heatmap of this dataset but to only include significant Genes. To Startwith I created a summary table of the dataset which gave me counts of samples by gene symbol
join_annotations.group_by('SAMPLE_ID','GENESYMBOL').agg(F.count('*'))\
.write.mode("overwrite").save_as_table('"Panel Summary Results"',type='Temporary')
Next, I created a subset to only show significant gens with the use of a Window Calculation
from snowflake.snowpark.window import Window
window1=Window.partition_by("SAMPLE_ID").order_by(F.col('"Count"').desc())
significant_genes = session.table('"Panel Summary Results"').with_column('Row',(F.row_number().over(window1))).filter(F.col('ROW')<=10).drop('ROW')
I utilised Matplotlib to visualise this information in a heatmap
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
matrix = significant_genes.to_pandas().pivot_table(index="GENESYMBOL", columns="SAMPLE_ID", values="Count")
sns.set(font_scale=0.4)
sns.heatmap(matrix.fillna(0), cmap="YlGnBu", annot=False, fmt="0.0f")
st.pyplot(plt)
Creating the Streamlit Application
The streamlit application compares one sample with the parent's sample if there is one.
Here is the complete code - simply copy and replace the code in a new 'Streamlit in Snowflake' app and then add the seaborn, numpy and matplotlib packages using the packages dropdown
# Import python packages
import streamlit as st
from snowflake.snowpark.context import get_active_session
from snowflake.snowpark import functions as F
from snowflake.snowpark import types as T
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from snowflake.snowpark.window import Window
window1=Window.partition_by("SAMPLE_ID").order_by(F.col('"Count"').desc())
window2=Window.partition_by('"CLNSIG"').order_by(F.col('"Count"').desc())
# Write directly to the app
st.set_page_config(layout="wide")
st.subheader("PATIENT GENOMIC DETAILS - CLINICIAN VIEW")
st.write(
'Clinician View of Patient Genome Details'
)
# Get the current credentials
session = get_active_session()
clinvar = session.table('DCR_GENOMES.SHARED_PUBLIC.CLINVAR')
PANEL0 = session.table('DCR_GENOMES.SHARED_PUBLIC.PANEL')
PANEL = PANEL0.filter((F.col('FATHER_ID')!='0') | (F.col('MOTHER_ID')!='0'))
genomes = session.table('DCR_GENOMES.SHARED_PUBLIC.GENOTYPES_BY_SAMPLE_ALL')
PANELU = PANEL0.filter((F.col('FATHER_ID')=='0') | (F.col('MOTHER_ID')=='0'))
genomes = session.table('DCR_GENOMES.SHARED_PUBLIC.GENOTYPES_BY_SAMPLE_ALL')
# create the side bar selector
with st.sidebar:
s_patient = st.selectbox('Choose Sample:',PANEL.select('SAMPLE_ID')\
.filter(F.col('SAMPLE_ID')!='HG00155'))
s_patient_1 = st.selectbox('Choose Unrelated Sample for reference:',PANELU.select('SAMPLE_ID')\
.filter(F.col('SAMPLE_ID')!='HG00155'))
topn = st.slider('Number of Significant Genes:',5,50,10)
def genomesone(sample):
genomesone = genomes.filter(F.col('SAMPLE_ID')== sample)
join_annotations = genomesone\
.join(clinvar,
on=(genomesone['POS'] == clinvar['POS']) & (genomesone['CHROM'] == clinvar['CHROM']) & (genomesone['REF'] == clinvar['REF'])
,lsuffix='annotation_')
return join_annotations
def disease(sample):
plt.clf()
significant_genes = sample.with_column('Row',(F.row_number().over(window1))).filter(F.col('ROW')<=10).drop('ROW')
matrix = significant_genes.to_pandas().pivot_table(index='Disease', columns="SAMPLE_ID", values="Count")
sns.set(font_scale=1)
sns.heatmap(matrix.fillna(0), cmap="YlGnBu", annot=True, fmt="0.0f")
return st.pyplot(plt)
# DISEASE SIGNIFICANCE
disease_significance = genomesone(s_patient)\
.join_table_function('flatten','CLNDN').select('SAMPLE_ID',F.cast('VALUE',T.StringType()).alias('"Disease"'))\
.group_by('SAMPLE_ID','"Disease"').agg(F.count('*').alias('"Count"'))\
.order_by(F.col('"Count"').desc())\
.filter((F.col('"Disease"')!='not provided') & (F.col('"Disease"')!='not specified'))
PANEL_INFO = PANEL.filter(F.col('SAMPLE_ID')==s_patient).to_pandas().iloc[0]
MOTHER_ID = PANEL_INFO.MOTHER_ID
FATHER_ID = PANEL_INFO.FATHER_ID
MOTHER = PANEL0.filter(F.col('SAMPLE_ID')==MOTHER_ID).to_pandas().iloc[0]
FATHER = PANEL0.filter(F.col('SAMPLE_ID')==FATHER_ID).to_pandas().iloc[0]
col1, col2 = st.columns(2)
with col1:
st.markdown(f'''Family ID: {PANEL_INFO.SAMPLE_ID}''')
st.markdown(f'''FATHER ID: {MOTHER_ID}''')
st.markdown(f'''Mother ID: {FATHER_ID}''')
with col2:
st.markdown(f'''Gender: {PANEL_INFO.GENDER}''')
st.markdown(f'''Population: {PANEL_INFO.POPULATION}''')
st.markdown(f'''Super Population: {PANEL_INFO.SUPERPOPULATION}''')
#datafreames of sample plus parents and unrelated
patient_full = genomesone(s_patient).select('POS','CHROM','REF','SAMPLE_ID','GENESYMBOL','ALLELE1','ALLELE2','ALT').distinct()
patient_full_copy = genomesone(s_patient).select('POS','CHROM','REF','SAMPLE_ID','GENESYMBOL','ALLELE1','ALLELE2','ALT').distinct()
mother_full = genomesone(MOTHER.SAMPLE_ID).select('POS','CHROM','REF','SAMPLE_ID','GENESYMBOL','ALLELE1','ALLELE2','ALT').distinct()
father_full = genomesone(FATHER.SAMPLE_ID).select('POS','CHROM','REF','SAMPLE_ID','GENESYMBOL','ALLELE1','ALLELE2','ALT').distinct()
unrelated = genomesone(s_patient_1).select('POS','CHROM','REF','SAMPLE_ID','GENESYMBOL','ALLELE1','ALLELE2','ALT').distinct()
# group_together
summary = patient_full.group_by('SAMPLE_ID','GENESYMBOL').agg(F.count('*').alias('"Count"'))
father = mother_full.group_by('SAMPLE_ID','GENESYMBOL').agg(F.count('*').alias('"Count"'))
mother = father_full.group_by('SAMPLE_ID','GENESYMBOL').agg(F.count('*').alias('"Count"'))
unrel = unrelated.group_by('SAMPLE_ID','GENESYMBOL').agg(F.count('*').alias('"Count"'))
st.divider()
#join to parenet
def patient_parent(parent):
return patient_full\
.join(parent,on=(patient_full['POS'] == parent['POS'])\
& (patient_full['CHROM'] == parent['CHROM'])
& (patient_full['REF'] == parent['REF'])
& (patient_full['GENESYMBOL'] == parent['GENESYMBOL'])
& (patient_full['ALT'] == parent['ALT'])
& (patient_full['ALLELE1'] == parent['ALLELE1'])
& (patient_full['ALLELE2'] == parent['ALLELE2'])
,lsuffix='patient_')
col1,col2,col3,col4 = st.columns(4)
with col1:
st.metric('Sample the same as Mother:',patient_parent(mother_full).count())
with col2:
st.metric('Sample the same as Father:',patient_parent(father_full).count())
with col3:
st.metric('Total Sample Count:',patient_parent(patient_full_copy).count())
with col4:
st.metric('Sample the same as Unrelated Sample:',unrel.count())
st.divider()
def genes(sample):
plt.clf()
significant_genes = sample.with_column('Row',(F.row_number().over(window1))).filter(F.col('ROW')<=topn).drop('ROW')
matrix = significant_genes.to_pandas().pivot_table(index="GENESYMBOL", columns="SAMPLE_ID", values="Count")
sns.set(font_scale=0.6)
#sns.set(xticklabels=[])
sns.heatmap(matrix.fillna(0), cmap="YlGnBu", annot=False, fmt="0.0f")
return st.pyplot(plt)
st.markdown('''### GENE COMPOSITION''')
col1,col2,col3 = st.columns([0.3,0.7,0.3])
with col2:
genes(summary)
st.divider()
st.markdown('''### DISEASE SIGNIFICANCE''')
disease(disease_significance)
st.divider()
st.markdown('''### FAMILY''')
col1,col2,col3 = st.columns(3)
with col1:
col1a,col1b = st.columns([0.1,0.9])
with col1b:
st.markdown('###### Selected Sample')
st.markdown(f'''Sample ID: {PANEL_INFO.SAMPLE_ID}''')
st.markdown(f'''Family ID: {PANEL_INFO.FAMILY_ID}''')
st.markdown(f'''Mother ID: {PANEL_INFO.MOTHER_ID}''')
st.markdown(f'''Father ID: {PANEL_INFO.FATHER_ID}''')
st.markdown(f'''Gender: {PANEL_INFO.GENDER}''')
st.markdown(f'''Population: {PANEL_INFO.POPULATION}''')
st.markdown(f'''Super Population: {PANEL_INFO.SUPERPOPULATION}''')
genes(summary)
with col2:
col1a,col1b = st.columns([0.1,0.9])
with col1b:
st.markdown('###### Father')
st.markdown(f'''Sample ID: {FATHER.SAMPLE_ID}''')
st.markdown(f'''Family ID: {FATHER.FAMILY_ID}''')
st.markdown(f'''Mother ID: {FATHER.MOTHER_ID}''')
st.markdown(f'''Father ID: {FATHER.FATHER_ID}''')
st.markdown(f'''Gender: {FATHER.GENDER}''')
st.markdown(f'''Population: {FATHER.POPULATION}''')
st.markdown(f'''Super Population: {FATHER.SUPERPOPULATION}''')
genes(father)
with col3:
col1a,col1b = st.columns([0.1,0.9])
with col1b:
st.markdown('###### Mother')
st.markdown(f'''Sample ID: {MOTHER.SAMPLE_ID}''')
st.markdown(f'''Family ID: {MOTHER.FAMILY_ID}''')
st.markdown(f'''Mother ID: {MOTHER.MOTHER_ID}''')
st.markdown(f'''Father ID: {MOTHER.FATHER_ID}''')
st.markdown(f'''Gender: {MOTHER.GENDER}''')
st.markdown(f'''Population: {MOTHER.POPULATION}''')
st.markdown(f'''Super Population: {MOTHER.SUPERPOPULATION}''')
genes(mother)
st.divider()
st.markdown('### PATIENT RAW DATA')
st.write(genomesone(s_patient).limit(5))
The app allows you to select a sample and gives you a slider to select the number of significant genes. It will then reveal the gene composition heat map with an additional disease significance visual. it will also compare this with the genetics of the mother and father.
There is also a drop down to compare with an unrelated sample -- this is good to compare the number of similar traits between related and unrelated samples.
Adding genome specific Visualisations
After some investigation i came across the python tool
The package relies on the libraries matplotlib and biopython - both of which Streamlit in Snowflake supports.
To load the pygenomevz code into your streamlit app, you will need to import the files into the stage which streamlit created. If you cannot remember where this is located, simply copy the stage name which is located in the streamlit URL and paste it in the data search area
For example, copy XS6XZV1KR438EAMB
Paste in the search
Click on the stage and press 'Enable Directory Table'
Download the compressed file from pygenomeviz
Extract the files and locate the SRC folder. In here there is a folder called pygenomeviz which is all you will need in snowflake.
Open the genbank.py file and replace the imports at the top with the following:
from functools import lru_cache
from io import TextIOWrapper
from pathlib import Path
from typing import Any, List, Optional, Tuple, Union
import numpy as np
from Bio import SeqIO, SeqUtils
from Bio.SeqFeature import FeatureLocation, SeqFeature
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
Load all files into the streamlit stage under a new folder called pygenomeviz
You will see something like this
Back in your streamlit app add the package biopython
Finally replace all the code with the code below:
# Import python packages
import streamlit as st
from snowflake.snowpark.context import get_active_session
from snowflake.snowpark import functions as F
from snowflake.snowpark import types as T
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from snowflake.snowpark.window import Window
import random
window1=Window.partition_by("SAMPLE_ID").order_by(F.col('"Count"').desc())
window2=Window.partition_by('"CLNSIG"').order_by(F.col('"Count"').desc())
# Write directly to the app
st.set_page_config(layout="wide")
st.subheader("PATIENT GENOMIC DETAILS - CLINICIAN VIEW")
st.write(
'Clinician View of Patient Genome Details'
)
# Get the current credentials
session = get_active_session()
clinvar = session.table('DCR_GENOMES.SHARED_PUBLIC.CLINVAR')
PANEL0 = session.table('DCR_GENOMES.SHARED_PUBLIC.PANEL')
PANEL = PANEL0.filter((F.col('FATHER_ID')!='0') | (F.col('MOTHER_ID')!='0'))
PANELU = PANEL0.filter((F.col('FATHER_ID')=='0') | (F.col('MOTHER_ID')=='0'))
genomes = session.table('DCR_GENOMES.SHARED_PUBLIC.GENOTYPES_BY_SAMPLE_ALL')
with st.sidebar:
s_patient = st.selectbox('Choose Sample:',PANEL.select('SAMPLE_ID')\
.filter(F.col('SAMPLE_ID')!='HG00155'))
s_patient_1 = st.selectbox('Choose Unrelated Sample for reference:',PANELU.select('SAMPLE_ID')\
.filter(F.col('SAMPLE_ID')!='HG00155'))
topn = st.slider('Number of Significant Genes:',5,50,10)
def genomesone(sample):
genomesone = genomes.filter(F.col('SAMPLE_ID')== sample)
join_annotations = genomesone\
.join(clinvar,
on=(genomesone['POS'] == clinvar['POS']) & (genomesone['CHROM'] == clinvar['CHROM']) & (genomesone['REF'] == clinvar['REF'])
,lsuffix='annotation_')
return join_annotations
def disease(sample):
plt.clf()
significant_genes = sample.with_column('Row',(F.row_number().over(window1))).filter(F.col('ROW')<=10).drop('ROW')
matrix = significant_genes.to_pandas().pivot_table(index='Disease', columns="SAMPLE_ID", values="Count")
sns.set(font_scale=1)
sns.heatmap(matrix.fillna(0), cmap="YlGnBu", annot=True, fmt="0.0f")
return st.pyplot(plt)
#DISEASE SIGNIFICANCE
disease_significance = genomesone(s_patient)\
.join_table_function('flatten','CLNDN').select('SAMPLE_ID',F.cast('VALUE',T.StringType()).alias('"Disease"'))\
.group_by('SAMPLE_ID','"Disease"').agg(F.count('*').alias('"Count"'))\
.order_by(F.col('"Count"').desc())\
.filter((F.col('"Disease"')!='not provided') & (F.col('"Disease"')!='not specified'))
PANEL_INFO = PANEL.filter(F.col('SAMPLE_ID')==s_patient).to_pandas().iloc[0]
MOTHER_ID = PANEL_INFO.MOTHER_ID
FATHER_ID = PANEL_INFO.FATHER_ID
MOTHER = PANEL0.filter(F.col('SAMPLE_ID')==MOTHER_ID).to_pandas().iloc[0]
FATHER = PANEL0.filter(F.col('SAMPLE_ID')==FATHER_ID).to_pandas().iloc[0]
col1, col2 = st.columns(2)
with col1:
st.markdown(f'''Family ID: {PANEL_INFO.SAMPLE_ID}''')
st.markdown(f'''FATHER ID: {MOTHER_ID}''')
st.markdown(f'''Mother ID: {FATHER_ID}''')
with col2:
st.markdown(f'''Gender: {PANEL_INFO.GENDER}''')
st.markdown(f'''Population: {PANEL_INFO.POPULATION}''')
st.markdown(f'''Super Population: {PANEL_INFO.SUPERPOPULATION}''')
#datafreames of sample plus parents and unrelated
patient_full = genomesone(s_patient).select('POS','CHROM','REF','SAMPLE_ID','GENESYMBOL','ALLELE1','ALLELE2','ALT').distinct()
patient_full_copy = genomesone(s_patient).select('POS','CHROM','REF','SAMPLE_ID','GENESYMBOL','ALLELE1','ALLELE2','ALT').distinct()
mother_full = genomesone(MOTHER.SAMPLE_ID).select('POS','CHROM','REF','SAMPLE_ID','GENESYMBOL','ALLELE1','ALLELE2','ALT').distinct()
father_full = genomesone(FATHER.SAMPLE_ID).select('POS','CHROM','REF','SAMPLE_ID','GENESYMBOL','ALLELE1','ALLELE2','ALT').distinct()
unrelated = genomesone(s_patient_1).select('POS','CHROM','REF','SAMPLE_ID','GENESYMBOL','ALLELE1','ALLELE2','ALT').distinct()
# group_together
summary = patient_full.group_by('SAMPLE_ID','GENESYMBOL').agg(F.count('*').alias('"Count"'))
father = mother_full.group_by('SAMPLE_ID','GENESYMBOL').agg(F.count('*').alias('"Count"'))
mother = father_full.group_by('SAMPLE_ID','GENESYMBOL').agg(F.count('*').alias('"Count"'))
unrel = unrelated.group_by('SAMPLE_ID','GENESYMBOL').agg(F.count('*').alias('"Count"'))
st.divider()
#join to parenet
def patient_parent(parent):
return patient_full\
.join(parent,on=(patient_full['POS'] == parent['POS'])\
& (patient_full['CHROM'] == parent['CHROM'])
& (patient_full['REF'] == parent['REF'])
& (patient_full['GENESYMBOL'] == parent['GENESYMBOL'])
& (patient_full['ALT'] == parent['ALT'])
& (patient_full['ALLELE1'] == parent['ALLELE1'])
& (patient_full['ALLELE2'] == parent['ALLELE2'])
,lsuffix='patient_')
col1,col2,col3,col4 = st.columns(4)
with col1:
st.metric('Sample the same as Mother:',patient_parent(mother_full).count())
with col2:
st.metric('Sample the same as Father:',patient_parent(father_full).count())
with col3:
st.metric('Total Sample Count:',patient_parent(patient_full_copy).count())
with col4:
st.metric('Sample the same as Unrelated Sample:',unrel.count())
st.divider()
def genes(sample):
plt.clf()
significant_genes = sample.with_column('Row',(F.row_number().over(window1))).filter(F.col('ROW')<=topn).drop('ROW')
matrix = significant_genes.to_pandas().pivot_table(index="GENESYMBOL", columns="SAMPLE_ID", values="Count")
sns.set(font_scale=0.6)
#sns.set(xticklabels=[])
sns.heatmap(matrix.fillna(0), cmap="YlGnBu", annot=False, fmt="0.0f")
return st.pyplot(plt)
st.markdown('''### GENE COMPOSITION''')
col1,col2,col3 = st.columns([0.3,0.7,0.3])
with col2:
genes(summary)
st.divider()
st.markdown('''### DISEASE SIGNIFICANCE''')
disease(disease_significance)
st.divider()
st.markdown('''### FAMILY''')
col1,col2,col3 = st.columns(3)
with col1:
col1a,col1b = st.columns([0.1,0.9])
with col1b:
st.markdown('###### Selected Sample')
st.markdown(f'''Sample ID: {PANEL_INFO.SAMPLE_ID}''')
st.markdown(f'''Family ID: {PANEL_INFO.FAMILY_ID}''')
st.markdown(f'''Mother ID: {PANEL_INFO.MOTHER_ID}''')
st.markdown(f'''Father ID: {PANEL_INFO.FATHER_ID}''')
st.markdown(f'''Gender: {PANEL_INFO.GENDER}''')
st.markdown(f'''Population: {PANEL_INFO.POPULATION}''')
st.markdown(f'''Super Population: {PANEL_INFO.SUPERPOPULATION}''')
genes(summary)
with col2:
col1a,col1b = st.columns([0.1,0.9])
with col1b:
st.markdown('###### Father')
st.markdown(f'''Sample ID: {FATHER.SAMPLE_ID}''')
st.markdown(f'''Family ID: {FATHER.FAMILY_ID}''')
st.markdown(f'''Mother ID: {FATHER.MOTHER_ID}''')
st.markdown(f'''Father ID: {FATHER.FATHER_ID}''')
st.markdown(f'''Gender: {FATHER.GENDER}''')
st.markdown(f'''Population: {FATHER.POPULATION}''')
st.markdown(f'''Super Population: {FATHER.SUPERPOPULATION}''')
genes(father)
with col3:
col1a,col1b = st.columns([0.1,0.9])
with col1b:
st.markdown('###### Mother')
st.markdown(f'''Sample ID: {MOTHER.SAMPLE_ID}''')
st.markdown(f'''Family ID: {MOTHER.FAMILY_ID}''')
st.markdown(f'''Mother ID: {MOTHER.MOTHER_ID}''')
st.markdown(f'''Father ID: {MOTHER.FATHER_ID}''')
st.markdown(f'''Gender: {MOTHER.GENDER}''')
st.markdown(f'''Population: {MOTHER.POPULATION}''')
st.markdown(f'''Super Population: {MOTHER.SUPERPOPULATION}''')
genes(mother)
st.divider()
############ this part of the code you need to upload the files from pygenomeviz library to the streamlit stage
##### delete all below if you do not wish to load the files from pygenomeViz
######### theres are the files you will need inside a folder called pygenomeviz within you streamlit stage.....
# go to the following URL - https://pypi.org/project/pygenomeviz/
# download the following file and unzip filespygenomeviz-0.4.4.tar.gz
# open up genbank.py from the folder src/pygeonomeviz and after all the imports at the top of theh page, add from Bio.Seq import Seq
# in line 7 remove the word Seq
import sys
sys.path.append('/pygenomeviz')
from pygenomeviz import GenomeViz
genomes_start_end = clinvar.agg(F.min('STARTPOS').alias('STARTPOS'),F.max('STOPPOS').alias('STOPPOS')).to_pandas()
plotstyles = ("bigarrow", "arrow", "bigbox", "box", "bigrbox", "rbox")
colors = ("#071e58", "#3ab0c3", "#f0f9b7", "#215ca7","#bbe4b5")
gene_colors = clinvar.select('GENESYMBOL').distinct().to_pandas()
chroms = clinvar.select('CHROM').distinct()
REF = clinvar.select('REF').distinct()
gene_colors['color'] = np.random.choice(colors,len(gene_colors))
def genome_viewer(sample,chrom):
data = genomesone(sample)\
.with_column('STRAND',F.replace('CLINSIGSIMPLE',0,1))\
.with_column('SIZE',F.col('STOPPOS')-F.col('STARTPOS'))\
.filter(F.col('CHROM').in_(chrom))\
.group_by(F.col('GENESYMBOL')).agg(F.any_value('STRAND').alias('STRAND'),F.sum('POS'),F.min('STARTPOS').alias('START'),F.max('STOPPOS').alias('STOP'),F.sum('SIZE').alias('SIZE')).sort(F.col('SIZE').desc())\
.filter(F.col('START')>=pos_slider[0])\
.filter(F.col('STOP')<=pos_slider[1])\
.filter(F.col('SIZE')>=0)
return data
st.markdown('### PATIENT GENOME VISUALISATION')
col1,col2 = st.columns(2)
with st.form('dna_visual'):
pos_slider = st.slider('choose position range:', genomes_start_end.STARTPOS.iloc[0],genomes_start_end.STOPPOS.iloc[0],(50000,2000000))
select_chroms = st.multiselect('Choose Chromosone: ', chroms)
submitted = st.form_submit_button('Submit')
if submitted:
def genome_size(sample):
genome_size = genome_viewer(sample,select_chroms).agg(F.max('STOP').alias('"max_size"'),F.min('START').alias('"Start_pos"'))
genome_size = genome_size.with_column('"max_size"',(F.col('"max_size"')+F.col('"Start_pos"'))).select('"max_size"')
return genome_size.to_pandas().max_size.iloc[0]
def sample_track(sample):
datapd = genome_viewer(sample,select_chroms).to_pandas()
datapd['color']= np.random.choice(colors, len(datapd))
gv = GenomeViz()
track = gv.add_feature_track(sample, genome_size(sample))
for a in datapd.index:
color = random.choice(colors)
start = datapd.START.iloc[a]
end = datapd.STOP.iloc[a]
strand = datapd.STRAND.iloc[a]
track.add_feature(start, end,
strand, label=datapd.GENESYMBOL.iloc[a],
plotstyle="bigbox",
labelsize=10,
facecolor=gene_colors.loc[gene_colors['GENESYMBOL'] == datapd.GENESYMBOL.iloc[a]].color.iloc[0])
fig = gv.plotfig()
return st.pyplot(fig)
try:
st.markdown('#### PATIENT')
sample_track(s_patient)
except:
st.warning('no data available')
try:
st.markdown('#### MOTHER')
sample_track(MOTHER.SAMPLE_ID)
except:
st.warning('no data available')
try:
st.markdown('#### FATHER')
sample_track(FATHER.SAMPLE_ID)
except:
st.warning('no data available')
try:
st.markdown('#### UNRELATED SAMPLE')
sample_track(s_patient_1)
except:
st.warning('no data available')
st.divider()
st.markdown('### PATIENT RAW DATA')
st.write(genomesone(s_patient).limit(5))
When you reload the app, you will see a new option appear at the bottom
use the selector to chose the position range of the genomes and select the chromosones you would like the results to be based on.
Finally press submit
You will see that the genomes are the previously selected patient with corresponding mother and father. I have also added the unrelated sample for comparison purposes.
I really hope you find this blog useful to help bring all that genome data to life.
Thanks