European Genome-Phenome Archive

File Quality

File InformationEGAF00004829671

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

106 300 89669 447 55427 646 50722 502 0889 292 0498 220 2343 673 5893 315 9191 645 1191 475 322818 106714 853451 062389 489282 857238 086187 768166 157138 017121 540109 74297 98590 29783 37876 90172 98067 54764 07761 77459 12456 11054 00751 77250 65548 85347 17545 97244 40143 22741 58241 11139 38938 22438 12836 47836 08535 01634 18033 83433 15232 16432 22931 32930 54330 36030 35830 08029 74429 49329 04528 80228 25528 11427 69027 47227 35126 71626 37426 37726 33325 90025 94525 86325 53025 12525 00324 99025 12025 05324 66524 54624 56524 03624 25724 29923 65923 64423 67123 37023 05423 21922 67723 12922 78922 58722 50522 67322 61921 90821 77821 84122 06121 49521 84121 95021 89121 68621 86321 18021 09621 29421 24021 26921 62220 95920 99820 71320 94920 70820 69820 59120 42420 45019 97320 08320 35720 12719 86919 93620 08619 74919 85320 14419 71720 12619 80019 87119 82219 71419 74119 53519 57319 66419 27919 62019 14819 43019 08219 34218 84718 99018 83718 79918 71118 51717 99718 22018 40218 31018 22617 84917 83618 05417 83317 90517 96617 97217 60917 78617 57917 89817 56617 28917 14517 10216 63316 84716 70816 68616 57316 51016 47716 59116 28916 10515 80315 98716 03316 01915 90316 27315 79015 66315 73915 41115 42515 42114 94614 92715 33214 79315 02114 55214 63714 77614 60914 36914 23814 18214 03914 21413 89413 98713 78613 71213 56213 39713 42313 38713 09213 03113 02912 92412 82512 71212 64512 61012 35812 30712 30812 16912 03211 83911 43811 79311 45311 47811 44711 49711 21911 01911 03010 76010 88110 59510 47710 43110 40310 40110 17610 2449 88110 05010 0079 6419 8759 5619 6279 4059 3389 3779 1589 0249 1448 9178 8478 5928 6608 3278 2308 1778 3908 3658 1718 1938 1648 1957 8077 8677 5227 5287 5497 4897 4757 4177 3037 2507 1127 0267 1366 8236 7126 7036 5616 5536 3746 3726 3666 3646 4716 0856 1536 1416 1605 8975 9535 8605 8795 7615 5685 6175 4715 6335 3505 3695 3195 4795 2995 1415 2825 2415 1775 2605 0575 0155 0294 9324 9794 9924 7804 7994 7254 7314 6644 6124 4994 4944 6204 3954 3604 3314 3824 2594 3164 1224 1704 0463 9584 0773 8493 9093 7873 8043 8643 7643 7833 5803 6153 5673 4723 4733 5963 4283 2983 4003 3533 3503 3533 2873 2763 2303 2773 2593 1303 1103 1403 2333 1913 0233 1182 9903 1312 9382 8452 8142 8862 7232 8052 7492 6462 5612 6172 6492 6372 5052 5862 5102 3992 3912 4552 2672 3212 2182 2702 2942 1972 1952 2362 2102 1202 0822 0812 0342 0522 0022 0302 0352 0221 9041 8951 8221 8491 8521 7521 7541 7831 8051 6991 7351 7141 6371 6891 6491 5651 6411 5461 5131 5131 5051 5541 4751 3881 4791 4901 4801 4451 4491 3631 4411 4571 3441 3671 4041 3051 3651 3541 3241 2831 2981 2931 3051 2581 1931 2331 3011 1271 1831 1741 1701 1641 1341 1211 1361 1051 1341 0741 0921 0891 0681 0521 0111 0659911 0101 0191 0171 0299969481 007922984964941868919881926926867816818795775771776764789794789802812812772743822787803738746728730736744708719728674709694701662692647652656637639604599606625610607583595632586522505579538497559548519507486482460512478454448456443444430420429436408418410456429425416404399398403395412432415398410388374414410383365374378349369382351357329352361364343350376363329311343334319366296312309316315260304329311294294293294327283240283272297272278279260275243269261229248243223250224214240237208230242228225200215213212225237206204212188168226193219188196177159197194193176184152167175190197179181190173189164184170139151162173176172167176166150173176132172162136136173154161165157134156146149143145140165139146150156149141152132159141153134146127135143137131149141144154133155140148130135149167119118133119129119118961181301071211041241231171101219513012511195921051001018092881129489878990979196112911038786871101021099491891029975102102831089796879488949810693908891851099888979810010185981128993868186948787827978658166868864838277848780727896736780927690747271709659767174756583908178738196798082851117588757879687973637970836659845064434647637067756463577352565962513658506259544968435353596154566246605071626251464148504847494348483960455358584541404947675144574252576146426662626560575448716046574052464056494658525152555253 678100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

06 48331 292 02313 51294 493240 171227 587453 3991 210 8213 423 8455 837 3736 809 0426 888 5896 813 5235 842 1654 424 7515 969 4076 823 79410 577 07831 079 45174 586 881121 748 614156 473 914231 645 039369 979 987445 068 070317 900 611104 447 7809 915 49700000000000000510152025303540Phred quality score0M50M100M150M200M250M300M350M400M# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

87.4 %16 952 80887.4 %12.6 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

84.2 %16 335 96884.2 %15.8 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

3.6 %616 8403.6 %96.4 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %9 701 95050 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

79.6 %15 446 04279.6 %20.4 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

38.2 %7 402 67238.2 %61.8 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

3 234 8392 9391 9953 5332 9574 2674 9235 9126 03625 55517 02315 67026 18613 9417 37267 55712 442121 42122 9504 45935 4452 75613 01084 5881 783123 0847911 6044873 269 1309426691 0171 7701 4653 99123 1121 866 5492 1779 5085 7212 0053675 27749278422 8676 3561 8712 0294 6611 7775 1074 1606 41911 0063221 30624310 241 46562213132411513 9531316601020304050607080Phred quality score1M2M3M4M5M6M7M8M9M10M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

96.32%95.72%96.67%96.26%96.4%96.69%96.21%96.8%96.76%96.02%96.78%96.59%96.55%96.27%95.85%96.72%96.99%96.26%97.32%96.68%96.99%95.96%92%93.57%3.68%4.28%3.33%3.74%3.6%3.31%3.79%3.2%3.24%3.98%3.22%3.41%3.45%3.73%4.15%3.28%3.01%3.74%2.68%3.32%3.01%4.04%8%6.43%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped