How can I remove (non-trivial) duplicates from a VCF file?Are duplicate variants against the VCF standard?How...

Dilemma of explaining to interviewer that he is the reason for declining second interview

Line of Bones to Travel and Conform to Curve (Like Train on a Track, Snake...)

New package vs new version?

How to deal with possible delayed baggage?

Why did the villain in the first Men in Black movie care about Earth's Cockroaches?

Which communication protocol is used in AdLib sound card?

When can a QA tester start his job?

Is using an 'empty' metaphor considered bad style?

What is the data structure of $@ in shell?

Constexpr if with a non-bool condition

What's a good word to describe a public place that looks like it wouldn't be rough?

Is it a fallacy if someone claims they need an explanation for every word of your argument to the point where they don't understand common terms?

In Linux what happens if 1000 files in a directory are moved to another location while another 300 files were added to the source directory?

Consequences of lack of rigour

How can I remove (non-trivial) duplicates from a VCF file?

What is a good reason for every spaceship to carry a weapon on board?

Early credit roll before the end of the film

Why are the books in the Game of Thrones citadel library shelved spine inwards?

Graph with overlapping labels

What is the wife of a henpecked husband called?

Am I a Rude Number?

Why avoid shared user accounts?

Do authors have to be politically correct in article-writing?

False written accusations not made public - is there law to cover this?



How can I remove (non-trivial) duplicates from a VCF file?


Are duplicate variants against the VCF standard?How to read structural variant VCF?How do I carry out an ancestry/admixture test on a single VCF file?How can I extract only insertions from a VCF file?How to manipulate a reference FASTA or bam to include variants from a VCF?How to represent a deletion at position 1 in a VCF file?Where can I get the population allele frequency vcf file?How to subset samples from a VCF file?Meaning of the FORMAT fields of the VCF file coming from GIAB projectHigh Phred Quality score VCF fileAre duplicate variants against the VCF standard?













3












$begingroup$


This is related to the question I asked here. Consider a vcf file that contains duplicate variants, but where the duplicates aren't simply the same thing in the same notation but instead one is a subset of the other. For example:



##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 . AACAC AATAC . PASS . GT 0/1
chr12 529516 . C T . PASS . GT 0/1


These the two variants are actually the same. They result in exactly the same genotype. Changing AACAC to AATAC at position 529514 just means change C to T at position 529516.



Is there any tool that can detect such duplicates and remove them? I tried vcfuniq from vcflib, but that doesn't seem to recognize this as a duplicate. I think it only looks at the 1st 4 fields and only considers duplicates those variants with exactly the same values in the 1st 4 fields:



$ ./bin/vcfuniq test.vcf
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 . AACAC AATAC . PASS . GT 0/1
chr12 529516 . C T . PASS . GT 0/1


However, as explained in the linked question, EBI's vcf_validator considers this invalid. And it doesn't really make sense to have these duplicates in any case, so is there any way I can detect and remove them? Preferably an existing tool, but I am open to scripting solutions as well.










share|improve this question











$endgroup$

















    3












    $begingroup$


    This is related to the question I asked here. Consider a vcf file that contains duplicate variants, but where the duplicates aren't simply the same thing in the same notation but instead one is a subset of the other. For example:



    ##fileformat=VCFv4.1
    ##reference=foo
    ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
    ##contig=<ID=chr12>
    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
    chr12 529514 . AACAC AATAC . PASS . GT 0/1
    chr12 529516 . C T . PASS . GT 0/1


    These the two variants are actually the same. They result in exactly the same genotype. Changing AACAC to AATAC at position 529514 just means change C to T at position 529516.



    Is there any tool that can detect such duplicates and remove them? I tried vcfuniq from vcflib, but that doesn't seem to recognize this as a duplicate. I think it only looks at the 1st 4 fields and only considers duplicates those variants with exactly the same values in the 1st 4 fields:



    $ ./bin/vcfuniq test.vcf
    ##fileformat=VCFv4.1
    ##reference=foo
    ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
    ##contig=<ID=chr12>
    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
    chr12 529514 . AACAC AATAC . PASS . GT 0/1
    chr12 529516 . C T . PASS . GT 0/1


    However, as explained in the linked question, EBI's vcf_validator considers this invalid. And it doesn't really make sense to have these duplicates in any case, so is there any way I can detect and remove them? Preferably an existing tool, but I am open to scripting solutions as well.










    share|improve this question











    $endgroup$















      3












      3








      3





      $begingroup$


      This is related to the question I asked here. Consider a vcf file that contains duplicate variants, but where the duplicates aren't simply the same thing in the same notation but instead one is a subset of the other. For example:



      ##fileformat=VCFv4.1
      ##reference=foo
      ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
      ##contig=<ID=chr12>
      #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
      chr12 529514 . AACAC AATAC . PASS . GT 0/1
      chr12 529516 . C T . PASS . GT 0/1


      These the two variants are actually the same. They result in exactly the same genotype. Changing AACAC to AATAC at position 529514 just means change C to T at position 529516.



      Is there any tool that can detect such duplicates and remove them? I tried vcfuniq from vcflib, but that doesn't seem to recognize this as a duplicate. I think it only looks at the 1st 4 fields and only considers duplicates those variants with exactly the same values in the 1st 4 fields:



      $ ./bin/vcfuniq test.vcf
      ##fileformat=VCFv4.1
      ##reference=foo
      ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
      ##contig=<ID=chr12>
      #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
      chr12 529514 . AACAC AATAC . PASS . GT 0/1
      chr12 529516 . C T . PASS . GT 0/1


      However, as explained in the linked question, EBI's vcf_validator considers this invalid. And it doesn't really make sense to have these duplicates in any case, so is there any way I can detect and remove them? Preferably an existing tool, but I am open to scripting solutions as well.










      share|improve this question











      $endgroup$




      This is related to the question I asked here. Consider a vcf file that contains duplicate variants, but where the duplicates aren't simply the same thing in the same notation but instead one is a subset of the other. For example:



      ##fileformat=VCFv4.1
      ##reference=foo
      ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
      ##contig=<ID=chr12>
      #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
      chr12 529514 . AACAC AATAC . PASS . GT 0/1
      chr12 529516 . C T . PASS . GT 0/1


      These the two variants are actually the same. They result in exactly the same genotype. Changing AACAC to AATAC at position 529514 just means change C to T at position 529516.



      Is there any tool that can detect such duplicates and remove them? I tried vcfuniq from vcflib, but that doesn't seem to recognize this as a duplicate. I think it only looks at the 1st 4 fields and only considers duplicates those variants with exactly the same values in the 1st 4 fields:



      $ ./bin/vcfuniq test.vcf
      ##fileformat=VCFv4.1
      ##reference=foo
      ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
      ##contig=<ID=chr12>
      #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
      chr12 529514 . AACAC AATAC . PASS . GT 0/1
      chr12 529516 . C T . PASS . GT 0/1


      However, as explained in the linked question, EBI's vcf_validator considers this invalid. And it doesn't really make sense to have these duplicates in any case, so is there any way I can detect and remove them? Preferably an existing tool, but I am open to scripting solutions as well.







      vcf variation






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited 2 hours ago







      terdon

















      asked 2 hours ago









      terdonterdon

      4,3701729




      4,3701729






















          2 Answers
          2






          active

          oldest

          votes


















          2












          $begingroup$

          It turns out that bcftools can do this (tested on bcftools-1.8), if you give it the reference genome to test against:



          $ bcftools norm -d none -f hg19.fa test.vcf
          ##fileformat=VCFv4.1
          ##FILTER=<ID=PASS,Description="All filters passed">
          ##reference=foo
          ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
          ##contig=<ID=chr12>
          ##bcftools_normVersion=1.8+htslib-1.8
          ##bcftools_normCommand=norm -d none -f hg19.fa test.vcf; Date=Wed Feb 27 16:08:44 2019
          #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
          chr12 529516 . C T . PASS . GT 0/1
          Lines total/split/realigned/skipped: 2/0/1/0





          share|improve this answer









          $endgroup$





















            1












            $begingroup$

            I'm no expert with VCF (few can say they are!) but I have worked a lot with VCF data in the last few years, both tools to consume and produce VCF. I've never seen variants encoded in this fashion, and it seems to be non-canonical. Typically:




            • Single nucleotide variants (SNVs) are encoded with a single base as the REF allele and a single base as the ALT allele.

            • In the case of insertions or deletions, the shorter of the REF and ALT alleles will be a single base, the base preceding the inserted/deleted sequence. Thus the first base of the REF and ALT alleles is always the same.

            • In the rarer case of two or more consecutive substitutions forming a multinucleotide variant (MNV) the REF and ALT alleles will have the same length.


            Using multi-bp strings of the same length to encode SNVs is unnecessary and, as you've pointed out, problematic. This makes me think its a bug or a "feature" of the variant predictor that produced the VCF.



            In this case, I'd write a small script that would check for variants where the REF and ALT alleles have the same length. If the base is the same for REF and ALT in any position, drop it, and adjust the position accordingly.



            The script below will convert these funky SNVs to the canonical representation, and will also work on MNVs. Standard tools should then work to remove the duplicates.



            #!/usr/bin/env python3


            def canonicalize(instream):
            for line in instream:
            if not line.startswith('#'):
            values = line.split('t')
            pos = int(values[1])
            ref, alt = values[3:5]
            if len(ref) > 1 and len(ref) == len(alt):
            # How many bp to trim off the end
            for n, (r, a) in enumerate(zip(ref[::-1], alt[::-1])):
            if r != a:
            revoffset = -1 * n
            break

            # How many bp to trim off the front
            for n, (r, a) in enumerate(zip(ref, alt)):
            if r != a:
            offset = n
            values[1] = str(pos + offset)
            values[3] = ref[offset:revoffset]
            values[4] = alt[offset:revoffset]
            break
            line = 't'.join(values)
            yield line


            if __name__ == '__main__':
            import sys
            for line in canonicalize(sys.stdin):
            print(line, end='')





            share|improve this answer











            $endgroup$









            • 1




              $begingroup$
              It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced by freebayes and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
              $endgroup$
              – terdon
              1 hour ago











            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "676"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f7126%2fhow-can-i-remove-non-trivial-duplicates-from-a-vcf-file%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            2












            $begingroup$

            It turns out that bcftools can do this (tested on bcftools-1.8), if you give it the reference genome to test against:



            $ bcftools norm -d none -f hg19.fa test.vcf
            ##fileformat=VCFv4.1
            ##FILTER=<ID=PASS,Description="All filters passed">
            ##reference=foo
            ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
            ##contig=<ID=chr12>
            ##bcftools_normVersion=1.8+htslib-1.8
            ##bcftools_normCommand=norm -d none -f hg19.fa test.vcf; Date=Wed Feb 27 16:08:44 2019
            #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
            chr12 529516 . C T . PASS . GT 0/1
            Lines total/split/realigned/skipped: 2/0/1/0





            share|improve this answer









            $endgroup$


















              2












              $begingroup$

              It turns out that bcftools can do this (tested on bcftools-1.8), if you give it the reference genome to test against:



              $ bcftools norm -d none -f hg19.fa test.vcf
              ##fileformat=VCFv4.1
              ##FILTER=<ID=PASS,Description="All filters passed">
              ##reference=foo
              ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
              ##contig=<ID=chr12>
              ##bcftools_normVersion=1.8+htslib-1.8
              ##bcftools_normCommand=norm -d none -f hg19.fa test.vcf; Date=Wed Feb 27 16:08:44 2019
              #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
              chr12 529516 . C T . PASS . GT 0/1
              Lines total/split/realigned/skipped: 2/0/1/0





              share|improve this answer









              $endgroup$
















                2












                2








                2





                $begingroup$

                It turns out that bcftools can do this (tested on bcftools-1.8), if you give it the reference genome to test against:



                $ bcftools norm -d none -f hg19.fa test.vcf
                ##fileformat=VCFv4.1
                ##FILTER=<ID=PASS,Description="All filters passed">
                ##reference=foo
                ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
                ##contig=<ID=chr12>
                ##bcftools_normVersion=1.8+htslib-1.8
                ##bcftools_normCommand=norm -d none -f hg19.fa test.vcf; Date=Wed Feb 27 16:08:44 2019
                #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
                chr12 529516 . C T . PASS . GT 0/1
                Lines total/split/realigned/skipped: 2/0/1/0





                share|improve this answer









                $endgroup$



                It turns out that bcftools can do this (tested on bcftools-1.8), if you give it the reference genome to test against:



                $ bcftools norm -d none -f hg19.fa test.vcf
                ##fileformat=VCFv4.1
                ##FILTER=<ID=PASS,Description="All filters passed">
                ##reference=foo
                ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
                ##contig=<ID=chr12>
                ##bcftools_normVersion=1.8+htslib-1.8
                ##bcftools_normCommand=norm -d none -f hg19.fa test.vcf; Date=Wed Feb 27 16:08:44 2019
                #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
                chr12 529516 . C T . PASS . GT 0/1
                Lines total/split/realigned/skipped: 2/0/1/0






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered 1 hour ago









                terdonterdon

                4,3701729




                4,3701729























                    1












                    $begingroup$

                    I'm no expert with VCF (few can say they are!) but I have worked a lot with VCF data in the last few years, both tools to consume and produce VCF. I've never seen variants encoded in this fashion, and it seems to be non-canonical. Typically:




                    • Single nucleotide variants (SNVs) are encoded with a single base as the REF allele and a single base as the ALT allele.

                    • In the case of insertions or deletions, the shorter of the REF and ALT alleles will be a single base, the base preceding the inserted/deleted sequence. Thus the first base of the REF and ALT alleles is always the same.

                    • In the rarer case of two or more consecutive substitutions forming a multinucleotide variant (MNV) the REF and ALT alleles will have the same length.


                    Using multi-bp strings of the same length to encode SNVs is unnecessary and, as you've pointed out, problematic. This makes me think its a bug or a "feature" of the variant predictor that produced the VCF.



                    In this case, I'd write a small script that would check for variants where the REF and ALT alleles have the same length. If the base is the same for REF and ALT in any position, drop it, and adjust the position accordingly.



                    The script below will convert these funky SNVs to the canonical representation, and will also work on MNVs. Standard tools should then work to remove the duplicates.



                    #!/usr/bin/env python3


                    def canonicalize(instream):
                    for line in instream:
                    if not line.startswith('#'):
                    values = line.split('t')
                    pos = int(values[1])
                    ref, alt = values[3:5]
                    if len(ref) > 1 and len(ref) == len(alt):
                    # How many bp to trim off the end
                    for n, (r, a) in enumerate(zip(ref[::-1], alt[::-1])):
                    if r != a:
                    revoffset = -1 * n
                    break

                    # How many bp to trim off the front
                    for n, (r, a) in enumerate(zip(ref, alt)):
                    if r != a:
                    offset = n
                    values[1] = str(pos + offset)
                    values[3] = ref[offset:revoffset]
                    values[4] = alt[offset:revoffset]
                    break
                    line = 't'.join(values)
                    yield line


                    if __name__ == '__main__':
                    import sys
                    for line in canonicalize(sys.stdin):
                    print(line, end='')





                    share|improve this answer











                    $endgroup$









                    • 1




                      $begingroup$
                      It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced by freebayes and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
                      $endgroup$
                      – terdon
                      1 hour ago
















                    1












                    $begingroup$

                    I'm no expert with VCF (few can say they are!) but I have worked a lot with VCF data in the last few years, both tools to consume and produce VCF. I've never seen variants encoded in this fashion, and it seems to be non-canonical. Typically:




                    • Single nucleotide variants (SNVs) are encoded with a single base as the REF allele and a single base as the ALT allele.

                    • In the case of insertions or deletions, the shorter of the REF and ALT alleles will be a single base, the base preceding the inserted/deleted sequence. Thus the first base of the REF and ALT alleles is always the same.

                    • In the rarer case of two or more consecutive substitutions forming a multinucleotide variant (MNV) the REF and ALT alleles will have the same length.


                    Using multi-bp strings of the same length to encode SNVs is unnecessary and, as you've pointed out, problematic. This makes me think its a bug or a "feature" of the variant predictor that produced the VCF.



                    In this case, I'd write a small script that would check for variants where the REF and ALT alleles have the same length. If the base is the same for REF and ALT in any position, drop it, and adjust the position accordingly.



                    The script below will convert these funky SNVs to the canonical representation, and will also work on MNVs. Standard tools should then work to remove the duplicates.



                    #!/usr/bin/env python3


                    def canonicalize(instream):
                    for line in instream:
                    if not line.startswith('#'):
                    values = line.split('t')
                    pos = int(values[1])
                    ref, alt = values[3:5]
                    if len(ref) > 1 and len(ref) == len(alt):
                    # How many bp to trim off the end
                    for n, (r, a) in enumerate(zip(ref[::-1], alt[::-1])):
                    if r != a:
                    revoffset = -1 * n
                    break

                    # How many bp to trim off the front
                    for n, (r, a) in enumerate(zip(ref, alt)):
                    if r != a:
                    offset = n
                    values[1] = str(pos + offset)
                    values[3] = ref[offset:revoffset]
                    values[4] = alt[offset:revoffset]
                    break
                    line = 't'.join(values)
                    yield line


                    if __name__ == '__main__':
                    import sys
                    for line in canonicalize(sys.stdin):
                    print(line, end='')





                    share|improve this answer











                    $endgroup$









                    • 1




                      $begingroup$
                      It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced by freebayes and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
                      $endgroup$
                      – terdon
                      1 hour ago














                    1












                    1








                    1





                    $begingroup$

                    I'm no expert with VCF (few can say they are!) but I have worked a lot with VCF data in the last few years, both tools to consume and produce VCF. I've never seen variants encoded in this fashion, and it seems to be non-canonical. Typically:




                    • Single nucleotide variants (SNVs) are encoded with a single base as the REF allele and a single base as the ALT allele.

                    • In the case of insertions or deletions, the shorter of the REF and ALT alleles will be a single base, the base preceding the inserted/deleted sequence. Thus the first base of the REF and ALT alleles is always the same.

                    • In the rarer case of two or more consecutive substitutions forming a multinucleotide variant (MNV) the REF and ALT alleles will have the same length.


                    Using multi-bp strings of the same length to encode SNVs is unnecessary and, as you've pointed out, problematic. This makes me think its a bug or a "feature" of the variant predictor that produced the VCF.



                    In this case, I'd write a small script that would check for variants where the REF and ALT alleles have the same length. If the base is the same for REF and ALT in any position, drop it, and adjust the position accordingly.



                    The script below will convert these funky SNVs to the canonical representation, and will also work on MNVs. Standard tools should then work to remove the duplicates.



                    #!/usr/bin/env python3


                    def canonicalize(instream):
                    for line in instream:
                    if not line.startswith('#'):
                    values = line.split('t')
                    pos = int(values[1])
                    ref, alt = values[3:5]
                    if len(ref) > 1 and len(ref) == len(alt):
                    # How many bp to trim off the end
                    for n, (r, a) in enumerate(zip(ref[::-1], alt[::-1])):
                    if r != a:
                    revoffset = -1 * n
                    break

                    # How many bp to trim off the front
                    for n, (r, a) in enumerate(zip(ref, alt)):
                    if r != a:
                    offset = n
                    values[1] = str(pos + offset)
                    values[3] = ref[offset:revoffset]
                    values[4] = alt[offset:revoffset]
                    break
                    line = 't'.join(values)
                    yield line


                    if __name__ == '__main__':
                    import sys
                    for line in canonicalize(sys.stdin):
                    print(line, end='')





                    share|improve this answer











                    $endgroup$



                    I'm no expert with VCF (few can say they are!) but I have worked a lot with VCF data in the last few years, both tools to consume and produce VCF. I've never seen variants encoded in this fashion, and it seems to be non-canonical. Typically:




                    • Single nucleotide variants (SNVs) are encoded with a single base as the REF allele and a single base as the ALT allele.

                    • In the case of insertions or deletions, the shorter of the REF and ALT alleles will be a single base, the base preceding the inserted/deleted sequence. Thus the first base of the REF and ALT alleles is always the same.

                    • In the rarer case of two or more consecutive substitutions forming a multinucleotide variant (MNV) the REF and ALT alleles will have the same length.


                    Using multi-bp strings of the same length to encode SNVs is unnecessary and, as you've pointed out, problematic. This makes me think its a bug or a "feature" of the variant predictor that produced the VCF.



                    In this case, I'd write a small script that would check for variants where the REF and ALT alleles have the same length. If the base is the same for REF and ALT in any position, drop it, and adjust the position accordingly.



                    The script below will convert these funky SNVs to the canonical representation, and will also work on MNVs. Standard tools should then work to remove the duplicates.



                    #!/usr/bin/env python3


                    def canonicalize(instream):
                    for line in instream:
                    if not line.startswith('#'):
                    values = line.split('t')
                    pos = int(values[1])
                    ref, alt = values[3:5]
                    if len(ref) > 1 and len(ref) == len(alt):
                    # How many bp to trim off the end
                    for n, (r, a) in enumerate(zip(ref[::-1], alt[::-1])):
                    if r != a:
                    revoffset = -1 * n
                    break

                    # How many bp to trim off the front
                    for n, (r, a) in enumerate(zip(ref, alt)):
                    if r != a:
                    offset = n
                    values[1] = str(pos + offset)
                    values[3] = ref[offset:revoffset]
                    values[4] = alt[offset:revoffset]
                    break
                    line = 't'.join(values)
                    yield line


                    if __name__ == '__main__':
                    import sys
                    for line in canonicalize(sys.stdin):
                    print(line, end='')






                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited 1 hour ago

























                    answered 1 hour ago









                    Daniel StandageDaniel Standage

                    2,333329




                    2,333329








                    • 1




                      $begingroup$
                      It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced by freebayes and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
                      $endgroup$
                      – terdon
                      1 hour ago














                    • 1




                      $begingroup$
                      It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced by freebayes and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
                      $endgroup$
                      – terdon
                      1 hour ago








                    1




                    1




                    $begingroup$
                    It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced by freebayes and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
                    $endgroup$
                    – terdon
                    1 hour ago




                    $begingroup$
                    It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced by freebayes and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
                    $endgroup$
                    – terdon
                    1 hour ago


















                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Bioinformatics Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f7126%2fhow-can-i-remove-non-trivial-duplicates-from-a-vcf-file%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Can't compile dgruyter and caption packagesLaTeX templates/packages for writing a patent specificationLatex...

                    Schneeberg (Smreczany) Bibliografia | Menu...

                    IEEEtran - How to include ORCID in TeX/PDF with PdfLatexIs there a standard way to include ORCID in TeX /...