How can I remove (non-trivial) duplicates from a VCF file?Are duplicate variants against the VCF standard?How...
Dilemma of explaining to interviewer that he is the reason for declining second interview
Line of Bones to Travel and Conform to Curve (Like Train on a Track, Snake...)
New package vs new version?
How to deal with possible delayed baggage?
Why did the villain in the first Men in Black movie care about Earth's Cockroaches?
Which communication protocol is used in AdLib sound card?
When can a QA tester start his job?
Is using an 'empty' metaphor considered bad style?
What is the data structure of $@ in shell?
Constexpr if with a non-bool condition
What's a good word to describe a public place that looks like it wouldn't be rough?
Is it a fallacy if someone claims they need an explanation for every word of your argument to the point where they don't understand common terms?
In Linux what happens if 1000 files in a directory are moved to another location while another 300 files were added to the source directory?
Consequences of lack of rigour
How can I remove (non-trivial) duplicates from a VCF file?
What is a good reason for every spaceship to carry a weapon on board?
Early credit roll before the end of the film
Why are the books in the Game of Thrones citadel library shelved spine inwards?
Graph with overlapping labels
What is the wife of a henpecked husband called?
Am I a Rude Number?
Why avoid shared user accounts?
Do authors have to be politically correct in article-writing?
False written accusations not made public - is there law to cover this?
How can I remove (non-trivial) duplicates from a VCF file?
Are duplicate variants against the VCF standard?How to read structural variant VCF?How do I carry out an ancestry/admixture test on a single VCF file?How can I extract only insertions from a VCF file?How to manipulate a reference FASTA or bam to include variants from a VCF?How to represent a deletion at position 1 in a VCF file?Where can I get the population allele frequency vcf file?How to subset samples from a VCF file?Meaning of the FORMAT fields of the VCF file coming from GIAB projectHigh Phred Quality score VCF fileAre duplicate variants against the VCF standard?
$begingroup$
This is related to the question I asked here. Consider a vcf file that contains duplicate variants, but where the duplicates aren't simply the same thing in the same notation but instead one is a subset of the other. For example:
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 . AACAC AATAC . PASS . GT 0/1
chr12 529516 . C T . PASS . GT 0/1
These the two variants are actually the same. They result in exactly the same genotype. Changing AACAC to AATAC at position 529514 just means change C to T at position 529516.
Is there any tool that can detect such duplicates and remove them? I tried vcfuniq
from vcflib
, but that doesn't seem to recognize this as a duplicate. I think it only looks at the 1st 4 fields and only considers duplicates those variants with exactly the same values in the 1st 4 fields:
$ ./bin/vcfuniq test.vcf
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 . AACAC AATAC . PASS . GT 0/1
chr12 529516 . C T . PASS . GT 0/1
However, as explained in the linked question, EBI's vcf_validator
considers this invalid. And it doesn't really make sense to have these duplicates in any case, so is there any way I can detect and remove them? Preferably an existing tool, but I am open to scripting solutions as well.
vcf variation
$endgroup$
add a comment |
$begingroup$
This is related to the question I asked here. Consider a vcf file that contains duplicate variants, but where the duplicates aren't simply the same thing in the same notation but instead one is a subset of the other. For example:
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 . AACAC AATAC . PASS . GT 0/1
chr12 529516 . C T . PASS . GT 0/1
These the two variants are actually the same. They result in exactly the same genotype. Changing AACAC to AATAC at position 529514 just means change C to T at position 529516.
Is there any tool that can detect such duplicates and remove them? I tried vcfuniq
from vcflib
, but that doesn't seem to recognize this as a duplicate. I think it only looks at the 1st 4 fields and only considers duplicates those variants with exactly the same values in the 1st 4 fields:
$ ./bin/vcfuniq test.vcf
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 . AACAC AATAC . PASS . GT 0/1
chr12 529516 . C T . PASS . GT 0/1
However, as explained in the linked question, EBI's vcf_validator
considers this invalid. And it doesn't really make sense to have these duplicates in any case, so is there any way I can detect and remove them? Preferably an existing tool, but I am open to scripting solutions as well.
vcf variation
$endgroup$
add a comment |
$begingroup$
This is related to the question I asked here. Consider a vcf file that contains duplicate variants, but where the duplicates aren't simply the same thing in the same notation but instead one is a subset of the other. For example:
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 . AACAC AATAC . PASS . GT 0/1
chr12 529516 . C T . PASS . GT 0/1
These the two variants are actually the same. They result in exactly the same genotype. Changing AACAC to AATAC at position 529514 just means change C to T at position 529516.
Is there any tool that can detect such duplicates and remove them? I tried vcfuniq
from vcflib
, but that doesn't seem to recognize this as a duplicate. I think it only looks at the 1st 4 fields and only considers duplicates those variants with exactly the same values in the 1st 4 fields:
$ ./bin/vcfuniq test.vcf
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 . AACAC AATAC . PASS . GT 0/1
chr12 529516 . C T . PASS . GT 0/1
However, as explained in the linked question, EBI's vcf_validator
considers this invalid. And it doesn't really make sense to have these duplicates in any case, so is there any way I can detect and remove them? Preferably an existing tool, but I am open to scripting solutions as well.
vcf variation
$endgroup$
This is related to the question I asked here. Consider a vcf file that contains duplicate variants, but where the duplicates aren't simply the same thing in the same notation but instead one is a subset of the other. For example:
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 . AACAC AATAC . PASS . GT 0/1
chr12 529516 . C T . PASS . GT 0/1
These the two variants are actually the same. They result in exactly the same genotype. Changing AACAC to AATAC at position 529514 just means change C to T at position 529516.
Is there any tool that can detect such duplicates and remove them? I tried vcfuniq
from vcflib
, but that doesn't seem to recognize this as a duplicate. I think it only looks at the 1st 4 fields and only considers duplicates those variants with exactly the same values in the 1st 4 fields:
$ ./bin/vcfuniq test.vcf
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 . AACAC AATAC . PASS . GT 0/1
chr12 529516 . C T . PASS . GT 0/1
However, as explained in the linked question, EBI's vcf_validator
considers this invalid. And it doesn't really make sense to have these duplicates in any case, so is there any way I can detect and remove them? Preferably an existing tool, but I am open to scripting solutions as well.
vcf variation
vcf variation
edited 2 hours ago
terdon
asked 2 hours ago
terdon♦terdon
4,3701729
4,3701729
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
It turns out that bcftools
can do this (tested on bcftools-1.8), if you give it the reference genome to test against:
$ bcftools norm -d none -f hg19.fa test.vcf
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
##bcftools_normVersion=1.8+htslib-1.8
##bcftools_normCommand=norm -d none -f hg19.fa test.vcf; Date=Wed Feb 27 16:08:44 2019
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529516 . C T . PASS . GT 0/1
Lines total/split/realigned/skipped: 2/0/1/0
$endgroup$
add a comment |
$begingroup$
I'm no expert with VCF (few can say they are!) but I have worked a lot with VCF data in the last few years, both tools to consume and produce VCF. I've never seen variants encoded in this fashion, and it seems to be non-canonical. Typically:
- Single nucleotide variants (SNVs) are encoded with a single base as the REF allele and a single base as the ALT allele.
- In the case of insertions or deletions, the shorter of the REF and ALT alleles will be a single base, the base preceding the inserted/deleted sequence. Thus the first base of the REF and ALT alleles is always the same.
- In the rarer case of two or more consecutive substitutions forming a multinucleotide variant (MNV) the REF and ALT alleles will have the same length.
Using multi-bp strings of the same length to encode SNVs is unnecessary and, as you've pointed out, problematic. This makes me think its a bug or a "feature" of the variant predictor that produced the VCF.
In this case, I'd write a small script that would check for variants where the REF and ALT alleles have the same length. If the base is the same for REF and ALT in any position, drop it, and adjust the position accordingly.
The script below will convert these funky SNVs to the canonical representation, and will also work on MNVs. Standard tools should then work to remove the duplicates.
#!/usr/bin/env python3
def canonicalize(instream):
for line in instream:
if not line.startswith('#'):
values = line.split('t')
pos = int(values[1])
ref, alt = values[3:5]
if len(ref) > 1 and len(ref) == len(alt):
# How many bp to trim off the end
for n, (r, a) in enumerate(zip(ref[::-1], alt[::-1])):
if r != a:
revoffset = -1 * n
break
# How many bp to trim off the front
for n, (r, a) in enumerate(zip(ref, alt)):
if r != a:
offset = n
values[1] = str(pos + offset)
values[3] = ref[offset:revoffset]
values[4] = alt[offset:revoffset]
break
line = 't'.join(values)
yield line
if __name__ == '__main__':
import sys
for line in canonicalize(sys.stdin):
print(line, end='')
$endgroup$
1
$begingroup$
It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced byfreebayes
and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
$endgroup$
– terdon♦
1 hour ago
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "676"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f7126%2fhow-can-i-remove-non-trivial-duplicates-from-a-vcf-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
It turns out that bcftools
can do this (tested on bcftools-1.8), if you give it the reference genome to test against:
$ bcftools norm -d none -f hg19.fa test.vcf
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
##bcftools_normVersion=1.8+htslib-1.8
##bcftools_normCommand=norm -d none -f hg19.fa test.vcf; Date=Wed Feb 27 16:08:44 2019
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529516 . C T . PASS . GT 0/1
Lines total/split/realigned/skipped: 2/0/1/0
$endgroup$
add a comment |
$begingroup$
It turns out that bcftools
can do this (tested on bcftools-1.8), if you give it the reference genome to test against:
$ bcftools norm -d none -f hg19.fa test.vcf
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
##bcftools_normVersion=1.8+htslib-1.8
##bcftools_normCommand=norm -d none -f hg19.fa test.vcf; Date=Wed Feb 27 16:08:44 2019
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529516 . C T . PASS . GT 0/1
Lines total/split/realigned/skipped: 2/0/1/0
$endgroup$
add a comment |
$begingroup$
It turns out that bcftools
can do this (tested on bcftools-1.8), if you give it the reference genome to test against:
$ bcftools norm -d none -f hg19.fa test.vcf
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
##bcftools_normVersion=1.8+htslib-1.8
##bcftools_normCommand=norm -d none -f hg19.fa test.vcf; Date=Wed Feb 27 16:08:44 2019
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529516 . C T . PASS . GT 0/1
Lines total/split/realigned/skipped: 2/0/1/0
$endgroup$
It turns out that bcftools
can do this (tested on bcftools-1.8), if you give it the reference genome to test against:
$ bcftools norm -d none -f hg19.fa test.vcf
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
##bcftools_normVersion=1.8+htslib-1.8
##bcftools_normCommand=norm -d none -f hg19.fa test.vcf; Date=Wed Feb 27 16:08:44 2019
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529516 . C T . PASS . GT 0/1
Lines total/split/realigned/skipped: 2/0/1/0
answered 1 hour ago
terdon♦terdon
4,3701729
4,3701729
add a comment |
add a comment |
$begingroup$
I'm no expert with VCF (few can say they are!) but I have worked a lot with VCF data in the last few years, both tools to consume and produce VCF. I've never seen variants encoded in this fashion, and it seems to be non-canonical. Typically:
- Single nucleotide variants (SNVs) are encoded with a single base as the REF allele and a single base as the ALT allele.
- In the case of insertions or deletions, the shorter of the REF and ALT alleles will be a single base, the base preceding the inserted/deleted sequence. Thus the first base of the REF and ALT alleles is always the same.
- In the rarer case of two or more consecutive substitutions forming a multinucleotide variant (MNV) the REF and ALT alleles will have the same length.
Using multi-bp strings of the same length to encode SNVs is unnecessary and, as you've pointed out, problematic. This makes me think its a bug or a "feature" of the variant predictor that produced the VCF.
In this case, I'd write a small script that would check for variants where the REF and ALT alleles have the same length. If the base is the same for REF and ALT in any position, drop it, and adjust the position accordingly.
The script below will convert these funky SNVs to the canonical representation, and will also work on MNVs. Standard tools should then work to remove the duplicates.
#!/usr/bin/env python3
def canonicalize(instream):
for line in instream:
if not line.startswith('#'):
values = line.split('t')
pos = int(values[1])
ref, alt = values[3:5]
if len(ref) > 1 and len(ref) == len(alt):
# How many bp to trim off the end
for n, (r, a) in enumerate(zip(ref[::-1], alt[::-1])):
if r != a:
revoffset = -1 * n
break
# How many bp to trim off the front
for n, (r, a) in enumerate(zip(ref, alt)):
if r != a:
offset = n
values[1] = str(pos + offset)
values[3] = ref[offset:revoffset]
values[4] = alt[offset:revoffset]
break
line = 't'.join(values)
yield line
if __name__ == '__main__':
import sys
for line in canonicalize(sys.stdin):
print(line, end='')
$endgroup$
1
$begingroup$
It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced byfreebayes
and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
$endgroup$
– terdon♦
1 hour ago
add a comment |
$begingroup$
I'm no expert with VCF (few can say they are!) but I have worked a lot with VCF data in the last few years, both tools to consume and produce VCF. I've never seen variants encoded in this fashion, and it seems to be non-canonical. Typically:
- Single nucleotide variants (SNVs) are encoded with a single base as the REF allele and a single base as the ALT allele.
- In the case of insertions or deletions, the shorter of the REF and ALT alleles will be a single base, the base preceding the inserted/deleted sequence. Thus the first base of the REF and ALT alleles is always the same.
- In the rarer case of two or more consecutive substitutions forming a multinucleotide variant (MNV) the REF and ALT alleles will have the same length.
Using multi-bp strings of the same length to encode SNVs is unnecessary and, as you've pointed out, problematic. This makes me think its a bug or a "feature" of the variant predictor that produced the VCF.
In this case, I'd write a small script that would check for variants where the REF and ALT alleles have the same length. If the base is the same for REF and ALT in any position, drop it, and adjust the position accordingly.
The script below will convert these funky SNVs to the canonical representation, and will also work on MNVs. Standard tools should then work to remove the duplicates.
#!/usr/bin/env python3
def canonicalize(instream):
for line in instream:
if not line.startswith('#'):
values = line.split('t')
pos = int(values[1])
ref, alt = values[3:5]
if len(ref) > 1 and len(ref) == len(alt):
# How many bp to trim off the end
for n, (r, a) in enumerate(zip(ref[::-1], alt[::-1])):
if r != a:
revoffset = -1 * n
break
# How many bp to trim off the front
for n, (r, a) in enumerate(zip(ref, alt)):
if r != a:
offset = n
values[1] = str(pos + offset)
values[3] = ref[offset:revoffset]
values[4] = alt[offset:revoffset]
break
line = 't'.join(values)
yield line
if __name__ == '__main__':
import sys
for line in canonicalize(sys.stdin):
print(line, end='')
$endgroup$
1
$begingroup$
It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced byfreebayes
and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
$endgroup$
– terdon♦
1 hour ago
add a comment |
$begingroup$
I'm no expert with VCF (few can say they are!) but I have worked a lot with VCF data in the last few years, both tools to consume and produce VCF. I've never seen variants encoded in this fashion, and it seems to be non-canonical. Typically:
- Single nucleotide variants (SNVs) are encoded with a single base as the REF allele and a single base as the ALT allele.
- In the case of insertions or deletions, the shorter of the REF and ALT alleles will be a single base, the base preceding the inserted/deleted sequence. Thus the first base of the REF and ALT alleles is always the same.
- In the rarer case of two or more consecutive substitutions forming a multinucleotide variant (MNV) the REF and ALT alleles will have the same length.
Using multi-bp strings of the same length to encode SNVs is unnecessary and, as you've pointed out, problematic. This makes me think its a bug or a "feature" of the variant predictor that produced the VCF.
In this case, I'd write a small script that would check for variants where the REF and ALT alleles have the same length. If the base is the same for REF and ALT in any position, drop it, and adjust the position accordingly.
The script below will convert these funky SNVs to the canonical representation, and will also work on MNVs. Standard tools should then work to remove the duplicates.
#!/usr/bin/env python3
def canonicalize(instream):
for line in instream:
if not line.startswith('#'):
values = line.split('t')
pos = int(values[1])
ref, alt = values[3:5]
if len(ref) > 1 and len(ref) == len(alt):
# How many bp to trim off the end
for n, (r, a) in enumerate(zip(ref[::-1], alt[::-1])):
if r != a:
revoffset = -1 * n
break
# How many bp to trim off the front
for n, (r, a) in enumerate(zip(ref, alt)):
if r != a:
offset = n
values[1] = str(pos + offset)
values[3] = ref[offset:revoffset]
values[4] = alt[offset:revoffset]
break
line = 't'.join(values)
yield line
if __name__ == '__main__':
import sys
for line in canonicalize(sys.stdin):
print(line, end='')
$endgroup$
I'm no expert with VCF (few can say they are!) but I have worked a lot with VCF data in the last few years, both tools to consume and produce VCF. I've never seen variants encoded in this fashion, and it seems to be non-canonical. Typically:
- Single nucleotide variants (SNVs) are encoded with a single base as the REF allele and a single base as the ALT allele.
- In the case of insertions or deletions, the shorter of the REF and ALT alleles will be a single base, the base preceding the inserted/deleted sequence. Thus the first base of the REF and ALT alleles is always the same.
- In the rarer case of two or more consecutive substitutions forming a multinucleotide variant (MNV) the REF and ALT alleles will have the same length.
Using multi-bp strings of the same length to encode SNVs is unnecessary and, as you've pointed out, problematic. This makes me think its a bug or a "feature" of the variant predictor that produced the VCF.
In this case, I'd write a small script that would check for variants where the REF and ALT alleles have the same length. If the base is the same for REF and ALT in any position, drop it, and adjust the position accordingly.
The script below will convert these funky SNVs to the canonical representation, and will also work on MNVs. Standard tools should then work to remove the duplicates.
#!/usr/bin/env python3
def canonicalize(instream):
for line in instream:
if not line.startswith('#'):
values = line.split('t')
pos = int(values[1])
ref, alt = values[3:5]
if len(ref) > 1 and len(ref) == len(alt):
# How many bp to trim off the end
for n, (r, a) in enumerate(zip(ref[::-1], alt[::-1])):
if r != a:
revoffset = -1 * n
break
# How many bp to trim off the front
for n, (r, a) in enumerate(zip(ref, alt)):
if r != a:
offset = n
values[1] = str(pos + offset)
values[3] = ref[offset:revoffset]
values[4] = alt[offset:revoffset]
break
line = 't'.join(values)
yield line
if __name__ == '__main__':
import sys
for line in canonicalize(sys.stdin):
print(line, end='')
edited 1 hour ago
answered 1 hour ago
Daniel StandageDaniel Standage
2,333329
2,333329
1
$begingroup$
It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced byfreebayes
and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
$endgroup$
– terdon♦
1 hour ago
add a comment |
1
$begingroup$
It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced byfreebayes
and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
$endgroup$
– terdon♦
1 hour ago
1
1
$begingroup$
It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced by
freebayes
and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.$endgroup$
– terdon♦
1 hour ago
$begingroup$
It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced by
freebayes
and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.$endgroup$
– terdon♦
1 hour ago
add a comment |
Thanks for contributing an answer to Bioinformatics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f7126%2fhow-can-i-remove-non-trivial-duplicates-from-a-vcf-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown