def index(filename, format, alphabet=None, key_function=None): """Indexes a sequence file and returns a dictionary like object. - filename - string giving name of file to be indexed - format - lower case string describing the file format - alphabet - optional Alphabet object, useful when the sequence type cannot be automatically inferred from the file itself (e.g. format="fasta" or "tab") - key_function - Optional callback function which when given a SeqRecord identifier string should return a unique key for the dictionary. This indexing function will return a dictionary like object, giving the SeqRecord objects as values: >>> from Bio import SeqIO >>> records = SeqIO.index("Quality/example.fastq", "fastq") >>> len(records) 3 >>> sorted(records) ['EAS54_6_R1_2_1_413_324', 'EAS54_6_R1_2_1_443_348', 'EAS54_6_R1_2_1_540_792'] >>> print records["EAS54_6_R1_2_1_540_792"].format("fasta") >EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA <BLANKLINE> >>> "EAS54_6_R1_2_1_540_792" in records True >>> print records.get("Missing", None) None Note that this psuedo dictionary will not support all the methods of a true Python dictionary, for example values() is not defined since this would require loading all of the records into memory at once. When you call the index function, it will scan through the file, noting the location of each record. When you access a particular record via the dictionary methods, the code will jump to the appropriate part of the file and then parse that section into a SeqRecord. Note that not all the input formats supported by Bio.SeqIO can be used with this index function. It is designed to work only with sequential file formats (e.g. "fasta", "gb", "fastq") and is not suitable for any interlaced file format (e.g. alignment formats such as "clustal"). For small files, it may be more efficient to use an in memory Python dictionary, e.g. >>> from Bio import SeqIO >>> records = SeqIO.to_dict(SeqIO.parse(open("Quality/example.fastq"), "fastq")) >>> len(records) 3 >>> sorted(records) ['EAS54_6_R1_2_1_413_324', 'EAS54_6_R1_2_1_443_348', 'EAS54_6_R1_2_1_540_792'] >>> print records["EAS54_6_R1_2_1_540_792"].format("fasta") >EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA <BLANKLINE> As with the to_dict() function, by default the id string of each record is used as the key. You can specify a callback function to transform this (the record identifier string) into your prefered key. For example: >>> from Bio import SeqIO >>> def make_tuple(identifier): ... parts = identifier.split("_") ... return int(parts[-2]), int(parts[-1]) >>> records = SeqIO.index("Quality/example.fastq", "fastq", ... key_function=make_tuple) >>> len(records) 3 >>> sorted(records) [(413, 324), (443, 348), (540, 792)] >>> print records[(540, 792)].format("fasta") >EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA <BLANKLINE> >>> (540, 792) in records True >>> "EAS54_6_R1_2_1_540_792" in records False >>> print records.get("Missing", None) None Another common use case would be indexing an NCBI style FASTA file, where you might want to extract the GI number from the FASTA identifer to use as the dictionary key. Notice that unlike the to_dict() function, here the key_function does not get given the full SeqRecord to use to generate the key. Doing so would impose a severe performance penalty as it would require the file to be completely parsed while building the index. Right now this is usually avoided. See also: Bio.SeqIO.index_db() and Bio.SeqIO.to_dict() """ #Try and give helpful error messages: if not isinstance(filename, basestring): raise TypeError("Need a filename (not a handle)") if not isinstance(format, basestring): raise TypeError("Need a string for the file format (lower case)") if not format: raise ValueError("Format required (lower case string)") if format != format.lower(): raise ValueError("Format string '%s' should be lower case" % format) if alphabet is not None and not (isinstance(alphabet, Alphabet) or \ isinstance(alphabet, AlphabetEncoder)): raise ValueError("Invalid alphabet, %s" % repr(alphabet)) #Map the file format to a sequence iterator: import _index #Lazy import return _index._IndexedSeqFileDict(filename, format, alphabet, key_function)