eugene.preprocess.ohe_seqs_sdata

eugene.preprocess.ohe_seqs_sdata(sdata, alphabet='DNA', seq_var='seq', ohe_var='ohe_seq', fill_value=0, copy=False)

One-hot encode sequences in a SeqData object.

Wraps the ohe function from SeqPro on the sequences in a SeqData object. Automatically adds a new variable to the SeqData object with the one-hot encoded sequences called “ohe_seq”. with dimensions ()”_sequence”, “length”, “_ohe”). Will also overwrite any existing variable with the same name.

Parameters:
  • sdata (xr.Dataset) – SeqData object.

  • alphabet (str, optional) – Alphabet to use for one-hot encoding, by default “DNA”

  • seq_var (str, optional) – Name of the variable holding the sequences to be encoded, by default “seq”

  • ohe_var (str, optional) – Name of the variable to store the one-hot encoded sequences in, by default “ohe_seq”

  • fill_value (Union[int, float], optional) – Value to fill the one-hot encoded sequences with, by default 0

  • copy (bool, optional) – Whether to return a copy of the SeqData object, by default False

Returns:

SeqData object with one-hot encoded sequences. If copy is True, a copy of the SeqData object is returned, else the original SeqData object is modified in place.

Return type:

xr.Dataset