eugene.preprocess.pad_seqs_sdata

eugene.preprocess.pad_seqs_sdata(sdata, length, seq_var='seq', pad='right', pad_value=None, copy=False)

Pad sequences in a SeqData object.

Wraps the pad_seqs function from SeqPro on the sequences in a SeqData object. Automatically adds a new variable to the SeqData object with the padded sequences called “{seq_var}_padded”. Assumes that the dimension for the number of sequences is named “_sequence” and will add dimension called length to the padded sequences. Will also overwrite any existing variable with the same name.

Parameters:
  • sdata (xr.Dataset) – SeqData object.

  • length (int) – Length to pad or truncate sequences to.

  • seq_var (str, optional) – Name of the variable holding the sequences, by default “seq”

  • pad (Literal["left", "both", "right"], optional) – How to pad. If padding on both sides and an odd amount of padding is needed, 1 more pad value will be on the right side, by default “right”

  • pad_val (str, optional) – Single character to pad sequences with. Needed for string input. Ignored for OHE sequences, by default None

  • copy (bool, optional) – Whether to return a copy of the SeqData object, by default False

Returns:

SeqData object with padded sequences. If copy is True, a copy of the SeqData object is returned, else the original SeqData object is modified in place.

Return type:

xr.Dataset