#Can this brute force code be easily sped up?

26 messages · Page 1 of 1 (latest)

hearty bolt
#

This is a segment of some bioinformatics code where i take some "word" and "alphabet" as input that is a dna string and possible DNA values:

word = "ACGTTCACGTCGATGCTATGCGATGCATGT";
alphabet = "ACGTN";

and this code is supposed to take that word and create a vector that lists every possible DNA strand that can be created via adding X number of additional letters, where the letters can be any listed in the alphabet. It needs to work for any value of X (num_mismatches in my code).

I do a very brute force method where I just loop through the Vec<DNA> and add each possible letter at each possible spot and append that all to a big list. Then, if the mismatches > 1, i loop through THAT list, and add each possible letter at each possible spot in the new words.

This looks like a lot of cloning and allocation/insertions that look expensive, but I can't figure out how to reduce that number.


#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum DNA {
    A,
    C,
    G,
    T,
    N,
}


// Insertions, slow?

    let mut potential_mismatches: Vec<Vec<DNA>> = Vec::new();

    let mut insertion_mismatches: Vec<Vec<DNA>> = Vec::new();
   
    let mut inserts = 1;
    while inserts <= num_mismatches {
        let mut word_insert_list: Vec<Vec<DNA>> = Vec::new();
        if inserts == 1 {
            word_insert_list.push(word.to_vec()); 
        }
        else {
            word_insert_list = insertion_mismatches.clone();
        }
        
        let word_length = word.len() + inserts;
        for word_to_insert_in in &mut word_insert_list {
            for i in 0..word_length {
                for nt in alphabet {
                    word_to_insert_in.insert(i, *nt); 
                    insertion_mismatches.push(word_to_insert_in.clone());
                    word_to_insert_in.remove(i); 
                }
            }
        }

        potential_mismatches.extend(insertion_mismatches.clone());
        inserts += 1;
    }

stone parrot
# hearty bolt This is a segment of some bioinformatics code where i take some "word" and "alph...

This is weirdly reminiscent of an Advent of Code problem. I can't find it anymore, but if you manage, the aoc reddit usually has really clever solutions posted.

Anyway, the goal is to iterate every possible string you can gain by inserting every possible letter at every possible location? That's an infinite list: you could just keep inserting increasingly large amounts of the first letter at the start.

I guess you can't reuse letters, but can avoid using them all?

hearty bolt
#

Basically you feed the funciton a word "ATG" a list of possible letters "ACGTN" and the maximum number of insertions (num_mismatches = 1, let's say).

Then it would output:

ATGN
ATGC
ATGA
ATGG
ATGT

ATNG
ATCG
ATGG
ATAG
...
etc.

So it's not infinite, it's just outputting all possible changes to the word that can occur due to insertions, given a set of letters that can be inserted, and the maximum total number of insertions

#

In biology terms it's trying to find all possible DNA strings given your input string, that result from inserting up to X new base pairs into your original string. And those base pairs are constrained by the 4 DNA base pairs ACGT and N.

stone parrot
#

Ah, you have a maximum number of mismatches, I'd missed that. But you can reuse your letters any number of times?

Hmm. Gimme a while, I'll see if I can come up with something. It feels like this should be feasible in zero allocations (well, except for the fact we need to make a list per DNA string)

hearty bolt
#

Yeah, you can re-use letters. So if you have 2 insertions and the original word is ACT, then GGACT is valid

#

I assume there's something clever here but can't find it. And when I re-wrote it using itertools it got 2x slower than my brute force method. I think because itertools actually has Vec<> allocations in a lot of their methods.

#

Also quick note that this is heavily parallelizable but that will occur further upstream this funciton, so mostly hyperfocused on the single threaded speed here

stone parrot
# hearty bolt I assume there's something clever here but can't find it. And when I re-wrote it...

I was trying to make a code example, but I'm struggling with off-by-one errors. For the sake of giving you something at least before I run out of time, here's a description with some basic pseudocode:

Since you want to insert letters of alphabet into word, we can reduce the number of allocations drastically by making a helper vector, which I'll call state, whose length is word.len() + 1. Elements of this vector can be thought of as being the spaces between elements of word: state[0] is the position just before the first element of word, state[1] is between the first and the second, and so on. state.last(), of course, is the position just after the last element of word. Most slots will be empty (See later, though. There's a neat way to make state a "dense" vector with no empty slots at all)

Now that we've done this, we want to enumerate all possible ways to distribute up to num_mismatches letters from alphabet in the state vector. Once we have a state, we can merge it with word to construct a new strand in a single allocation.

Let's start by focusing on a simple case: alphabet has only one letter.

  1. Start with a totally empty state vector (this assumes you want to consider the unchanged word an output: it's what happens if you perform zero substitutions, which is definitely at most as many as num_mismatches)
  2. move the letter in the last non-empty slot to an empty slot one to the right, if you can. If you can't (because it's already in the last slot or because all slots after it are filled), try the next step
  3. If you couldn't, move the previous letter in a non-empty slot to an empty slot one to the right. If you can't (because there is no previous letter), try the next step
  4. if you couldn't, reset the state (moving all letters as left as possible), then add one more letter. If you can't (because you already have num_mismatches letters), you're done.
  5. If you're not done after that, goto after step 1
#

For example, the above algorithm creates a progression like this:

state: [_, _, _]
num_mismatches: 2
outputs: //I'm assuming you want to return the unchanged input as a valid output
[_, _, _] //no more letters to move right, increase their number to 1
[a, _, _]
[_, a, _]
[_, _, a] //no more letters to move right, increase their number to 2
[a, a, _]
[a, _, a] //we can't move the current letter anymore, but we can move a previous letter
[_, a, a] //no more states to move right, and we can't increase the number of letters. We're done.
```Now, of course, realistic alphabets will have more than one letter. To account for that, we need to add a single step: 
1.5. Before moving a letter to the right, try to replace it with the next letter in `alphabet`. If you can't (because it's already the last letter), move it one to the right and reset it to the first letter. Either way, when a letter is changed or moved, all letters after it need to be reset to the first letter in the alphabet.
```rs
state: [_, _, _]
num_mismatches: 2,
alphabet: "ab"
outputs:
[_, _, _]
[a, _, _]
[b, _, _]
[_, a, _]
[_, b, _]
[_, _, a]
[_, _, b]
[a, a, _]
[a, b, _]
[a, _, a]
[a, _, b]
[_, a, a]
[_, a, b]
[_, b, a]
[_, b, b]
#

As for making state a dense vector, you can store it as a Vec<(usize, char)>. The usize represents the position of the character, rather than using its actual index in the vector to represent that. Incidentally, this makes it slightly easier to merge them by using Itertools::merge_by (btw, that function is neat. It has an optimal size_hint, so if you collect, or map().collect(), its output, it will preallocate exactly the correct capacity for the vector)

#

And of course, once you've merged alphabet and state, state can be reused for the next iteration, so overall you will have n + 1 allocations, where n is the number of outputs

stone parrot
#

Hmm, nevermind, this isn't all the possibilities. Using the above example, [b, a, _] was never produced. I think that could be amended by "preferring" replacing all letters with the next letter before moving them to the right? That is, the correct sequel to [a, b, _] isn't [a, _, a], it's [b, a, _], [b, b, _], [a, _, a]: if the last letter would have to be moved, first try changing/moving the letters before it, and only then move the last letter and reset the ones before it

#

(you may want to test it, I didn't manage to get it in a testable state 😄)

topaz crown
#

you got nerdsniped hard anyway lmao

#

I'm about to be too

topaz crown
stone parrot
#

If you do, you can reduce it to two allocations assuming my algorithm

#

The state helper, plus a vec that holds the current merged result

hearty bolt
#

I had to sleep, I will catch up on the help provided soon!

hearty bolt
#

I can see how to generate this state vector, but how do I then use the state vector to produce the set of Vec<Vec<DNA>> that I need at the end? Am I just looping through the vector and insert/removing and pushing to a Vec?

stone parrot
#

This should be feasible with itertools' merge_by, if you feel like staring at its docs for longer than I did

hearty bolt
#

I see, so the hopeful performance gain vs just two brute force loops comes from not having to .insert() and .remove() , instead we can just .push() everything?

Because I think this is the same number of allocations (maybe one more because it requires a state vector now?) as the brute force loop?

Unless I'm missing something

stone parrot