Skip to content

Combine a list element

Open in Gitpod

Problem

Say that you have a tool which takes a parameter and a bunch of files and does something with those. This is exemplified by the process CAT below. My first approach was to collect the files before combining them with the parameter. See the following workflow as an example.

problem.nf
#!/usr/bin/env nextflow

nextflow.enable.dsl = 2

/*******************************************************************************
 * Define processes
 ******************************************************************************/

process CREATE {
    input:
    val filename

    output:
    path filename

    script:  // (1)
    """
    echo ${filename} > ${filename}
    """
}

process CAT {
    input:
    tuple val(number), path(files)

    output:
    tuple val(number), path('result.txt')

    script:  // (2)
    """
    cat ${files} > result.txt
    echo 'Parameter: ${number}' >> result.txt
    """
}

/*******************************************************************************
 * Define main workflow
 ******************************************************************************/

workflow {
    ch_param = Channel.of(1..5)
    ch_files = Channel.of('foo.txt', 'bar.txt', 'baz.txt')

    CREATE(ch_files)

    CREATE.out.map { it.name } .view()  // (3)

    ch_input = ch_param.combine( CREATE.out.collect() )  // (4)

    ch_input.map { row ->
            [row.head()] + row.tail().collect{ it.name }  // (5)
        }
        .view()

    CAT(ch_input)

    CAT.out.map { it[1].text }.view()  // (6)
}
  1. Create a file with distinct content.
  2. Concatenate all files into one and use the parameter value.
  3. For better display, I'm only showing the filenames here and not the whole paths.
  4. The collect operator should turn this into a list.
  5. Don't worry too much about this, I'm again transforming the output to only display filenames and not the entire paths.
  6. Here, I want to show the content of the resulting file which is the second of the pair in the output.

Run the above workflow with:

NXF_VER='21.10.6' nextflow run examples/combine-list/problem.nf

which gives the following output. It looks like the combine operator, when combining a single list of elements treats that just like a channel and forms the cartesian product with every element. There is also a warning about the input cardinality not matching the defined one in CAT and indeed we can see in the output that only one file is written to the result while the others are ignored.

executor >  local (8)
[32/a72ef8] process > CREATE (1) [100%] 3 of 3 ✔
[0a/dfd2e7] process > CAT (4)    [100%] 5 of 5 ✔
baz.txt
bar.txt
foo.txt
[1, baz.txt, bar.txt, foo.txt]
[2, baz.txt, bar.txt, foo.txt]
[3, baz.txt, bar.txt, foo.txt]
[4, baz.txt, bar.txt, foo.txt]
[5, baz.txt, bar.txt, foo.txt]
baz.txt
Parameter: 1

baz.txt
Parameter: 5

baz.txt
Parameter: 2

baz.txt
Parameter: 3

baz.txt
Parameter: 4

WARN: Input tuple does not match input set cardinality declared by process `CAT`

Solution

Well, if a single list gets treated just like a channel, maybe we can nest that list such that we have a list with a single element that is also a list. I tried quite a few different ways:

  1. Can we collect twice?

    ch_input = ch_param.combine( CREATE.out.collect().collect() )
    

    This does not work correctly. Just like in the problem, we get a flat list.

  2. What if we place it into a list manually?

    ch_input = ch_param.combine( [ CREATE.out.collect() ] )
    

    This yields an error

    Not a valid path value type: groovyx.gpars.dataflow.DataflowVariable
    

    which makes sense since we place the collected variable (of type DataflowVariable) inside the literal list and thus it gets passed to our CAT process directly.

  3. Instead of collect there is also toList...

    ch_input = ch_param.combine( [ CREATE.out.toList() ] )
    

    Same error 😞

    Not a valid path value type: groovyx.gpars.dataflow.DataflowVariable
    
  4. Then I got the correct advice:

    ch_input = ch_param.combine( CREATE.out.toList().toList() )
    

    The corresponding comment on Slack was:

    Harshil Patel

    Don't ask me why.

    🙈 🙊

  5. Turns out that the following combination also works.

    ch_input = ch_param.combine( CREATE.out.collect().toList() )
    

So in full the solution looks as follows.

solution.nf
#!/usr/bin/env nextflow

nextflow.enable.dsl = 2

/*******************************************************************************
 * Define processes
 ******************************************************************************/

process CREATE {
    input:
    val filename

    output:
    path filename

    script:  // (1)
    """
    echo ${filename} > ${filename}
    """
}

process CAT {
    input:
    tuple val(number), path(files)

    output:
    tuple val(number), path('result.txt')

    script:  // (2)
    """
    cat ${files} > result.txt
    echo 'Parameter: ${number}' >> result.txt
    """
}

/*******************************************************************************
 * Define main workflow
 ******************************************************************************/

workflow {
    ch_param = Channel.of(1..5)
    ch_files = Channel.of('foo.txt', 'bar.txt', 'baz.txt')

    CREATE(ch_files)

    CREATE.out.map { it.name } .view()  // (3)

    ch_input = ch_param.combine( CREATE.out.toList().toList() )  // (4)

    ch_input.map { row ->
            [row.head(), row.last().collect{ it.name }]  // (5)
        }
        .view()

    CAT(ch_input)

    CAT.out.map { it[1].text }.view()  // (6)
}
  1. Create a file with distinct content.
  2. Concatenate all files into one and use the parameter value.
  3. For better display, I'm only showing the filenames here and not the whole paths.
  4. Use the winning solution from above. The toList operator applied twice creates the nested list.
  5. Don't worry too much about this, I'm again transforming the output to only display filenames and not the entire paths.
  6. Again, I want to show the content of the resulting file which is the second of the pair in the output.

Run the above workflow with:

NXF_VER='21.10.6' nextflow run examples/combine-list/solution.nf

This time, both the shape of the input for CAT, as well as the content of the resulting files are as expected. ✌

executor >  local (8)
[0c/731285] process > CREATE (3) [100%] 3 of 3 ✔
[e0/670c78] process > CAT (5)    [100%] 5 of 5 ✔
bar.txt
foo.txt
baz.txt
[1, [bar.txt, foo.txt, baz.txt]]
[2, [bar.txt, foo.txt, baz.txt]]
[3, [bar.txt, foo.txt, baz.txt]]
[4, [bar.txt, foo.txt, baz.txt]]
[5, [bar.txt, foo.txt, baz.txt]]
bar.txt
foo.txt
baz.txt
Parameter: 3

bar.txt
foo.txt
baz.txt
Parameter: 1

bar.txt
foo.txt
baz.txt
Parameter: 4

bar.txt
foo.txt
baz.txt
Parameter: 2

bar.txt
foo.txt
baz.txt
Parameter: 5

Alternative solutions

DataflowVariable value

We saw above that the following code caused an error because we are passing a groovyx.gpars.dataflow.DataflowVariable to the process.

ch_input = ch_param.combine( [ CREATE.out.collect() ] )

It is possible, though highly discouraged, to access a DataflowVariable's inner value.

ch_input = ch_param.combine( [ CREATE.out.collect() ] )  // (1)
    .map { first, second -> [first, second.val] }
  1. This combination generates pairs where the first element is the val and the second the DataflowVariable containing the list.

Creating a list through transformation

In our problem statement we saw:

ch_input = ch_param.combine( CREATE.out.collect() )

which created lists of four elements each. The parameter and the three files. We can transform this shape ourselves.

ch_input = ch_param.combine( CREATE.out.collect() )
    .map { [it.head(), it.tail()] }

Done 🙂

Using combine and groupTuple

A very different approach is to first combine every parameter value with every file. This generates pairs of one value and one file. We can then group the pairs together as tuples.

group-tuple.nf
#!/usr/bin/env nextflow

nextflow.enable.dsl = 2

/*******************************************************************************
 * Define processes
 ******************************************************************************/

process CREATE {
    input:
    val filename

    output:
    path filename

    script:  // (1)
    """
    echo ${filename} > ${filename}
    """
}

process CAT {
    input:
    tuple val(number), path(files)

    output:
    tuple val(number), path('result.txt')

    script:  // (2)
    """
    cat ${files} > result.txt
    echo 'Parameter: ${number}' >> result.txt
    """
}

/*******************************************************************************
 * Define main workflow
 ******************************************************************************/

workflow {
    ch_param = Channel.of(1..5)
    ch_files = Channel.of('foo.txt', 'bar.txt', 'baz.txt')

    CREATE(ch_files)

    CREATE.out.map { it.name } .view()  // (3)

    ch_input = ch_param.combine( CREATE.out )  // (4)
        .groupTuple()

    ch_input.map { row ->
            [row.head(), row.last().collect{ it.name }]  // (5)
        }
        .view()

    CAT(ch_input)

    CAT.out.map { it[1].text }.view()  // (6)
}
  1. Create a file with distinct content.
  2. Concatenate all files into one and use the parameter value.
  3. For better display, I'm only showing the filenames here and not the whole paths.
  4. Use combine on the flat channels to generate pairs. Then collect tuples of files by grouping the pairs by their first element, the numeric value, with groupTuple.
  5. Don't worry too much about this, I'm again transforming the output to only display filenames and not the entire paths.
  6. Again, I want to show the content of the resulting file which is the second of the pair in the output.

Run it

NXF_VER='21.10.6' nextflow run examples/combine-list/group-tuple.nf

This generates the exact same solution. However, if you have a lot of elements in your channels this might perform slightly worse since you generate a lot more pairs first that you then have to group again.

executor >  local (8)
[e9/c7a72b] process > CREATE (2) [100%] 3 of 3 ✔
[cb/44c510] process > CAT (5)    [100%] 5 of 5 ✔
baz.txt
foo.txt
bar.txt
[1, [baz.txt, foo.txt, bar.txt]]
[2, [baz.txt, foo.txt, bar.txt]]
[3, [baz.txt, foo.txt, bar.txt]]
[4, [baz.txt, foo.txt, bar.txt]]
[5, [baz.txt, foo.txt, bar.txt]]
baz.txt
foo.txt
bar.txt
Parameter: 4

baz.txt
foo.txt
bar.txt
Parameter: 2

baz.txt
foo.txt
bar.txt
Parameter: 3

baz.txt
foo.txt
bar.txt
Parameter: 1

baz.txt
foo.txt
bar.txt
Parameter: 5