Assemblers
Developing a package provider primarily deals with extending or re-using Assembler-related abstract classes and
implementations, but it is helpful to understand the context in which Assemblers operate. Assemblers are responsible for
gathering custodial and supplemental resources associated with a submission and returning a PackageStream of those
resources according to a packaging specification. Deposit Services will then stream the package to a downstream
repository via a Transport.
Use Case
An Assembler implementation is required for every packaging specification you wish to support. For example, if you
want to produce BagIt packages and DSpace METS packages,
you would need two Assembler implementations, each responsible for producing packages that align with their respective
specifications.
Another reason to develop an Assembler is to control how the metadata of a submission is mapped into your package. For
example, if your DSpace installation requires custom metadata elements, you would need to develop or extend an
existing Assembler to include the custom metadata as appropriate to your environment, by way of implementing a custom
Package Provider.
One easy approach would be by extending a base Assembler class without having to write something entirely new.
Quick Start
- Create your Assembler class that extends - org.eclipse.pass.deposit.assembler.AbstractAssembler
- Create your Package Provider class that implements - org.eclipse.pass.deposit.assembler.PackageProvider
To get started with testing:
Create your package verifier that implements org.eclipse.pass.deposit.assembler.PackageVerifier
Extend and implement org.eclipse.pass.deposit.assembler.AbstractThreadedAssemblyIT
API Overview
Assembler API
The main entrypoint into the Assembler API is on the Assembler interface:PackageStream assemble(DepositSubmission, Map<String, Object>)
where the DepositSubmission is the internal representation of a Submission, and the Map is a set of package
options read from repositories.json.
The AbstractAssembler provides an implementation of assemble(DepositSubmission, Map), and requires its subclasses to
implement:
PackageStream createPackageStream(DepositSubmission, List<DepositFileResource>, MetadataBuilder, ResourceBuilderFactory, Map<String, Object>)
Where the List<DepositFileResource> is the custodial content of the submission, the MetadataBuilder allowing
modification of the package-level metadata, and the ResourceBuilderFactory used to generate an instance
of ResourceBuilder for each DepositFileResource.
The primary benefit of extending AbstractAssembler is that the logic for identifying the custodial resources in the
submission and creating their representation as List<DepositSubmission> is shared. Subclasses of AbstractAssembler
must instantiate and return a PackageStream.
Examples can be found at: DspaceAssembler, NihmsAssembler, InvenioRdmAssembler, and BagItAssembler
PackageStream API
Assemblers are invoked by Deposit Services and return
a PackageStream. The PackageStream represents the content to be sent to a downstream repository. Conceptually,
the PackageStream behaves like a Java InputStream: the bytes for the stream can come from anywhere (memory, a
file on disk, or retrieved from another network resource), and can generally only be read once.
Practically, the PackageStream represents an archive file: either a ZIP, TAR, or some variant like TAR.GZ. This is
encapsulated by the ArchivingPackageStream class. Re-using the ArchivingPackageStream class has the advantage
that your package resources will be bundled up in a single archive file according to the options supplied to
the Assembler (e.g. compression and archive type to use).
To instantiate an ArchivingPackageStream class requires an instance of PackageProvider.
There is also a SimplePackageStream class that contains the associated DepositSubmission, List ofDepositFileResources, and metadata to be sent to repository. This class may be used in integrations where the
individual file resources are needed for processing the repository deposit. For example, the InvenioRDM integration is
one such repository.
PackageProvider API
The PackageProvider interface was developed as an ad hoc lifecycle for streaming a package: there's a start(...)
and finish(...) method, along with a packagePath(...) method. PackageProvider also defines a new interface:SupplementalResource. This interface is returned by the finish(...) method, allowing the PackageProvider
implementation to generate supplemental (i.e. BagIt tag files or METS.xml files) content after the rest of the package
has been streamed.
Implementing this interface therefore allows for customizing where resources will appear in the package, and to customize the metadata that appears in the package.
Because packaging specifications generally have something to say about what resources are included where in the package,
a Package Provider is loosely coupled to a package specification. For example, a Package Provider that placed custodial
resources in the <package root>/foo directory would be incompatible with a BagIt packaging specification, which
requires custodial resources to appear under <package root>/data. Similarly, if your Package Provider is to align
with a DSpace METS packaging scheme, it will need to produce a <package root>/METS.xml file with the required content.
Therefore, any PackageProvider implementation can be used with any Assembler implementation as long as the package
specification shared between the two is not violated.
Assembler Development Recap
Implementations of AbstractAssembler that return a single archive for deposit will return an ArchivingPackageStream
which uses a PackageProvider to path resources and generate supplemental metadata contained in the package. For
integrations that process each file resource, the AbstractAssembler implementation should return SimplePackageStream
that can be used later by the Transport for processing.
When developing your own Assembler that returns a ArchivingPackageStream, you will need to:
- Extend - AbstractAssembler
- Implement - PackageProvider, including the logic to produce supplemental package content like BagIt tag files or DSpace METS.xml files
- Construct - ArchivingPackageStreamwith your- PackageProviderand return that from your- AbstractAssemblerimplementation
Examples of implemented package providers:
- NihmsPackageProvider
- BagItPackageProvider
When developing your own Assembler that returns a SimplePackageStream, you will need to:
- Extend - AbstractAssembler
- Construct - SimplePackageStreamreturn that from your- AbstractAssemblerimplementation
Examples of implemented such assemblers:
- InvenioRdmAssembler
Concurrency
Assemblers exist in the Deposit Services runtime as singletons. A single Assembler instance may be invoked from
multiple threads, therefore all the code paths executed by an Assembler must be thread-safe.
AbstractAssembler and ArchivingPackageStream are already thread-safe; your concrete implementation
of AbstractAssembler and PackageProvider will need to maintain that thread safety. Streaming a package inherently
involves maintaining state, including the updating of metadata for resources as they are streamed.
One strategy for maintaining thread safety is to scope any state maintained over the course of streaming a package to
the executing thread. Assembler implementations are free to use whatever mechanisms they wish to ensure thread
safety, but Deposit Services accomplishes this in its codebase by simply instantiating a new instance of
state-maintaining classes each time the Assembler.assemble(...) is invoked, and ensures that state is not shared (i.e.
kept on the Thread stack and not in the JVM heap). For example:
- AbstractAssemblerinstantiates a new- MetadataBuildereach time using a factory pattern.
- AbstractAssemblerimplementations instantiate a new- ArchivingPackageStreameach time.
- DefaultStreamWriterImplinstantiates a new- ResourceBuilderfor each resource being streamed using a factory pattern.
The factory objects may be kept in shared memory (i.e. as instance member variables), but the objects produced by the
factories are maintained in the Thread stack (as method variables). After a PackageStream has been opened and
subsequently closed, these objects will be released and garbage collected by the JVM. To help ensure thread safety,
there is an integration test fixture, ThreadedAssemblyIT, which can be subclassed and used by Assembler integration
tests to verify thread safety.
Testing
Adequate test coverage of Assemblers includes proper unit testing. This document presumes that you've adequately unit
tested your implementation, and instead focuses on integration testing.
Integration testing of Assemblers is supported by some shared test fixtures in the core Deposit Services codebase.
ThreadedAssemblyIT
The approach taken by the shared ThreadedAssemblyIT is to invoke the Assembler under test directly using
random DepositSubmissions. A singleton Assembler implementation under test is retrieved from the IT subclass that
you provide a number of different DepositSubmissions are used to concurrently invoke Assembler.assemble(...) on the
singleton instance under test the PackageStreams returned by the Assembler under test are streamed to and stored on
the filesystem. A package verifier supplied by the IT subclass verifies the content of the packages.
The advantage of extending ThreadedAssemblyIT is that it ensures that your Assembler can be invoked concurrently by
multiple threads while avoiding the complexity of setting up and configuring the Deposit Services runtime. The Spring
Framework is not used, the Deposit Services runtime is not required, and no Docker containers are needed: the IT is
simple Java and JUnit. The downside is that your full runtime is not being integration tested, only your Assembler.
To use ThreadedAssemblyIT, extend it, and implement the required methods:
- assemblerUnderTest(): provide an AbstractAssembler instance, fully initialized and ready to be invoked
- packageOptions(): provides a set of package options, used when creating the PackageStream and storing it on disk. The package options include:- The package specification to be used 
- The compression algorithm used when creating the package 
- The checksumming algorithm to be used when calculating package and package resource checksums 
 
- packageVerifier(): answers a- PackageVerifierwhich inspects a package stored on the filesystem and verifies its content. You must implement a- PackageVerifierfor each- Assemblerbeing tested.
The test logic automatically executes in ThreadedAssemblyIT.testMultiplePackageStreams(). The PackageVerifier is
very important: it does most of the heavy lifting with respect to passing or failing the integration test, so it must be
well written and test all aspects of a generated package.
Examples of these ITs:
- BagItThreadedAssemblyIT
- NihmsThreadedAssemblyIT
PackageVerifier
Each Assembler that is developed should have a corresponding PackageVerifier. The PackageVerifier is the primary
interface for verifying that a package written to disk contains the expected content. The primary method to implement
is:void verify(DepositSubmission, ExplodedPackage, Map<String, Object>) where the DepositSubmission is the original
submission, ExplodedPackage is the generated package on disk, and the Map includes the options supplied to
the Assembler that created the package.
The verifier is responsible for:
- Ensuring that every custodial file from the submission is present and accounted for in the package. 
- Ensuring there are no extraneous custodial files in the package that are not in the submission. 
- Ensuring that the custodial files checksums are correct. 
- Ensuring that the proper supplemental files are present in the package and have the correct content. 
Essentially all aspects of a generated package must be verified through a PackageVerifier.
The PackageVerifier interface includes a helper method verifyCustodialFiles for ensuring that there is a
custodial file in the package for each submitted file, and that there are no unexplained custodial files present in the
package.void verifyCustodialFiles(DepositSubmission, File, FileFilter, BiFunction<File, File, DepositFile>)
where DepositSubmission is the original submission, the File is the directory on the filesystem that contains the
exploded package, the FileFilter selects custodial files from the package directory, and the BiFunction accepts
a DepositFile from the submission and maps it to its expected location in the package directory.
Examples:
- NihmsPackageVerifier
Runtime
Deposit Services is a Spring Boot application, and Assemblers are simply a component executed within the application.
If you are familiar with Spring and/or Spring Boot, you are welcome to leverage its features as you wish. Regardless of
your views of Spring, you need to be aware of Spring in these cases:
- When extending - SubmitAndValidatePackagesITyour IT will need to use the- SpringRunner
- Your - Assemblerimplementation must be annotated with- @Component
Wiring
So, how is your Assembler, PackageStream, and PackageProvider wired together? As outlined above, the wiring of
these components is straightforward. You can either "hardwire" your implementations at compile-time, or you can leverage
Spring dependency injection.
Deposit Services uses Spring Auto Configuration to discover your Assembler on the classpath on boot. Supporting Spring
Auto Configuration is very simple, by ensuring that your Assembler implementation is annotated with @Component.
Last updated