NIHMS Loader
The NIH Manuscript Submission Loader (NIHMS Loader) contains the components required to download, transform, and load Submission information from NIHMS to PASS.
Summary
The NIHMS Loader is a module contained in Pass-Support, and is composed of two Java command line tools. The first uses the NIH API to download the CSV(s) containing compliant, non-compliant, and in-process publication information. The second tool reads those files, transforms the data to the PASS data model, and then loads them to PASS.
For background information on the NIH Public Access Compliance Monitor (PACM), see the user guide. Limited information on the API is provided by the NLM Technical Bulletin
The NIHMS Loader operates in two stages:
Harvests data from the NLM’s Public Access Compliance Monitor (PACM) website about the compliance status of publications written by researcher PIs
It compares PACM data with
Submission
data in PASS and then adds or updatesSubmission
,Publication
, andRepositoryCopy
.
These two processes are separate Java command line interface (CLI) applications: the NIHMS Data Harvest CLI and the NIHMS Transform and Load CLI.
Knowledge Needed / Skills Inventory
Development of the NIHMS Loader
Programming in Java
Basic understanding of PMC data
REST/HTTP
Running the NIHMS Loader
CLI commands
Technologies Utilized
Technical Deep Dive
NIHMS Data Harvest CLI
The NIHMS Data Harvest CLI uses the NIH API to download the PACM data.
The following are required to run this tool:
Java 17+
Download the latest docker image or download the latest pass-support release and compile the
pass-nihms-loader
moduleAn account for the NIH PACM website, and obtain an API key. The API key is only valid for 3 months, so it will need to be updated periodically. There are a couple of ways to obtain an account, and the process is institution-specific.
Data Harvest Configuration
There are several ways to configure the Data Harvest CLI. Data Harvest CLI is a Spring Boot Application, so it can be configured using Spring Boot Configuration
You will need to set values for the following properties:
nihmsetl.api.url.param.inst
nihmsetl.api.url.param.ipf
nihmsetl.api.url.param.api-token
The full set of properties for the NIHMS Harvester is listed below. These properties are set in the resources/application.properties
file in the nihms-data-harvest
module.
Property | Default Value | Notes |
---|---|---|
nihmsetl.data.dir | /data/nihmsloader/data | Defines a directory where the files will be downloaded. If not specified, the default directory will be created for you. |
nihmsetl.api.host | www.ncbi.nlm.nih.gov | The host name of the API. The default should not change. If the API is moved to another host this will be updated by the PASS team. |
nihmsetl.api.scheme | https | The HTTP scheme of the API. The default should not change. |
nihmsetl.api.path | /pmc/utils/pacm/ | The API URL Path. The default should not change. If the API does change the URL, the default value will be updated by the PASS team. |
nihmsetl.http.read-timeout-ms | 30000 | Allow 30 seconds for a request to be read before timing out. |
nihmsetl.http.connect-timeout-ms | 30000 | Allow 30 seconds for establishing connections before timing out. |
nihmsetl.api.url.param.format | csv | The format for the downloaded file. The default should not change. The NIHMS Transform and Load requires a csv file. |
nihmsetl.api.url.param.inst | Name of the institution making the API request. The value for your organization can be found in the PACM website. | |
nihmsetl.api.url.param.ipf | IPF (Institutional Profile File) number, the unique ID assigned to a grantee organization in the eRA system. | |
nihmsetl.api.url.param.api-token | The API token retrieved from the PACM website. The token expires every three months. | |
nihmsetl.api.url.param.pdf | See Notes | Date in MM/YYYY format that the PACM data should start from. Can be set using the |
nihmsetl.api.url.param.pdt | See Notes | Date in MM/YYYY format that the PACM data should end. Leave blank to default to the current month. |
Running the Data Harvester
The simplest way to run the data harvester is to use the java command in a CLI with Java system property option, by using the -D parameter. Note if you're running on Windows you may have double quote the parameters, for example "-Dnihmsetl.api.url.param.inst=my-inst"
.
Once the Data Harvest CLI has been configured, there are a few additional options you can add when running from the command line.
By default, all 3 publication statuses - compliant, non-compliant, and in-process CSVs will be downloaded. To download one or two of them, you can add them individually at the command line:
You can also specify a start date, by default the PACM system sets the start date to 1 year prior to the date of the download. You can change this by adding a start date parameter. This will return all records published since the date provided. The syntax for this parameter is mm-yyyy .
So, for example, to download the compliant publications published since December 2012, you would do the following:
On running this command, files will be downloaded and renamed with a prefix according to their status ("compliant", " noncompliant", or "inprocess") and a timestamp integer e.g. noncompliant_nihmspubs_20180507104323.csv.
NIHMS Transform and Load
The NIHMS Data Transform-Load CLI reads data in from CSVs that were downloaded from the PACM system, converts them to PASS compliant data and loads them into the PASS database. Pre-requisites
The following is required to run this tool:
Java 17+
Download the latest docker image or download the latest pass-support release and compile the
pass-nihms-loader
module. The jar to run for this process isnihms-data-transform-load-exec.jar
found in thenihms-data-transform-load
module.
Data Transform-Load Configuration
There are several ways to configure the Data Transform-Load CLI. Data Transform-Load CLI is a Spring Boot Application, so it can be configured as described here: Spring Boot Configuration
You will need to set values for the following props: nihmsetl.repository.id, pass.client.url, pass.client.user and pass.client.password. There are a number of ways to do this with a Spring Boot app, which are described in the link above.
Here is an example using Java system properties -D.
java -Dnihmsetl.repository.id=my-repo-id -Dpass.client.url=my-url -Dpass.client.user=my-user -Dpass.client.password=my-pw -jar nihms-data-transform-load-exec.jar
Running the Data Transform-Load
Once the Data Transform-Load CLI has been configured, there are a few additional options you can add when running from the command line.
By default, all 3 publication statuses: compliant
, non-compliant
, and in-process
CSVs will be downloaded. To download one or two of them, you can add them individually at the command line:
So, for example, to process non-compliant spreadsheets only:
When run, each row will be loaded into the application and new Publications, Submissions, and RepositoryCopies will be created in PASS as needed. The application will also update any Deposit.repositoryCopy links where a new one is discovered. Once a CSV file has been processed, it will be renamed with a suffix of ".done" e.g. noncompliant_nihmspubs_20180507104323.csv.done. To re-process the file, simply rename it to remove the .done suffix and re-run the application.
Running the Harvester and Data Transform-Load using Docker
To run both the Harvest and Data Transform-Load using Docker you will also need PASS docker running, otherwise it will fail on the Transform-Load step.
Once PASS Docker is running, use docker pull to get the Pass NIHMS Loader image, and be sure to replace the image tag 1.8.0-snapshot
with the version that you want to pull and run.
Run the docker image with the following environment variables, using the docker -e parameter. On Windows you may have double quote the parameters, for example "-eNIHMS_API_INST=YOUR_INST"
.
NIHMS Data Harvester Classes and Relationships
Data Flow Overview
Initialization:
NihmsHarvesterCLIRunner
starts and processes command-line arguments.NihmsHarvesterCLI
configures and initializesNihmsHarvester
usingNihmsHarvesterConfig
.
URL Construction:
NihmsHarvester
usesUrlBuilder
andUrlType
to create URLs needed to access NIHMS data.
Data Download:
NihmsHarvesterDownloader
retrieves data from the constructed URLs.Downloaded data is passed back to
NihmsHarvester
.
Data Processing:
NihmsHarvester
processes the downloaded data.
Interactions and Dependencies
NihmsHarvester
is the central class, relying on configurations fromNihmsHarvesterConfig
, URL construction fromUrlBuilder
andUrlType
, and data downloading fromNihmsHarvesterDownloader
.NihmsHarvesterCLI
andNihmsHarvesterCLIRunner
are entry points for the application, primarily handling user interaction and delegation to the coreNihmsHarvester
.UrlBuilder
andUrlType
ensure that the URLs used for data retrieval are correctly constructed and categorized.
NIHMS Data Transform-Load Classes and Relationships
Data Flow Overview
Data Transformation and Loading:
NihmsTransformLoadCLIRunner
andNihmsTransformLoadCLI
handle command-line interactions for data transformation and loading.NihmsTransformLoadService
transforms publication data and loads it usingSubmissionLoader
.PmidLookup
service to retrieve a PMID record from the NBCI Entrez API.
Data Transfer:
SubmissionDTO
encapsulates transformed submission data for transfer between components.NihmsPassClientService
service to provide interactions with the data via the PASS client.SubmissionLoader
loads the final submission data into PASS and accomplishes this by interfacing with theNihmsPassClientService
.
Interactions and Dependencies
NihmsTransformLoadCLI
is the entry point and passes control toNihmsTransformLoadCLIRunner
.NihmsTransformLoadCLIRunner
initializes the process using configurations fromNihmsTransformLoadConfig
and performs the transformation and loading by invokingNihmsTransformLoadService
.NihmsTransformLoadConfig
provides necessary configuration settings toNihmsTransformLoadService
.NihmsTransformLoadService
orchestrates the transformation of data usingNihmsPublicationToSubmission
and the loading of data usingSubmissionLoader
.NihmsPublicationToSubmission
is responsible for transforming PMC publication data and associated data to aSubmissionDTO
, which is composed ofGrant
,Publication
,RepositoryCopy
, andSubmission
objects.SubmissionDTO
acts as a container for the transformed data.SubmissionLoader
is responsible for the final step of loading the data into PASS.
Next Step / Institution Configuration
Configuring the NIHMS loader to run at an institution requires several NIH/NLM accounts to be setup. The first step would be to familiarize yourself with the PACM Guide. Institutions that are universities typically have an Office of Research and that is a good starting point to finding out more information how PACR roles are assigned, typically a Signing Officer at the Office of Research can perform this function.
Once access to PACM has been established, configuring the infrastructure where the Data Harvester and the Data Transform-Load applications is the next step. Since they are Java applications they can be scheduled as cron jobs on a server or run in the cloud. At Johns Hopkins University these applications are run by using AWS Batch and ECS.
Optionally as a last step, setting up the NIH Manuscript Submission account would enable seeing which submissions have been made and how far along they are in the process. Once a submission makes its way through this process it appears in the PubMed Central data. This can be useful for troubleshooting records expected to be in the PubMed Central data.
Manual NIHMS Token Generation
Once a PACM account is established, generating the token can be done via the PACM Website. Use the eRA Commons login and sign in with your username and password. Note that the login instructions may be different for every institution as the authentication mechanism may differ. After logging in, use the link at the top of the page that is labeled API Token
. This page will generate a new token every time it is accessed and your new API token will appear here. A token is valid for three months, and currently there is no way of deleting a token, so each page access will create a new token.
NIHMS Token Refresh
The NIHMS harvester process requires an Authentication token. This token is available from the PACM Utils page and is valid for three months. There is currently no API available to refresh the token.
In order to provide an automatic token refresh, there is a module within the NIHMS loader called nihms-token-refresh
that performs an automatic refresh of the token and updates the AWS Parameter Store. More information about how this module works in a production environment can be found in the Operation/Production Data Loaders section.
Related Information
The resources below will assist in setting up the accounts required to run the NIHMS Loader and understanding the submission process.
PubMed Central Resources
NIHMS Resources
Last updated