Data Loaders
Last updated
Last updated
This section describes how the are implemented in the JHU AWS production environment.
AWS Batch: The Journal Loader is executed as a batch job using AWS Batch. The Journal Loader batch job retrieves, processes, and uploads journal data to the PASS system.
ECS: The batch job is executed in ECS Fargate within AWS. The environment is deployed in the main VPC.
EventBridge: AWS EventBridge is used to schedule the execution of the Journal Loader job.
AWS Batch: The Grant Loader is executed as a batch job using AWS Batch. The Grant Loader batch job pulls grant data from the JHU grant management databases that is used by Johns Hopkins University.
ECS: The batch job is executed in ECS Fargate within AWS. The environment is deployed in the main VPC and JHU connected VPC.
S3: The configuration of the Grant Loader is persisted in S3 buckets. The policy information policy.properties
and grant_update_timestamps
are stored in a S3 bucket.
EventBridge: AWS EventBridge is used to schedule the execution of the Grant Loader job. The target of the schedule is a Step Function which breaks out the pulling and loading of data into separate steps. It also defines functionality for retrying failed loads with a certain number of attempts and interval wait time.
AWS Batch: The NIHMS Loader is executed as a batch job using AWS Batch. It pulls data from the NLM’s Public Access Compliance Monitor (PACM) API updating and creating submissions.
ECS: The batch job is executed in ECS Fargate within AWS. The environment is deployed in the main VPC.
EventBridge: AWS EventBridge is used to schedule the execution of the NIHMS Loader job.
The NIHMS Data Loader Harvester process requires an NIHMS API Authentication token. This token is available from the NIHMS/PACM utils page and is valid for three months.
If any modifications to the account need to be made, such as changing the associated email with the account, it will need to be done via the eRA administrator.
There currently is no NIHMS/PACM API available to refresh the NIHMS API authentication token. Obtaining/refreshing the token must be done using the NIHMS PACM website.
PASS has created a Robotic Process Automation (RPA) to generate a new API Token and set the token into a store. The default store is an AWS Parameter Store. The Token is generated by running a script that logs into the PACM utils page using th ERA Commons login, clicks on the API Token link, and writes the new token to a file named in the NIHMS_OUTFILE
environment variable. The value in the NIHMS_OUTFILE
file is then set into the parameter store.
There is a Docker image available named ghcr.io/eclipse-pass/pass-nihms-token-refresh:<version>
.
To run the token refresh RPA, the following needs to be passed to the docker image as environment variables:
NIHMS_USER
: The ERA Commons login username
NIHMS_PASSWORD
: The ERA Commons login password
NIHMS_OUTFILE
: The full path to the file to write the new token
If you are setting the token into AWS Parameter Store, the RPA assumes that the needed AWS CLI authentication are in place such as passing AWS keys as environment variables or using IAM roles.
In order to obtain a NIHMS API Authentication token, you must create an account with NIH/ERA. See this page for instructions on creating a User Account with NIH/eRACommons:
System Owner: The system owner is the NIH, but account management is delegated to a University’s Office of Sponsored Research. In JHU’s case, this is: . For any other university setting up their own NIHMS data loader, it will be the Office of Sponsored Research that creates the account for the API key.
Account Setup: In the case of JHU, the account needs to be set up by . They will need to have the following permissions in order to create an account: SO, AO, AA, or BO and they cannot be affiliated with more than one institution.
More information regarding the NIHMS Loader can be found in the