Usage manual¶

After Pub2Tools is installed, it can be executed on the command line by running the java command (from a Java Runtime Environment (JRE) capable of running at least version 8 of Java), while giving the compiled Pub2Tools .jar file as argument. For example, executing Pub2Tools with the argument -h or --help outputs a list of all possible parameters and commands:

$ java -jar path/to/pub2tools-<version>.jar --help

Running Pub2Tools consists of running it multiple times with different Setup commands that copy or fetch files required as prerequisites for the Steps commands, which, after being all run, will generate the end results. Commands of Pub2Tools begin with one dash (-) and parameters giving required arguments to or influencing the commands begin with two dashes (--). All commands must be followed by the output directory path where all files of a Pub2Tools run will end up.

Setup commands¶

Setup commands can be run in any order, and also multiple times – previous files will just be overwritten. Setup commands must be run such that all files required for the Steps are present in the output directory: EDAM.owl, tf.idf, tf.stemmed.idf, biotools.json, pub.txt and db.db.

Note

Once any Steps has completed successfully, no Setup commands can be run anymore and only Steps commands are allowed to be run to finalise the current Pub2Tools run.

-copy-edam¶

Copy the EDAM ontology file in OWL format given with the --edam parameter to EDAM.owl in the given output directory. The EDAM ontology is used in the last -map step to add EDAM annotations to the results.

The given OWL file can be a path to a file in the local file system or a web link, in which case the file will be downloaded from the link to the given output directory. Fetching of the link can be influenced by the –timeout and –user-agent parameters.

Note

In the same way, either a local file or a web resource can be given as the source file for all following -copy commands.

Examples copying the EDAM ontology file to the directory results:

$ java -jar path/to/pub2tools-<version>.jar -copy-edam results --edam path/to/EDAM.owl
$ java -jar path/to/pub2tools-<version>.jar -copy-edam results --edam http://edamontology.org/EDAM.owl

-copy-idf¶

Copy two IDF files (one where words are stemmed and another where they are not) given with the --idf and -idf-stemmed parameters to tf.idf and tf.stemmed.idf in the given output directory. tf–idf weighting is used in multiple parts of Pub2Tools: the unstemmed version (tf.idf) is used in -pass1 and -pass2 and in the -map step either tf.idf or tf.stemmed.idf is used, depending on the used parameters (using stemming is the default).

Pre-generated IDF files are provided in the EDAMmap repo: biotools.idf and biotools.stemmed.idf. However, these files can also be generated from scratch using EDAMmap: more info in the IDF section of the EDAMmap manual.

Example copying the (either downloaded or generated) IDF files from their location on local disk to the results directory:

$ java -jar path/to/pub2tools-<version>.jar -copy-idf results --idf path/to/tf.idf --idf-stemmed path/to/tf.stemmed.idf

-get-biotools¶

Fetch the entire current bio.tools content using the bio.tools API to the file biotools.json in the given output directory. This file containing existing bio.tools content is used in the -pass2 step to see which results are already present in bio.tools.

Calls the same code as the EDAMmap-Util command -biotools-full.

Example fetching bio.tools content to the results directory:

$ java -jar path/to/pub2tools-<version>.jar -get-biotools results

-copy-biotools¶

Copy the file containing bio.tools content in JSON format given with the --biotools parameter to the file biotools.json in the given output directory. This -copy command can be used instead of -get-biotools to re-use a biotools.json file downloaded with -get-biotools as part of some previous Pub2Tools run or to use some alternative JSON file of bio.tools entries.

Example copying bio.tools content to the results directory:

$ java -jar path/to/pub2tools-<version>.jar -copy-biotools results --biotools path/to/biotools.json

-select-pub¶

Fetch publication IDs of journal articles from the given period that are potentially suitable for bio.tools to the file pub.txt in the given output directory. The resulting file is used as input to the -fetch-pub step that will download the content of these publications forming the basis of the search for new tools and services to add to bio.tools. Only articles matching certain (changeable) criteria are selected, as otherwise the number of publications to download for the given period would be too large – due to this filtering the number of publications is reduced by around 50 times (with default options).

The granularity of the selectable period is one day and the range can be specified with the parameters --from and --to. As argument to these parameters, an ISO-8601 date must be given, e.g. 2019-08-23. Instead of --from and --to the parameters --month or --day can be used. The parameter --month allows to specify an exact concrete month as the period (e.g. 2019-08), so that the number of days in a month doesn’t have to be known to cover any whole month. The parameter --day allows to specify just one whole day as the period (e.g. 2019-08-23).

Fetching the publication IDs works by sending large query strings to the Europe PMC API and extracting the PMID, PMCID and DOI from the returned results. The query string consists of the date range, of the content source, of (OR-ed together) phrases that must be present in the abstract to try to restrict the output to articles about tools, of a potential custom search string to restrict the output further to the desired theme and of (AND-ed together) phrases that must not appear in the publication abstract or title to try to remove some false positives:

The date range is specified with --from and --to or --month or --day, as explained above. The search field filled with the specified date is “CREATION_DATE”, which is the first date of entry of the publication to Europe PMC. This is not necessarily equal to the (print or electronic) publication date of the journal article, as a publication can be added to Europe PMC some time after it has been published, but also ahead of publication time. The search field “CREATION_DATE” is used instead of publication date to try to ensure that the set of publications returned for some date range remains the same in different points in time. For example, if all publications of August are queried at some date in September, we would want to get more or less the same results also at some query date in October. If the article publication date was used as the search field, then maybe some articles published in August were added to Europe PMC in the meantime, meaning that the query made in October would return more results and the query made in September would miss those newly added publications. Using “CREATION_DATE” enables us to do the query of publications added to Europe PMC in August only once and not bother with that date range anymore in later queries.
Europe PMC has content from different sources (Sources of content in Europe PMC). Pub2Tools searches from “MED” (PubMed/MEDLINE NLM) and “PMC” (PubMed Central). As we are only interested in the PMID and PMCID, then we request a minimal amount of information in the results to save bandwidth (resultType=idlist).

But Pub2Tools also searches from the source “PPR” (Preprints). The inclusion of preprints in Europe PMC is a nice feature that enables Pub2Tools to easily extend the search of tools to publications in services like bioRxiv and F1000Research. In case of preprints we usually can only get a DOI and, as the minimal results do not contain it, we execute the query for publication IDs from preprints separately (resultType=lite).
Phrases that must appear in a publication abstract are divided into categories and by combining these categories we get the final necessary requirement a publication abstract must meet to be included in the selection. Files corresponding to these categories are the following:
- excellent, e.g. “github”, “implemented in r”, “freely available”
- good, e.g. “available for academic”, “sequence annotation”, “unix”
- mediocre1, e.g. “our tool”, “paired-end”, “ontology”
- mediocre2, e.g. “computationally”, “high-throughput”, “shiny”
- http, e.g. “https”, “index.html”
- tool_good, e.g. “server”, “plugin”
- tool, e.g. “tool”, “pipeline”, “repository”
Out of these, only phrases from excellent are sufficient on their own to meet the inclusion requirement. Phrases from other categories must be combined, e.g. an abstract matching one phrase from good and another from http will also meet the requirement. This can be directly encoded as an Europe PMC query by AND-ing together the OR-ed phrase of good with the OR-ed phrase of http. However, some combinations can’t be expressed as Europe PMC queries, for example one phrase from good, another from tool and a third, but different one also from tool. In that case, results for all phrases of tool must be fetched from Europe PMC one by one and programmatically combined in Pub2Tools. In total, the following combinations are done:
- excellent
- good + http
- good + tool_good
- mediocre1 + http + tool
- mediocre2 + http + tool
- mediocre1 + tool_good + tool
- mediocre2 + tool_good + tool
- http + tool_good
- good + tool + tool
- mediocre + tool + tool + tool
- http + tool + tool
- tool_good + tool_good
- tool_good + tool + tool
Note

The category mediocre is split into two in the implementation simply because otherwise query strings sent to Europe PMC would get too long for it to handle.

Note

It seems that common words (stop words maybe) are filtered out, e.g. ABSTRACT:"available as web" and ABSTRACT:"available at web" give the same results (however ABSTRACT:"available web" gives different results, so some stop word must be present in-between).

This restricting by phrase combinations is done to considerably narrow down the returned number of publication IDs so that the amount of publication content to be downloaded would be reasonable. It also reduces potential false positives further down the line, but inevitably, some good articles about tools also get discarded in the process. To not do this restricting, the parameter --disable-tool-restriction can be supplied. However, doing this would significantly increase the number of results (mostly in the form of false positives), so --disable-tool-restriction should really be used in conjuction with --custom-restriction.
A custom search string can be specified after the parameter --custom-restriction to further restrict the results, for example to some desired custom theme. How to construct such a search string, what fields can be searched and how to combine them can be seen in Search syntax reference of Europe PMC. For example, something like --custom-restriction '"COVID-19" OR "SARS-CoV-2" OR ABSTRACT:"Coronavirus"' could be used to restrict the output of Pub2Tools to tools related to the COVID-19 pandemic (in reality, the search string should be made a bit more elaborate). Constructed search strings can be tested using the search box at https://europepmc.org/. If the search string is specific enough and looking through every hit is important, then other restrictions can potentially be disabled with --disable-tool-restriction and potentially also --disable-exclusions.
As the last part of the query string sent to Europe PMC, phrases that must not appear in the publication abstract or title are specified to remove some systematic false positives. These help to exclude a few publications that are otherwise selected, but that are actually not about a tool or service (mostly, publications about a medical trial and review articles are excluded this way). The exclusion phrases are specified in the files not_abstract.txt (e.g. “trial registration”, “http://clinicaltrials.gov”) and not_title.txt (e.g. “systematic review”, “controlled trial”). To not do such exclusions, the parameter --disable-exclusions can be supplied (this would mostly just have the effect of introducing a small number of additional FPs).

Some journals have articles suitable for bio.tools more often than some other journals. As the selection of publications with phrases that must appear in the abstract is not perfect and sometimes excludes good articles, it makes sense to not use this mechanism for some high relevance journals and instead download all publications of the given period from these journals. If the number of such journals is not too high, then this does not significantly increase the total number of publications that must be downloaded. The list of such high priority journals is specified in the file journal.txt. Phrase exclusion with not_abstract.txt and not_title.txt is still done (unless --disable-exclusions is specified) and additional restrictions from --custom-restriction will also apply. Separate selection from these journals is not done if the parameter --disable-tool-restriction is specified.

Two equivalent examples fetching all publication IDs for the month of August 2019 to the directory results:

$ java -jar path/to/pub2tools-<version>.jar -select-pub results --from 2019-08-01 --to 2019-08-31
$ java -jar path/to/pub2tools-<version>.jar -select-pub results --month 2019-08

Example selecting publication IDs from publications added to Europe PMC on the 23rd of August 2019:

$ java -jar path/to/pub2tools-<version>.jar -select-pub results --day 2019-08-23

Example selecting publication IDs of COVID-19 related articles for the year 2020 (normally, selecting large time spans can get too slow because of large numbers of combinations to be done with too many returned IDs, here, --custom-restriction is restricting the output enough):

$ java -jar path/to/pub2tools-<version>.jar -select-pub results --from 2020-01-01 --to 2020-12-31 --custom-restriction '"2019-nCoV" OR "2019nCoV" OR "COVID-19" OR "SARS-CoV-2" OR "COVID19" OR "COVID" OR "SARS-nCoV" OR ("wuhan" AND "coronavirus") OR "Coronavirus" OR "Corona virus" OR "corona-virus" OR "corona viruses" OR "coronaviruses" OR "SARS-CoV" OR "Orthocoronavirinae" OR "MERS-CoV" OR "Severe Acute Respiratory Syndrome" OR "Middle East Respiratory Syndrome" OR ("SARS" AND "virus") OR "soluble ACE2" OR ("ACE2" AND "virus") OR ("ARDS" AND "virus") or ("angiotensin-converting enzyme 2" AND "virus")'

-copy-pub¶

Copy the file containing publication IDs to download with -fetch-pub from the path given with the --pub parameter to the file pub.txt in the given output directory. This -copy command can be used instead of -select-pub in order to use publication IDs got through some different means (for example, a list of publication IDs could be manually created).

Example copying the file containing publication IDs to the results directory:

$ java -jar path/to/pub2tools-<version>.jar -copy-pub results --pub path/to/pub.txt

-init-db¶

Initialise an empty PubFetcher database to the file db.db in the given output directory. The database is used to store the contents of the publications fetched with -fetch-pub and webpages and docs fetched with -fetch-web. The database is read for getting the contents of publications in -pass1, for the contents of webpages and docs in -pass2 and for all content in -map.

Note

In contrast to other Setup commands, -init-db will not automatically overwrite an existing file (as the filling of an existing database file might have taken a lot of resources), so db.db must be explicitly removed by the user if -init-db is to be run a second time.

Calls the same code as the PubFetcher CLI command -db-init.

Example initialising an empty database file to the results directory:

$ java -jar path/to/pub2tools-<version>.jar -init-db results

-copy-db¶

Copy the PubFetcher database given with the --db parameter to the file db.db in the given output directory. This can be used instead of -init-db in order to use a database full of publications and web pages got through some other means.

Example copying an existing database to the results directory:

$ java -jar path/to/pub2tools-<version>.jar -copy-db results --db path/to/db.db

Steps¶

Once setup is done, steps must be run in the given order: -fetch-pub, -pass1, -fetch-web, -pass2 and -map. A re-run, starting from any step, is also possible – previous results will be overwritten. And if there is confidence in the set of publications and web pages not changing, then -fetch-pub and -fetch-web can be skipped, if they have been run at least once. Although running also -fetch-pub and -fetch-web a second time might be beneficial in that some previously inaccessible or slow web resources might now be online. After a step successfully concludes, the next step to be run is written to step.txt. Once all steps have completed successfully, the files results.csv, diff.csv and to_biotools.json will be present in the given output directory.

-fetch-pub¶

This will fetch publications for publication IDs given in pub.txt to the database file db.db in the given output directory. Fetching is done like in the -db-fetch-end method of PubFetcher. Fetching behaviour can be influenced by the Fetching parameters and by --fetcher-threads that sets how many threads to use for parallel fetching (default is 8).

For best results, before a major run PubFetcher scraping rules could be tested with the PubFetcher CLI command -test-site, especially if this hasn’t been done in a while. Also for better results, -fetch-pub could potentially be run multiple times, spaced out by a few days, as some web pages might have been temporarily inaccessible the first time. A re-run is quicker as fetching is not retried for resources that were fetched to final state the first time. And also for better results, sometimes full texts of publications are downloaded directly from publisher sites, thus using Pub2Tools in a network with better access to those is beneficiary.

Example of running the step with some non-default parameter values:

$ java -jar path/to/pub2tools-<version>.jar -fetch-pub results --timeout 30000 --journalsYaml fixes.yaml --fetcher-threads 16

-pass1¶

The first pass of the Pub2Tools algorithm will load all publications from db.db corresponding to the publication IDs in pub.txt and iterate over these publications trying to find a name for the tool or service each publication is potentially about, assign a goodness score to the name suggestion and try to find web links of the tool from the publication abstract and full text. The unstemmed tf.idf is also read, as tf–idf weighting is used as part of the scoring and link matching. Results are output to pass1.json (for input to the second pass -pass2) and matched links to web.txt and doc.txt (so that -fetch-web can download their contents).

Publications whose abstract length is larger than 5000 or full text length is larger than 200000 are discarded altogether. Then, text from publications needs to be preprocessed – support for this comes from EDAMmap. Input is tokenised and processed, for example everything is converted to lowercase and stop words are removed. Processing is good for doing comparisons etc, however tokens closer to the original form (e.g. preserving the capitalisation) are also kept, as this is what we might want to output to the user. Code to divide the input into sentences and to extract web links has also been implemented in EDAMmap and is used here. This implementation might not be perfect, but it has enabled devising regexes and hacks dealing with quirks and mistakes specific to the input got from publications. The removal of stop words and some other preprocessing (except --stemming) can be influenced by the Preprocessing parameters.

Then, the process of looking at all possible phrases in the publication title and abstract as potential names of a tool or service begins. The goodness scores of the phrases are calculated and modified along the way:

First, words in the title and abstract are scored according to tf–idf weighting, using the tf.idf file generated from bio.tools content. A unique word (according to bio.tools content) appearing once in the abstract will get a score of 1. The more common the word is, the lower the score according to a formula. If the word occurs more than once in the title and abstract, then the score will be higher. Short phrases (many words as a tool name) are also calculated scores for, using the scores of their constituent words.
Quite often, the tool name is present in a publication title as “Tool name: foo bar”, “Tool name - a foo bar”, etc. Extracting the phrase before “: “, ” - “, etc, and removing some common words like “database”, “software”, “version”, “update”, etc, from that phrase would result in a phrase (the tool_title) that we have more confidence in being the tool name. Thus, we increase the score of that phrase by multiplying its score from the last step with a constant (or initialise it to a constant if the extracted phrase is a new combination). The tool_title could also be an acronym with the expanded name occurring somewhere in the abstract. Fittingly, matching acronyms to their expanded forms is also supported (here and in the next steps).
In a publication abstract about a tool, certain words tend to occur more often just before or after the tool’s name than they occur elsewhere. So, if a candidate phrase has one such word before or after it, the probability that the phrase is a tool name is higher and we can increase its score. The list of such words that often occur just before or after (or one step away) from a tool name was bootstrapped by tentatively setting as tool name the tool_title (where available). These bootstrapped words were divided into tiers based on how much they preferably occur around the tool_title, thus how much they should increase the score. For example, the best words to occur before a tool name are in before_tier1.txt (e.g. “called”, “named”) and after a tool name in after_tier1.txt (e.g. “freely”, “outperforms”). Words raising the score less are in before_tier2.txt and after_tier2.txt, before_tier3.txt and after_tier3.txt. Now, having these word lists, we can iterate through each candidate phrase in the title and abstract and raise the score by some amount depending on the tier (but up to a limit) each time when a “before” or “after” word is found to be in the neighbourhood.
If an abstract contains web links, we can be somewhat more certain that the publication is about a software tool or service, as in such publications links to the tool are often put in the abstract. However, such links can point to other things as well, for example to some resource used in the publication. So what we would like to do, is to match these links to phrases in the abstract and increase the score of candidate phrases that have matching links. In addition to matching links in the abstract, it also makes sense to match the candidate phrases from the abstract to links in the full text of the publication (while having a smaller matching score in that case), as often the homepage of the tool is not put into the abstract or additional links can appear in the full text (the repository of the tool, some documentation links, etc). The matching of links to phrases increases the score of some phrases and thus helps in finding the most likely tool name, but in addition, once the name has been chosen, we can possibly suggest a homepage, documentation links, etc, (done in -pass2) based on the links attached to the name.

The matching of links is done by extracting the part of the link URL string that is most likely a tool or service name and matching it in various forms (including in acronym form) to the candidate phrase. The part extracted from the URL string is either a path part, or if there is no path or all path parts are too unlikely, then it is extracted from the domain name. Choosing the correct path part is done from right to left with the unlikeliness of being the tool name decided mainly by tf–idf weighting. If the name has to be extracted from the domain name, then the lowest level domain name part is chosen, unless it matches some hardcoded patterns or any of the words in host_ignore.txt (in which case, the link can’t be matched to any phrases at all).

In some cases, the tool or service name can correctly be extracted from the link, however it doesn’t match any phrases in the publication title or abstract simply because the tool name is not mentioned there. To also catch and potentially include such publications, such orphaned link parts are added to the pool of candidate phrases (from_abstract_link of such name suggestions is set to true). In some other cases, the matching of links fails for some other reason or the extraction of the link part fails to work correctly, so as a backup mechanism, candidate phrases are also matched to any part of a link URL (but in case of a match, the score of the phrase is not increased); and if this also fails, then in case a variation of the word “available” appears in the same sentence as the unmatched link, that link will be attached to the phrase (again, the score of the phrase is not increased), unless it appears to be a link to a dataset repository.

Once the final score has been calculated, candidate phrases are ordered by it and the top one suggested as the tool or service name. Up to 4 more candidates can be output, if their scores are not too low compared to the top one. This usually means that for publications where the confidence in the top choice is not very high, other options besides the top one will also appear.

The publications themselves can also be ordered based on the scores of their top choices and possibly a threshold could be drawn somewhere, below which we would say that the publication is not about a tool or service (however, such final decision will be done in the end of the second pass by taking some additional aspects into account).

Note

A higher score does not mean a “better”, higher impact, etc tool. It just mean that Pub2Tools is more confident about the correctness of the extracted tool name.

Note

One publication can possibly be about more than one tool. Currently we only detect this when the names of such tools are in the title and separated by “and” or “&” – in such case we split the publication into independent results for each tool.

The final output of the first pass of Pub2Tools needs some cleaning:

For example, there are often problems with web links: sometimes links are “glued” together and should be broken into two separate links, sometimes there seems to be garbage at the end of a link, sometimes the schema protocol string in front of the URL is truncated, etc. We can fix some such mistakes by guesswork or sometimes a problematic link in the abstract has a correct version in the full text or vice versa. After fixing the links, we also keep the unfixed versions, because they might have been correct after all (this will be known after trying to resolve the links).
Mistakes in the source material can cause other output to be invalid also. For example, the publication DOIs sometimes contain garbage in them that causes them to be discarded.
Even if the output seems to be correct, it has to be valid according to biotoolsSchema, and this can cause further modifications to be made. For example, some attribute values might need to be truncated because of maximum length requirements. Or, according to biotoolsSchema, the extracted tool name can only contain Latin letters, numbers and a few punctuation symbols and thus, invalid characters are either replaced (accents, Greek letters, etc) or discarded altogether.

In the end, results are written to pass1.json for further processing by -pass2. Results contain the publication IDs and other information about the publication, like the title, possible name extracted from the title (tool_title), sentences from the abstract, journal title, publication date, citations count, corresponding authors, but also the suggested tool name (or names in case of multiple suggestions) along with the suggestion’s score and links from the abstract and full text matching the name. All matched links are divided to documentation and other links (based on the URL string alone) and written to doc.txt and web.txt for fetching by the next step.

Example of running the step:

$ java -jar path/to/pub2tools-<version>.jar -pass1 results

-fetch-web¶

This will fetch webpages and docs for URLs given in web.txt and doc.txt to the database file db.db in the given output directory. Fetching is done like in the -db-fetch-end method of PubFetcher. Fetching behaviour can be influenced by the Fetching parameters and by --fetcher-threads that sets how many threads to use for parallel fetching (default is 8).

For best results, before a major run PubFetcher scraping rules could be tested with the PubFetcher CLI command -test-webpage, especially if this hasn’t been done in a while. Also for better results, -fetch-web could potentially be run multiple times, spaced out by a few days, as some web pages might have been temporarily inaccessible the first time. A re-run is quicker as fetching is not retried for resources that were fetched to final state the first time.

Example of running the step with some non-default parameter values:

$ java -jar path/to/pub2tools-<version>.jar -fetch-web results --timeout 30000 --webpagesYaml fixes.yaml --fetcher-threads 16

-pass2¶

The second pass of the Pub2Tools algorithm will load all results of the first pass from pass1.json and while iterating over these results it will: reassess and reorder entries with lower scores by calculating a second score; merge entries if they are determined to be about the same tool; look into biotools.json to see if an entry is already present in bio.tools; assign types to all the links and decide which one of them is the homepage. The unstemmed tf.idf is also read, as tf–idf weighting is used as one part of calculating the second score, and db.db is read to get the license, language and description candidate phrases from webpages and docs and to check if webpages and docs are broken and if the final URL (after redirections) is different than the initial one. Final results of all entries are output to results.csv along with intermediate results, new entries determined to be suitable for entry into bio.tools are output to new.json with the output adhering to biotoolsSchema and differences between the content of entries determined to already exist in bio.tools and the corresponding content in bio.tools are output to diff.csv.

Calculate score2

A second score is calculated for entries whose first score (calculated in -pass1) is below 1000. This has as goal elevating entries that are quite likely about a tool but that got a low score in the first pass (showing trouble in being confident in the extracted tool name). Up to 5 tool names are suggested for an entry after the first pass with the top name being suggested as the correct one, however this top name choice can potentially change while calculating score2 for all tool name suggestions of an entry. For calculating score2, first it is set equal to the first score and then:

If a suggestion has matching non-broken links in the publication abstract, then increase its score2 (matching non-broken links in the fulltext also increase score2, but less). It’s possible, that the top name suggestion changes after this step, as the current top name might not have matching links, but some lesser choice might.
If a suggestion matches the tool name that can be extracted from the publication title (tool_title), then increase its score2 depending on how good the match is and how many words the name contains (less is better). Again, the suggestions can get reordered and the top name suggestion change.
Next, increase score2 of a suggestion based on the capitalisation of the name. A mix of lower- and uppercase letters gets the highest increase and all lowercase letters the smallest, and if the name consists of many words, then the score increase is lowered. The score2 of other suggestions besides the current top one is only increased if it was already increased by any of the two previous methods.
Increase score2 of a suggestion based on the average uniqueness of the words making up the tool name (calculated based on the input tf.idf file). The score2 of other suggestions besides the current top one is only increased if it was already increased by any of the first two methods.

Determine confidence

If the first score is at least 1000 or score2 is more than 3000, then confidence is “high”. If score2 is more than 1750 and less or equal to 3000, then confidence is “medium”. If score2 is at least 1072.1 and less than or equal to 1750, then confidence is “low”. If score2 is less than 1072.1, then confidence is “very low”.

Note

The confidence value is more like the confidence in the correctness of the extracted tool name. Whether an entry should be considered to be about a tool and thus eligible for entry to bio.tools is not based solely on this confidence and the final decision of inclusion is done later.

Merge same suggestions

In the first pass (-pass1) a few publications were split, as sometimes a publication can be about more than one tool. Conversely, different publications can also be about the same tool, so here we try to merge different entries into one entry for these kinds of publications. This merging is done, if the top name suggestions of the entries are exactly equal and the confidences are not “very low”. Entries with a “very low” confidence are not merged and instead connected through the same_suggestion field.

Note

Entries with a “very low” confidence can in some occasions also be included in new.json, thus the tool name is not a unique identifier there.

Find existing bio.tools entries

If an entry is found to be existing in bio.tools, then it should not be suggested as a new addition in new.json. However differences with the current bio.tools content are highlighted in diff.csv.

Note

What is meant under the current bio.tools content is not the content of bio.tools at the exact time of running -pass2, but the content in the supplied biotools.json file.

An entry is determined to be existing in bio.tools in the following cases:

Some publication IDs of the new entry are matching some publication IDs of an entry in bio.tools.
The name of the entry is equal or matching (ignoring version, capitalisation, symbols, etc) a name or ID of an entry in bio.tools and some link from that entry is matching (ignoring lowest subdomain and last path) any link from that bio.tools entry. As additional requirement in this case, the confidence must not be “very low” and the final decision about inclusion must be positive.
The name of the entry is matching a name or ID of an entry in bio.tools and also matching a credit (through the name, ORCID iD or e-mail) with that bio.tools entry. The confidence must not be “very low” and the final decision about inclusion must be positive.

Divide links

The web links extracted in -pass1 from publication abstracts and full texts and matching with the tool name suggestion are divided into bio.tools link groups and assigned a type. One of these links is chosen to be the tool homepage and broken links are removed.

The link groups of bio.tools are link, download and documentation. Further inside the group, each link is assigned a type, e.g. “Mailing list”, “Source code” or “Installation instructions”. In Pub2Tools, the division of links and assignment of type is done solely based on matching regular expressions to the link URL string. For example, a URL ending with ".jar" would be a download with type “Binaries”, matching of "(?i)(^|[^\\p{L}])faqs?([^\\p{L}]|$)" would be documentation with type “FAQ” and matching of the host "github.com" would be a link with type “Repository”. Note, that some other link types might also have the host "github.com", for example the GitHub tab “Issues” is put under link type “Issue tracker”, the GitHub tab “Wiki” is put under documentation type “User manual” and “Releases” is put under download type “Software package” – so these options would have to be explored before link type “Repository”. Links whose URL can’t be matched by any of the rules will end up under link type “Other”.

After division, links determined to be broken (i.e. that were not successfully resolved or returned a non-successful HTTP status code) are removed from link, download and documentation.

In the end, one of the links is chosen as the bio.tools homepage. First, only links in the abstract are considered with the following priority:

link “Other”
link “Repository”
documentation “General”
other documentation
any non-broken link whose URL does not end with an extension suggesting it is a downloadable file

If no suitable homepages are found this way, then links in the fulltext are considered following the same logic. If a homepage is still not found, then broken links in the abstract or fulltext are also allowed to be the homepage (and the status is set to “broken”, if such a homepage is found). And if there are still no suitable homepages, then a link to the publication (in PubMed, in PubMed Central or a DOI link) is set as the homepage (and the status is set to “missing”).

Description

As the bio.tools description attribute needs to be filled, then from all fetched and gathered content some candidate phrases are generated, that the curator can choose from or combine into the final description. The more difficult task of automatic text summarisation has not been undertaken, thus the description is an output of Pub2Tools that definitely needs further manual curation (along with EDAM terms output in -map).

As the length of the description is limited to 1000 characters, then phrases from different sources need to be prioritised and only some can be chosen for consideration. First, the publication title is output as a potential description (with the potential tool name, i.e. tool_title, removed from it).

Then, a suitable description phrase is looked for in web page titles. From these titles, any irrelevant content and the tool name are potentially removed with simple heuristics and a minimum length is required for such modified titles to be considered. The next priority goes to the first few sentences of a web page that are long enough. But one or two sentence (one short and one longer) that contain the tool name are also looked for in any case, with the priority of such sentences depending on how far from the top of the page they are. Also, the priority of sentences from some web page are a bit higher if scraping rules for that web page exist in PubFetcher. In case of equal priority, phrases from the homepage are preferred (then from certain link types and then from documentation). Menus, headers and other non-main content of web pages are mostly ignored due to PubFetcher’s ability to filter out such content. Deduplication of the final suggested phrases is also attempted.

If the search for potential phrases from web pages does not yield any results, then the description candidate phrases after the publication title will be from the publication abstract.

In addition to the Pub2Tools results themselves, we might want to communicate to the curator things that should potentially be checked or kept in mind when curating the entry (for example, that the homepage is potentially “broken” or that there is a slight chance that the entry is already existing in bio.tools). As there are no nice and non-hidden places for such messages in the output meant for bio.tools, then the description attribute is abused for such purpose. The messages to the curator, if any, will be appended to the description candidate phrases, are separated by "\n\n", prefixed with "|||" and written in all caps. The space reserved for the messages is up to half of description (500 characters), any non-fitting discarded messages will be logged. The list of possible messages can be seen in the output documentation at description.

License

License information for the bio.tools license attribute is looked for in the homepage and in the link, download, documentation links (in the license field provided by PubFetcher) and also in the publication abstract. The most frequently encountered license is chosen as the suggested license.

An encountered license string must be mapped to the SPDX inspired enumeration used in bio.tools (see license.txt). Difficulties in doing so include the fact, that many licenses have versions (which can be specified in different ways, like “GPL-3”, “GPL(>= 3)” or “GPLv3”, or not specified at all) and even the license name can be specified differently (as an acronym or fully spelled out or something in-between or there are just different ways to mean the same license). In free text, like the publication abstract, some license strings from the enumeration can sometimes match words in the text that are actually not about a license, so require the presence of "Licen[sc]" in the immediate neighbourhood of such matches to avoid false positives.

Language

Language information for the bio.tools language attribute is looked for in the homepage and in the link, download, documentation links (in the language field provided by PubFetcher) and also in the publication abstract. All encountered languages (with duplicates removed) are chosen as the suggested languages.

An encountered language string must be mapped to the programming language enumeration used in bio.tools (see language.txt). Most language strings are rather unique, so these can relatively safely be matched by carefully extracting the characters from the target text. However, some languages (like “C”, “R”, “Scheme”) can easily be mistaken for other things, so the presence of a keyword (like “implemented”, “software”, “language”, full list in language_keywords.txt) in the immediate neighbourhood is required in such cases. Somewhat conversely, some words can automatically infer a language (like “bioconductor” -> “R”, “django” -> “Python”).

Credit

Currently, information to add to the bio.tools credit attribute comes only from one place: the corresponding authors of publications. Corresponding authors can be extracted from articles in PubMed Central, i.e. for publications that have a PMCID, or names and e-mails can also be extracted from many journal web pages of articles, i.e. from web pages got after resolving the DOIs. The task of extracting corresponding author information is done by code from PubFetcher and stored in the publication field correspAuthor. We can possibly get the name, ORCID iD, e-mail, phone and web page of a corresponding author with PubFetcher. The only thing we can additionally do here, is merge potential duplicate authors coming from different sources (and there are some things to keep in mind when merging, for example the name of the same person can be written with or without potentially abbreviated middle names, academic titles, accents, etc).

Final decision

The final decision whether an entry is suggested for entry to bio.tools (when it’s determined to not already exist in bio.tools) is not based solely on confidence, but there are some further considerations. First, if confidence is “high”, then it is suggested for entry. For any lower confidence, it is suggested for entry if the first score (not score2) is high enough (the threshold depending on the exact confidence) or in case the first score is too low, then 1 or 2 (depending on score) of the following has to hold:

the homepage is not “missing”;
a license is found;
at least one language is found;
none of the publications have a PMID or PMCID (i.e. all publications have only a DOI).

In addition, certain homepage suggestions (like “clinicaltrials.gov”) or journals (like “Systematic reviews”) will also bar the entry from being suggested (even if confidence is “high”). Also, the presence of words from not_abstract.txt in the abstract or words from not_title.txt in the publication title will exclude the entry (although, publications containing these words were already excluded by -select-pub, if the normal workflow was used).

All entries (irrespective of the final decision on whether the entry should be added to bio.tools) are output to results.csv along with all possible data (explained in results.csv columns). Merged entries are output as one entry, with values of the constituent entries separated by ” | ” in the field values.

If an entry is determined to be about a tool already existing in bio.tools, then it is not suggested for entry to bio.tools. However, if there are any differences between values of the entry and values of the corresponding entry in bio.tools or if bio.tools seems to be missing information, then these differences and potential extra information are added to diff.csv (with a detailed explanation of the values in diff.csv columns).

Note

Sometimes, an entry seems to be existing in bio.tools, but the evidence for existence is not enough (and more often than not misleading) – in that case the new entry is still suggested for entry to bio.tools, but information about the potentially related existing entry in bio.tools is output as a message in the description.

All entries for which the final decision is positive and that are determined to not be existing in bio.tools are added to new.json in biotoolsSchema compatible JSON. Attributes that can be filled are:

name, as extracted by -pass1
publication, originally selected in -select-pub
description, license, language, credit, constructed here in the second pass (explained in description, license, language, credit)
homepage, link, download, documentation, as described in divide links

Example of running the step:

$ java -jar path/to/pub2tools-<version>.jar -pass2 results

-map¶

This step will add EDAM ontology annotations using EDAMmap to the Pub2Tools results in new.json, outputting the annotated results to to_biotools.json in the given output directory. Additional inputs to the mapping algorithm are the EDAM ontology file EDAM.owl, the database file db.db containing the contents of publications, webpages and docs, and the IDF files tf.idf and tf.stemmed.idf (which IDF file being used depending on the supplied --stemming parameter, with the stemmed version being the default). As additional output, more details about the mapping results are provided in different formats and detail level in map.txt, map/ directory of HTML files and map.json.

The input and output file names of the mapping step are fixed and cannot be changed. However, other aspects of the mapping process can be influenced by a multitude of parameters: Preprocessing parameters, Fetching parameters and Mapping parameters. By default, the default parameter values are used, for example up to 5 terms from the “topic” and “operation” branches are output (with results from the “data” and “format” branches being omitted by default, as currently EDAMmap does not work well for “data” and “format”). In addition, the parameter --mapper-threads can be used to set how many threads to use for parallel mapping of entries (default is 4).

The mapping step will in essence just fill and add the operation attribute (under a new function group) and the topic attribute, containing a list of EDAM term URIs and labels, to the new bio.tools entries in new.json, outputting the result to to_biotools.json.

Annotations from the “data” and “format” branches can also be added if requested. However, in the function group of bio.tools, it has to be specified whether the data attribute and the format attribute are input or output data and format, and EDAMmap can’t do that. So, if mapping results for the “data” and “format” branches are to be output, they will be output to the function note attribute as text and not to the data and format attributes.

In addition to the description attribute, the EDAM terms output by Pub2Tools are attributes that definitely need further manual curation. Unfortunately, EDAMmap scores of the found terms cannot be output to to_biotools.json, which makes it a bit harder to decide on the correctness of the suggested terms. However, suggested concepts are ordered by score so it can be assumed, that the few last term suggestions are more probably wrong. And if needed, more detailed mapping results (including scores) can be found in map.txt, map/ directory of HTML files or map.json.

Example of running the step with some non-default parameter values:

$ java -jar path/to/pub2tools-<version>.jar -map results --stemming false --branches topic operation data format --mapper-threads 8

-all¶

This command will run all setup commands necessary for getting the files required by the steps and then run all steps in order (-fetch-pub, -pass1, -fetch-web, -pass2, -map). So in essence, it could the sole command run to get a batch of new results from Pub2Tools.

The command has a few mandatory parameters: --edam to copy the EDAM ontology (as in -copy-edam), --idf and --idf-stemmed to copy the IDF files (as in -copy-idf) and --from/--to or --month or --day to specify a date range for fetching publication IDs (as in -select-pub). The last date range parameters can actually be replaced with --pub, that is, instead of fetching new publication IDs, a file containing publication IDs can be copied (as in -copy-pub).

In addition, the parameter --biotools can be specified to copy a JSON file containing the entire content of bio.tools (as in -copy-biotools) and the parameter --db can be used to copy a PubFetcher database (as in -copy-db). If these parameters are omitted, then the entire current bio.tools content is fetched as part of the command (as in -get-biotools) and an empty PubFetcher database is initialised (as in -init-db).

All parameters (like --timeout, --fetcher-threads, --matches) influencing the step commands can also be specified and are passed on to the steps accepting them.

An example of running the command:

$ java -jar path/to/pub2tools-<version>.jar -all results --edam http://edamontology.org/EDAM.owl --idf https://github.com/edamontology/edammap/raw/master/doc/biotools.idf --idf-stemmed https://github.com/edamontology/edammap/raw/master/doc/biotools.stemmed.idf --month 2019-08

-resume¶

This command will run all steps starting with the step stored in step.txt until the last step (-map). Setup must have been completed separately beforehand. If step.txt is missing, then the command will run all steps, starting with -fetch-pub.

In general, after a step is successfully completed, the next step is written to step.txt. Thus, given some output directory where running Pub2Tools has been aborted (but setup has been completed), the -resume command allows finishing a Pub2Tools run, while not re-executing already done steps, with just one command.

As no setup commands are run, then there are no mandatory parameters. However, if resuming from an interrupted command, then the same parameters that were used for that command should be respecified here.

An example of running the command:

$ java -jar path/to/pub2tools-<version>.jar -resume results

Parameters¶

Parameters give required arguments to or influence the setup commands and steps and begin with two dashes (--). All the -copy setup commands have a mandatory parameter specifying the source of the file to be copied. The -select-pub setup command needs parameters to specify the data range for fetching publication IDs. All other parameters are optional and influence the default behaviour of the commands.

Parameter	Parameter args	Default	Description
`--edam`	<file or URL>		The EDAM ontology OWL file to be copied to the output directory with -copy-edam (or -all)
`--idf`	<file or URL>		The unstemmed IDF file to be copied to the output directory with -copy-idf (or -all)
`--idf-stemmed`	<file or URL>		The stemmed IDF file to be copied to the output directory with -copy-idf (or -all)
`--biotools`	<file or URL>		The JSON file containing the entire bio.tools content to be copied to the output directory with -copy-biotools (or -all)
`--from`	<ISO-8601 date>		The start date (in the form `2019-08-23`) of the date range used to fetch publication IDs from with -select-pub (or -all)
`--to`	<ISO-8601 date>		The end date (in the form `2019-08-23`) of the date range used to fetch publication IDs from with -select-pub (or -all)
`--month`	<ISO-8601 month>		One month (in the form `2019-08`) for which publication IDs should be fetched from with -select-pub (or -all)
`--day`	<ISO-8601 date>		One day (in the form `2019-08-23`) for which publication IDs should be fetched from with -select-pub (or -all)
`--disable-tool-restriction`			If specified, using phrase combinations to narrow down publication IDs to only those potentially about tools is not done with -select-pub (or -all)
`--custom-restriction`	<string>		Additional restrictions for publication IDs to be fetched with -select-pub (or -all), specified using the Europe PMC search syntax (https://europepmc.org/searchsyntax)
`--disable-exclusions`			If specified, some further restrictions to eliminate a few wrong publication IDs are not used with -select-pub (or -all)
`--pub`	<file or URL>		The file containing publication IDs to be copied to the output directory with -copy-pub (or -all)
`--db`	<file or URL>		The PubFetcher database file to be copied to the output directory with -copy-db (or -all)
`--fetcher-threads`	<integer>	`8`	Number of threads to use for parallel fetching in -fetch-pub and -fetch-web (or -all or -resume)
`--mapper-threads`	<integer>	`4`	Number of threads to use for parallel mapping in -map (or -all or -resume)
`--verbose`	<LogLevel>	`OFF`	The level of log messages that code called from PubFetcher (like fetching publications and web pages) and EDAMmap (like progress of mapping) can output to the console. For example, a value of `WARN` would enable printing of `ERROR` and `WARN` level log messages from PubFetcher and EDAMmap code. Possible values are `OFF`, `ERROR`, `WARN`, `INFO`, `DEBUG`. To note, this affects only log messages output to the console, as log messages of any level from PubFetcher and EDAMmap code are written to the log file in any case.

In addition, some commands are influenced by parameters defined in PubFetcher or EDAMmap: Preprocessing parameters (influences -pass1, -pass2 and -map), Fetching parameters (influences -fetch-pub, -fetch-web, -pass2 and -map) and Mapping parameters (influences -map).

Note

The --stemming parameter in the Preprocessing parameters is always false for -pass1 and -pass2, but it can be set to either true or false for -map (default is true).

Examples¶

A quickstart example for August 2019, where the EDAM ontology and IDF files and the entire content of bio.tools are downloaded from the web and there are no deviations from default values in any of the steps, is the following:

$ java -jar path/to/pub2tools-<version>.jar -all results \
--edam http://edamontology.org/EDAM.owl \
--idf https://github.com/edamontology/edammap/raw/master/doc/biotools.idf \
--idf-stemmed https://github.com/edamontology/edammap/raw/master/doc/biotools.stemmed.idf \
--month 2019-08

An example getting potential tools about COVID-19 from the year 2020:

$ java -jar path/to/pub2tools-<version>.jar -all results \
--edam http://edamontology.org/EDAM.owl \
--idf https://github.com/edamontology/edammap/raw/master/doc/biotools.idf \
--idf-stemmed https://github.com/edamontology/edammap/raw/master/doc/biotools.stemmed.idf \
--from 2020-01-01 --to 2020-12-31 \
--custom-restriction '"2019-nCoV" OR "2019nCoV" OR "COVID-19" OR "SARS-CoV-2" OR "COVID19" OR "COVID" OR "SARS-nCoV" OR ("wuhan" AND "coronavirus") OR "Coronavirus" OR "Corona virus" OR "corona-virus" OR "corona viruses" OR "coronaviruses" OR "SARS-CoV" OR "Orthocoronavirinae" OR "MERS-CoV" OR "Severe Acute Respiratory Syndrome" OR "Middle East Respiratory Syndrome" OR ("SARS" AND "virus") OR "soluble ACE2" OR ("ACE2" AND "virus") OR ("ARDS" AND "virus") or ("angiotensin-converting enzyme 2" AND "virus")'

The next example executes each individual setup and step command from start to final results, while changing some default values:

# The EDAM ontology was previously downloaded to the local file system
# to path/to/EDAM.owl and is copied from there to results/EDAM.owl
$ java -jar path/to/pub2tools-<version>.jar -copy-edam results \
--edam path/to/EDAM.owl
# The IDF files have been downloaded to the local file system and are
# copied from there to the output directory "results"
$ java -jar path/to/pub2tools-<version>.jar -copy-idf results \
--idf path/to/tf.idf --idf-stemmed path/to/tf.stemmed.idf
# All bio.tools content is fetched to the file results/biotools.json
$ java -jar path/to/pub2tools-<version>.jar -get-biotools results
# Candidate publication IDs from August 2019 are fetched to the file
# results/pub.txt
$ java -jar path/to/pub2tools-<version>.jar -select-pub results \
--from 2019-08-01 --to 2019-08-31
# An empty PubFetcher database is initialised to results/db.db
$ java -jar path/to/pub2tools-<version>.jar -init-db results
# In the first step, the content of publications listed in
# results/pub.txt is fetched to results/db.db, while the connect and
# read timeout is changed to 30 seconds, some fixes for outdated
# journal scraping rules are loaded from a YAML file and the number
# of threads used for parallel fetching is doubled from the default 8
$ java -jar path/to/pub2tools-<version>.jar -fetch-pub results \
--timeout 30000 --journalsYaml journalsFixes.yaml --fetcher-threads 16
# The first pass of Pub2Tools is run
$ java -jar path/to/pub2tools-<version>.jar -pass1 results
# Web pages extracted by the first pass are fetched, with some default
# parameters modified analogously to -fetch-pub
$ java -jar path/to/pub2tools-<version>.jar -fetch-web results \
--timeout 30000 --webpagesYaml webpagesFixes.yaml --fetcher-threads 16
# The second pass of Pub2Tools is run, culminating in the files
# results/results.csv, results/diff.csv and results/new.json
$ java -jar path/to/pub2tools-<version>.jar -pass2 results
# EDAM annotations are added to results/new.json and output to
# results/to_biotools.json, with stemming turned off, mapping done in
# parallel in 8 threads and up to 5 terms output for all EDAM branches
$ java -jar path/to/pub2tools-<version>.jar -map results \
--stemming false --branches topic operation data format \
--mapper-threads 8

The following example is equivalent with the previous one, just all commands have been replaced with one -all command:

$ java -jar path/to/pub2tools-<version>.jar -all results \
--edam path/to/EDAM.owl --idf path/to/tf.idf \
--idf-stemmed path/to/tf.stemmed.idf --month 2019-08 \
--timeout 30000 --journalsYaml journalsFixes.yaml \
--webpagesYaml webpagesFixes.yaml --fetcher-threads 16 \
--stemming false --branches topic operation data format \
--mapper-threads 8

All files of the setup can be obtained through some external means and simply copied to the output directory results:

# Copy a previously downloaded EDAM ontology to the output directory
$ java -jar path/to/pub2tools-<version>.jar -copy-edam results \
--edam path/to/EDAM.owl
# Copy previously downloaded IDF files to the output directory
$ java -jar path/to/pub2tools-<version>.jar -copy-idf results \
--idf path/to/tf.idf --idf-stemmed path/to/tf.stemmed.idf
# Copy the existing content of bio.tools in JSON format, obtained
# through a different tool or through a previous run of Pub2Tools
# to the output directory
$ java -jar path/to/pub2tools-<version>.jar -copy-biotools results \
--biotools path/to/biotools.json
# Copy publication IDs obtained through some different means, for
# example a small list of manually entered IDs meant for testing,
# to the output directory
$ java -jar path/to/pub2tools-<version>.jar -copy-pub results \
--pub path/to/pub.txt
# Copy a PubFetcher database preloaded with potentially useful
# content to the output directory
$ java -jar path/to/pub2tools-<version>.jar -copy-db results \
--db path/to/db.db
# We can use the -resume command to run all step commands in one go;
# the number of threads has been doubled from default values and up to
# 5 EDAM terms are output in the mapping step for the default branches
# of "topic" and "operation"
$ java -jar path/to/pub2tools-<version>.jar -resume results \
--fetcher-threads 16 --mapper-threads 8

The following -all command is equivalent to the previous list of commands:

$ java -jar path/to/pub2tools-<version>.jar -all results \
--edam path/to/EDAM.owl --idf path/to/tf.idf --idf-stemmed \
path/to/tf.stemmed.idf --biotools path/to/biotools.json \
--pub path/to/pub.txt --db path/to/db.db
--fetcher-threads 16 --mapper-threads 8 ^C (Interrupted)
# But for some reason, the -all command was interrupted. If this
# happened during a step command (when all setup was already done),
# then finishing the run can be done with the -resume command. The
# process is resumed by restarting the step that was interrupted and
# running the remaining steps up to the end. The same step parameters
# that were supplied to -all must also be supplied to -resume.
$ java -jar path/to/pub2tools-<version>.jar -resume results \
--fetcher-threads 16 --mapper-threads 8

When fetching publications and web pages, some resources might be temporarily down. So for slightly better results, one option could be to wait a few days after an initial fetch and hope that a few extra resources would be available then. Due to PubFetcher’s logic, publications and web pages that were successfully fetched in full the first time, are not retried during refetching:

$ java -jar path/to/pub2tools-<version>.jar -fetch-pub results
$ java -jar path/to/pub2tools-<version>.jar -pass1 results
$ java -jar path/to/pub2tools-<version>.jar -fetch-web results
$ # Wait a few days
$ java -jar path/to/pub2tools-<version>.jar -fetch-pub results
$ java -jar path/to/pub2tools-<version>.jar -pass1 results
$ java -jar path/to/pub2tools-<version>.jar -fetch-web results
$ java -jar path/to/pub2tools-<version>.jar -pass2 results
$ java -jar path/to/pub2tools-<version>.jar -map results

To run all steps again after the wait, another option would be to just use the -resume command after removing step.txt:

$ java -jar path/to/pub2tools-<version>.jar -fetch-pub results
$ java -jar path/to/pub2tools-<version>.jar -pass1 results
$ java -jar path/to/pub2tools-<version>.jar -fetch-web results
$ # Wait a few days
$ rm results/step.txt
$ java -jar path/to/pub2tools-<version>.jar -resume results ^C (Interrupted)
# The -resume command was interrupted for some reason. If -resume is
# now run again, it will not start again from -fetch-pub, but from the
# step that was interrupted.
$ java -jar path/to/pub2tools-<version>.jar -resume results

Note

Before a bigger run of Pub2Tools, it could be beneficial to test if scraping rules are still up to date. The running of Pub2Tools in a network with good access to journal articles could also be beneficial, as publisher web sites have to be consulted sometimes.

Improving existing bio.tools entries¶

When Pub2Tools is run on a relatively new batch of publication IDs, then most results will end up in to_biotools.json as new entry suggestions for bio.tools and only a few results will be diverted to diff.csv as fix suggestions of existing bio.tools entries. This is expected, as most new articles will be about tools not seen before and only some will be update articles of tools already entered to bio.tools or articles of tools entered to bio.tools through some other means than Pub2Tools.

But as an alternative and additional example of Pub2Tools usage we can consider the following: let’s get all publication IDs (and nothing else) currently in bio.tools and run Pub2Tools on these IDs. Now, most results will end up in diff.csv, as all entries are determined to be already existing in bio.tools. So what we get out of this, is a large spreadsheet of suggestions (diff.csv) on what to improve in existing bio.tools content. A lot of the suggestions are incorrect, but there should also be many valuable fixes and additions there. If content missing in bio.tools was found, then the tool will suggest adding it to bio.tools (to “publication”, “link”, “download”, “documentation”, “language” or “credit”), or for some content types, the tool will suggest modifying existing content (in “name”, “homepage”, “license” or “credit”). No suggestions (for removal) are given for content that is present in bio.tools, but that was not found by the tool. All values suggested by the tool are valid according to biotoolsSchema, but they don’t necessarily follow the curation guidelines.

In addition, the file to_biotools.json will probably not be totally empty, but contain a few suggested new entries to bio.tools – this happens, because some publications seem to be about multiple tools and are broken up, and it would seem that some of these broken up tools are missing in bio.tools. As detecting multiple tools from one publication is relatively crude right now, then the quality of these new entries in the JSON file might not be the best, but going through them might still be worth the while.

Running Pub2Tools on existing bio.tools content can be done in the following way (needs EDAMmap-Util from EDAMmap):

# Download all bio.tools content to biotools.json
$ java -jar path/to/edammap-util-<version>.jar -biotools-full biotools.json
# Extract all publication IDs present in biotools.json to pub.txt
$ java -jar path/to/edammap-util-<version>.jar -pub-query biotools.json \
--query-type biotools -txt-ids-pub pub.txt --plain
# Run Pub2Tools with default parameters, outputting all to directory "results"
$ java -jar path/to/pub2tools-<version>.jar -all results \
--edam http://edamontology.org/EDAM.owl \
--idf https://github.com/edamontology/edammap/raw/master/doc/biotools.idf \
--idf-stemmed https://github.com/edamontology/edammap/raw/master/doc/biotools.stemmed.idf \
--biotools biotools.json \
--pub pub.txt

Note

The example works best when still only few entries have been added to bio.tools from curated Pub2Tools results, as for entries from these results mistakes made by Pub2Tools would have to be gone through repeatedly.