Output

All results and intermediate files will be output to the output directory specified by the user. The main results files of interest are results.csv, where all results with all possible extra data are output, diff.csv, where potential fixes to some bio.tools content is output (based on differences of new entries and corresponding entries already existing in bio.tools), and to_biotools.json, where new additions to bio.tools are output. All intermediate files are also copied and kept in the output directory for potential reproducibility and debugging purposes.

The output of Pub2Tools does not lead to fully automatic growth and improvement of bio.tools content. The results have to be curated: filled attributes might need checking or editing (especially description), some suggested entries might actually not be tools or otherwise not be suitable for bio.tools, information for many attributes is not found (as seen in performance), fixes in diff.csv have to be checked and applied manually, etc.

Output directory

The name of the output directory is chosen by the user and must be specified after all commands, like -copy-edam directory_name or -pass1 directory_name. In case the output directory is not existing, it will be created. And while the name of the output directory can be chosen, the names of the results and intermediate files are fixed. Short descriptions of these files is given below.

pub2tools.log

The log file is created to a new output directory by the first command that is run with the output directory as argument. The first two lines of the log file are the arguments given to Pub2Tools and the version of Pub2Tools. Then, the same lines that are output to the console while the command is running, are output to the log file, except also all DEBUG level messages are output and all log messages coming from PubFetcher and EDAMmap code are output to the log file (outputting those to the console is turned off by default with the parameter --verbose OFF). Any subsequent commands run on the output directory will just append to the existing log file.

The log file could later be used for debugging. One option would be analysing ERROR level messages, for example with grep ERROR pub2tools.log | less. More information about the structure of a log line and analysing the logs can be found in the PubFetcher documentation about the log file.

EDAM.owl

The EDAM ontology file is needed by the -map step to add EDAM terms to the results. It can be downloaded from the EDAM ontology GitHub and copied to the output directory with the setup command -copy-edam.

tf.idf

A simple file of tab-separated values containing normalised tf–idf scores of (unstemmed) words occuring in the bio.tools corpus. It can either be downloaded or generated, more information at -copy-idf, which is the command used to copy the file to the output directory.

Note

As the IDF files are one of the largest files in the output directory and these files are potentially equal across many runs of Pub2Tools, then these could be the files to delete first in a finalised output directory to save disk space (or linked to some master IDF files with ln -s).

tf.stemmed.idf

Like tf.idf, except the words are stemmed.

biotools.json

The entire content of bio.tools in JSON format and adhering to biotoolsSchema, either downloaded with -get-biotools or copied with -copy-biotools.

pub.txt

A list of candidate publication IDs to search tools from. It’s a simple text file containing publication IDs in the form <pmid>\t<pmcid>\t<doi>, one per line. Empty lines and lines beginning with # are ignored. It can be either fetched with -select-pub or copied with -copy-pub or created manually.

db.db

A PubFetcher database file containing the contents of publications and web pages fetched as part of a Pub2Tools run. It needs to be initialised with -init-db or a database with prefetched content can be copied with -copy-db. The database can be queried or manipulated with PubFetcher-CLI or EDAMmap-Util.

step.txt

Used to keep track of the current step being run. Can contain the value None, -fetch-pub, -pass1, -fetch-web, -pass2, -map or Done. The value indicates, which step should be run next. So for example, after -pass1 completes successfully, the value -fetch-web will be written to the file. If the file is present and contains any value other than None, then this means that some steps have been run and no Setup commands can be run anymore. But the main use of the file is enabling the -resume command: when that command is run it checks which step should be run next and runs that step and all subsequent steps until the last step of -map is completed and Done is written out.

web.txt

A list of webpage URLs extracted from the publication abstracts and fulltexts by the -pass1 command that are matching the (up to 5 per publication) names suggested for the tools the publications are potentially about. These URLs are candidates for the tool homepage and other link attributes in bio.tools and the content of these links needs to be fetched in -fetch-web. The URLs are simply written one per line, with empty lines and lines beginning with # being ignored.

doc.txt

Same as web.txt, except links determined to be about documentation are written here instead (because the PubFetcher database has a separate store for docs).

pass1.json

Results of the -pass1 command, that are later used as input for -pass2. The results include information about the publication (like its IDs, title, publication date and journal, number of citation and corresponding authors) and about the up to 5 candidate names for the potential tool the publication is about (including the name in processed form, the score assigned to the name and links attached to it). Most of the values passed on to -pass2 also end up in results.csv, so more thorough documentation about these values can be found in results.csv columns.

results.csv

This file will contain all results of Pub2Tools as output by the -pass2 command, including entries that were excluded for entry to bio.tools or found to be already existing there. In addition to the end results that can be inserted to bio.tools attributes, each entry will contain all possible other data related to the entry and values of intermediate results, but also values currently present in bio.tools for entries that were found to be existing there. All these values are documented in results.csv columns. The first row of the file specifies the column names and the second row contains links to the column documentations in results.csv columns.

diff.csv

This file will contain entries that were found to be existing in bio.tools in -pass2. More precisely, it will only contain entries, that were found to be existing in bio.tools and for which some value was found to be different or missing in bio.tools, and the contents of the file will be a listing of these difference (i.e. differing or missing values). Many of these differences are mistakes made by Pub2Tools, but many are also pointing to incorrect or missing information in bio.tools, thus the contents of this file can be used to improve existing entries of bio.tools. In rare circumstances, some entries that are not actually already existing in bio.tools might be mistakenly diverted here (instead of to_biotools.json) – such entries should be added to bio.tools manually. This file can be especially useful if Pub2Tools is run on all publications currently in bio.tools, like exemplified in Improving existing bio.tools entries. The structure of the file is documented in diff.csv columns. The first row of the file specifies the column names and the second row contains links to the column documentations in diff.csv columns.

new.json

This file will contain all new entries suggested for addition to bio.tools, as decided and output by -pass2 and adhering to biotoolsSchema. The file is fed as input to the command -map, producing to_biotools.json, which is the file that should actually be used to add the new entries to bio.tools.

The following bio.tools attributes will always be filled: name attribute, description attribute (if nothing else is found, then it is filled with the publication abstract), homepage attribute (if no links found, then filled with a link to the publication itself) and publication attribute. Additionally, an effort is made to fill the following attributes: language attribute, license attribute, link attribute, download attribute, documentation attribute and credit attribute. Further information about possible values of these attributes (for example about the messages to the curator in the description) can be found in to_biotools.json attributes.

map.txt

Additional data about the EDAMmap results got using the -map command, in plain text format.

map/

Additional data about the EDAMmap results got using the -map command, in a directory of HTML files. To see this mapping data, open map/index.html in a web browser.

map.json

Additional data about the EDAMmap results got using the -map command, in JSON format.

to_biotools.json

Same as new.json, except EDAMmap terms have been added by the -map command to the function attribute and topic attribute. This is the file that should be used to add new entries to bio.tools. Rarely, some entries here are actually already existing in bio.tools (and thus should have been output to diff.csv instead) – such entries should evidently not be added to bio.tools (however, such entries might still contain useful information on what to change in those existing entries). Further information about possible values of the attributes can be found in to_biotools.json attributes.

results.csv columns

pmid
As results are extracted from publications, then the first 3 columns are the IDs of the publication – here, the PubMed ID of the publication is output. These publication IDs are used to fill the publication attribute of bio.tools. Sometimes, multiple publications seem to be about the same tool – in that case the corresponding results are merged into one row and the PubMed IDs of these different publications will be separated by " | " here.
pmcid
Like pmid, but for the PubMed Central ID of publications.
doi
Like pmid, but for the Digital Object Identifier (DOI) of publications.
same_suggestions
Currently, results got from two different publications are merged into one result, if their top name suggestion is exactly equal and confidence is not “very low”. If the names are equal, but confidence of at least one of the names is “very low”, then the publications are not merged, but instead linked through this column (where one result will contain publication IDs of the other result and vice versa). If multiple such links are made, then the publication IDs of the different linked results are separated by " | ".
score
The goodness score of the suggestion is calculated in the first pass (-pass1) and shows confidence in the extracted tool name (and not in how “good” or high impact the tool itself is). Entries in the results file are sorted by score (for entries whose score is at least 1000), but there are a few other things to consider in assessing whether an entry is about a tool and suitable for suggestion to bio.tools – whether an entry is suggested can be seen in the include column.
score2
If score is lower than 1000, then this second score is calculated in the second pass (-pass2) for further fine-tuning of entries of lower confidence. Entries that have this second score are sorted by it instead of score.
score2_parts
Values of the four parts of score2. Summing these four parts, plus the value of score, will get as result score2.
confidence
A confidence of “high”, “medium”, “low” or “very low” is determined based on the values of score and score2.
include
true, if the final decision of Pub2Tools, based on some additional aspects in addition to score and score2, is that the entry is about a tool. In the true case, the entry will be suggested as a new tool to add to bio.tools, unless the value in the existing column is not empty. Also, if confidence is “very low”, but include is still true, then the entry is quite possibly about a tool and suggested for entry, however, the confidence in the tool name suggestion is very low and should be checked.
existing
Will contain bio.tools ID(s) of entries that are found to be already existing in bio.tools. If multiple entries in bio.tools are matched, then the IDs are separated by " | ". Entries that are found to be already existing in bio.tools are not suggested as new tools, however, if there are differences in information currently in bio.tools and information extracted by Pub2Tools for these entries, then these differences are highlighted in diff.csv (and for entries that were found to be existing due to matching publication IDs in bio.tools, entry to diff.csv is done even if include is false).
suggestion_original
The name suggested for the tool, in original form as extracted from the title and abstract of the publication. As there are syntactic restrictions and a limited set of characters allowed in the name (latin letters, numbers and some punctuation symbols, as seen in name attribute API docs), then for some entries the original suggestion must be edited: invalid characters are either replaced (done for accents, greek letters, etc) or discarded altogether and too long suggestions truncated. Only syntactic rules mandated by biotoolsSchema are followed, curation guidelines for the name attribute are not necessarily followed. The value in this column will be empty, if no such modifications need to be made, otherwise this column will contain the original name and the suggestion column the modified form of the name.
suggestion
The name suggested as the name attribute of the tool for bio.tools, extracted from the title and abstract of the publication in the first pass (-pass1).
suggestion_processed
A further processed version of suggestion (with letters converted to lowercase and symbols removed), used in many parts of the Pub2Tools algorithm (like matching the name to extracted links).
publication_and_name_existing
Contains bio.tools IDs (separated by " | ") of entries in bio.tools that have exactly the same name and whose publications are also present in this entry constructed by Pub2Tools. Matching publication IDs mean that the entry is considered existing in bio.tools and it is added to the existing column (even if include is false).
name_existing_some_publication_different
Contains bio.tools IDs (separated by " | ") of entries in bio.tools that have exactly the same name and for which some publications are also present in this entry constructed by Pub2Tools, but some are not (IDs of publications found by Pub2Tools but not present in bio.tools are written in parenthesis after the bio.tools ID, with possible multiple publications separated by " ; "). Some matching publication IDs mean that the entry is considered existing in bio.tools and it is added to the existing column (even if include is false).
some_publication_existing_name_different
Contains bio.tools IDs (separated by " | ") of entries in bio.tools whose publications are also present in this entry constructed by Pub2Tools, but whose name is different than the name found by Pub2Tools (the tool name of the entry in bio.tools is written in parenthesis after the ID; in addition, if Pub2Tools has found publications that are not present in the matching bio.tools entry, then the IDs of these publications are written to another set of parenthesis after the ID and name, with potential multiple publications separated by " ; "). Some matching publication IDs mean that the entry is considered existing in bio.tools and it is added to the existing column (even if include is false). The difference in name is highlighted in diff.csv.
name_existing_publication_different
Contains bio.tools IDs (separated by " | ") of entries in bio.tools that have exactly the same name as this entry constructed by Pub2Tools, but that have no matching publication IDs with this entry (publications found by Pub2Tools are written in parenthesis after the bio.tools ID, with possible multiple publications separated by " ; "). The new entry is considered existing in bio.tools only if one of the bio.tools IDs in this column also occurs in the link_match column or if a credit of the new entry matches a credit in a bio.tools entry corresponding to these bio.tools IDs (and additionally, confidence must not be “very low” and include must be true), in which case bio.tools IDs matching these criterias are added to the existing column.
name_match
Like name_existing_publication_different, except the name of the bio.tools entry is not exactly equal to the name of the new entry constructed by Pub2Tools, just their processed names are equal (the processed name being like in suggestion_processed but with potential version information removed from the end). Also, non-matching publication IDs will not be output in parenthesis after the bio.tools ID – the name of the tool in bio.tools will be output instead.
link_match
Contains bio.tools IDs (separated by " | ") of entries in bio.tools that have any matching link with any link extracted by Pub2Tools for this suggestion (as seen in links_abstract or links_fulltext). Links don’t have to be equal: in addition to the standard www and index.html parts, the lowest subdomain and last path of the links are ignored when matching. The common matching part of the matching link is output in parenthesis after the bio.tools ID, with potential multiple partial links separated by " ; ". This column is not filled with bio.tools IDs already occuring in publication_and_name_existing, name_existing_some_publication_different or some_publication_existing_name_different. If any of the bio.tools IDs occuring here also occur in name_existing_publication_different or name_match, then this entry is considered existing in bio.tools and these common bio.tools IDs are added to the existing column.
name_word_match
Contains bio.tools IDs (separated by " | ") of entries in bio.tools whose name has a matching word with a word from the name of this entry constructed by Pub2Tools. The name of the entry in bio.tools follows in parenthesis. If a bio.tools ID is already in any of the columns from publication_and_name_existing to link_match, then it is not added here. Also, if too many bio.tools IDs would be added (over 5), then nothing is output here. The values in this column are not used anywhere in the Pub2Tools algorithm.
links_abstract
Contains URLs (separated by " | ") extracted from the abstracts of publications and matched to the suggestion. This is done in the first pass (-pass1).
links_fulltext
Contains URLs (separated by " | ") extracted from the full texts of publications and matched to the suggestion. This is done in the first pass (-pass1).
true, if the tool name in suggestion was extracted from a link in the publication abstract (as that name was only occuring in a link and not elsewhere in the text of the abstract or title). If there are other_suggestions, then the Boolean values (separated by " | ") for those will be appended after " | ".
homepage
A URL suggested as the homepage attribute of the tool for bio.tools. The homepage is selected when dividing links (i.e. the links in links_abstract and links_fulltext are divided) in the second pass (-pass2).
homepage_broken

true, if the homepage link seems to be broken. A broken page is suggested as the homepage, as no better alternatives were found. The broken status of a web page is determined in PubFetcher code called by Pub2Tools based on reachability and the HTTP status code.

Note

A reportedly broken homepage can sometimes still be functional (for example, maybe it was temporarily down at the time Pub2Tools was run) – this could be manually checked in a web browser.

homepage_missing
true, if no links (even broken ones) matching the suggestion were found, i.e. a homepage could not be extracted. In that case, the homepage column is still filled, but with a link to the publication. A missing homepage does not necessarily mean that the entry is not a tool, it just means that no suitable links in the publication abstract or fulltext were matched to the extracted tool name in suggestion (either Pub2Tools failed to find the homepage or the publication just doesn’t mention any links of the tool).
homepage_biotools
Contains homepages (separated by " | ") of the bio.tools entries corresponding to the bio.tools IDs in existing, that is, if the current entry constructed by Pub2Tools is found to be existing in bio.tools, then the homepage currently in bio.tools is output here to contrast with the value in the column homepage. If a homepage currently in bio.tools is determined to be broken by Pub2Tools, then "(broken)" will follow the homepage URL and in addition, if the homepage is determined to be problematic in bio.tools itself, then "(homepage_status: x)" will follow the homepage URL (where x is a status number other than 0, as got through the bio.tools API).
link
A list of URLs (separated by " | ") suggested for the link attribute of the tool for bio.tools. These links are selected when dividing links (the links in links_abstract and links_fulltext) in the second pass (-pass2). After each URL, the type of the link will follow in parenthesis (in case of the link attribute, for example “Repository” or “Mailing list”).
link_biotools
Contains lists (separated by " | ") of links (separated by " ; ") of the bio.tools entries corresponding to the bio.tools IDs in existing, that is, if the current entry constructed by Pub2Tools is found to be existing in bio.tools, then the links currently in bio.tools are output here to contrast with the values in the column link. After each URL, the type of the link will follow in parenthesis (in case of the link attribute, for example “Repository” or “Mailing list”).
download
Like link, except links meant for the download attribute of bio.tools are output.
download_biotools
Like link_biotools, except download attribute links of existing bio.tools entries are output.
documentation
Like link, except links meant for the documentation attribute of bio.tools are output.
documentation_biotools
Like link_biotools, except documentation attribute links of existing bio.tools entries are output.
broken_links
Contains link attribute, download attribute and documentation attribute URLs (separated by " | ") that were found to be broken when dividing links (the links in links_abstract and links_fulltext) in the second pass (-pass2). After each URL, the type of the link will follow in parenthesis (in case of the link attribute, for example “Repository” or “Mailing list”). Links occuring here will not be output to link, download and documentation (and thus not suggested for input to bio.tools), however, if the homepage is broken, then the homepage URL will appear both here and in the homepage column.
other_scores
The rounded scores (separated by " | ") of other_suggestions, analogous to the score column of the main suggestion.
other_scores2
The rounded second scores (separated by " | ") of other_suggestions, analogous to the score2 column of the main suggestion.
other_scores2_parts
The parts of the rounded second scores (separated by " | ") of other_suggestions, analogous to the score2_parts column of the main suggestion.
other_suggestions_original
The unedited names (separated by " | ") of other_suggestions, analogous to the suggestion_original column of the main suggestion.
other_suggestions
Up to 4 alternative suggestions for the tool name are extracted in the first pass (-pass1). The order of these suggestions was possibly changed (with one of them possibly even elevated to be the main suggestion) when score2 was calculated in the second pass (-pass2). There may also be no alternative suggestions, which shows higher confidence in the main suggestion. This column contains the names (for the name attribute of bio.tools) of these alternative suggestions (separated by " | "). Alternative suggestions are not suggested for entry to bio.tools, however a message in the description will draw the attention of the curator to the existence of possible alternative names of the tool.
other_suggestions_processed
The processed names (separated by " | ") of other_suggestions, analogous to the suggestion_processed column of the main suggestion.
other_publication_and_name_existing
A column analogous to publication_and_name_existing, but for other_suggestions. Values of different suggestions are separated by " | " and IDs within a suggestion are separated by " ; ".
other_name_existing_some_publication_different
A column analogous to name_existing_some_publication_different, but for other_suggestions. Values of different suggestions are separated by " | " and IDs within a suggestion are separated by " ; ".
other_some_publication_existing_name_different
A column analogous to some_publication_existing_name_different, but for other_suggestions. Values of different suggestions are separated by " | " and IDs within a suggestion are separated by " ; ".
other_name_existing_publication_different
A column analogous to name_existing_publication_different, but for other_suggestions. Values of different suggestions are separated by " | " and IDs within a suggestion are separated by " ; ".
other_links_abstract
Contains links found in the publication abstract that are matching other_suggestions. Links of different suggestions are separated by " | " and links within a suggestion are separated by " ; ".
other_links_fulltext
Contains links found in the publication fulltext that are matching other_suggestions. Links of different suggestions are separated by " | " and links within a suggestion are separated by " ; ".
leftover_links_abstract
Contains all links (separated by " | ") that were extracted from the publication abstract, but not matched to the main suggestion (thus, not output to the links_abstract column) or to any other_suggestions (thus, not output to the other_links_abstract column). These links are just output to this column and not used anywhere else in Pub2Tools.
leftover_links_fulltext
Contains all links (separated by " | ") that were extracted from the publication fulltext, but not matched to the main suggestion (thus, not output to the links_fulltext column) or to any other_suggestions (thus, not output to the other_links_fulltext column). These links are just output to this column and not used anywhere else in Pub2Tools.
title
Contains the title(s) of the publication(s) (separated by " | ").
tool_title_others
Contains the other tool_title of a publication that was split into two entries (base on a " and ", " & " or ", " in the entire tool_title part of a publication title). If a publication is split into more than two entries, then the other tool_titles will be separated by " ; ". If the entry has more than one publication, than the other tool_titles of different publications are separated by " | ". Keeping track of these other tool_titles is needed, because if a publication is split into many entries, then all these entries will have a common publication and Pub2Tools would otherwise suggest merging them back into one entry in diff.csv.
tool_title_extracted_original
The tool_title as originally extracted from the publication title. If no tool_title can be extracted from the publication title, then this column will be empty. Note, that some processing steps have still been done, for example, other tool_titles have been separated to tool_title_others, whitespace has been normalised, some punctuation removed from the start and end of words, etc. This form of the tool_title is used as part of the calculations of the score2 part concerning the tool_title.
tool_title
The tool_title is the part of the publication title that precedes ": ", " - ", ", a", etc. The tool_titles of different publications are separated by " | ". In this column, the intermediate extraction step of the tool_title, as presented in tool_title_extracted_original, is further processed, for example stop words are removed (this can be further influenced by Preprocessing parameters). Also, if tool_title_extracted_original contains an acronym in parenthesis, then this acronym is removed (to tool_title_acronym). If this processing does not alter the value in tool_title_extracted_original, then the value in this column is left empty for readability purposes. The tool_title is often equal to the name of the tool and thus often (but not always) ends up as the name of the entry in suggestion.
tool_title_pruned
A further processed tool_title, where version information and some common words (like “database”, “server”, “pipeline”) have been pruned. If this pruning doesn’t remove anything and thus the value is equal to tool_title, then an empty string would be output to this column instead. Like tool_title_extracted_original, the pruned version of tool_title is used in the calculations of the score2 part concerning the tool_title.
tool_title_acronym
Contains the acronym version of the tool_title, with values of different publications separated by " | ". The acronym must be in parenthesis after the expanded name and it is found and extracted when processing tool_title_extracted_original. Like tool_title_extracted_original and tool_title_pruned, the acronym version of tool_title is used in the calculations of the score2 part concerning the tool_title.
description

A list of descriptions (separated by "\n\n") suggested as the description attribute of the tool for bio.tools. This is the one column that definitely need curation: a curator can choose one of the descriptions from the list or combine multiple description suggestions into the final description of the tool in bio.tools. More information can be found in the description part of the second pass (-pass2), where the descriptions are constructed.

In addition to the list of descriptions, a list of messages to the curator (also separated by "\n\n") are appended to the descriptions (after a "\n\n"). The messages start with "|||" and are uppercase. If there are any messages to the curator, then these should be acknowledged, potentially acted upon and deleted. Messages could be the following:

description_biotools
Contains the values of the description attributes (separated by " | ") of the bio.tools entries corresponding to the bio.tools IDs in existing, that is, if the current entry constructed by Pub2Tools is found to be existing in bio.tools, then the descriptions currently in bio.tools are output here to contrast with the value in the column description. Line breaks and tabs in the bio.tools description will be replaced with the strings "\n", "\r", "\t".
license_homepage
Contains the value of the license field of the PubFetcher webpage corresponding to the homepage URL. Nothing is output, if the field is empty – the field can usually be filled when it’s a URL of a repository. The license string is output as got from PubFetcher and needs to be mapped to a valid bio.tools license Enum value in the second pass (-pass2).
license_link
Contains the non-empty values (separated by " | ") of the license fields of the PubFetcher webpages corresponding to the link URLs. The URL follows the license string in parenthesis. The license strings are output as got from PubFetcher and need to be mapped to valid bio.tools license Enum values in the second pass (-pass2).
license_download
Like license_link, but for licenses from download URLs.
license_documentation
Like license_link, but for licenses from documentation URLs.
license_abstract
Contains all bio.tools licenses found from the abstracts of the publications of this entry. Licenses found from one publication abstract are separated by " ; " and values from different publications are separated by " | ". The publication IDs of the abstract where a license was found will follow the license value. The license value is extracted in the second pass (-pass2).
license
The license suggested as the value of the license attribute of the tool for bio.tools. This license value is chosen as the most common value occuring among the values of license_homepage, license_link, license_download, license_documentation and license_abstract. URLs and publication IDs (separated by ", ") of the webpages and abstracts where the chosen license was encountered will follow the license value in parenthesis.
license_biotools
Contains the values of the license attribute (separated by " | ") of the bio.tools entries corresponding to the bio.tools IDs in existing, that is, if the current entry constructed by Pub2Tools is found to be existing in bio.tools, then the licenses currently in bio.tools are output here to contrast with the value in the column license.
language_homepage
Contains the value of the language field of the PubFetcher webpage corresponding to the homepage URL. Nothing is output, if the field is empty – the field can usually be filled when it’s a URL of a repository. The language value is output as got from PubFetcher and needs to be mapped to valid bio.tools language Enum value(s) in the second pass (-pass2).
language_link
Contains the non-empty values (separated by " | ") of the language fields of the PubFetcher webpages corresponding to the link URLs. The URL follows the language value in parenthesis. The language value is output as got from PubFetcher and needs to be mapped to valid bio.tools language Enum values in the second pass (-pass2).
language_download
Like language_link, but for licenses from download URLs.
language_documentation
Like language_link, but for licenses from documentation URLs.
language_abstract
Contains all bio.tools languages found from the abstracts of the publications of this entry. Languages found from one publication abstract are separated by " ; " and values from different publications are separated by " | ". The publication IDs of the abstract where a language was found will follow the language value. The language value is extracted in the second pass (-pass2).
language
The languages (separated by " ; ") suggested as the content of the language attribute of the tool for bio.tools. The languages are put together from all language values found in language_homepage, language_link, language_download, language_documentation and language_abstract (duplicate values are merged). URLs and publication IDs (separated by ", ") of the webpages and abstracts where a language was encountered will follow each language value in parenthesis.
language_biotools
Contains the values of the language attribute of the bio.tools entries corresponding to the bio.tools IDs in existing, that is, if the current entry constructed by Pub2Tools is found to be existing in bio.tools, then the languages currently in bio.tools are output here to contrast with the values in the column language. Languages of a bio.tools entry are separated by " ; " and languages of different entries are separated by " | ".
oa
true, if the publication is Open Access (according to the PubFetcher’s oa field of the publication). Values of different publication are separated by " | ". This information is just got as a side effect of fetching publications in -fetch-pub and it is not used anywhere in Pub2Tools.
journal_title
Journal titles of publications (separated by " | ") as got from the PubFetcher journalTitle field. Journal titles are used as part of the publication IDs selection process in -select-pub and in excluding a few publications from certain journals.
pub_date
Publication dates of publications (separated by " | ") as got from the PubFetcher pubDateHuman field (the value of the pubDate field follows in parenthesis). The publication date is the date of first publication, whichever is first, electronic or print publication, which is not the same as the “CREATION_DATE” used in -select-pub. Therefore, if Pub2Tools is run for some concrete month (using --month), then not all publications will necessarily have a publication date from that month (it can be from a previous month, but for some upcoming publications also from a future month). Currently, the publication date is used only to calculate citations_count_normalised.
citations_count
Numbers (separated by " | ") showing how many times publications have been cited as got from the PubFetcher citationsCount field. This information is obtained from Europe PMC, which usually has lower numbers than other citation databases. Furthermore, if Pub2Tools is run on recent publications, then the value is usually 0, as not enough time has passed for others to cite the articles. The count can be normalised by pub_date, giving the value in citations_count_normalised.
citations_timestamp
The timestamps (separated by " | ") when citations_count of publications were last updated as got from the PubFetcher citationsTimestampHuman field (the value of the citationsTimestamp field follows in parenthesis). Used when calculating citations_count_normalised.
citations_count_normalised
The citations_count normalised by pub_date. The exact formula is citations_count / (citations_timestamp - pub_date) * 1000000000, where the unit of citations_timestamp and pub_date is milliseconds (since Unix epoch). Currently, the result is not used anywhere in Pub2Tools, but it might be useful for prioritising or selecting candidates from a large batch of older publications.
corresp_author_name
Names of the corresponding authors of the publications as got from the PubFetcher correspAuthor field. The names of corresponding authors of a publication are separated by " ; " and values from different publications are separated by " | ".
credit_name_biotools
Contains the values of the credit name attribute of the credit group of the bio.tools entries corresponding to the bio.tools IDs in existing, that is, if the current entry constructed by Pub2Tools is found to be existing in bio.tools, then the credit names currently in bio.tools are output here to contrast with the values in the column corresp_author_name. Values of different credit name attributes of a bio.tools entry are separated by " ; " and values from different bio.tools entries are separated by " | ".
corresp_author_orcid
Like corresp_author_name, but for ORCID iDs of corresponding authors.
credit_orcidid_biotools
Like credit_name_biotools, but for the ORCID iD attribute.
corresp_author_email
Like corresp_author_name, but for e-mails of corresponding authors.
credit_email_biotools
Like credit_name_biotools, but for the email attribute.
corresp_author_phone
Like corresp_author_name, but for telephone numbers of corresponding authors.
corresp_author_uri
Like corresp_author_name, but for web pages of corresponding authors.
credit_url_biotools
Like credit_name_biotools, but for the URL attribute.
credit
The credit is constructed in the second pass (-pass2) from the corresponding authors of publications (with possible duplicates being merged). The name, ORCID iD, e-mail and URL can be filled, with only non-empty values output to the column and separated by ", " and values of different credits separated by " | ". The value of this column is suggested as the content of the credit attribute of the tool for bio.tools.

diff.csv columns

biotools_id

The first column lists the bio.tools ID of an existing bio.tools entry the current row of suggestions is about. If a new entry constructed by Pub2Tools is determined to be existing in bio.tools, then it will not be output to to_biotools.json, but instead redirected here. Values of both the new entry and the entry existing in bio.tools are output to results.csv and the corresponding row there can be found be searching for the ID present here in the column existing of results.csv.

However, if no differences are found between the new entry and the entry existing in bio.tools (and possibly_related is also empty), then nothing is output also to diff.csv. To be more precise, by differences we mean clashes between values of the new entry and the bio.tools entry or values which exist only in the new entry – so values that exist in the bio.tools entry and not in the new entry constructed by Pub2Tools are not considered to be different and nothing is suggested about them.

score_score2
A combined score (either equal to score2 or to score + 10000 in case score2 is not calculated) of a new entry constructed by Pub2Tools, which more or less shows the confidence that the correct tool name was extracted from the publication(s) in the new entry. Entries of the diff.csv spreadsheet are sorted by this score, unless there are multiple entries with the same biotools_id, in which case these entries are grouped together next to the highest scored such entry (this can happen for example when a bio.tools entry has multiple publications and distinct new Pub2Tools entries each match one of these publications).
current_publications
The publication IDs (separated by " | ") of the existing bio.tools entry. The value in this column is only filled if any of the columns modify_publications, add_publications or modify_name contain some non-empty value.
modify_publications

Contains publication IDs of the new entry constructed by Pub2Tools that have a conflict with some existing publication IDs of the current bio.tools entry. A conflict means that there is a match between some members of the publication ID triplets [PMID, PMCID, DOI] of the entries, but some other non-empty members are not equal. This indicates a mistake either in bio.tools (which happened for example when manually entering a publication ID) or in the entry constructed by Pub2Tools (where publication information came from an external service, like Europe PMC). So publication IDs here could be compared to the corresponding publication IDs in current_publications and by checking the publication online it can be decided which one is correct and if modifications have to be made in bio.tools.

Note

In principle, this column could also contain cases, where some existing publication ID has some empty parts (PMID, PMCID or DOI), which could be filled by information found by Pub2Tools, however such cases are not output here as such filling could be done automatically without any need for curation (see https://github.com/bio-tools/biotoolsLint/issues/2#issuecomment-427509431).

add_publications
Contains publication IDs (separated by " | ") of the new entry constructed by Pub2Tools that are missing in the matched existing entry currently in bio.tools. Thus, the publication IDs listed here could be added to the existing bio.tools entry. However, sometimes the suggestion in this column is wrong (for example, when suggestions were merged incorrectly in Pub2Tools because the names of distinct tools were exactly equal), but sometimes a value here could also indicate mistakes in bio.tools (like an incorrect publication attached to a tool or the same tool duplicated in bio.tools, but with different publications).
current_name
The name of the existing bio.tools entry. The value in this column is only filled if modify_name contains some non-empty value, that is, if it is suggested to change the name currently in bio.tools.
modify_name

Contains the name suggestion of the new entry constructed by Pub2Tools if it differs from the name currently existing in bio.tools (output to current_name). Whether the name should actually be modified in bio.tools, is up to the curator.

In many cases, both current_name and modify_name list quite obviously the same tool name, but with a slight difference in capitalisation, punctuation, whitespace, version number being present, name being an acronym, etc. And these small differences can matter, for example the tools coMET (1), Comet (2), CoMet (3) or PRISM (1), PriSM (2), PrISM (3) are all distinct tools with the only difference in the names being the capitalisation.

Note

Pub2Tools doesn’t really take into account the Curators Guide’s rules for the name attribute, thus in some cases the value in current_name will actually be correct.

In some cases, very different names are listed by current_name and modify_name. This can happen, if a wrong publication is attached to a tool in bio.tools, if Pub2Tools failed to extract the correct name, if a bio.tools entry is a conglomeration of differently named subtools, if a very general publication is attached to a more specific constituent subtool, if an attached publication is only indirectly related to the tool, etc.

The lower in the table, the more probable it is, that Pub2Tools failed to extract the correct name, thus for entries with “very low” confidence (score_score2 is less than 1072.1) the columns current_name and modify_name will be empty even if there are differences in names.

possibly_related
Contains bio.tools IDs (separated by " | ", with each ID followed by the name in parenthesis) of existing entries of bio.tools that might be related to the new entry constructed by Pub2Tools. It lists entries where evidence was not enough to say that the new entry is a duplicate of the listed entries. This happens, when names were matched (name_existing_publication_different or name_match), but no publications, links or credits could additionally be matched, or when solely some links could be matched (link_match). As such, this column contains mostly unrelated entries, however, sometimes the entries could actually be related and require some curation decisions (removal, combining of entries, etc).
current_homepage
The homepage of the existing bio.tools entry (also output to homepage_biotools of results.csv). Not filled, if modify_homepage is empty. If the homepage is determined to be broken in bio.tools, then (homepage_status: 1) will follow the URL. If it is determined to be broken by Pub2Tools, then (broken) will follow.
modify_homepage

The new homepage as suggested by Pub2Tools. A new homepage is suggested as replacement for current_homepage if the homepage of the new entry constructed by Pub2Tools does not match the homepage of the existing bio.tools entry and one of the following holds: current_homepage is broken (according to both bio.tools and Pub2Tools) or the URL of the new homepage is determined to be a link with type “Other”. Note, that the new and existing homepages are also considered equal if they redirect to the same final URL, also, www, index.html, etc are ignored and comparison of the domain name part is done case-insensitively.

If current_homepage is suggested to be replaced, then Pub2Tools might add the URL in current_homepage to add_links, add_downloads or add_documentations, that is, the homepage of the existing bio.tools entry should not simply be thrown away but added to some other bio.tools link attribute. If current_homepage is not suggested to be replaced, the this column would be empty and Pub2Tools might instead add the homepage of the new entry to add_links, add_downloads or add_documentations.

The URL suggested as the new homepage has the limitation that it must have occurred somewhere in a publication abstract or full text. Which means, that the URL in current_homepage might actually be a better homepage that just doesn’t occur in the publication text. It’s up to the curator to decide whether to perform the replacement – and if the replacement is not done, then the new homepage should not simply be thrown away, but considered for addition to link, download or documentation beforehand. The new homepage extracted by Pub2Tools could also be plainly incorrect and the probability of this increases the further down the entries we move. So, if confidence is “very low” (score_score2 is less than 1072.1), then the new homepage is always thrown away and current_homepage and modify_homepage will always be empty.

current_links
URLs currently in the link attribute of the existing bio.tools entry (also output to link_biotools of results.csv). Links are separated by " | " and each URL is followed by the link type in parenthesis. Not filled, if no new links to add are present in the entry constructed by Pub2Tools (that is, add_links is empty) or if there are simply no link attribute links currently in the existing bio.tools entry.
add_links
URLs from link of the new entry constructed by Pub2Tools that are missing in the currently existing entry of bio.tools and thus could be added there. Links are separated by " | " and each URL is followed by the link type in parenthesis. Sometimes, a link could be incorrectly categorised, as whether it should go to link, download or documentation is based solely on the URL string. Also, if confidence is “very low” (score_score2 is less than 1072.1), then confidence in the correctness of the new links found by Pub2Tools is too low and thus these new links will be thrown away and current_links and add_links will by empty.
current_downloads
Like current_links, but concerning the download attribute and download_biotools.
add_downloads
Like add_links, but concerning download and adding to current_downloads.
current_documentations
Like current_links, but concerning the documentation attribute and documentation_biotools.
add_documentations
Like add_links, but concerning documentation and adding to current_documentations.
current_license
The license currently set as the value of the license attribute of the existing bio.tools entry (also output to license_biotools of results.csv). Not filled, if modify_license is empty, that is, no licenses were extracted by Pub2Tools for the new entry or the found license is equal to the license in the existing bio.tools entry.
modify_license
The license of the new entry constructed by Pub2Tools that should replace the (either different or missing) license information of the existing bio.tools entry displayed in current_license. New license information is extracted from web pages (mostly repositories, like GitHub and Bioconductor) and publication abstracts, which means we can add provenance information, that is web page URLs and publication IDs (separated by ", "), after the license string in parenthesis. If confidence is “very low” (score_score2 is less than 1072.1), then confidence in the correctness of the extracted tool name and thus in the correctness of the extracted web pages is too low, so in that case only license information extracted from publication abstracts is considered (that is, license_abstract is used instead of license).
current_languages
The languages (separated by " | ") currently set as the value of the language attribute of the existing bio.tools entry (also output to language_biotools of results.csv). Not filled, if add_languages is empty, that is, no languages were extracted by Pub2Tools for the new entry or all found languages are already present in the existing bio.tools entry.
add_languages
A list of language strings (separated by " | ") from the new entry constructed by Pub2Tools that are different from the languages in the existing bio.tools entry (displayed in current_languages) and thus should be added there. New language information is extracted from web pages (mostly repositories, like GitHub and Bioconductor) and publication abstracts, which means we can add provenance information, that is web page URLs and publication IDs (separated by ", "), after the each language string in parenthesis. If confidence is “very low” (score_score2 is less than 1072.1), then confidence in the correctness of the extracted tool name and thus in the correctness of the extracted web pages is too low, so in that case only language information extracted from publication abstracts is considered (that is, language_abstract is used instead of language).
current_credits
The credit information currently set as the value of the credit attribute of the existing bio.tools entry (also output to credit_name_biotools, credit_orcidid_biotools, credit_email_biotools and credit_url_biotools of results.csv). The credit entries are separated by " | " with each entry in the form name, ORCID iD, e-mail, URL, where any missing attribute is simply omitted. Not filled, if modify_credits and add_credits are empty.
modify_credits
Credit entries from the new entry constructed by Pub2Tools that have a match with an existing credit in current_credits through the name, ORCID iD or e-email (a match does not mean equality, for example a person’s name can be written with an academic title and abbreviated middle name, while omitting accents), but where the new credit has information missing in the existing credit or there are slight differences in the name, ORCID iD or e-mail. Whether the missing information or the slight variations are important, is left to decide by the curator.
add_credits
Credit entries from the new entry constructed by Pub2Tools that are missing in the existing bio.tools entry (displayed in current_credits) and thus could possibly be added to the existing entry. Credits are displayed as in current_credits: separated by " | " with each credit in the form name, ORCID iD, e-mail, URL, where any missing attribute is simply omitted. One possible caveat: if bio.tools contains only a person’s e-mail and Pub2Tools extracts only the name of the same person, then these cannot be automatically connected currently and the name is added here instead of the correct column modify_credits.

to_biotools.json attributes

The final results file to_biotools.json will contain entries where include is true and existing is empty. It is a JSON file containing a number (named “count”) specifying how many entries there are and an array (named “list”) containing each entry as a JSON object with the following structure:

name
The name of the tool from suggestion. The name is not necessarily unique within a JSON file – equal names are indeed merged into one entry, but this is not done for entries with a “very low” confidence. Generating a unique bio.tools ID is also not done, this is left to the importer of the JSON file.
description
The description candidates and messages to the curator from description.
homepage
The homepage of the tool from homepage.
function[]

The function attribute is an array containing EDAM operations (but also EDAM data and format) found by the -map step. Pub2Tools outputs all found EDAM operations under one function (see tool functions), so the size of the array is always 1 when any EDAM operations are found.

operation[]

An array containing the found EDAM operation terms.

uri
The URI of the EDAM term.
term
The label of the EDAM term.
note
The -map step can also propose candidate EDAM terms from the data and format branches (if requested), however, these will need to be divided into the input object and output object and EDAMmap can’t differentiate between inputs and outputs. Thus, EDAM data and format terms will be output under note as a string with the following format: EDAM_URI (EDAM_label) | EDAM_URI (EDAM_label) | ....
topic[]

The topic attribute is an array containing the EDAM topic terms found by the -map step.

uri
The URI of the EDAM term.
term
The label of the EDAM term.
language[]
An array containing the strings of all languages of the tool from language. Unfortunately, biotoolsSchema does not leave space for outputting the web page URLs and publication IDs where these languages where found from, so if this extra information seems important for making curation decisions, then it can be looked up from the language column of results.csv.
license
The license of the tool from license. Unfortunately, biotoolsSchema does not leave space for outputting the web page URLs and publication IDs where the license was found from, so if this extra information seems important for making curation decisions, then it can be looked up from the license column of results.csv.
link[]

An array of miscellaneous links of the tool from link.

url
The URL of the link.
type[]
The link type; an array with exactly one element as currently Pub2Tools only finds exactly one type for each link.
download[]

An array of download links of the tool from download.

url
The URL of the link.
type[]
The download type; an array with exactly one element as currently Pub2Tools only finds exactly one type for each download link.
documentation[]

An array of documentation links of the tool from documentation.

url
The URL of the link.
type[]
The documentation type; an array with exactly one element as currently Pub2Tools only finds exactly one type for each documentation link.
publication[]

The publication attribute is an array filled with publications where the tool was extracted from. Normally, one publication can produce one tool entry for bio.tools, but sometimes multiple tool suggestions can be merged into one result, thus the size of the array can be greater than 1.

doi
The DOI of a publication from doi.
pmid
The PMID of a publication from pmid.
pmcid
The PMCID of a publication from pmcid.
credit[]

An array of credits of the tool from credit.

name
The name of a credit.
email
The e-mail of a credit.
url
The URL of a credit.
orcidid
The ORCID iD of a credit.
typeEntity
The entity type of a credit. Always “Person”, because currently the only source for credits is the corresponding authors of publications.
confidence_flag
From confidence, so either “high”, “medium”, “low” or “very low”.

Note

Empty or null values will be omitted from the output.

As an example, consider the following new entry:

{
  "name" : "PAWER",
  "description" : "Protein Array Web ExploreR.\n\npaweR is an R package for analysing protein microarray data.\n\nWeb interface for PAWER tool (https://biit.cs.ut.ee/pawer/).",
  "homepage" : "https://biit.cs.ut.ee/pawer",
  "function" : [ {
    "operation" : [ {
      "uri" : "http://edamontology.org/operation_3435",
      "term" : "Standardisation and normalisation"
    }, {
      "uri" : "http://edamontology.org/operation_0337",
      "term" : "Visualisation"
    }, {
      "uri" : "http://edamontology.org/operation_2436",
      "term" : "Gene-set enrichment analysis"
    } ],
    "note" : "http://edamontology.org/data_2603 (Expression data) | http://edamontology.org/data_2082 (Matrix) | http://edamontology.org/data_0958 (Tool metadata) | http://edamontology.org/data_3932 (Q-value) | http://edamontology.org/format_3829 (GPR) | http://edamontology.org/format_1208 (protein) | http://edamontology.org/format_3752 (CSV)"
  } ],
  "topic" : [ {
    "uri" : "http://edamontology.org/topic_3518",
    "term" : "Microarray experiment"
  }, {
    "uri" : "http://edamontology.org/topic_0121",
    "term" : "Proteomics"
  }, {
    "uri" : "http://edamontology.org/topic_0203",
    "term" : "Gene expression"
  }, {
    "uri" : "http://edamontology.org/topic_0769",
    "term" : "Workflows"
  }, {
    "uri" : "http://edamontology.org/topic_0632",
    "term" : "Probes and primers"
  } ],
  "language" : [ "R" ],
  "link" : [ {
    "url" : "https://gl.cs.ut.ee/biit/paweR",
    "type" : [ "Other" ]
  }, {
    "url" : "https://gl.cs.ut.ee/biit/pawer_web_client",
    "type" : [ "Other" ]
  } ],
  "publication" : [ {
    "doi" : "10.1101/692905"
  }, {
    "doi" : "10.1186/S12859-020-03722-Z",
    "pmid" : "32942983",
    "pmcid" : "PMC7499988"
  } ],
  "confidence_flag" : "high"
}

The example is missing the following fields: license, because license information could not be extracted from the publication abstract and there were also no (usually repository) links where this information could be found; credit, which has been manually removed from the example; download and documentation, as none of the links matched to the name of the tool and extracted from the publication abstract and full text are categorised as such.

Performance

On the 6th of August 2019, Pub2Tools was run for the months of May, June and July 2019. The results can give a rough estimate of its performance.

Extracting new tools from 1 month worth of publications took Pub2Tools about 1h 40min (1h 15min of it was spent on downloading the publications) with default parameters.

The total number of publications returned from Europe PMC for CREATION_DATE:[2019-05-01 TO 2019-05-31] was around 123000, for June the number was 115000 and for July 111000. After prefiltering with the -select-pub step (with default options), these numbers were reduced to 2429, 2365 and 2253 for May, June and July respectively. So, such prefiltering allows to reduce the number of publications to be fetched to around 2% of the initial availability (of course, the cost is that a few valid publications will also be thrown away). After running all the steps of Pub2Tools, the number of entries written to to_biotools.json were 689, 670 and 670 for May, June and July respectively. A manual inspection of the results revealed, that around 20% of the entries were not publications about tools, databases or services and thus were unsuitable for bio.tools (but even for “very low” confidence entries, roughly half seemed to be about a tool, though the name was quite often wrongly extracted). So, in the year 2019, roughly 500 new entries per month could be added to bio.tools, which is a bit less than 0.5% of all articles available through PubMed. In addition to the new entries, some results were found to be already existing in bio.tools: the file diff.csv contained 82, 37, 29 entries for May, June, July.

On the 1st of October 2020, Pub2Tools was run for the months of August and September 2020. Extracting new tools from 1 month worth of publications took Pub2Tools about 2h with default parameters (around 1h of it was spent on downloading publications and 30min on downloading web pages).

The following table shows the percentage of potential new entries whose attribute was filled with at least some value per each attribute:

attribute 2019-05 2019-06 2019-07 2020-08 2020-09
pmid 75.04% 71.64% 68.21% 68.55% 61.99%
pmcid 43.11% 39.10% 40.60% 35.36% 28.92%
doi 99.85% 99.55% 99.40% 99.67% 98.69%
homepage 80.41% 80.00% 84.18% 85.64% 83.79%
link 17.42% 17.31% 16.57% 18.82% 17.74%
download 1.89% 1.19% 2.84% 3.26% 3.07%
documentation 3.77% 3.13% 4.33% 4.79% 3.50%
license 22.35% 25.52% 25.82% 23.39% 23.77%
language 47.90% 52.69% 54.03% 50.27% 49.73%
credit 42.09% 35.67% 37.91% 72.58% 70.65%

The corresponding figure:

_images/pub2tools_perf.svg

The name and publication are always filled, because all entries are extracted from some publication and a name has to be extracted and chosen. The description is also always filled, though it always requires curation also and in case of missing links will contain only text from the publication abstract. The homepage is also a required attribute, however it will be reported unfilled here in case a homepage could not be found and the homepage attribute was just filled with a link to the publication.

The falling fill rate of PMID and PMCID points to the growing share of pre-prints (findable through Europe PMC). For 2020-08 and 2020-09 the fill rate of credit information is a lot higher because in January 2020 support was added for getting corresponding author information directly from web pages of articles of many journals (resolved through the publication DOI) in addition to getting corresponding authors information from PubMed Central (i.e. only for publications that have a PMCID).

Note

Pub2Tools sometimes also extracts and writes incorrect information to an attribute (except publication and credit information which is mostly correct), so the percentages presented in the table would be slightly lower if only correctly filled attributes would be taken into account. On the other hand, if only high confidence entries would be taken into account, then the fill rates of homepage, link, license, language and credit would be roughly 10 percentage points higher.