Ideas for future

Sometimes ideas are emerging. These are written down here for future reference. A written down idea is not necessarily a good idea, thus not all points here should be implemented.

  • Fix potential systematic issues discovered from feedback of curation of Pub2Tools results.
  • Currently, Pub2Tools has been tweaked in the direction of having less false positives (for example, even roughly half of the entries with “very low” confidence are about tools) to not discourage curators too much into looking at all results of Pub2Tools. But maybe the amount of false negatives could also be decreased further without too much cost, for example by running Pub2Tools on existing bio.tools content and checking if there are any ways to reduce entries with include false.
  • Also, if some entries are added through some other means than Pub2Tools, then Pub2Tools could be run on publications of these entries and then it could be checked how many of these entries Pub2Tools fails to include and why.
  • The default values of the mapping parameters in the -map step are equal to the default values defined for the parameters in EDAMmap. Maybe some defaults could be changed for Pub2Tools, for example the number of terms output per branch?
  • Maybe the score or score2 could also be modified by how well the entry could be mapped to EDAM terms by EDAMmap in the -map step?
  • Could the currently unused citations_count_normalised also be used for modifying the goodness of entries? Note, that its value could only be useful for older publications as these have had enough time to be cited.
  • Could MeSH terms be used in -select-pub to prefilter the publications in addition to or instead of selection based on occurrence of phrases in the abstract and title? PubFetcher and the mesh field of its database can be used to look for prevalence of MeSH terms in publications currently in bio.tools. Note, that the selection by MeSH terms can’t return recent publications, as these terms are not added to articles right away (for example, it might take half a year after the publication date for the terms to appear).
  • How to best present the provenance (i.e. publication IDs and web page URLs where information was found) of the licenses and languages to the curator? As this extra information can’t be embedded into to_biotools.json and it must currently manually be looked up from results.csv.
  • Currently, the merging of supposedly duplicate results is done base on the exact equality of the suggested names only. Maybe further features should be taken into account, like matching links or credits, to reduce the number of wrongly merged results?
  • Further links could potentially be found by looking for links in the content of web pages extracted from the publication abstract and fulltext. But to avoid false positive, maybe links from the same domain should only be considered and maybe “about”, “help”, etc should be looked for in the URL string? Also, in some cases further links could be automatically inferred, like a documentation URL for a given CRAN URL.
  • Currently, each link receives exactly one type (like “Repository” or “Mailing list”), however biotoolsSchema allows for more than one type to be specified.
  • Explore the possibilities of extracting tools through other means than publications present in Europe PMC. Maybe from repositories, like GitHub, CRAN, Zenodo, etc?
  • Try running Pub2Tools on all publications from a given period, i.e. without prefiltering with -select-pub. This could be achieved for the Open Access subset of articles, that can be downloaded in bulk (as using PubFetcher to download millions of articles would be too wasteful and slow).
  • Try to automatically fill further attributes, without causing too many false positives, for example operating system or the bio.tools specific tool type.
  • Currently, the description is filled with candidate phrases that the curator must choose from or combine. Automatic text summarisation could be tried to automatically construct the final description proposal. Or, just choose one/two of the candidate phrases automatically (instead of the curator making the choice).
  • Try to figure out which of the description messages are actually useful.
  • The credit is currently filled only from corresponding authors of publications. Explore other possibilities to find credit information, for example contact information is sometimes mentioned in the publication abstract.