Project:Analytics/PubPeer

From Librarybase
< Project:Analytics
Revision as of 02:00, 9 January 2026 by Harej (talk | contribs) (Update resources)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

API

https://dashboards.pubpeer.com/docs/api#/operations/partner

Relevant parameters:

  • page: start with 1 then iterate based on whether there are more results
  • per_page: set at maximum value 300
  • sort:
  • published_at: concerns when the document was published; I only care about comments

Resources

  • Wikimedia Cloud Services
    • Toolforge: project "pubpeer"
      • Don't think it's being used for anything
    • Cloud VPS: project "wikicite"
      • VM wikicite-refsdb-proc-1.wikicite.eqiad1.wikimedia.cloud
      • Trove DB: ouqdvgrbzf3.svc.trove.eqiad1.wikimedia.cloud

Process

  • Initial seed:
    • Build pageset
      • Pull usages of identifiers from Wikimedia Cloud DB Replicas
      • Create database table (wikipedia):
        • id (incrementing key)
        • language_code
        • mw_page_id
        • mw_page_title (probably should have a process to refresh this before the full process runs)
    • API query: 2006-01-01..2025-12-31
      • To prevent overloading the script, I'll probably end up going one month at a time, 2006-01-01..2006-01-31
    • Iterate through as many pages as needed to get to the end
    • Build internal database (pubpeer_articles table):
      • id_pubpeer (key)
      • id_doi (update on conflict)
      • id_pubmed (update on conflict)
      • id_arxiv (update on conflict)
    • Build minimal citations database (citations table):
      • id_pubpeer (key to pubpeer_articles table)
      • id_wiki_page (key to wikipedia table)
      • cited_id_type (integer)
        • 0 = unknown/other
        • 1 = doi
        • 2 = pubmed
        • 3 = arxiv
        • Why not use enum? It's easier to add a new value Python-side than it is to carry out a database schema migration
      • cited_id_value (string)
      • time_last_updated_table (null when created)
      • time_last_talk_page_post (null when created)
      • time_most_recent_comment (on conflict, update if submitted > stored)
    • Post initial report to wiki as a table
    • Post initial notification that the report is posted
  • Subsequent builds:
    • Get most recent time_most_recent_comment from database
    • API query: that date...present day
    • Iterate through as many result pages as needed (probably only one page)
    • Submit into database, which should transparently handle conflicts
    • Build new wiki table based on citations database table
      • Check database for
        • null time_last_updated_table
        • time_most_recent_comment > time_last_updated_table
    • Come up with alerts describing changes to table.
      • Retire old notifications to a subpage
  • If/when talk page notifications are approved:
    • Check database for
      • null time_last_talk_page_post
      • time_most_recent_comment > time_last_talk_page_post
    • Queue up talk pages to notify
    • Check presence for message already on talk page.
      • No message comment: add post to talk page
      • Presence of message comment: skip over talk page