Project:Analytics/PubPeer: Difference between revisions

From Librarybase
(Start)
 
(I think that's the workflow)
Line 16: Line 16:
* Initial seed:
* Initial seed:
** Build pageset
** Build pageset
** Start from <code>2006-01-01...2025-12-31</code>
*** Pull usages of identifiers from Wikimedia Cloud DB Replicas
*** Create database table (wikipedia):
**** id (incrementing key)
**** language_code
**** mw_page_id
**** mw_page_title (probably should have a process to refresh this before the full process runs)
** API query: <code>2006-01-01...2025-12-31</code>
** Iterate through as many pages as needed to get to the end
** Iterate through as many pages as needed to get to the end
** Build internal database:
** Build internal database (pubpeer_articles table):
*** id_pubpeer (key)
*** id_pubpeer (key)
*** id_doi (update on conflict)
*** id_doi (update on conflict)
*** id_pubmed (update on conflict)
*** id_pubmed (update on conflict)
*** id_arxiv (update on conflict)
*** id_arxiv (update on conflict)
*** time_last_notified_wiki (<code>null</code> when created)
**Build minimal citations database (citations table):
***id_pubpeer (key to pubpeer_articles table)
***id_wiki_page (key to wikipedia table)
***time_last_updated_table (<code>null</code> when created)
***time_last_talk_page_post (<code>null</code> when created)
*** time_most_recent_comment (on conflict, update if submitted > stored)
*** time_most_recent_comment (on conflict, update if submitted > stored)
**Post initial report to wiki as a table
**Post initial notification that the report is posted
* Subsequent builds:
* Subsequent builds:
** Get most recent <code>time_most_recent_comment</code> from database
** Get most recent <code>time_most_recent_comment</code> from database
** Start from <code>that date...present day</code>
** API query: <code>that date...present day</code>
** Iterate through as many result pages as needed (probably only one page)
** Iterate through as many result pages as needed (probably only one page)
** Submit into database, which should transparently handle conflicts
** Submit into database, which should transparently handle conflicts
** Build new wiki table based on citations database table
*** Check database for
****null time_last_updated_table
****time_most_recent_comment > time_last_updated_table
** Come up with alerts describing changes to table.
*** Retire old notifications to a subpage
*If/when talk page notifications are approved:
**Check database for
***null time_last_talk_page_post
***time_most_recent_comment > time_last_talk_page_post
**Queue up talk pages to notify
**Check presence for message already on talk page.
***No message comment: add post to talk page
***Presence of message comment: skip over talk page

Revision as of 00:52, 8 January 2026

API

https://dashboards.pubpeer.com/docs/api#/operations/partner

Relevant parameters:

  • page: start with 1 then iterate based on whether there are more results
  • per_page: set at maximum value 300
  • sort:
  • published_at: concerns when the document was published; I only care about comments

Resources

  • Wikimedia Cloud Services
    • Toolforge: project "pubpeer"
    • Cloud VPS: project "wikicite", Trove DB instance

Process

  • Initial seed:
    • Build pageset
      • Pull usages of identifiers from Wikimedia Cloud DB Replicas
      • Create database table (wikipedia):
        • id (incrementing key)
        • language_code
        • mw_page_id
        • mw_page_title (probably should have a process to refresh this before the full process runs)
    • API query: 2006-01-01...2025-12-31
    • Iterate through as many pages as needed to get to the end
    • Build internal database (pubpeer_articles table):
      • id_pubpeer (key)
      • id_doi (update on conflict)
      • id_pubmed (update on conflict)
      • id_arxiv (update on conflict)
    • Build minimal citations database (citations table):
      • id_pubpeer (key to pubpeer_articles table)
      • id_wiki_page (key to wikipedia table)
      • time_last_updated_table (null when created)
      • time_last_talk_page_post (null when created)
      • time_most_recent_comment (on conflict, update if submitted > stored)
    • Post initial report to wiki as a table
    • Post initial notification that the report is posted
  • Subsequent builds:
    • Get most recent time_most_recent_comment from database
    • API query: that date...present day
    • Iterate through as many result pages as needed (probably only one page)
    • Submit into database, which should transparently handle conflicts
    • Build new wiki table based on citations database table
      • Check database for
        • null time_last_updated_table
        • time_most_recent_comment > time_last_updated_table
    • Come up with alerts describing changes to table.
      • Retire old notifications to a subpage
  • If/when talk page notifications are approved:
    • Check database for
      • null time_last_talk_page_post
      • time_most_recent_comment > time_last_talk_page_post
    • Queue up talk pages to notify
    • Check presence for message already on talk page.
      • No message comment: add post to talk page
      • Presence of message comment: skip over talk page