Project:Analytics/PubPeer: Difference between revisions

From Librarybase
(Update resources)
(Updating notes based on current repository)
Line 17: Line 17:


==Process==
==Process==
* Initial seed:
=== Data collection and indexing ===
** Build pageset
; PubPeer Data (index_pubpeer.py)
*** Pull usages of identifiers from Wikimedia Cloud DB Replicas
* '''Initial Seed:'''
*** Create database table (wikipedia):
** Starts from 2000-01-01
**** id (incrementing key)
** Uses an initial large date window (2000-01-01 to 2014-12-31) followed by 30-day increments
**** language_code
** API query handles pagination (per_page: 300) and automatically reduces the time window if the result set hits the 10,000-record limit to ensure no data is missed
**** mw_page_id
* '''Subsequent Builds:'''
**** mw_page_title (probably should have a process to refresh this before the full process runs)
** Triggered via <code>python index_pubpeer.py --update</code>
** API query: <code>2006-01-01..2025-12-31</code>
** Identifies the latest_comment_date from the local database and starts fetching from that date to the present
*** To prevent overloading the script, I'll probably end up going one month at a time, 2006-01-01..2006-01-31
* '''Database Updates:'''
** Iterate through as many pages as needed to get to the end
** Updates the pubpeer_articles table
** Build internal database (pubpeer_articles table):
** Fields: id_pubpeer (URL), id_doi, id_pubmed, id_arxiv, title (truncated to 250 chars), and time_last_comment
*** id_pubpeer (key)
; Wikipedia Citations (index_citations.py)
*** id_doi (update on conflict)
* '''Process:'''
*** id_pubmed (update on conflict)
** Pulls current external links from Wikimedia Cloud DB Replicas for DOI (org.doi.), PubMed (gov.nih.nlm.ncbi.pubmed.), and arXiv (org.arxiv.)
*** id_arxiv (update on conflict)
** Restricted to Main (0) and Draft (118) namespaces
**Build minimal citations database (citations table):
** Matches these links against the local pubpeer_articles table
***id_pubpeer (key to pubpeer_articles table)
* '''Database Updates:'''
***id_wiki_page (key to wikipedia table)
** '''wikipedia table:''' Stores language_code, mw_page_id, mw_page_title, and mw_talk_page_id. Page titles are refreshed during each run
***cited_id_type (integer)
** '''citations table:''' Maps id_pubpeer to id_wiki_page
****0 = unknown/other
** '''Stale Data:''' Automatically removes citations from the local database if the link has been removed from Wikipedia
****1 = doi
 
****2 = pubmed
=== Reporting and wiki updates ===
****3 = arxiv
* '''Wiki Maintenance:'''
****Why not use enum? It's easier to add a new value Python-side than it is to carry out a database schema migration
** Automatically handles page moves and deletions via sync_wikipedia_titles.py before updating reports
***cited_id_value (string)
* '''Report Generation (report.py):'''
***time_last_updated_table (<code>null</code> when created)
** '''Alerts Report:''' Lists new citations (time_last_updated_table IS NULL) or existing citations with new comments (time_last_comment > time_last_updated_table)
***time_last_talk_page_post (<code>null</code> when created)
** '''Most Affected Report:''' Lists Wikipedia articles with the highest number of unique PubPeer-commented citations
*** time_most_recent_comment (on conflict, update if submitted > stored)
** '''Article List Reports:''' Large table and alphabetical subpages (/By article/A, etc.) listing all matched citations
**Post initial report to wiki as a table
** '''Frequency Report:''' Aggregates by PubPeer article to show which research is most cited across Wikipedia
**Post initial notification that the report is posted
* '''User Interactions (Dismissals):''' (pending working implementation)
* Subsequent builds:
** The bot reads the current wiki report and compares it to its previous version
** Get most recent <code>time_most_recent_comment</code> from database
** If an editor removes a row from a wiki table, the bot marks that citation as dismissed = TRUE in the database and stops reporting it
** API query: <code>that date...present day</code>
 
** Iterate through as many result pages as needed (probably only one page)
=== Database Schema (schema.sql) ===
** Submit into database, which should transparently handle conflicts
* '''wikipedia:''' id, language_code, mw_page_id, mw_page_title, mw_talk_page_id, librarybase_id
** Build new wiki table based on citations database table
* '''pubpeer_articles:''' id_pubpeer, id_doi, id_pubmed, id_arxiv, title, time_last_comment
*** Check database for
* '''citations:'''
****null time_last_updated_table
** Links articles to wiki pages
****time_most_recent_comment > time_last_updated_table
** cited_id_type: 1 (DOI), 2 (PubMed), 3 (arXiv)
** Come up with alerts describing changes to table.
*** 0 is reserved for "other"
*** Retire old notifications to a subpage
** time_last_updated_table: Tracks when the wiki report last included this citation
*If/when talk page notifications are approved:
** time_last_talk_page_post: (Reserved for future talk page notifications)
**Check database for
** dismissed: Boolean flag for editor-driven dismissals
***null time_last_talk_page_post
 
***time_most_recent_comment > time_last_talk_page_post
=== Post-Implementation Status ===
**Queue up talk pages to notify
* '''Talk Page Notifications:''' Code includes fields for tracking (time_last_talk_page_post), but the active workflow currently focuses on centralized reports (Wikipedia:PubPeer/*) rather than automated talk page posting
**Check presence for message already on talk page.
* '''Frequency of Runs:''' Designed to be run periodically (e.g., via cron) using the --update flag for index_pubpeer.py
***No message comment: add post to talk page
* '''Dismissals Not Recognized:''' The logic to recognize user removals of report entries does not work yet and has been disabled.
***Presence of message comment: skip over talk page

Revision as of 23:37, 13 January 2026

API

https://dashboards.pubpeer.com/docs/api#/operations/partner

Relevant parameters:

  • page: start with 1 then iterate based on whether there are more results
  • per_page: set at maximum value 300
  • sort:
  • published_at: concerns when the document was published; I only care about comments

Resources

  • Wikimedia Cloud Services
    • Toolforge: project "pubpeer"
      • Don't think it's being used for anything
    • Cloud VPS: project "wikicite"
      • VM wikicite-refsdb-proc-1.wikicite.eqiad1.wikimedia.cloud
      • Trove DB: ouqdvgrbzf3.svc.trove.eqiad1.wikimedia.cloud

Process

Data collection and indexing

PubPeer Data (index_pubpeer.py)
  • Initial Seed:
    • Starts from 2000-01-01
    • Uses an initial large date window (2000-01-01 to 2014-12-31) followed by 30-day increments
    • API query handles pagination (per_page: 300) and automatically reduces the time window if the result set hits the 10,000-record limit to ensure no data is missed
  • Subsequent Builds:
    • Triggered via python index_pubpeer.py --update
    • Identifies the latest_comment_date from the local database and starts fetching from that date to the present
  • Database Updates:
    • Updates the pubpeer_articles table
    • Fields: id_pubpeer (URL), id_doi, id_pubmed, id_arxiv, title (truncated to 250 chars), and time_last_comment
Wikipedia Citations (index_citations.py)
  • Process:
    • Pulls current external links from Wikimedia Cloud DB Replicas for DOI (org.doi.), PubMed (gov.nih.nlm.ncbi.pubmed.), and arXiv (org.arxiv.)
    • Restricted to Main (0) and Draft (118) namespaces
    • Matches these links against the local pubpeer_articles table
  • Database Updates:
    • wikipedia table: Stores language_code, mw_page_id, mw_page_title, and mw_talk_page_id. Page titles are refreshed during each run
    • citations table: Maps id_pubpeer to id_wiki_page
    • Stale Data: Automatically removes citations from the local database if the link has been removed from Wikipedia

Reporting and wiki updates

  • Wiki Maintenance:
    • Automatically handles page moves and deletions via sync_wikipedia_titles.py before updating reports
  • Report Generation (report.py):
    • Alerts Report: Lists new citations (time_last_updated_table IS NULL) or existing citations with new comments (time_last_comment > time_last_updated_table)
    • Most Affected Report: Lists Wikipedia articles with the highest number of unique PubPeer-commented citations
    • Article List Reports: Large table and alphabetical subpages (/By article/A, etc.) listing all matched citations
    • Frequency Report: Aggregates by PubPeer article to show which research is most cited across Wikipedia
  • User Interactions (Dismissals): (pending working implementation)
    • The bot reads the current wiki report and compares it to its previous version
    • If an editor removes a row from a wiki table, the bot marks that citation as dismissed = TRUE in the database and stops reporting it

Database Schema (schema.sql)

  • wikipedia: id, language_code, mw_page_id, mw_page_title, mw_talk_page_id, librarybase_id
  • pubpeer_articles: id_pubpeer, id_doi, id_pubmed, id_arxiv, title, time_last_comment
  • citations:
    • Links articles to wiki pages
    • cited_id_type: 1 (DOI), 2 (PubMed), 3 (arXiv)
      • 0 is reserved for "other"
    • time_last_updated_table: Tracks when the wiki report last included this citation
    • time_last_talk_page_post: (Reserved for future talk page notifications)
    • dismissed: Boolean flag for editor-driven dismissals

Post-Implementation Status

  • Talk Page Notifications: Code includes fields for tracking (time_last_talk_page_post), but the active workflow currently focuses on centralized reports (Wikipedia:PubPeer/*) rather than automated talk page posting
  • Frequency of Runs: Designed to be run periodically (e.g., via cron) using the --update flag for index_pubpeer.py
  • Dismissals Not Recognized: The logic to recognize user removals of report entries does not work yet and has been disabled.