Project:Analytics/PubPeer: Difference between revisions

Revision as of 23:37, 13 January 2026

API

https://dashboards.pubpeer.com/docs/api#/operations/partner

Relevant parameters:

page: start with 1 then iterate based on whether there are more results
per_page: set at maximum value 300
sort:
~~published_at~~: concerns when the document was published; I only care about comments

Resources

Wikimedia Cloud Services
- Toolforge: project "pubpeer"
  - Don't think it's being used for anything
- Cloud VPS: project "wikicite"
  - VM wikicite-refsdb-proc-1.wikicite.eqiad1.wikimedia.cloud
  - Trove DB: ouqdvgrbzf3.svc.trove.eqiad1.wikimedia.cloud

Process

Data collection and indexing

PubPeer Data (index_pubpeer.py)

Initial Seed:
- Starts from 2000-01-01
- Uses an initial large date window (2000-01-01 to 2014-12-31) followed by 30-day increments
- API query handles pagination (per_page: 300) and automatically reduces the time window if the result set hits the 10,000-record limit to ensure no data is missed
Subsequent Builds:
- Triggered via python index_pubpeer.py --update
- Identifies the latest_comment_date from the local database and starts fetching from that date to the present
Database Updates:
- Updates the pubpeer_articles table
- Fields: id_pubpeer (URL), id_doi, id_pubmed, id_arxiv, title (truncated to 250 chars), and time_last_comment

Wikipedia Citations (index_citations.py)

Process:
- Pulls current external links from Wikimedia Cloud DB Replicas for DOI (org.doi.), PubMed (gov.nih.nlm.ncbi.pubmed.), and arXiv (org.arxiv.)
- Restricted to Main (0) and Draft (118) namespaces
- Matches these links against the local pubpeer_articles table
Database Updates:
- wikipedia table: Stores language_code, mw_page_id, mw_page_title, and mw_talk_page_id. Page titles are refreshed during each run
- citations table: Maps id_pubpeer to id_wiki_page
- Stale Data: Automatically removes citations from the local database if the link has been removed from Wikipedia

Reporting and wiki updates

Wiki Maintenance:
- Automatically handles page moves and deletions via sync_wikipedia_titles.py before updating reports
Report Generation (report.py):
- Alerts Report: Lists new citations (time_last_updated_table IS NULL) or existing citations with new comments (time_last_comment > time_last_updated_table)
- Most Affected Report: Lists Wikipedia articles with the highest number of unique PubPeer-commented citations
- Article List Reports: Large table and alphabetical subpages (/By article/A, etc.) listing all matched citations
- Frequency Report: Aggregates by PubPeer article to show which research is most cited across Wikipedia
User Interactions (Dismissals): (pending working implementation)
- The bot reads the current wiki report and compares it to its previous version
- If an editor removes a row from a wiki table, the bot marks that citation as dismissed = TRUE in the database and stops reporting it

Database Schema (schema.sql)

wikipedia: id, language_code, mw_page_id, mw_page_title, mw_talk_page_id, librarybase_id
pubpeer_articles: id_pubpeer, id_doi, id_pubmed, id_arxiv, title, time_last_comment
citations:
- Links articles to wiki pages
- cited_id_type: 1 (DOI), 2 (PubMed), 3 (arXiv)
  - 0 is reserved for "other"
- time_last_updated_table: Tracks when the wiki report last included this citation
- time_last_talk_page_post: (Reserved for future talk page notifications)
- dismissed: Boolean flag for editor-driven dismissals

Post-Implementation Status

Talk Page Notifications: Code includes fields for tracking (time_last_talk_page_post), but the active workflow currently focuses on centralized reports (Wikipedia:PubPeer/*) rather than automated talk page posting
Frequency of Runs: Designed to be run periodically (e.g., via cron) using the --update flag for index_pubpeer.py
Dismissals Not Recognized: The logic to recognize user removals of report entries does not work yet and has been disabled.

@@ Line 17: / Line 17: @@
 ==Process==
-* Initial seed:
+=== Data collection and indexing ===
-** Build pageset
+; PubPeer Data (index_pubpeer.py)
-*** Pull usages of identifiers from Wikimedia Cloud DB Replicas
+* '''Initial Seed:'''
-*** Create database table (wikipedia):
+** Starts from 2000-01-01
-**** id (incrementing key)
+** Uses an initial large date window (2000-01-01 to 2014-12-31) followed by 30-day increments
-**** language_code
+** API query handles pagination (per_page: 300) and automatically reduces the time window if the result set hits the 10,000-record limit to ensure no data is missed
-**** mw_page_id
+* '''Subsequent Builds:'''
-**** mw_page_title (probably should have a process to refresh this before the full process runs)
+** Triggered via <code>python index_pubpeer.py --update</code>
-** API query: <code>2006-01-01..2025-12-31</code>
+** Identifies the latest_comment_date from the local database and starts fetching from that date to the present
-*** To prevent overloading the script, I'll probably end up going one month at a time, 2006-01-01..2006-01-31
+* '''Database Updates:'''
-** Iterate through as many pages as needed to get to the end
+** Updates the pubpeer_articles table
-** Build internal database (pubpeer_articles table):
+** Fields: id_pubpeer (URL), id_doi, id_pubmed, id_arxiv, title (truncated to 250 chars), and time_last_comment
-*** id_pubpeer (key)
+; Wikipedia Citations (index_citations.py)
-*** id_doi (update on conflict)
+* '''Process:'''
-*** id_pubmed (update on conflict)
+** Pulls current external links from Wikimedia Cloud DB Replicas for DOI (org.doi.), PubMed (gov.nih.nlm.ncbi.pubmed.), and arXiv (org.arxiv.)
-*** id_arxiv (update on conflict)
+** Restricted to Main (0) and Draft (118) namespaces
-**Build minimal citations database (citations table):
+** Matches these links against the local pubpeer_articles table
-***id_pubpeer (key to pubpeer_articles table)
+* '''Database Updates:'''
-***id_wiki_page (key to wikipedia table)
+** '''wikipedia table:''' Stores language_code, mw_page_id, mw_page_title, and mw_talk_page_id. Page titles are refreshed during each run
-***cited_id_type (integer)
+** '''citations table:''' Maps id_pubpeer to id_wiki_page
-****0 = unknown/other
+** '''Stale Data:''' Automatically removes citations from the local database if the link has been removed from Wikipedia
-****1 = doi
-****2 = pubmed
+=== Reporting and wiki updates ===
-****3 = arxiv
+* '''Wiki Maintenance:'''
-****Why not use enum? It's easier to add a new value Python-side than it is to carry out a database schema migration
+** Automatically handles page moves and deletions via sync_wikipedia_titles.py before updating reports
-***cited_id_value (string)
+* '''Report Generation (report.py):'''
-***time_last_updated_table (<code>null</code> when created)
+** '''Alerts Report:''' Lists new citations (time_last_updated_table IS NULL) or existing citations with new comments (time_last_comment > time_last_updated_table)
-***time_last_talk_page_post (<code>null</code> when created)
+** '''Most Affected Report:''' Lists Wikipedia articles with the highest number of unique PubPeer-commented citations
-*** time_most_recent_comment (on conflict, update if submitted > stored)
+** '''Article List Reports:''' Large table and alphabetical subpages (/By article/A, etc.) listing all matched citations
-**Post initial report to wiki as a table
+** '''Frequency Report:''' Aggregates by PubPeer article to show which research is most cited across Wikipedia
-**Post initial notification that the report is posted
+* '''User Interactions (Dismissals):''' (pending working implementation)
-* Subsequent builds:
+** The bot reads the current wiki report and compares it to its previous version
-** Get most recent <code>time_most_recent_comment</code> from database
+** If an editor removes a row from a wiki table, the bot marks that citation as dismissed = TRUE in the database and stops reporting it
-** API query: <code>that date...present day</code>
-** Iterate through as many result pages as needed (probably only one page)
+=== Database Schema (schema.sql) ===
-** Submit into database, which should transparently handle conflicts
+* '''wikipedia:''' id, language_code, mw_page_id, mw_page_title, mw_talk_page_id, librarybase_id
-** Build new wiki table based on citations database table
+* '''pubpeer_articles:''' id_pubpeer, id_doi, id_pubmed, id_arxiv, title, time_last_comment
-*** Check database for
+* '''citations:'''
-****null time_last_updated_table
+** Links articles to wiki pages
-****time_most_recent_comment > time_last_updated_table
+** cited_id_type: 1 (DOI), 2 (PubMed), 3 (arXiv)
-** Come up with alerts describing changes to table.
+*** 0 is reserved for "other"
-*** Retire old notifications to a subpage
+** time_last_updated_table: Tracks when the wiki report last included this citation
-*If/when talk page notifications are approved:
+** time_last_talk_page_post: (Reserved for future talk page notifications)
-**Check database for
+** dismissed: Boolean flag for editor-driven dismissals
-***null time_last_talk_page_post
-***time_most_recent_comment > time_last_talk_page_post
+=== Post-Implementation Status ===
-**Queue up talk pages to notify
+* '''Talk Page Notifications:''' Code includes fields for tracking (time_last_talk_page_post), but the active workflow currently focuses on centralized reports (Wikipedia:PubPeer/*) rather than automated talk page posting
-**Check presence for message already on talk page.
+* '''Frequency of Runs:''' Designed to be run periodically (e.g., via cron) using the --update flag for index_pubpeer.py
-***No message comment: add post to talk page
+* '''Dismissals Not Recognized:''' The logic to recognize user removals of report entries does not work yet and has been disabled.
-***Presence of message comment: skip over talk page