Project:Analytics/PubPeer: Difference between revisions

Latest revision as of 02:00, 9 January 2026

API

https://dashboards.pubpeer.com/docs/api#/operations/partner

Relevant parameters:

page: start with 1 then iterate based on whether there are more results
per_page: set at maximum value 300
sort:
~~published_at~~: concerns when the document was published; I only care about comments

Resources

Wikimedia Cloud Services
- Toolforge: project "pubpeer"
  - Don't think it's being used for anything
- Cloud VPS: project "wikicite"
  - VM wikicite-refsdb-proc-1.wikicite.eqiad1.wikimedia.cloud
  - Trove DB: ouqdvgrbzf3.svc.trove.eqiad1.wikimedia.cloud

Process

Initial seed:
- Build pageset
  - Pull usages of identifiers from Wikimedia Cloud DB Replicas
  - Create database table (wikipedia):
    - id (incrementing key)
    - language_code
    - mw_page_id
    - mw_page_title (probably should have a process to refresh this before the full process runs)
- API query: 2006-01-01..2025-12-31
  - To prevent overloading the script, I'll probably end up going one month at a time, 2006-01-01..2006-01-31
- Iterate through as many pages as needed to get to the end
- Build internal database (pubpeer_articles table):
  - id_pubpeer (key)
  - id_doi (update on conflict)
  - id_pubmed (update on conflict)
  - id_arxiv (update on conflict)
- Build minimal citations database (citations table):
  - id_pubpeer (key to pubpeer_articles table)
  - id_wiki_page (key to wikipedia table)
  - cited_id_type (integer)
    - 0 = unknown/other
    - 1 = doi
    - 2 = pubmed
    - 3 = arxiv
    - Why not use enum? It's easier to add a new value Python-side than it is to carry out a database schema migration
  - cited_id_value (string)
  - time_last_updated_table (null when created)
  - time_last_talk_page_post (null when created)
  - time_most_recent_comment (on conflict, update if submitted > stored)
- Post initial report to wiki as a table
- Post initial notification that the report is posted
Subsequent builds:
- Get most recent time_most_recent_comment from database
- API query: that date...present day
- Iterate through as many result pages as needed (probably only one page)
- Submit into database, which should transparently handle conflicts
- Build new wiki table based on citations database table
  - Check database for
    - null time_last_updated_table
    - time_most_recent_comment > time_last_updated_table
- Come up with alerts describing changes to table.
  - Retire old notifications to a subpage
If/when talk page notifications are approved:
- Check database for
  - null time_last_talk_page_post
  - time_most_recent_comment > time_last_talk_page_post
- Queue up talk pages to notify
- Check presence for message already on talk page.
  - No message comment: add post to talk page
  - Presence of message comment: skip over talk page

@@ Line 11: / Line 11: @@
 * Wikimedia Cloud Services
 ** Toolforge: project "pubpeer"
-** Cloud VPS: project "wikicite", Trove DB instance
+*** Don't think it's being used for anything
+** Cloud VPS: project "wikicite"
+*** VM wikicite-refsdb-proc-1.wikicite.eqiad1.wikimedia.cloud
+*** Trove DB: ouqdvgrbzf3.svc.trove.eqiad1.wikimedia.cloud
 ==Process==
 * Initial seed:
 ** Build pageset
-** Start from <code>2006-01-01...2025-12-31</code>
+*** Pull usages of identifiers from Wikimedia Cloud DB Replicas
+*** Create database table (wikipedia):
+**** id (incrementing key)
+**** language_code
+**** mw_page_id
+**** mw_page_title (probably should have a process to refresh this before the full process runs)
+** API query: <code>2006-01-01..2025-12-31</code>
+*** To prevent overloading the script, I'll probably end up going one month at a time, 2006-01-01..2006-01-31
 ** Iterate through as many pages as needed to get to the end
-** Build internal database:
+** Build internal database (pubpeer_articles table):
 *** id_pubpeer (key)
 *** id_doi (update on conflict)
 *** id_pubmed (update on conflict)
 *** id_arxiv (update on conflict)
-*** time_last_notified_wiki (<code>null</code> when created)
+**Build minimal citations database (citations table):
+***id_pubpeer (key to pubpeer_articles table)
+***id_wiki_page (key to wikipedia table)
+***cited_id_type (integer)
+****0 = unknown/other
+****1 = doi
+****2 = pubmed
+****3 = arxiv
+****Why not use enum? It's easier to add a new value Python-side than it is to carry out a database schema migration
+***cited_id_value (string)
+***time_last_updated_table (<code>null</code> when created)
+***time_last_talk_page_post (<code>null</code> when created)
 *** time_most_recent_comment (on conflict, update if submitted > stored)
+**Post initial report to wiki as a table
+**Post initial notification that the report is posted
 * Subsequent builds:
 ** Get most recent <code>time_most_recent_comment</code> from database
-** Start from <code>that date...present day</code>
+** API query: <code>that date...present day</code>
 ** Iterate through as many result pages as needed (probably only one page)
 ** Submit into database, which should transparently handle conflicts
+** Build new wiki table based on citations database table
+*** Check database for
+****null time_last_updated_table
+****time_most_recent_comment > time_last_updated_table
+** Come up with alerts describing changes to table.
+*** Retire old notifications to a subpage
+*If/when talk page notifications are approved:
+**Check database for
+***null time_last_talk_page_post
+***time_most_recent_comment > time_last_talk_page_post
+**Queue up talk pages to notify
+**Check presence for message already on talk page.
+***No message comment: add post to talk page
+***Presence of message comment: skip over talk page