Project:Analytics/PubPeer
API
https://dashboards.pubpeer.com/docs/api#/operations/partner
Relevant parameters:
page: start with1then iterate based on whether there are more resultsper_page: set at maximum value300sort:: concerns when the document was published; I only care about commentspublished_at
Resources
- Wikimedia Cloud Services
- Toolforge: project "pubpeer"
- Don't think it's being used for anything
- Cloud VPS: project "wikicite"
- VM wikicite-refsdb-proc-1.wikicite.eqiad1.wikimedia.cloud
- Trove DB: ouqdvgrbzf3.svc.trove.eqiad1.wikimedia.cloud
- Toolforge: project "pubpeer"
Process
- Initial seed:
- Build pageset
- Pull usages of identifiers from Wikimedia Cloud DB Replicas
- Create database table (wikipedia):
- id (incrementing key)
- language_code
- mw_page_id
- mw_page_title (probably should have a process to refresh this before the full process runs)
- API query:
2006-01-01..2025-12-31- To prevent overloading the script, I'll probably end up going one month at a time, 2006-01-01..2006-01-31
- Iterate through as many pages as needed to get to the end
- Build internal database (pubpeer_articles table):
- id_pubpeer (key)
- id_doi (update on conflict)
- id_pubmed (update on conflict)
- id_arxiv (update on conflict)
- Build minimal citations database (citations table):
- id_pubpeer (key to pubpeer_articles table)
- id_wiki_page (key to wikipedia table)
- cited_id_type (integer)
- 0 = unknown/other
- 1 = doi
- 2 = pubmed
- 3 = arxiv
- Why not use enum? It's easier to add a new value Python-side than it is to carry out a database schema migration
- cited_id_value (string)
- time_last_updated_table (
nullwhen created) - time_last_talk_page_post (
nullwhen created) - time_most_recent_comment (on conflict, update if submitted > stored)
- Post initial report to wiki as a table
- Post initial notification that the report is posted
- Build pageset
- Subsequent builds:
- Get most recent
time_most_recent_commentfrom database - API query:
that date...present day - Iterate through as many result pages as needed (probably only one page)
- Submit into database, which should transparently handle conflicts
- Build new wiki table based on citations database table
- Check database for
- null time_last_updated_table
- time_most_recent_comment > time_last_updated_table
- Check database for
- Come up with alerts describing changes to table.
- Retire old notifications to a subpage
- Get most recent
- If/when talk page notifications are approved:
- Check database for
- null time_last_talk_page_post
- time_most_recent_comment > time_last_talk_page_post
- Queue up talk pages to notify
- Check presence for message already on talk page.
- No message comment: add post to talk page
- Presence of message comment: skip over talk page
- Check database for