Project:Analytics/PubPeer: Difference between revisions
(Start) |
(Update resources) |
||
| (3 intermediate revisions by the same user not shown) | |||
| Line 11: | Line 11: | ||
* Wikimedia Cloud Services | * Wikimedia Cloud Services | ||
** Toolforge: project "pubpeer" | ** Toolforge: project "pubpeer" | ||
** Cloud VPS: project "wikicite" | *** Don't think it's being used for anything | ||
** Cloud VPS: project "wikicite" | |||
*** VM wikicite-refsdb-proc-1.wikicite.eqiad1.wikimedia.cloud | |||
*** Trove DB: ouqdvgrbzf3.svc.trove.eqiad1.wikimedia.cloud | |||
==Process== | ==Process== | ||
* Initial seed: | * Initial seed: | ||
** Build pageset | ** Build pageset | ||
** | *** Pull usages of identifiers from Wikimedia Cloud DB Replicas | ||
*** Create database table (wikipedia): | |||
**** id (incrementing key) | |||
**** language_code | |||
**** mw_page_id | |||
**** mw_page_title (probably should have a process to refresh this before the full process runs) | |||
** API query: <code>2006-01-01..2025-12-31</code> | |||
*** To prevent overloading the script, I'll probably end up going one month at a time, 2006-01-01..2006-01-31 | |||
** Iterate through as many pages as needed to get to the end | ** Iterate through as many pages as needed to get to the end | ||
** Build internal database: | ** Build internal database (pubpeer_articles table): | ||
*** id_pubpeer (key) | *** id_pubpeer (key) | ||
*** id_doi (update on conflict) | *** id_doi (update on conflict) | ||
*** id_pubmed (update on conflict) | *** id_pubmed (update on conflict) | ||
*** id_arxiv (update on conflict) | *** id_arxiv (update on conflict) | ||
*** | **Build minimal citations database (citations table): | ||
***id_pubpeer (key to pubpeer_articles table) | |||
***id_wiki_page (key to wikipedia table) | |||
***cited_id_type (integer) | |||
****0 = unknown/other | |||
****1 = doi | |||
****2 = pubmed | |||
****3 = arxiv | |||
****Why not use enum? It's easier to add a new value Python-side than it is to carry out a database schema migration | |||
***cited_id_value (string) | |||
***time_last_updated_table (<code>null</code> when created) | |||
***time_last_talk_page_post (<code>null</code> when created) | |||
*** time_most_recent_comment (on conflict, update if submitted > stored) | *** time_most_recent_comment (on conflict, update if submitted > stored) | ||
**Post initial report to wiki as a table | |||
**Post initial notification that the report is posted | |||
* Subsequent builds: | * Subsequent builds: | ||
** Get most recent <code>time_most_recent_comment</code> from database | ** Get most recent <code>time_most_recent_comment</code> from database | ||
** | ** API query: <code>that date...present day</code> | ||
** Iterate through as many result pages as needed (probably only one page) | ** Iterate through as many result pages as needed (probably only one page) | ||
** Submit into database, which should transparently handle conflicts | ** Submit into database, which should transparently handle conflicts | ||
** Build new wiki table based on citations database table | |||
*** Check database for | |||
****null time_last_updated_table | |||
****time_most_recent_comment > time_last_updated_table | |||
** Come up with alerts describing changes to table. | |||
*** Retire old notifications to a subpage | |||
*If/when talk page notifications are approved: | |||
**Check database for | |||
***null time_last_talk_page_post | |||
***time_most_recent_comment > time_last_talk_page_post | |||
**Queue up talk pages to notify | |||
**Check presence for message already on talk page. | |||
***No message comment: add post to talk page | |||
***Presence of message comment: skip over talk page | |||
Latest revision as of 02:00, 9 January 2026
API
https://dashboards.pubpeer.com/docs/api#/operations/partner
Relevant parameters:
page: start with1then iterate based on whether there are more resultsper_page: set at maximum value300sort:: concerns when the document was published; I only care about commentspublished_at
Resources
- Wikimedia Cloud Services
- Toolforge: project "pubpeer"
- Don't think it's being used for anything
- Cloud VPS: project "wikicite"
- VM wikicite-refsdb-proc-1.wikicite.eqiad1.wikimedia.cloud
- Trove DB: ouqdvgrbzf3.svc.trove.eqiad1.wikimedia.cloud
- Toolforge: project "pubpeer"
Process
- Initial seed:
- Build pageset
- Pull usages of identifiers from Wikimedia Cloud DB Replicas
- Create database table (wikipedia):
- id (incrementing key)
- language_code
- mw_page_id
- mw_page_title (probably should have a process to refresh this before the full process runs)
- API query:
2006-01-01..2025-12-31- To prevent overloading the script, I'll probably end up going one month at a time, 2006-01-01..2006-01-31
- Iterate through as many pages as needed to get to the end
- Build internal database (pubpeer_articles table):
- id_pubpeer (key)
- id_doi (update on conflict)
- id_pubmed (update on conflict)
- id_arxiv (update on conflict)
- Build minimal citations database (citations table):
- id_pubpeer (key to pubpeer_articles table)
- id_wiki_page (key to wikipedia table)
- cited_id_type (integer)
- 0 = unknown/other
- 1 = doi
- 2 = pubmed
- 3 = arxiv
- Why not use enum? It's easier to add a new value Python-side than it is to carry out a database schema migration
- cited_id_value (string)
- time_last_updated_table (
nullwhen created) - time_last_talk_page_post (
nullwhen created) - time_most_recent_comment (on conflict, update if submitted > stored)
- Post initial report to wiki as a table
- Post initial notification that the report is posted
- Build pageset
- Subsequent builds:
- Get most recent
time_most_recent_commentfrom database - API query:
that date...present day - Iterate through as many result pages as needed (probably only one page)
- Submit into database, which should transparently handle conflicts
- Build new wiki table based on citations database table
- Check database for
- null time_last_updated_table
- time_most_recent_comment > time_last_updated_table
- Check database for
- Come up with alerts describing changes to table.
- Retire old notifications to a subpage
- Get most recent
- If/when talk page notifications are approved:
- Check database for
- null time_last_talk_page_post
- time_most_recent_comment > time_last_talk_page_post
- Queue up talk pages to notify
- Check presence for message already on talk page.
- No message comment: add post to talk page
- Presence of message comment: skip over talk page
- Check database for