Project:Analytics/PubPeer: Difference between revisions
(I think that's the workflow) |
(New schema!) |
||
| Line 22: | Line 22: | ||
**** mw_page_id | **** mw_page_id | ||
**** mw_page_title (probably should have a process to refresh this before the full process runs) | **** mw_page_title (probably should have a process to refresh this before the full process runs) | ||
** API query: <code>2006-01-01 | ** API query: <code>2006-01-01..2025-12-31</code> | ||
*** To prevent overloading the script, I'll probably end up going one month at a time, 2006-01-01..2006-01-31 | |||
** Iterate through as many pages as needed to get to the end | ** Iterate through as many pages as needed to get to the end | ||
** Build internal database (pubpeer_articles table): | ** Build internal database (pubpeer_articles table): | ||
| Line 32: | Line 33: | ||
***id_pubpeer (key to pubpeer_articles table) | ***id_pubpeer (key to pubpeer_articles table) | ||
***id_wiki_page (key to wikipedia table) | ***id_wiki_page (key to wikipedia table) | ||
***cited_id_type (integer) | |||
****0 = unknown/other | |||
****1 = doi | |||
****2 = pubmed | |||
****3 = arxiv | |||
****Why not use enum? It's easier to add a new value Python-side than it is to carry out a database schema migration | |||
***cited_id_value (string) | |||
***time_last_updated_table (<code>null</code> when created) | ***time_last_updated_table (<code>null</code> when created) | ||
***time_last_talk_page_post (<code>null</code> when created) | ***time_last_talk_page_post (<code>null</code> when created) | ||
Revision as of 03:52, 8 January 2026
API
https://dashboards.pubpeer.com/docs/api#/operations/partner
Relevant parameters:
page: start with1then iterate based on whether there are more resultsper_page: set at maximum value300sort:: concerns when the document was published; I only care about commentspublished_at
Resources
- Wikimedia Cloud Services
- Toolforge: project "pubpeer"
- Cloud VPS: project "wikicite", Trove DB instance
Process
- Initial seed:
- Build pageset
- Pull usages of identifiers from Wikimedia Cloud DB Replicas
- Create database table (wikipedia):
- id (incrementing key)
- language_code
- mw_page_id
- mw_page_title (probably should have a process to refresh this before the full process runs)
- API query:
2006-01-01..2025-12-31- To prevent overloading the script, I'll probably end up going one month at a time, 2006-01-01..2006-01-31
- Iterate through as many pages as needed to get to the end
- Build internal database (pubpeer_articles table):
- id_pubpeer (key)
- id_doi (update on conflict)
- id_pubmed (update on conflict)
- id_arxiv (update on conflict)
- Build minimal citations database (citations table):
- id_pubpeer (key to pubpeer_articles table)
- id_wiki_page (key to wikipedia table)
- cited_id_type (integer)
- 0 = unknown/other
- 1 = doi
- 2 = pubmed
- 3 = arxiv
- Why not use enum? It's easier to add a new value Python-side than it is to carry out a database schema migration
- cited_id_value (string)
- time_last_updated_table (
nullwhen created) - time_last_talk_page_post (
nullwhen created) - time_most_recent_comment (on conflict, update if submitted > stored)
- Post initial report to wiki as a table
- Post initial notification that the report is posted
- Build pageset
- Subsequent builds:
- Get most recent
time_most_recent_commentfrom database - API query:
that date...present day - Iterate through as many result pages as needed (probably only one page)
- Submit into database, which should transparently handle conflicts
- Build new wiki table based on citations database table
- Check database for
- null time_last_updated_table
- time_most_recent_comment > time_last_updated_table
- Check database for
- Come up with alerts describing changes to table.
- Retire old notifications to a subpage
- Get most recent
- If/when talk page notifications are approved:
- Check database for
- null time_last_talk_page_post
- time_most_recent_comment > time_last_talk_page_post
- Queue up talk pages to notify
- Check presence for message already on talk page.
- No message comment: add post to talk page
- Presence of message comment: skip over talk page
- Check database for