Property Detection
Repository Property Detection
This page documents how repository properties are inferred automatically from each repository README.md.
Scope
The current README-based heuristics detect:
- DOI publication links
- Posit Connect Cloud (PCC) publication links
Both are discovered without modifying repository content.
README Badge Pattern
The parser searches badge links in standard markdown form:
[](TARGET_URL)
All matching badge links are scanned, then filtered by the rules below.
DOI Heuristic
A DOI is detected only when a Zenodo badge and DOI target are both present:
IMAGE_URLcontainszenodo.org/badgeTARGET_URLcontainsdoi.org/
The DOI is then normalized to canonical format:
- remove leading
https://doi.org/ - remove trailing
.svgif present - remove trailing punctuation
- validate pattern
10.<prefix>/<suffix>
The website stores this as doi and doi_url.
PCC Heuristic
A Posit Connect Cloud URL is detected when a shield badge indicates Connect Cloud app publishing:
IMAGE_URLcontainsimg.shields.io/badge/IMAGE_URLcontains both:Connect CloudOpen App
TARGET_URLstarts withhttp://orhttps://
The TARGET_URL is stored as the repository PCC URL.
Caching
Detection results are cached to reduce API requests:
- DOI cache:
input/cache/repo_doi_map.csv - PCC cache:
input/cache/repo_pcc_map.csv
TTL:
- positive hits: 24 hours
- negative hits: 1 hour
Data Flow
Detected properties are used in:
- landing-page metrics (
Have DOIs,Published on PCC) - repository table publication icons and scope filters
Detection Source Fields
Caches include source columns for traceability:
- DOI cache:
doi_source - PCC cache:
pcc_source
Current values:
- DOI:
readme_zenodo_badgeornone - PCC:
readme_connect_badgeornone
Data Freshness
The website data is refreshed by scheduled GitHub Actions deploys every 6 hours, and the README detection caches use shorter TTL for misses (1h) and longer TTL for hits (24h).
Practical effect:
- newly added repositories usually appear after the next scheduled deploy,
- new DOI/PCC badges are picked up automatically once cache refresh conditions are met.
Current Limits
- Heuristics rely on badge conventions in
README.md. - Non-standard badge formats may not be detected.
- If GitHub API access fails temporarily, detection may lag until cache refresh.
Validation
Heuristic parsing is checked with a lightweight script:
scripts/validate_property_detection.R