Commit Graph

25 Commits

Author SHA1 Message Date
Luke Mino-Altherr
8ff4d38ad1 refactor(assets): merge AssetInfo and AssetCacheState into AssetReference
This change solves the basename collision bug by using UNIQUE(file_path) on the
unified asset_references table. Key changes:

Database:
- Migration 0005 merges asset_cache_states and asset_infos into asset_references
- AssetReference now contains: cache state fields (file_path, mtime_ns, needs_verify,
  is_missing, enrichment_level) plus info fields (name, owner_id, preview_id, etc.)
- AssetReferenceMeta replaces AssetInfoMeta
- AssetReferenceTag replaces AssetInfoTag
- UNIQUE constraint on file_path prevents duplicate entries for same file

Code:
- New unified query module: asset_reference.py (replaces asset_info.py, cache_state.py)
- Updated scanner, seeder, and services to use AssetReference
- Updated API routes to use reference_id instead of asset_info_id

Tests:
- All 175 unit tests updated and passing
- Integration tests require server environment (not run here)

Amp-Thread-ID: https://ampcode.com/threads/T-019c4fe8-9dcb-75ce-bea8-ea786343a581
Co-authored-by: Amp <amp@ampcode.com>
2026-02-11 20:03:10 -08:00
Luke Mino-Altherr
3fee0962c6 feat: implement two-phase scanning architecture (fast + enrich)
Phase 1 (FAST): Creates stub records with filesystem metadata only
- path, size, mtime - no file content reading
- Populates asset database quickly on startup

Phase 2 (ENRICH): Extracts metadata and computes hashes
- Safetensors header parsing, MIME types
- Optional blake3 hash computation
- Updates existing stub records

Changes:
- Add ScanPhase enum (FAST, ENRICH, FULL)
- Add enrichment_level column to AssetCacheState (0=stub, 1=metadata, 2=hashed)
- Add build_stub_specs() for fast scanning without metadata extraction
- Add get_unenriched_cache_states(), enrich_asset(), enrich_assets_batch()
- Add start_fast(), start_enrich() convenience methods to AssetSeeder
- Update start() to accept phase parameter (defaults to FULL)
- Split _run_scan() into _run_fast_phase() and _run_enrich_phase()
- Add migration 0003_add_enrichment_level.py
- Update tests for new architecture

Amp-Thread-ID: https://ampcode.com/threads/T-019c4eef-1568-778f-aede-38254728f848
Co-authored-by: Amp <amp@ampcode.com>
2026-02-11 17:41:38 -08:00
Luke Mino-Altherr
53869fb0c7 Fix FK constraint violation in bulk_ingest by filtering dropped assets
Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019c3626-c6ad-7139-a570-62da4e656a1a
2026-02-11 17:41:38 -08:00
Luke Mino-Altherr
7519a556df Add optional blake3 hashing during asset scanning
- Make blake3 import lazy in hashing.py (only imported when needed)
- Add compute_hashes parameter to AssetSeeder.start(), build_asset_specs(), and seed_assets()
- Fix missing tag clearing: include is_missing states in sync when update_missing_tags=True
- Clear is_missing flag on cache states when files are restored with matching mtime/size
- Fix validation error serialization in routes.py (use json.loads(ve.json()))

Amp-Thread-ID: https://ampcode.com/threads/T-019c3614-56d4-74a8-a717-19922d6dbbee
Co-authored-by: Amp <amp@ampcode.com>
2026-02-11 17:41:38 -08:00
Luke Mino-Altherr
947ca6b61b Fix is_missing state updates for asset cache states on startup
- Add bulk_update_is_missing() to efficiently update is_missing flag
- Update sync_cache_states_with_filesystem() to mark non-existent files as is_missing=True
- Call restore_cache_states_by_paths() in batch_insert_seed_assets() to restore
  previously-missing states when files reappear during scanning

Amp-Thread-ID: https://ampcode.com/threads/T-019c3177-e591-7666-ac6b-7e05c71c8ebf
Co-authored-by: Amp <amp@ampcode.com>
2026-02-11 17:41:38 -08:00
Luke Mino-Altherr
4e431b5969 Rename bulk_set_needs_verify to bulk_update_needs_verify for readability
Amp-Thread-ID: https://ampcode.com/threads/T-019c3167-e8f1-7409-904b-5fc0edaeef37
Co-authored-by: Amp <amp@ampcode.com>
2026-02-11 17:41:38 -08:00
Luke Mino-Altherr
f13b7aeba2 feat: non-destructive asset pruning with is_missing flag
- Add is_missing column to AssetCacheState for soft-delete
- Replace hard-delete pruning with mark_cache_states_missing_outside_prefixes
- Auto-restore missing cache states when files are re-scanned
- Filter out missing cache states from queries by default
- Rename functions for clarity:
  - mark_cache_states_missing_outside_prefixes (was delete_cache_states_outside_prefixes)
  - get_unreferenced_unhashed_asset_ids (was get_orphaned_seed_asset_ids)
  - mark_assets_missing_outside_prefixes (was prune_orphaned_assets)
  - mark_missing_outside_prefixes_safely (was prune_orphans_safely)
- Add restore_cache_states_by_paths for explicit restoration
- Add cleanup_unreferenced_assets for explicit hard-delete when needed
- Update API endpoint /api/assets/prune to use new soft-delete behavior

This preserves user metadata (tags, etc.) when base directories change,
allowing assets to be restored when the original paths become available again.

Amp-Thread-ID: https://ampcode.com/threads/T-019c3114-bf28-73a9-a4d2-85b208fd5462
Co-authored-by: Amp <amp@ampcode.com>
2026-02-11 17:41:38 -08:00
Luke Mino-Altherr
bc92ae4a0d refactor(assets): extract scanner logic into service modules
- Create file_utils.py with shared file utilities:
  - get_mtime_ns() - extract mtime in nanoseconds from stat
  - get_size_and_mtime_ns() - get both size and mtime
  - verify_file_unchanged() - check file matches DB mtime/size
  - list_files_recursively() - recursive directory listing

- Create bulk_ingest.py for bulk operations:
  - BulkInsertResult dataclass
  - batch_insert_seed_assets() - batch insert with conflict handling
  - prune_orphaned_assets() - clean up orphaned assets

- Update scanner.py to use new service modules instead of
  calling database queries directly

- Update ingest.py to use shared get_size_and_mtime_ns()

- Export new functions from services/__init__.py

Amp-Thread-ID: https://ampcode.com/threads/T-019c2ae7-f701-716a-a0dd-1feb988732fb
Co-authored-by: Amp <amp@ampcode.com>
2026-02-11 17:41:38 -08:00
Luke Mino-Altherr
8c4eb9a659 refactor(assets): consolidate duplicated query utilities and remove unused code
- Extract shared helpers to database/queries/common.py:
  - MAX_BIND_PARAMS, calculate_rows_per_statement, iter_chunks, iter_row_chunks
  - build_visible_owner_clause

- Remove duplicate _compute_filename_for_asset, consolidate in path_utils.py

- Remove unused get_asset_info_with_tags (duplicated get_asset_detail)

- Remove redundant __all__ from cache_state.py

- Make internal helpers private (_check_is_scalar)

Amp-Thread-ID: https://ampcode.com/threads/T-019c2ad9-9432-7451-94a8-79287dbbb19e
Co-authored-by: Amp <amp@ampcode.com>
2026-02-11 17:41:38 -08:00
Luke Mino-Altherr
2ddf91c6d9 refactor: add explicit types to asset service functions
- Add typed result dataclasses: IngestResult, AddTagsResult,
  RemoveTagsResult, SetTagsResult, TagUsage
- Add UserMetadata type alias for user_metadata parameters
- Type helper functions with Session parameters
- Use TypedDicts at query layer to avoid circular imports
- Update manager.py and tests to use attribute access

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-11 17:41:38 -08:00
Luke Mino-Altherr
4695694263 chore: remove obvious/self-documenting comments from assets package
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-11 17:41:38 -08:00
Luke Mino-Altherr
a6a8d3ad74 chore: remove module-level comments and docstrings from assets package
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-11 17:41:38 -08:00
Luke Mino-Altherr
fe59234476 chore: sort imports in assets package
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-11 17:41:38 -08:00
Luke Mino-Altherr
1b169a2b2e refactor: use query functions instead of direct ORM modifications in service layer
Add update_asset_info_name and update_asset_info_updated_at query functions
and update asset_management.py to use them instead of modifying ORM objects
directly. This ensures the service layer only uses explicit operations from
the queries package.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-11 17:41:37 -08:00
Luke Mino-Altherr
6d6ab81b72 refactor: flatten nested try blocks and if statements in assets package
Extract helper functions to eliminate nested try-except blocks in scanner.py
and remove duplicated type-checking logic in asset_info.py. Simplify nested
conditionals in asset_management.py for clearer control flow.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-11 17:41:37 -08:00
Luke Mino-Altherr
915d21afcb refactor: improve function naming for clarity and consistency
Rename functions to use clearer verb-based names:
- pick_best_live_path → select_best_live_path
- escape_like_prefix → escape_sql_like_string
- list_tree → list_files_recursively
- check_asset_file_fast → verify_asset_file_unchanged
- _seed_from_paths_batch → _batch_insert_assets_from_paths
- reconcile_cache_states_for_root → sync_cache_states_with_filesystem
- touch_asset_info_by_id → update_asset_info_access_time
- replace_asset_info_metadata_projection → set_asset_info_metadata
- expand_metadata_to_rows → convert_metadata_to_rows
- _rows_per_stmt → _calculate_rows_per_statement
- ensure_within_base → validate_path_within_base
- _cleanup_temp → _delete_temp_file_if_exists
- validate_hash_format → normalize_and_validate_hash
- get_relative_to_root_category_path_of_asset → get_asset_category_and_relative_path

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-11 17:41:37 -08:00
Luke Mino-Altherr
1f9272bc94 refactor: rename functions to verb-based naming convention
Rename functions across app/assets/ to follow verb-based naming:
- is_scalar → check_is_scalar
- project_kv → expand_metadata_to_rows
- _visible_owner_clause → _build_visible_owner_clause
- _chunk_rows → _iter_row_chunks
- _at_least_one → _validate_at_least_one_field
- _tags_norm → _normalize_tags_field
- _ser_dt → _serialize_datetime
- _ser_updated → _serialize_updated_at
- _error_response → _build_error_response
- _validation_error_response → _build_validation_error_response
- file_sender → stream_file_chunks
- seed_assets_endpoint → seed_assets
- utcnow → get_utc_now
- _safe_sort_field → _validate_sort_field
- _safe_filename → _sanitize_filename
- fast_asset_file_check → check_asset_file_fast
- prefixes_for_root → get_prefixes_for_root
- blake3_hash → compute_blake3_hash
- blake3_hash_async → compute_blake3_hash_async
- _is_within → _check_is_within
- _rel → _compute_relative

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-11 17:41:37 -08:00
Luke Mino-Altherr
c546e9315b fix: ruff linting errors and add comprehensive test coverage for asset queries
- Fix unused imports in routes.py, asset.py, manager.py, asset_management.py, ingest.py
- Fix whitespace issues in upload.py, asset_info.py, ingest.py
- Fix typo in manager.py (stray character after result["asset"])
- Fix broken import in test_metadata.py (project_kv moved to asset_info.py)
- Add fixture override in queries/conftest.py for unit test isolation

Add 48 new tests covering all previously untested query functions:
- asset.py: upsert_asset, bulk_insert_assets
- cache_state.py: upsert_cache_state, delete_cache_states_outside_prefixes,
  get_orphaned_seed_asset_ids, delete_assets_by_ids, get_cache_states_for_prefixes,
  bulk_set_needs_verify, delete_cache_states_by_ids, delete_orphaned_seed_asset,
  bulk_insert_cache_states_ignore_conflicts, get_cache_states_by_paths_and_asset_ids
- asset_info.py: insert_asset_info, get_or_create_asset_info,
  update_asset_info_timestamps, replace_asset_info_metadata_projection,
  bulk_insert_asset_infos_ignore_conflicts, get_asset_info_ids_by_ids
- tags.py: bulk_insert_tags_and_meta

Total: 119 tests pass (up from 71)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-11 17:41:37 -08:00
Luke Mino-Altherr
24ca007bf6 Refactor helpers.py: move functions to their respective modules
- Move scanner-only functions to scanner.py
- Move query-only functions (is_scalar, project_kv) to asset_info.py
- Move get_query_dict to routes.py
- Create path_utils.py service for path-related functions
- Reduce helpers.py to shared utilities only

Amp-Thread-ID: https://ampcode.com/threads/T-019c2510-33fa-7199-ae4b-bc31102277a7
Co-authored-by: Amp <amp@ampcode.com>
2026-02-11 17:41:37 -08:00
Luke Mino-Altherr
ef97ea8880 refactor: move bulk_ops to queries and scanner service
- Delete bulk_ops.py, moving logic to appropriate layers
- Add bulk insert query functions:
  - queries/asset.bulk_insert_assets
  - queries/cache_state.bulk_insert_cache_states_ignore_conflicts
  - queries/cache_state.get_cache_states_by_paths_and_asset_ids
  - queries/asset_info.bulk_insert_asset_infos_ignore_conflicts
  - queries/asset_info.get_asset_info_ids_by_ids
  - queries/tags.bulk_insert_tags_and_meta
- Move seed_from_paths_batch orchestration to scanner._seed_from_paths_batch

Amp-Thread-ID: https://ampcode.com/threads/T-019c24fd-157d-776a-ad24-4f19cf5d3afe
Co-authored-by: Amp <amp@ampcode.com>
2026-02-11 17:41:37 -08:00
Luke Mino-Altherr
48bfd29fb6 refactor: move scanner to services layer with pure query extraction
- Move app/assets/scanner.py to app/assets/services/scanner.py
- Extract pure queries from fast_db_consistency_pass:
  - get_cache_states_for_prefixes()
  - bulk_set_needs_verify()
  - delete_cache_states_by_ids()
  - delete_orphaned_seed_asset()
- Split prune_orphaned_assets into pure queries:
  - delete_cache_states_outside_prefixes()
  - get_orphaned_seed_asset_ids()
  - delete_assets_by_ids()
- Add reconcile_cache_states_for_root() service function
- Add prune_orphaned_assets() service function
- Remove function injection pattern
- Update imports in main.py, server.py, routes.py

Amp-Thread-ID: https://ampcode.com/threads/T-019c24f1-3385-701b-87e0-8b6bc87e841b
Co-authored-by: Amp <amp@ampcode.com>
2026-02-11 17:41:37 -08:00
Luke Mino-Altherr
7372858f12 refactor: move in-function imports to top-level and remove keyword-only argument pattern
- Move imports from inside functions to module top-level in:
  - app/assets/database/queries/asset.py
  - app/assets/database/queries/asset_info.py
  - app/assets/database/queries/cache_state.py
  - app/assets/manager.py
  - app/assets/services/asset_management.py
  - app/assets/services/ingest.py

- Remove keyword-only argument markers (*,) from app/assets/ to match codebase conventions

Amp-Thread-ID: https://ampcode.com/threads/T-019c24eb-bfa2-727f-8212-8bc976048604
Co-authored-by: Amp <amp@ampcode.com>
2026-02-11 17:41:37 -08:00
Luke Mino-Altherr
02117ae130 Refactor asset database: separate business logic from queries
Architecture changes:
- API Routes -> manager.py (thin adapter) -> services/ (business logic) -> queries/ (atomic DB ops)
- Services own session lifecycle via create_session()
- Queries accept Session as parameter, do single-table atomic operations

New app/assets/services/ layer:
- __init__.py - exports all service functions
- ingest.py - ingest_file_from_path(), register_existing_asset()
- asset_management.py - get_asset_detail(), update_asset_metadata(), delete_asset_reference(), set_asset_preview()
- tagging.py - apply_tags(), remove_tags(), list_tags()

Removed from queries/asset_info.py:
- ingest_fs_asset (moved to services/ingest.py as ingest_file_from_path)
- update_asset_info_full (moved to services/asset_management.py as update_asset_metadata)
- create_asset_info_for_existing_asset (moved to services/ingest.py as register_existing_asset)

Updated manager.py:
- Now a thin adapter that transforms API schemas to/from service calls
- Delegates all business logic to services layer
- No longer imports sqlalchemy.orm.Session or models directly

Test updates:
- Fixed test_cache_state.py import of pick_best_live_path (moved to helpers.py)
- Added comprehensive service layer tests (41 new tests)
- All 112 query + service tests pass

Amp-Thread-ID: https://ampcode.com/threads/T-019c24e2-7ae4-707f-ad19-c775ed8b82b5
Co-authored-by: Amp <amp@ampcode.com>
2026-02-11 17:41:37 -08:00
Luke Mino-Altherr
e708054f7d chore: remove unused Asset import from manager.py
Amp-Thread-ID: https://ampcode.com/threads/T-019c24bb-475b-7442-9ff9-8288edea3345
Co-authored-by: Amp <amp@ampcode.com>
2026-02-11 17:41:37 -08:00
Luke Mino-Altherr
e46c602c18 refactor(assets): split queries.py into modular query modules
Split the ~1000 line app/assets/database/queries.py into focused modules:

- queries/asset.py - Asset entity queries (asset_exists_by_hash, get_asset_by_hash)
- queries/asset_info.py - AssetInfo queries (~15 functions)
- queries/cache_state.py - AssetCacheState queries (list_cache_states_by_asset_id,
  pick_best_live_path, prune_orphaned_assets, fast_db_consistency_pass)
- queries/tags.py - Tag queries (8 functions including ensure_tags_exist,
  add/remove tag functions, list_tags_with_usage)
- queries/__init__.py - Re-exports all public functions for backward compatibility

Also adds comprehensive unit tests using in-memory SQLite:
- tests-unit/assets_test/queries/conftest.py - Session fixture
- tests-unit/assets_test/queries/test_asset.py - 5 tests
- tests-unit/assets_test/queries/test_asset_info.py - 23 tests
- tests-unit/assets_test/queries/test_cache_state.py - 8 tests
- tests-unit/assets_test/queries/test_metadata.py - 12 tests for _apply_metadata_filter
- tests-unit/assets_test/queries/test_tags.py - 23 tests

All 71 unit tests pass. Existing integration tests unaffected.

Amp-Thread-ID: https://ampcode.com/threads/T-019c24bb-475b-7442-9ff9-8288edea3345
Co-authored-by: Amp <amp@ampcode.com>
2026-02-11 17:41:37 -08:00