* add prenorm/postnorm support, refactor using generate.py
* update README
* update README
* fix format
* update some description and fix format
* update format
* format
* use non-raw for loading
* format and update n4096
* dynamic-quant ready
* update readme
* support fused dynamic-quant
* update fused-quant, with smooth
* update README
* update args
* update some based on comment
[ROCm/composable_kernel commit: c3a4800c5f]
* [CK_TILE] Image to Column kernel
* Fixes
* Vector loads and stores
* Fixes
* Fixes
* change test dir name
[ROCm/composable_kernel commit: de3e3b6424]