Rpc improvement (#480)

* Add RPC backend in device list to override tensors.

* rpc : prevent crashes on invalid input (#9040)

Add more checks which prevent RPC server from crashing if invalid input
is received from client
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* rpc : print error message when failed to connect endpoint (#9042)

* Fix RPC error

* Add vulkan, sycl to rpc backend

* add thread in rpc cpu backend

* add cache folder and other improvement in rpc

* add header file

* support for models with non-512 aligned tensors

* rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943)

RPC_CMD_SET_TENSOR always returns an empty response and we send this 4
times per token. We can improve TG speed if we don't wait for this empty
response.

The performance impact of this change depends on the network latency.
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* fix(rpc): Improve input validation and error handling (#13069)

* fix(rpc): Improve input validation and error handling

The `rpc-server` was vulnerable to Denial of Service attacks via
several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed
messages could trigger failed assertions (e.g., invalid `ggml_type`)
or out-of-bounds reads/writes leading to `GGML_ABORT` calls,
crashing the server process.

This PR introduces robust input validation and replaces `abort()`
calls with graceful error handling:

- **Type Validation:** `deserialize_tensor` now checks if the
  `tensor->type` is within the valid `GGML_TYPE_COUNT` range
  *before* calling `ggml_new_tensor_4d`. Returns `nullptr` on
  invalid type.
- **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`,
  `set_tensor_hash`, and `get_tensor` handlers with error
  logging and returning `false` when data/offset parameters
  are out of buffer bounds.
- **Size Checks:** Added safe arithmetic checks (for overflow) in
  `graph_compute` when calculating required message sizes based
  on client-provided `n_nodes` and `n_tensors`. Returns early
  if the reported sizes conflict with the actual message size or
  would lead to overflow.
- **Error Propagation:**
    - `create_node` now checks for `nullptr` return values from
      `deserialize_tensor` and its recursive calls, propagating
      `nullptr` upwards on failure. Uses `find` instead of `at`
      for safer map access.
    - `copy_tensor` now checks for `nullptr` from `deserialize_tensor`
      and sets the response status to failure if deserialization
      or bounds checks fail.
    - `graph_compute` now checks for `nullptr` return from
      `create_node` and returns failure status correctly. The final
      return value now reflects the actual computation status.

These changes improve the RPC server's resilience
against malformed client requests, preventing crashes and ensuring
errors are handled more gracefully.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): address pr comments

removed comments and unnecessary returns

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): ambiguous nullptr from create_node

rpc_server::create_node could previously return nullptr if the input ID
was 0 (valid) or if an internal error (deserialization, recursion
failure) occurred (invalid). This ambiguity made error handling
difficult for the caller (`graph_compute`).

This commit clarifies the meaning of nullptr:
- `graph_compute` now checks if the input 'id' was non-zero when
  `create_node` returns nullptr, correctly identifying failures
  versus intentional null links.
- `create_node` avoids recursive calls for zero IDs and propagates
  nullptr unambiguously on failure during recursion.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): initial zero check in create_node

The caller (`graph_compute`) already checks `id != 0` when handling
a `nullptr` return from `create_node`, correctly distinguishing
intentional null links from actual errors. This makes the initial
`if (id == 0)` check redundant.

Also removes the log message when a tensor ID is not found in the
provided map which was added in this branch.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* fix(rpc): Handle get_alloc_size failure in server

Check the return value of `server.get_alloc_size` in the RPC server
loop. If the call fails, return early to close the connection.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): input size validation in graph_compute

Removes detailed, step-by-step size calculations and overflow
checks in favor of simpler direct comparisons, assuming 64-bit
overflow is unlikely.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): remove extra status code setting

Removes the explicit setting of `response.result = GGML_STATUS_FAILED`
when `create_node` returns `nullptr` within `graph_compute`.
Primary signal is the `false` return value in case of failure.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): remove redundant check for tensor->type

Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus
the check is not needed.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

---------

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* rpc : fix cache directory initialization (#13188)

Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
# Conflicts:
#	examples/rpc/rpc-server.cpp

* rpc : avoid uninitialized memory in serialize_tensor (#13210)

Zero out the name and padding buffers.

* fix merge error

* Add hello command in RPC

* bug fix

* add rpc header

* fix bug for missing rpc names

* add tpc no delay for rpc

* add back webui

---------

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com>
Co-authored-by: matt23456 <matt23456>
Co-authored-by: Ville Vesilehto <ville@vesilehto.fi>
Co-authored-by: xiaofei <hbuxiaofei@gmail.com>
Co-authored-by: Justin Santa Barbara <justinsb@google.com>
This commit is contained in:
firecoperana
2025-06-08 11:43:21 +00:00
committed by GitHub
parent 08273f7e52
commit ed9e8ecc9b
11 changed files with 1255 additions and 505 deletions

View File

@@ -6,7 +6,6 @@ include(CheckIncludeFileCXX)
set(CMAKE_WARN_UNUSED_CLI YES)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED true)

View File

@@ -81,7 +81,9 @@
#endif
#define LLAMA_CURL_MAX_URL_LENGTH 2084 // Maximum URL Length in Chrome: 2083
#endif // LLAMA_USE_CURL
#ifdef GGML_USE_RPC
# include "ggml-rpc.h"
#endif
using json = nlohmann::ordered_json;
//
@@ -1004,6 +1006,35 @@ bool gpt_params_find_arg(int argc, char ** argv, const std::string & arg, gpt_pa
if (arg == "--rpc") {
CHECK_ARG
params.rpc_servers = argv[i];
std::string servers(params.rpc_servers);
size_t pos = 0;
while ((pos = servers.find(",")) != std::string::npos) {
std::string server = servers.substr(0, pos);
ggml_backend_rpc_buffer_type(server.c_str());
servers.erase(0, pos + 1);
}
ggml_backend_rpc_buffer_type(servers.c_str());
return true;
}
if (arg == "--override-kv") {
CHECK_ARG
if (!string_parse_kv_override(argv[i], params.kv_overrides)) {
fprintf(stderr, "error: Invalid type for KV override: %s\n", argv[i]);
invalid_param = true;
return true;
}
return true;
}
if (arg == "--override-tensor" || arg == "-ot") {
CHECK_ARG
/*for (auto endpoint : params.rpc_servers.split)
{
}*/
if (!parse_buft_overrides(std::string{ argv[i] }, params.tensor_buft_overrides)) {
fprintf(stderr, "error: Invalid tensor buffer type override: %s\n", argv[i]);
invalid_param = true;
}
return true;
}
if (arg == "--no-mmap") {
@@ -1211,23 +1242,7 @@ bool gpt_params_find_arg(int argc, char ** argv, const std::string & arg, gpt_pa
sparams.grammar = json_schema_to_grammar(json::parse(argv[i]));
return true;
}
if (arg == "--override-kv") {
CHECK_ARG
if (!string_parse_kv_override(argv[i], params.kv_overrides)) {
fprintf(stderr, "error: Invalid type for KV override: %s\n", argv[i]);
invalid_param = true;
return true;
}
return true;
}
if (arg == "--override-tensor" || arg == "-ot") {
CHECK_ARG
if (!parse_buft_overrides(std::string{argv[i]}, params.tensor_buft_overrides)) {
fprintf(stderr, "error: Invalid tensor buffer type override: %s\n", argv[i]);
invalid_param = true;
}
return true;
}
if (arg == "--offload-policy" || arg == "-op") {
CHECK_ARG
auto p = string_split_pairs<int,int>(argv[i], ',');

View File

@@ -1,2 +1,4 @@
add_executable(rpc-server rpc-server.cpp)
target_link_libraries(rpc-server PRIVATE ggml llama)
set(TARGET rpc-server)
add_executable(${TARGET} rpc-server.cpp)
target_link_libraries(${TARGET} PRIVATE ggml)
target_compile_features(${TARGET} PRIVATE cxx_std_17)

View File

@@ -5,33 +5,166 @@
#ifdef GGML_USE_METAL
#include "ggml-metal.h"
#endif
#ifdef GGML_USE_VULKAN
#include "ggml-vulkan.h"
#endif
#ifdef GGML_USE_SYCL
#include "ggml-sycl.h"
#endif
#include "ggml-rpc.h"
#ifdef _WIN32
# define DIRECTORY_SEPARATOR '\\'
# define NOMINMAX
# include <locale>
# include <windows.h>
# include <fcntl.h>
# include <io.h>
#else
# define DIRECTORY_SEPARATOR '/'
# include <unistd.h>
# include <sys/stat.h>
#endif
#include <string>
#include <stdio.h>
#include <algorithm>
#include <thread>
#include <fstream>
#include <filesystem>
#include <codecvt>
namespace fs = std::filesystem;
// NOTE: this is copied from common.cpp to avoid linking with libcommon
// returns true if successful, false otherwise
static bool fs_create_directory_with_parents(const std::string& path) {
#ifdef _WIN32
std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
std::wstring wpath = converter.from_bytes(path);
// if the path already exists, check whether it's a directory
const DWORD attributes = GetFileAttributesW(wpath.c_str());
if ((attributes != INVALID_FILE_ATTRIBUTES) && (attributes & FILE_ATTRIBUTE_DIRECTORY)) {
return true;
}
size_t pos_slash = 0;
// process path from front to back, procedurally creating directories
while ((pos_slash = path.find('\\', pos_slash)) != std::string::npos) {
const std::wstring subpath = wpath.substr(0, pos_slash);
const wchar_t* test = subpath.c_str();
const bool success = CreateDirectoryW(test, NULL);
if (!success) {
const DWORD error = GetLastError();
// if the path already exists, ensure that it's a directory
if (error == ERROR_ALREADY_EXISTS) {
const DWORD attributes = GetFileAttributesW(subpath.c_str());
if (attributes == INVALID_FILE_ATTRIBUTES || !(attributes & FILE_ATTRIBUTE_DIRECTORY)) {
return false;
}
}
else {
return false;
}
}
pos_slash += 1;
}
return true;
#else
// if the path already exists, check whether it's a directory
struct stat info;
if (stat(path.c_str(), &info) == 0) {
return S_ISDIR(info.st_mode);
}
size_t pos_slash = 1; // skip leading slashes for directory creation
// process path from front to back, procedurally creating directories
while ((pos_slash = path.find('/', pos_slash)) != std::string::npos) {
const std::string subpath = path.substr(0, pos_slash);
struct stat info;
// if the path already exists, ensure that it's a directory
if (stat(subpath.c_str(), &info) == 0) {
if (!S_ISDIR(info.st_mode)) {
return false;
}
}
else {
// create parent directories
const int ret = mkdir(subpath.c_str(), 0755);
if (ret != 0) {
return false;
}
}
pos_slash += 1;
}
return true;
#endif // _WIN32
}
// NOTE: this is copied from common.cpp to avoid linking with libcommon
static std::string fs_get_cache_directory() {
std::string cache_directory = "";
auto ensure_trailing_slash = [](std::string p) {
// Make sure to add trailing slash
if (p.back() != DIRECTORY_SEPARATOR) {
p += DIRECTORY_SEPARATOR;
}
return p;
};
if (getenv("LLAMA_CACHE")) {
cache_directory = std::getenv("LLAMA_CACHE");
}
else {
#if defined(__linux__) || defined(__FreeBSD__) || defined(_AIX)
if (std::getenv("XDG_CACHE_HOME")) {
cache_directory = std::getenv("XDG_CACHE_HOME");
}
else {
cache_directory = std::getenv("HOME") + std::string("/.cache/");
}
#elif defined(__APPLE__)
cache_directory = std::getenv("HOME") + std::string("/Library/Caches/");
#elif defined(_WIN32)
cache_directory = std::getenv("LOCALAPPDATA");
#else
# error Unknown architecture
#endif
cache_directory = ensure_trailing_slash(cache_directory);
cache_directory += "llama.cpp";
}
return ensure_trailing_slash(cache_directory);
}
struct rpc_server_params {
std::string host = "127.0.0.1";
int port = 50052;
size_t backend_mem = 0;
bool use_cache = false;
int n_threads = std::max(1U, std::thread::hardware_concurrency() / 2);
};
static void print_usage(int /*argc*/, char ** argv, rpc_server_params params) {
static void print_usage(int /*argc*/, char** argv, rpc_server_params params) {
fprintf(stderr, "Usage: %s [options]\n\n", argv[0]);
fprintf(stderr, "options:\n");
fprintf(stderr, " -h, --help show this help message and exit\n");
fprintf(stderr, " -H HOST, --host HOST host to bind to (default: %s)\n", params.host.c_str());
fprintf(stderr, " -p PORT, --port PORT port to bind to (default: %d)\n", params.port);
fprintf(stderr, " -m MEM, --mem MEM backend memory size (in MB)\n");
fprintf(stderr, " -h, --help show this help message and exit\n");
fprintf(stderr, " -t, --threads number of threads for the CPU backend (default: %d)\n", params.n_threads);
fprintf(stderr, " -H HOST, --host HOST host to bind to (default: %s)\n", params.host.c_str());
fprintf(stderr, " -p PORT, --port PORT port to bind to (default: %d)\n", params.port);
fprintf(stderr, " -m MEM, --mem MEM backend memory size (in MB)\n");
fprintf(stderr, " -c, --cache enable local file cache\n");
fprintf(stderr, "\n");
}
static bool rpc_server_params_parse(int argc, char ** argv, rpc_server_params & params) {
static bool rpc_server_params_parse(int argc, char** argv, rpc_server_params& params) {
std::string arg;
for (int i = 1; i < argc; i++) {
arg = argv[i];
@@ -40,7 +173,18 @@ static bool rpc_server_params_parse(int argc, char ** argv, rpc_server_params &
return false;
}
params.host = argv[i];
} else if (arg == "-p" || arg == "--port") {
}
else if (arg == "-t" || arg == "--threads") {
if (++i >= argc) {
return false;
}
params.n_threads = std::stoi(argv[i]);
if (params.n_threads <= 0) {
fprintf(stderr, "error: invalid number of threads: %d\n", params.n_threads);
return false;
}
}
else if (arg == "-p" || arg == "--port") {
if (++i >= argc) {
return false;
}
@@ -48,15 +192,21 @@ static bool rpc_server_params_parse(int argc, char ** argv, rpc_server_params &
if (params.port <= 0 || params.port > 65535) {
return false;
}
} else if (arg == "-m" || arg == "--mem") {
}
else if (arg == "-c" || arg == "--cache") {
params.use_cache = true;
}
else if (arg == "-m" || arg == "--mem") {
if (++i >= argc) {
return false;
}
params.backend_mem = std::stoul(argv[i]) * 1024 * 1024;
} else if (arg == "-h" || arg == "--help") {
}
else if (arg == "-h" || arg == "--help") {
print_usage(argc, argv, params);
exit(0);
} else {
}
else {
fprintf(stderr, "error: unknown argument: %s\n", arg.c_str());
print_usage(argc, argv, params);
exit(0);
@@ -65,7 +215,7 @@ static bool rpc_server_params_parse(int argc, char ** argv, rpc_server_params &
return true;
}
static ggml_backend_t create_backend() {
static ggml_backend_t create_backend(const rpc_server_params& params) {
ggml_backend_t backend = NULL;
#ifdef GGML_USE_CUDA
fprintf(stderr, "%s: using CUDA backend\n", __func__);
@@ -79,12 +229,25 @@ static ggml_backend_t create_backend() {
if (!backend) {
fprintf(stderr, "%s: ggml_backend_metal_init() failed\n", __func__);
}
#elif GGML_USE_VULKAN
fprintf(stderr, "%s: using Vulkan backend\n", __func__);
backend = ggml_backend_vk_init(0); // init device 0
if (!backend) {
fprintf(stderr, "%s: ggml_backend_vulkan_init() failed\n", __func__);
}
#elif GGML_USE_SYCL
fprintf(stderr, "%s: using SYCL backend\n", __func__);
backend = ggml_backend_sycl_init(0); // init device 0
if (!backend) {
fprintf(stderr, "%s: ggml_backend_sycl_init() failed\n", __func__);
}
#endif
// if there aren't GPU Backends fallback to CPU backend
if (!backend) {
fprintf(stderr, "%s: using CPU backend\n", __func__);
backend = ggml_backend_cpu_init();
ggml_backend_cpu_set_n_threads(backend, params.n_threads);
}
return backend;
}
@@ -92,6 +255,10 @@ static ggml_backend_t create_backend() {
static void get_backend_memory(size_t * free_mem, size_t * total_mem) {
#ifdef GGML_USE_CUDA
ggml_backend_cuda_get_device_memory(0, free_mem, total_mem);
#elif GGML_USE_VULKAN
ggml_backend_vk_get_device_memory(0, free_mem, total_mem);
#elif GGML_USE_SYCL
ggml_backend_sycl_get_device_memory(0, free_mem, total_mem);
#else
#ifdef _WIN32
MEMORYSTATUSEX status;
@@ -125,7 +292,7 @@ int main(int argc, char * argv[]) {
fprintf(stderr, "\n");
}
ggml_backend_t backend = create_backend();
ggml_backend_t backend = create_backend(params);
if (!backend) {
fprintf(stderr, "Failed to create backend\n");
return 1;
@@ -135,11 +302,28 @@ int main(int argc, char * argv[]) {
if (params.backend_mem > 0) {
free_mem = params.backend_mem;
total_mem = params.backend_mem;
} else {
}
else {
get_backend_memory(&free_mem, &total_mem);
}
printf("Starting RPC server on %s, backend memory: %zu MB\n", endpoint.c_str(), free_mem / (1024 * 1024));
start_rpc_server(backend, endpoint.c_str(), free_mem, total_mem);
const char * cache_dir = nullptr;
std::string cache_dir_str;
if (params.use_cache) {
cache_dir_str = fs_get_cache_directory() + "rpc/";
if (!fs_create_directory_with_parents(cache_dir_str)) {
fprintf(stderr, "Failed to create cache directory: %s\n", cache_dir_str.c_str());
return 1;
}
cache_dir = cache_dir_str.c_str();
}
printf("Starting RPC server v%d.%d.%d\n",
RPC_PROTO_MAJOR_VERSION,
RPC_PROTO_MINOR_VERSION,
RPC_PROTO_PATCH_VERSION);
printf(" endpoint : %s\n", endpoint.c_str());
printf(" local cache : %s\n", cache_dir ? cache_dir : "n/a");
printf(" backend memory : %zu MB\n", free_mem / (1024 * 1024));
ggml_backend_rpc_start_server(backend, endpoint.c_str(), cache_dir, free_mem, total_mem);
ggml_backend_free(backend);
return 0;
}

View File

@@ -12,6 +12,8 @@
#endif
// increase max payload length to allow use of larger context size
#define CPPHTTPLIB_FORM_URL_ENCODED_PAYLOAD_MAX_LENGTH 1048576
// disable Nagle's algorithm
#define CPPHTTPLIB_TCP_NODELAY true
#include "httplib.h"
// Change JSON_ASSERT from assert() to GGML_ASSERT:
#define JSON_ASSERT GGML_ASSERT

View File

@@ -6,7 +6,6 @@
// Change JSON_ASSERT from assert() to GGML_ASSERT:
#define JSON_ASSERT GGML_ASSERT
#include "json.hpp"
#include <string>
#include <vector>
#include <sstream>

View File

@@ -7,6 +7,9 @@
extern "C" {
#endif
#define RPC_PROTO_MAJOR_VERSION 2
#define RPC_PROTO_MINOR_VERSION 0
#define RPC_PROTO_PATCH_VERSION 1
#define GGML_RPC_MAX_SERVERS 16
// backend API
@@ -17,7 +20,9 @@ GGML_API GGML_CALL ggml_backend_buffer_type_t ggml_backend_rpc_buffer_type(const
GGML_API GGML_CALL void ggml_backend_rpc_get_device_memory(const char * endpoint, size_t * free, size_t * total);
GGML_API GGML_CALL void start_rpc_server(ggml_backend_t backend, const char * endpoint, size_t free_mem, size_t total_mem);
GGML_API GGML_CALL void ggml_backend_rpc_start_server(ggml_backend_t backend, const char * endpoint,
const char * cache_dir,
size_t free_mem, size_t total_mem);
#ifdef __cplusplus
}

View File

@@ -1,6 +1,7 @@
#include "ggml-backend-impl.h"
#include "ggml-alloc.h"
#include "ggml-impl.h"
#include "ggml-rpc.h"
#include <assert.h>
#include <limits.h>
@@ -468,6 +469,10 @@ GGML_CALL static void ggml_backend_registry_init(void) {
extern GGML_CALL int ggml_backend_cann_reg_devices(void);
ggml_backend_cann_reg_devices();
#endif
#ifdef GGML_USE_RPC
extern GGML_CALL void ggml_backend_rpc_reg_devices(void);
ggml_backend_rpc_reg_devices();
#endif
}
GGML_CALL void ggml_backend_register(const char * name, ggml_backend_init_fn init_fn, ggml_backend_buffer_type_t default_buffer_type, void * user_data) {
@@ -943,6 +948,13 @@ GGML_CALL static ggml_backend_t ggml_backend_reg_cpu_init(const char * params, v
GGML_UNUSED(user_data);
}
GGML_CALL static ggml_backend_t ggml_backend_reg_rpc_init(const char* params, void* user_data) {
return ggml_backend_rpc_init((const char*)user_data);
GGML_UNUSED(params);
GGML_UNUSED(user_data);
}
// multi-buffer buffer
struct ggml_backend_multi_buffer_context {

File diff suppressed because it is too large Load Diff

View File

@@ -4935,7 +4935,7 @@ static struct ggml_object * ggml_new_object(struct ggml_context * ctx, enum ggml
if (cur_end + size_needed + GGML_OBJECT_SIZE > ctx->mem_size) {
GGML_PRINT("%s: not enough space in the context's memory pool (needed %zu, available %zu)\n",
__func__, cur_end + size_needed, ctx->mem_size);
__func__, cur_end + size_needed + GGML_OBJECT_SIZE, ctx->mem_size);
assert(false);
return NULL;
}

View File

@@ -18,6 +18,7 @@
#include <vector>
#include <locale>
#include <codecvt>
#include <iostream>
size_t unicode_len_utf8(char src) {
const size_t lookup[] = { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 4 };
@@ -25,7 +26,7 @@ size_t unicode_len_utf8(char src) {
return lookup[highbits];
}
static std::string unicode_cpts_to_utf8(const std::vector<uint32_t> & cps) {
static std::string unicode_cpts_to_utf8(const std::vector<uint32_t>& cps) {
std::string result;
for (size_t i = 0; i < cps.size(); ++i) {
result.append(unicode_cpt_to_utf8(cps[i]));
@@ -33,7 +34,7 @@ static std::string unicode_cpts_to_utf8(const std::vector<uint32_t> & cps) {
return result;
}
uint32_t unicode_cpt_from_utf8(const std::string & utf8, size_t & offset) {
uint32_t unicode_cpt_from_utf8(const std::string& utf8, size_t& offset) {
assert(offset < utf8.size());
if (!(utf8[offset + 0] & 0x80)) {
auto result = utf8[offset + 0];
@@ -44,7 +45,7 @@ uint32_t unicode_cpt_from_utf8(const std::string & utf8, size_t & offset) {
throw std::invalid_argument("invalid character");
}
if (!(utf8[offset + 0] & 0x20)) {
if (offset + 1 >= utf8.size() || ! ((utf8[offset + 1] & 0xc0) == 0x80)) {
if (offset + 1 >= utf8.size() || !((utf8[offset + 1] & 0xc0) == 0x80)) {
throw std::invalid_argument("invalid character");
}
auto result = ((utf8[offset + 0] & 0x1f) << 6) | (utf8[offset + 1] & 0x3f);
@@ -52,7 +53,7 @@ uint32_t unicode_cpt_from_utf8(const std::string & utf8, size_t & offset) {
return result;
}
if (!(utf8[offset + 0] & 0x10)) {
if (offset + 2 >= utf8.size() || ! ((utf8[offset + 1] & 0xc0) == 0x80) || ! ((utf8[offset + 2] & 0xc0) == 0x80)) {
if (offset + 2 >= utf8.size() || !((utf8[offset + 1] & 0xc0) == 0x80) || !((utf8[offset + 2] & 0xc0) == 0x80)) {
throw std::invalid_argument("invalid character");
}
auto result = ((utf8[offset + 0] & 0x0f) << 12) | ((utf8[offset + 1] & 0x3f) << 6) | (utf8[offset + 2] & 0x3f);
@@ -60,7 +61,7 @@ uint32_t unicode_cpt_from_utf8(const std::string & utf8, size_t & offset) {
return result;
}
if (!(utf8[offset + 0] & 0x08)) {
if (offset + 3 >= utf8.size() || ! ((utf8[offset + 1] & 0xc0) == 0x80) || ! ((utf8[offset + 2] & 0xc0) == 0x80) || !((utf8[offset + 3] & 0xc0) == 0x80)) {
if (offset + 3 >= utf8.size() || !((utf8[offset + 1] & 0xc0) == 0x80) || !((utf8[offset + 2] & 0xc0) == 0x80) || !((utf8[offset + 3] & 0xc0) == 0x80)) {
throw std::invalid_argument("invalid character");
}
auto result = ((utf8[offset + 0] & 0x07) << 18) | ((utf8[offset + 1] & 0x3f) << 12) | ((utf8[offset + 2] & 0x3f) << 6) | (utf8[offset + 3] & 0x3f);
@@ -122,10 +123,10 @@ uint32_t unicode_cpt_from_utf8(const std::string & utf8, size_t & offset) {
static std::vector<codepoint_flags> unicode_cpt_flags_array() {
std::vector<codepoint_flags> cpt_flags(MAX_CODEPOINTS, codepoint_flags::UNDEFINED);
assert (unicode_ranges_flags.front().first == 0);
assert (unicode_ranges_flags.back().first == MAX_CODEPOINTS);
assert(unicode_ranges_flags.front().first == 0);
assert(unicode_ranges_flags.back().first == MAX_CODEPOINTS);
for (size_t i = 1; i < unicode_ranges_flags.size(); ++i) {
const auto range_ini = unicode_ranges_flags[i-1]; // codepoint_ini, flags
const auto range_ini = unicode_ranges_flags[i - 1]; // codepoint_ini, flags
const auto range_end = unicode_ranges_flags[i]; // codepoint_end, flags
for (uint32_t cpt = range_ini.first; cpt < range_end.first; ++cpt) {
cpt_flags[cpt] = range_ini.second;
@@ -144,7 +145,7 @@ static std::vector<codepoint_flags> unicode_cpt_flags_array() {
cpt_flags[p.second].is_uppercase = true;
}
for (auto &range : unicode_ranges_nfd) { // start, last, nfd
for (auto& range : unicode_ranges_nfd) { // start, last, nfd
cpt_flags[range.nfd].is_nfd = true;
}
@@ -199,22 +200,55 @@ static std::unordered_map<std::string, uint8_t> unicode_utf8_to_byte_map() {
return map;
}
static inline std::wstring unicode_wstring_from_utf8(const std::string & s) {
static inline bool is_valid_utf8(const std::string& str) {
int remaining_bytes = 0; // 当前多字节字符剩余的字节数
for (unsigned char c : str) {
if (remaining_bytes == 0) {
if ((c & 0x80) == 0x00) continue; // 1字节字符
else if ((c & 0xE0) == 0xC0) remaining_bytes = 1; // 2字节
else if ((c & 0xF0) == 0xE0) remaining_bytes = 2; // 3字节
else if ((c & 0xF8) == 0xF0) remaining_bytes = 3; // 4字节
else return false; // 非法起始字节
}
else {
// 检查后续字节是否为10xxxxxx
if ((c & 0xC0) != 0x80)
{
return false;
}
remaining_bytes--;
}
}
return (remaining_bytes == 0); // 确保多字节字符完整
}
static inline std::wstring unicode_wstring_from_utf8(const std::string& s) {
#if defined(__clang__)
// disable C++17 deprecation warning for std::codecvt_utf8
# pragma clang diagnostic push
# pragma clang diagnostic ignored "-Wdeprecated-declarations"
#endif
bool isvalid = is_valid_utf8(s);
std::wstring_convert<std::codecvt_utf8<wchar_t>> conv;
#if defined(__clang__)
# pragma clang diagnostic pop
#endif
return conv.from_bytes(s);
}
static std::vector<std::string> unicode_byte_encoding_process(const std::vector<std::string> & bpe_words) {
static std::vector<std::string> unicode_byte_encoding_process(const std::vector<std::string>& bpe_words) {
std::vector<std::string> bpe_encoded_words;
for (const auto & word : bpe_words) {
for (const auto& word : bpe_words) {
std::string text_utf;
auto utf_word = unicode_cpts_from_utf8(word);
auto utf_word = unicode_cpts_from_utf8(word);
for (size_t i = 0; i < utf_word.size(); ++i) {
text_utf += unicode_cpt_to_utf8(utf_word[i]);
}
std::string encoded_token;
for (char & c : text_utf) {
for (char& c : text_utf) {
encoded_token += unicode_byte_to_utf8(c);
}
bpe_encoded_words.emplace_back(encoded_token);
@@ -223,7 +257,7 @@ static std::vector<std::string> unicode_byte_encoding_process(const std::vector<
}
// GPT2 system regex: 's|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+
static std::vector<size_t> unicode_regex_split_custom_gpt2(const std::string & text, const std::vector<size_t> & offsets) {
static std::vector<size_t> unicode_regex_split_custom_gpt2(const std::string& text, const std::vector<size_t>& offsets) {
std::vector<size_t> bpe_offsets; // store the offset of each word
bpe_offsets.reserve(offsets.size()); // Reserve memory for the approximate size
@@ -237,16 +271,16 @@ static std::vector<size_t> unicode_regex_split_custom_gpt2(const std::string & t
start = offset_end;
static const uint32_t OUT_OF_RANGE = 0xFFFFFFFF;
auto _get_cpt = [&] (const size_t pos) -> uint32_t {
auto _get_cpt = [&](const size_t pos) -> uint32_t {
return (offset_ini <= pos && pos < offset_end) ? cpts[pos] : OUT_OF_RANGE;
};
auto _get_flags = [&] (const size_t pos) -> codepoint_flags {
auto _get_flags = [&](const size_t pos) -> codepoint_flags {
return (offset_ini <= pos && pos < offset_end) ? unicode_cpt_flags(cpts[pos]) : codepoint_flags{};
};
size_t _prev_end = offset_ini;
auto _add_token = [&] (const size_t end) -> size_t {
auto _add_token = [&](const size_t end) -> size_t {
assert(_prev_end <= end && end <= offset_end);
size_t len = end - _prev_end;
if (len > 0) {
@@ -262,29 +296,29 @@ static std::vector<size_t> unicode_regex_split_custom_gpt2(const std::string & t
return len;
};
for (size_t pos = offset_ini; pos < offset_end; /*pos++*/ ) {
for (size_t pos = offset_ini; pos < offset_end; /*pos++*/) {
const uint32_t cpt = _get_cpt(pos);
const auto flags = _get_flags(pos);
// regex: 's|'t|'re|'ve|'m|'ll|'d
if (cpt == '\'' && pos+1 < offset_end) {
uint32_t cpt_next = _get_cpt(pos+1);
if (cpt == '\'' && pos + 1 < offset_end) {
uint32_t cpt_next = _get_cpt(pos + 1);
if (cpt_next == 's' || cpt_next == 't' || cpt_next == 'm' || cpt_next == 'd') {
pos += _add_token(pos+2);
pos += _add_token(pos + 2);
continue;
}
if (pos+2 < offset_end) {
uint32_t cpt_next_next = _get_cpt(pos+2);
if (pos + 2 < offset_end) {
uint32_t cpt_next_next = _get_cpt(pos + 2);
if ((cpt_next == 'r' && cpt_next_next == 'e') ||
(cpt_next == 'v' && cpt_next_next == 'e') ||
(cpt_next == 'l' && cpt_next_next == 'l')) {
pos += _add_token(pos+3);
pos += _add_token(pos + 3);
continue;
}
}
}
auto flags2 = (cpt == ' ' ? _get_flags(pos+1) : flags);
auto flags2 = (cpt == ' ' ? _get_flags(pos + 1) : flags);
// regex: <space>?\p{L}+
if (flags2.is_letter) {
pos += (cpt == ' ');
@@ -314,12 +348,12 @@ static std::vector<size_t> unicode_regex_split_custom_gpt2(const std::string & t
}
size_t num_whitespaces = 0;
while (_get_flags(pos+num_whitespaces).is_whitespace) {
while (_get_flags(pos + num_whitespaces).is_whitespace) {
num_whitespaces++;
}
// regex: \s+(?!\S)
if (num_whitespaces > 1 && _get_cpt(pos+num_whitespaces) != OUT_OF_RANGE) {
if (num_whitespaces > 1 && _get_cpt(pos + num_whitespaces) != OUT_OF_RANGE) {
pos += num_whitespaces - 1;
_add_token(pos);
continue;
@@ -341,7 +375,7 @@ static std::vector<size_t> unicode_regex_split_custom_gpt2(const std::string & t
}
// LLAMA3 system regex: "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"
static std::vector<size_t> unicode_regex_split_custom_llama3(const std::string & text, const std::vector<size_t> & offsets) {
static std::vector<size_t> unicode_regex_split_custom_llama3(const std::string& text, const std::vector<size_t>& offsets) {
std::vector<size_t> bpe_offsets; // store the offset of each word
bpe_offsets.reserve(offsets.size()); // Reserve memory for the approximate size
@@ -355,16 +389,16 @@ static std::vector<size_t> unicode_regex_split_custom_llama3(const std::string &
start = offset_end;
static const uint32_t OUT_OF_RANGE = 0xFFFFFFFF;
auto _get_cpt = [&] (const size_t pos) -> uint32_t {
auto _get_cpt = [&](const size_t pos) -> uint32_t {
return (offset_ini <= pos && pos < offset_end) ? cpts[pos] : OUT_OF_RANGE;
};
auto _get_flags = [&] (const size_t pos) -> codepoint_flags {
auto _get_flags = [&](const size_t pos) -> codepoint_flags {
return (offset_ini <= pos && pos < offset_end) ? unicode_cpt_flags(cpts[pos]) : codepoint_flags{};
};
size_t _prev_end = offset_ini;
auto _add_token = [&] (const size_t end) -> size_t {
auto _add_token = [&](const size_t end) -> size_t {
assert(_prev_end <= end && end <= offset_end);
size_t len = end - _prev_end;
if (len > 0) {
@@ -380,23 +414,23 @@ static std::vector<size_t> unicode_regex_split_custom_llama3(const std::string &
return len;
};
for (size_t pos = offset_ini; pos < offset_end; /*pos++*/ ) {
for (size_t pos = offset_ini; pos < offset_end; /*pos++*/) {
const uint32_t cpt = _get_cpt(pos);
const auto flags = _get_flags(pos);
// regex: (?i:'s|'t|'re|'ve|'m|'ll|'d) // case insensitive
if (cpt == '\'' && pos+1 < offset_end) {
uint32_t cpt_next = unicode_tolower(_get_cpt(pos+1));
if (cpt == '\'' && pos + 1 < offset_end) {
uint32_t cpt_next = unicode_tolower(_get_cpt(pos + 1));
if (cpt_next == 's' || cpt_next == 't' || cpt_next == 'm' || cpt_next == 'd') {
pos += _add_token(pos+2);
pos += _add_token(pos + 2);
continue;
}
if (pos+2 < offset_end) {
uint32_t cpt_next_next = unicode_tolower(_get_cpt(pos+2));
if (pos + 2 < offset_end) {
uint32_t cpt_next_next = unicode_tolower(_get_cpt(pos + 2));
if ((cpt_next == 'r' && cpt_next_next == 'e') ||
(cpt_next == 'v' && cpt_next_next == 'e') ||
(cpt_next == 'l' && cpt_next_next == 'l')) {
pos += _add_token(pos+3);
pos += _add_token(pos + 3);
continue;
}
}
@@ -404,7 +438,7 @@ static std::vector<size_t> unicode_regex_split_custom_llama3(const std::string &
// regex: [^\r\n\p{L}\p{N}]?\p{L}+
if (!(cpt == '\r' || cpt == '\n' || flags.is_number)) {
if (flags.is_letter || _get_flags(pos+1).is_letter) { // one or more letters
if (flags.is_letter || _get_flags(pos + 1).is_letter) { // one or more letters
pos++;
while (_get_flags(pos).is_letter) {
pos++;
@@ -418,7 +452,7 @@ static std::vector<size_t> unicode_regex_split_custom_llama3(const std::string &
if (flags.is_number) {
size_t ini = pos;
while (_get_flags(pos).is_number) {
if (++pos - ini >= 3 ) {
if (++pos - ini >= 3) {
_add_token(pos);
ini = pos;
}
@@ -428,7 +462,7 @@ static std::vector<size_t> unicode_regex_split_custom_llama3(const std::string &
}
// regex: <space>?[^\s\p{L}\p{N}]+[\r\n]*
auto flags2 = (cpt == ' ' ? _get_flags(pos+1) : flags);
auto flags2 = (cpt == ' ' ? _get_flags(pos + 1) : flags);
if (!(flags2.is_whitespace | flags2.is_letter | flags2.is_number) && flags.as_uint()) {
pos += (cpt == ' ');
while (!(flags2.is_whitespace | flags2.is_letter | flags2.is_number) && flags2.as_uint()) {
@@ -444,8 +478,8 @@ static std::vector<size_t> unicode_regex_split_custom_llama3(const std::string &
size_t num_whitespaces = 0;
size_t last_end_r_or_n = 0;
while (_get_flags(pos+num_whitespaces).is_whitespace) {
uint32_t cpt2 = _get_cpt(pos+num_whitespaces);
while (_get_flags(pos + num_whitespaces).is_whitespace) {
uint32_t cpt2 = _get_cpt(pos + num_whitespaces);
if (cpt2 == '\r' || cpt2 == '\n') {
last_end_r_or_n = pos + num_whitespaces + 1;
}
@@ -460,7 +494,7 @@ static std::vector<size_t> unicode_regex_split_custom_llama3(const std::string &
}
// regex: \s+(?!\S)
if (num_whitespaces > 1 && _get_cpt(pos+num_whitespaces) != OUT_OF_RANGE) {
if (num_whitespaces > 1 && _get_cpt(pos + num_whitespaces) != OUT_OF_RANGE) {
pos += num_whitespaces - 1;
_add_token(pos);
continue;
@@ -482,7 +516,7 @@ static std::vector<size_t> unicode_regex_split_custom_llama3(const std::string &
}
// use std::wregex to split the text
static std::vector<size_t> unicode_regex_split_stl(const std::wstring & wtext, const std::wstring & regex_expr, const std::vector<size_t> & offsets) {
static std::vector<size_t> unicode_regex_split_stl(const std::wstring& wtext, const std::wstring& regex_expr, const std::vector<size_t>& offsets) {
std::wregex expr(regex_expr);
std::vector<size_t> bpe_offsets; // store the offset of each word
bpe_offsets.reserve(offsets.size()); // Reserve memory for the approximate size
@@ -502,7 +536,7 @@ static std::vector<size_t> unicode_regex_split_stl(const std::wstring & wtext, c
++it;
}
if (start_idx < (int64_t) offset) {
if (start_idx < (int64_t)offset) {
bpe_offsets.emplace_back(offset - start_idx);
}
start += offset;
@@ -512,7 +546,7 @@ static std::vector<size_t> unicode_regex_split_stl(const std::wstring & wtext, c
}
// use std::regex to split the text
static std::vector<size_t> unicode_regex_split_stl(const std::string & text, const std::string & regex_expr, const std::vector<size_t> & offsets) {
static std::vector<size_t> unicode_regex_split_stl(const std::string& text, const std::string& regex_expr, const std::vector<size_t>& offsets) {
std::regex expr(regex_expr);
std::vector<size_t> bpe_offsets; // store the offset of each word
bpe_offsets.reserve(offsets.size()); // Reserve memory for the approximate size
@@ -532,7 +566,7 @@ static std::vector<size_t> unicode_regex_split_stl(const std::string & text, con
++it;
}
if (start_idx < (int64_t) offset) {
if (start_idx < (int64_t)offset) {
bpe_offsets.emplace_back(offset - start_idx);
}
start += offset;
@@ -541,14 +575,15 @@ static std::vector<size_t> unicode_regex_split_stl(const std::string & text, con
return bpe_offsets;
}
static std::vector<size_t> unicode_regex_split_custom(const std::string & text, const std::string & regex_expr, const std::vector<size_t> & offsets) {
static std::vector<size_t> unicode_regex_split_custom(const std::string& text, const std::string& regex_expr, const std::vector<size_t>& offsets) {
std::vector<size_t> bpe_offsets;
if (regex_expr == "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)") {
bpe_offsets = unicode_regex_split_custom_gpt2(text, offsets);
} else if (
regex_expr == "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" ||
regex_expr == "(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+") {
}
else if (
regex_expr == "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" ||
regex_expr == "(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+") {
bpe_offsets = unicode_regex_split_custom_llama3(text, offsets);
}
@@ -589,8 +624,8 @@ std::string unicode_cpt_to_utf8(uint32_t cp) {
throw std::invalid_argument("invalid codepoint");
}
std::vector<uint32_t> unicode_cpts_normalize_nfd(const std::vector<uint32_t> & cpts) {
auto comp = [] (const uint32_t cpt, const range_nfd & range) {
std::vector<uint32_t> unicode_cpts_normalize_nfd(const std::vector<uint32_t>& cpts) {
auto comp = [](const uint32_t cpt, const range_nfd& range) {
return cpt < range.first;
};
std::vector<uint32_t> result(cpts.size());
@@ -602,7 +637,7 @@ std::vector<uint32_t> unicode_cpts_normalize_nfd(const std::vector<uint32_t> & c
return result;
}
std::vector<uint32_t> unicode_cpts_from_utf8(const std::string & utf8) {
std::vector<uint32_t> unicode_cpts_from_utf8(const std::string& utf8) {
std::vector<uint32_t> result;
result.reserve(utf8.size());
size_t offset = 0;
@@ -618,7 +653,7 @@ codepoint_flags unicode_cpt_flags(const uint32_t cp) {
return cp < cpt_flags.size() ? cpt_flags[cp] : undef;
}
codepoint_flags unicode_cpt_flags(const std::string & utf8) {
codepoint_flags unicode_cpt_flags(const std::string& utf8) {
static const codepoint_flags undef(codepoint_flags::UNDEFINED);
if (utf8.empty()) {
return undef; // undefined
@@ -632,7 +667,7 @@ std::string unicode_byte_to_utf8(uint8_t byte) {
return map.at(byte);
}
uint8_t unicode_utf8_to_byte(const std::string & utf8) {
uint8_t unicode_utf8_to_byte(const std::string& utf8) {
static std::unordered_map<std::string, uint8_t> map = unicode_utf8_to_byte_map();
return map.at(utf8);
}
@@ -642,7 +677,7 @@ uint32_t unicode_tolower(uint32_t cp) {
return it == unicode_map_lowercase.end() ? cp : it->second;
}
std::vector<std::string> unicode_regex_split(const std::string & text, const std::vector<std::string> & regex_exprs) {
std::vector<std::string> unicode_regex_split(const std::string& text, const std::vector<std::string>& regex_exprs) {
// unicode categories
static const std::map<std::string, int> k_ucat_enum = {
{ "\\p{N}", codepoint_flags::NUMBER },
@@ -671,9 +706,9 @@ std::vector<std::string> unicode_regex_split(const std::string & text, const std
// compute collapsed codepoints only if needed by at least one regex
bool need_collapse = false;
for (auto & regex_expr : regex_exprs) {
for (auto& regex_expr : regex_exprs) {
// search for unicode categories
for (const auto & ucat : k_ucat_enum) {
for (const auto& ucat : k_ucat_enum) {
if (std::string::npos != regex_expr.find(ucat.first)) {
need_collapse = true;
break;
@@ -702,18 +737,20 @@ std::vector<std::string> unicode_regex_split(const std::string & text, const std
if (flags.is_whitespace) {
//NOTE: C++ std::regex \s does not mach 0x85, Rust and Python regex does.
//text_collapsed[i] = (char) 0x85; // <Next Line> as whitespace fallback
text_collapsed[i] = (char) 0x0B; // <vertical tab> as whitespace fallback
} else if (k_ucat_cpt.find(flags.category_flag()) != k_ucat_cpt.end()) {
text_collapsed[i] = (char)0x0B; // <vertical tab> as whitespace fallback
}
else if (k_ucat_cpt.find(flags.category_flag()) != k_ucat_cpt.end()) {
text_collapsed[i] = k_ucat_cpt.at(flags.category_flag());
} else {
text_collapsed[i] = (char) 0xD0; // fallback
}
else {
text_collapsed[i] = (char)0xD0; // fallback
}
}
}
std::vector<size_t> bpe_offsets = { cpts.size() };
for (auto & regex_expr : regex_exprs) {
for (auto& regex_expr : regex_exprs) {
// first, see if we have an efficient custom regex implementation
auto tmp = unicode_regex_split_custom(text, regex_expr, bpe_offsets);
@@ -727,7 +764,7 @@ std::vector<std::string> unicode_regex_split(const std::string & text, const std
// if a unicode category is used in the regex, we use the collapsed text and replace the unicode category
// with the corresponding collapsed representation
bool use_collapsed = false;
for (auto & ucat : k_ucat_enum) {
for (auto& ucat : k_ucat_enum) {
if (std::string::npos != regex_expr.find(ucat.first)) {
use_collapsed = true;
break;
@@ -786,7 +823,8 @@ std::vector<std::string> unicode_regex_split(const std::string & text, const std
//printf("text_collapsed: %s\n", text_collapsed.c_str());
//printf("regex_expr_collapsed: %s\n", regex_expr_collapsed.c_str());
bpe_offsets = unicode_regex_split_stl(text_collapsed, regex_expr_collapsed, bpe_offsets);
} else {
}
else {
// no unicode category used, we can use std::wregex directly
const std::wstring wregex_expr = unicode_wstring_from_utf8(regex_expr);
@@ -802,7 +840,8 @@ std::vector<std::string> unicode_regex_split(const std::string & text, const std
//printf("regex_expr: %s\n", regex_expr.c_str());
bpe_offsets = unicode_regex_split_stl(wtext, wregex_expr, bpe_offsets);
}
} catch (std::regex_error & e) {
}
catch (std::regex_error& e) {
fprintf(stderr, "Failed to process regex: '%s'\n", regex_expr.c_str());
fprintf(stderr, "Regex error: %s\n", e.what());
throw std::runtime_error("Failed to process regex");
@@ -813,7 +852,7 @@ std::vector<std::string> unicode_regex_split(const std::string & text, const std
bpe_words.reserve(bpe_offsets.size()); // reserve memory for the approximate size
size_t start = 0;
for (size_t & offset : bpe_offsets) {
for (size_t& offset : bpe_offsets) {
bpe_words.emplace_back();
for (size_t i = start; i < start + offset; ++i) {
bpe_words.back() += unicode_cpt_to_utf8(cpts[i]);