mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-01-27 01:29:51 +00:00
* Add RPC backend in device list to override tensors. * rpc : prevent crashes on invalid input (#9040) Add more checks which prevent RPC server from crashing if invalid input is received from client # Conflicts: # ggml/src/ggml-rpc.cpp * rpc : print error message when failed to connect endpoint (#9042) * Fix RPC error * Add vulkan, sycl to rpc backend * add thread in rpc cpu backend * add cache folder and other improvement in rpc * add header file * support for models with non-512 aligned tensors * rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943) RPC_CMD_SET_TENSOR always returns an empty response and we send this 4 times per token. We can improve TG speed if we don't wait for this empty response. The performance impact of this change depends on the network latency. # Conflicts: # ggml/src/ggml-rpc.cpp * fix(rpc): Improve input validation and error handling (#13069) * fix(rpc): Improve input validation and error handling The `rpc-server` was vulnerable to Denial of Service attacks via several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed messages could trigger failed assertions (e.g., invalid `ggml_type`) or out-of-bounds reads/writes leading to `GGML_ABORT` calls, crashing the server process. This PR introduces robust input validation and replaces `abort()` calls with graceful error handling: - **Type Validation:** `deserialize_tensor` now checks if the `tensor->type` is within the valid `GGML_TYPE_COUNT` range *before* calling `ggml_new_tensor_4d`. Returns `nullptr` on invalid type. - **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`, `set_tensor_hash`, and `get_tensor` handlers with error logging and returning `false` when data/offset parameters are out of buffer bounds. - **Size Checks:** Added safe arithmetic checks (for overflow) in `graph_compute` when calculating required message sizes based on client-provided `n_nodes` and `n_tensors`. Returns early if the reported sizes conflict with the actual message size or would lead to overflow. - **Error Propagation:** - `create_node` now checks for `nullptr` return values from `deserialize_tensor` and its recursive calls, propagating `nullptr` upwards on failure. Uses `find` instead of `at` for safer map access. - `copy_tensor` now checks for `nullptr` from `deserialize_tensor` and sets the response status to failure if deserialization or bounds checks fail. - `graph_compute` now checks for `nullptr` return from `create_node` and returns failure status correctly. The final return value now reflects the actual computation status. These changes improve the RPC server's resilience against malformed client requests, preventing crashes and ensuring errors are handled more gracefully. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): address pr comments removed comments and unnecessary returns Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): ambiguous nullptr from create_node rpc_server::create_node could previously return nullptr if the input ID was 0 (valid) or if an internal error (deserialization, recursion failure) occurred (invalid). This ambiguity made error handling difficult for the caller (`graph_compute`). This commit clarifies the meaning of nullptr: - `graph_compute` now checks if the input 'id' was non-zero when `create_node` returns nullptr, correctly identifying failures versus intentional null links. - `create_node` avoids recursive calls for zero IDs and propagates nullptr unambiguously on failure during recursion. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): initial zero check in create_node The caller (`graph_compute`) already checks `id != 0` when handling a `nullptr` return from `create_node`, correctly distinguishing intentional null links from actual errors. This makes the initial `if (id == 0)` check redundant. Also removes the log message when a tensor ID is not found in the provided map which was added in this branch. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * fix(rpc): Handle get_alloc_size failure in server Check the return value of `server.get_alloc_size` in the RPC server loop. If the call fails, return early to close the connection. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): input size validation in graph_compute Removes detailed, step-by-step size calculations and overflow checks in favor of simpler direct comparisons, assuming 64-bit overflow is unlikely. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): remove extra status code setting Removes the explicit setting of `response.result = GGML_STATUS_FAILED` when `create_node` returns `nullptr` within `graph_compute`. Primary signal is the `false` return value in case of failure. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): remove redundant check for tensor->type Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus the check is not needed. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> --------- Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> # Conflicts: # ggml/src/ggml-rpc.cpp * rpc : fix cache directory initialization (#13188) Signed-off-by: xiaofei <hbuxiaofei@gmail.com> # Conflicts: # examples/rpc/rpc-server.cpp * rpc : avoid uninitialized memory in serialize_tensor (#13210) Zero out the name and padding buffers. * fix merge error * Add hello command in RPC * bug fix * add rpc header * fix bug for missing rpc names * add tpc no delay for rpc * add back webui * fix rpc function not found error --------- Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> Signed-off-by: xiaofei <hbuxiaofei@gmail.com> Co-authored-by: firecoperana <firecoperana> Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com> Co-authored-by: matt23456 <matt23456> Co-authored-by: Ville Vesilehto <ville@vesilehto.fi> Co-authored-by: xiaofei <hbuxiaofei@gmail.com> Co-authored-by: Justin Santa Barbara <justinsb@google.com>
865 lines
31 KiB
C++
865 lines
31 KiB
C++
#if defined(_MSC_VER)
|
|
#define _SILENCE_CXX17_CODECVT_HEADER_DEPRECATION_WARNING
|
|
#endif
|
|
|
|
#include "unicode.h"
|
|
#include "unicode-data.h"
|
|
|
|
#include <cassert>
|
|
#include <cstddef>
|
|
#include <cstdint>
|
|
#include <map>
|
|
#include <regex>
|
|
#include <stdexcept>
|
|
#include <string>
|
|
#include <unordered_map>
|
|
#include <unordered_set>
|
|
#include <utility>
|
|
#include <vector>
|
|
#include <locale>
|
|
#include <codecvt>
|
|
#include <iostream>
|
|
|
|
size_t unicode_len_utf8(char src) {
|
|
const size_t lookup[] = { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 4 };
|
|
uint8_t highbits = static_cast<uint8_t>(src) >> 4;
|
|
return lookup[highbits];
|
|
}
|
|
|
|
static std::string unicode_cpts_to_utf8(const std::vector<uint32_t>& cps) {
|
|
std::string result;
|
|
for (size_t i = 0; i < cps.size(); ++i) {
|
|
result.append(unicode_cpt_to_utf8(cps[i]));
|
|
}
|
|
return result;
|
|
}
|
|
|
|
uint32_t unicode_cpt_from_utf8(const std::string& utf8, size_t& offset) {
|
|
assert(offset < utf8.size());
|
|
if (!(utf8[offset + 0] & 0x80)) {
|
|
auto result = utf8[offset + 0];
|
|
offset += 1;
|
|
return result;
|
|
}
|
|
if (!(utf8[offset + 0] & 0x40)) {
|
|
throw std::invalid_argument("invalid character");
|
|
}
|
|
if (!(utf8[offset + 0] & 0x20)) {
|
|
if (offset + 1 >= utf8.size() || !((utf8[offset + 1] & 0xc0) == 0x80)) {
|
|
throw std::invalid_argument("invalid character");
|
|
}
|
|
auto result = ((utf8[offset + 0] & 0x1f) << 6) | (utf8[offset + 1] & 0x3f);
|
|
offset += 2;
|
|
return result;
|
|
}
|
|
if (!(utf8[offset + 0] & 0x10)) {
|
|
if (offset + 2 >= utf8.size() || !((utf8[offset + 1] & 0xc0) == 0x80) || !((utf8[offset + 2] & 0xc0) == 0x80)) {
|
|
throw std::invalid_argument("invalid character");
|
|
}
|
|
auto result = ((utf8[offset + 0] & 0x0f) << 12) | ((utf8[offset + 1] & 0x3f) << 6) | (utf8[offset + 2] & 0x3f);
|
|
offset += 3;
|
|
return result;
|
|
}
|
|
if (!(utf8[offset + 0] & 0x08)) {
|
|
if (offset + 3 >= utf8.size() || !((utf8[offset + 1] & 0xc0) == 0x80) || !((utf8[offset + 2] & 0xc0) == 0x80) || !((utf8[offset + 3] & 0xc0) == 0x80)) {
|
|
throw std::invalid_argument("invalid character");
|
|
}
|
|
auto result = ((utf8[offset + 0] & 0x07) << 18) | ((utf8[offset + 1] & 0x3f) << 12) | ((utf8[offset + 2] & 0x3f) << 6) | (utf8[offset + 3] & 0x3f);
|
|
offset += 4;
|
|
return result;
|
|
}
|
|
throw std::invalid_argument("failed to convert utf8 to codepoint");
|
|
}
|
|
|
|
//static std::vector<uint16_t> unicode_cpt_to_utf16(uint32_t cp) {
|
|
// std::vector<uint16_t> result;
|
|
// if (/* 0x0000 <= cp && */ cp <= 0xffff) {
|
|
// result.emplace_back(cp);
|
|
// return result;
|
|
// }
|
|
// if (0x10000 <= cp && cp <= 0x10ffff) {
|
|
// result.emplace_back(0xd800 | ((cp - 0x10000) >> 10));
|
|
// result.emplace_back(0xdc00 | ((cp - 0x10000) & 0x03ff));
|
|
// return result;
|
|
// }
|
|
// throw std::invalid_argument("failed to convert codepoint to utf16");
|
|
//}
|
|
|
|
//static std::vector<uint16_t> unicode_cpts_to_utf16(const std::vector<uint32_t> & cps) {
|
|
// std::vector<uint16_t> result;
|
|
// for (size_t i = 0; i < cps.size(); ++i) {
|
|
// auto temp = unicode_cpt_to_utf16(cps[i]);
|
|
// result.insert(result.end(), temp.begin(), temp.end());
|
|
// }
|
|
// return result;
|
|
//}
|
|
|
|
//static uint32_t unicode_cpt_from_utf16(const std::vector<uint16_t> & utf16, size_t & offset) {
|
|
// assert(offset < utf16.size());
|
|
// if (((utf16[0] >> 10) << 10) != 0xd800) {
|
|
// auto result = utf16[offset + 0];
|
|
// offset += 1;
|
|
// return result;
|
|
// }
|
|
//
|
|
// if (offset + 1 >= utf16.size() || !((utf16[1] & 0xdc00) == 0xdc00)) {
|
|
// throw std::invalid_argument("invalid character");
|
|
// }
|
|
//
|
|
// auto result = 0x10000 + (((utf16[0] & 0x03ff) << 10) | (utf16[1] & 0x03ff));
|
|
// offset += 2;
|
|
// return result;
|
|
//}
|
|
|
|
//static std::vector<uint32_t> unicode_cpts_from_utf16(const std::vector<uint16_t> & utf16) {
|
|
// std::vector<uint32_t> result;
|
|
// size_t offset = 0;
|
|
// while (offset < utf16.size()) {
|
|
// result.push_back(unicode_cpt_from_utf16(utf16, offset));
|
|
// }
|
|
// return result;
|
|
//}
|
|
|
|
static std::vector<codepoint_flags> unicode_cpt_flags_array() {
|
|
std::vector<codepoint_flags> cpt_flags(MAX_CODEPOINTS, codepoint_flags::UNDEFINED);
|
|
|
|
assert(unicode_ranges_flags.front().first == 0);
|
|
assert(unicode_ranges_flags.back().first == MAX_CODEPOINTS);
|
|
for (size_t i = 1; i < unicode_ranges_flags.size(); ++i) {
|
|
const auto range_ini = unicode_ranges_flags[i - 1]; // codepoint_ini, flags
|
|
const auto range_end = unicode_ranges_flags[i]; // codepoint_end, flags
|
|
for (uint32_t cpt = range_ini.first; cpt < range_end.first; ++cpt) {
|
|
cpt_flags[cpt] = range_ini.second;
|
|
}
|
|
}
|
|
|
|
for (auto cpt : unicode_set_whitespace) {
|
|
cpt_flags[cpt].is_whitespace = true;
|
|
}
|
|
|
|
for (auto p : unicode_map_lowercase) {
|
|
cpt_flags[p.second].is_lowercase = true;
|
|
}
|
|
|
|
for (auto p : unicode_map_uppercase) {
|
|
cpt_flags[p.second].is_uppercase = true;
|
|
}
|
|
|
|
for (auto& range : unicode_ranges_nfd) { // start, last, nfd
|
|
cpt_flags[range.nfd].is_nfd = true;
|
|
}
|
|
|
|
return cpt_flags;
|
|
}
|
|
|
|
static std::unordered_map<uint8_t, std::string> unicode_byte_to_utf8_map() {
|
|
std::unordered_map<uint8_t, std::string> map;
|
|
for (int ch = 0x21; ch <= 0x7E; ++ch) { // u'!' to u'~'
|
|
assert(0 <= ch && ch < 256);
|
|
map[ch] = unicode_cpt_to_utf8(ch);
|
|
}
|
|
for (int ch = 0xA1; ch <= 0xAC; ++ch) { // u'¡' to u'¬'
|
|
assert(0 <= ch && ch < 256);
|
|
map[ch] = unicode_cpt_to_utf8(ch);
|
|
}
|
|
for (int ch = 0xAE; ch <= 0xFF; ++ch) { // u'®' to u'ÿ'
|
|
assert(0 <= ch && ch < 256);
|
|
map[ch] = unicode_cpt_to_utf8(ch);
|
|
}
|
|
auto n = 0;
|
|
for (int ch = 0; ch < 256; ++ch) {
|
|
if (map.find(ch) == map.end()) {
|
|
map[ch] = unicode_cpt_to_utf8(256 + n);
|
|
++n;
|
|
}
|
|
}
|
|
return map;
|
|
}
|
|
|
|
static std::unordered_map<std::string, uint8_t> unicode_utf8_to_byte_map() {
|
|
std::unordered_map<std::string, uint8_t> map;
|
|
for (int ch = 0x21; ch <= 0x7E; ++ch) { // u'!' to u'~'
|
|
assert(0 <= ch && ch < 256);
|
|
map[unicode_cpt_to_utf8(ch)] = ch;
|
|
}
|
|
for (int ch = 0xA1; ch <= 0xAC; ++ch) { // u'¡' to u'¬'
|
|
assert(0 <= ch && ch < 256);
|
|
map[unicode_cpt_to_utf8(ch)] = ch;
|
|
}
|
|
for (int ch = 0xAE; ch <= 0xFF; ++ch) { // u'®' to u'ÿ'
|
|
assert(0 <= ch && ch < 256);
|
|
map[unicode_cpt_to_utf8(ch)] = ch;
|
|
}
|
|
auto n = 0;
|
|
for (int ch = 0; ch < 256; ++ch) {
|
|
if (map.find(unicode_cpt_to_utf8(ch)) == map.end()) {
|
|
map[unicode_cpt_to_utf8(256 + n)] = ch;
|
|
++n;
|
|
}
|
|
}
|
|
return map;
|
|
}
|
|
|
|
static inline bool is_valid_utf8(const std::string& str) {
|
|
int remaining_bytes = 0; // 当前多字节字符剩余的字节数
|
|
for (unsigned char c : str) {
|
|
if (remaining_bytes == 0) {
|
|
if ((c & 0x80) == 0x00) continue; // 1字节字符
|
|
else if ((c & 0xE0) == 0xC0) remaining_bytes = 1; // 2字节
|
|
else if ((c & 0xF0) == 0xE0) remaining_bytes = 2; // 3字节
|
|
else if ((c & 0xF8) == 0xF0) remaining_bytes = 3; // 4字节
|
|
else return false; // 非法起始字节
|
|
}
|
|
else {
|
|
// 检查后续字节是否为10xxxxxx
|
|
if ((c & 0xC0) != 0x80)
|
|
{
|
|
return false;
|
|
}
|
|
remaining_bytes--;
|
|
}
|
|
}
|
|
return (remaining_bytes == 0); // 确保多字节字符完整
|
|
}
|
|
|
|
static inline std::wstring unicode_wstring_from_utf8(const std::string& s) {
|
|
#if defined(__clang__)
|
|
// disable C++17 deprecation warning for std::codecvt_utf8
|
|
# pragma clang diagnostic push
|
|
# pragma clang diagnostic ignored "-Wdeprecated-declarations"
|
|
#endif
|
|
bool isvalid = is_valid_utf8(s);
|
|
std::wstring_convert<std::codecvt_utf8<wchar_t>> conv;
|
|
|
|
#if defined(__clang__)
|
|
# pragma clang diagnostic pop
|
|
#endif
|
|
|
|
return conv.from_bytes(s);
|
|
}
|
|
|
|
static std::vector<std::string> unicode_byte_encoding_process(const std::vector<std::string>& bpe_words) {
|
|
std::vector<std::string> bpe_encoded_words;
|
|
for (const auto& word : bpe_words) {
|
|
std::string text_utf;
|
|
auto utf_word = unicode_cpts_from_utf8(word);
|
|
for (size_t i = 0; i < utf_word.size(); ++i) {
|
|
text_utf += unicode_cpt_to_utf8(utf_word[i]);
|
|
}
|
|
|
|
std::string encoded_token;
|
|
for (char& c : text_utf) {
|
|
encoded_token += unicode_byte_to_utf8(c);
|
|
}
|
|
bpe_encoded_words.emplace_back(encoded_token);
|
|
}
|
|
return bpe_encoded_words;
|
|
}
|
|
|
|
// GPT2 system regex: 's|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+
|
|
static std::vector<size_t> unicode_regex_split_custom_gpt2(const std::string& text, const std::vector<size_t>& offsets) {
|
|
std::vector<size_t> bpe_offsets; // store the offset of each word
|
|
bpe_offsets.reserve(offsets.size()); // Reserve memory for the approximate size
|
|
|
|
const auto cpts = unicode_cpts_from_utf8(text);
|
|
|
|
size_t start = 0;
|
|
for (auto offset : offsets) {
|
|
const size_t offset_ini = start;
|
|
const size_t offset_end = start + offset;
|
|
assert(offset_end <= cpts.size());
|
|
start = offset_end;
|
|
|
|
static const uint32_t OUT_OF_RANGE = 0xFFFFFFFF;
|
|
auto _get_cpt = [&](const size_t pos) -> uint32_t {
|
|
return (offset_ini <= pos && pos < offset_end) ? cpts[pos] : OUT_OF_RANGE;
|
|
};
|
|
|
|
auto _get_flags = [&](const size_t pos) -> codepoint_flags {
|
|
return (offset_ini <= pos && pos < offset_end) ? unicode_cpt_flags(cpts[pos]) : codepoint_flags{};
|
|
};
|
|
|
|
size_t _prev_end = offset_ini;
|
|
auto _add_token = [&](const size_t end) -> size_t {
|
|
assert(_prev_end <= end && end <= offset_end);
|
|
size_t len = end - _prev_end;
|
|
if (len > 0) {
|
|
bpe_offsets.push_back(len);
|
|
}
|
|
_prev_end = end;
|
|
//if (len > 0) {
|
|
// std::string s = "";
|
|
// for(size_t p = end-len; p < end; p++)
|
|
// s += unicode_cpt_to_utf8(cpts[p]);
|
|
// printf(">>> '%s'\n", s.c_str());
|
|
//}
|
|
return len;
|
|
};
|
|
|
|
for (size_t pos = offset_ini; pos < offset_end; /*pos++*/) {
|
|
const uint32_t cpt = _get_cpt(pos);
|
|
const auto flags = _get_flags(pos);
|
|
|
|
// regex: 's|'t|'re|'ve|'m|'ll|'d
|
|
if (cpt == '\'' && pos + 1 < offset_end) {
|
|
uint32_t cpt_next = _get_cpt(pos + 1);
|
|
if (cpt_next == 's' || cpt_next == 't' || cpt_next == 'm' || cpt_next == 'd') {
|
|
pos += _add_token(pos + 2);
|
|
continue;
|
|
}
|
|
if (pos + 2 < offset_end) {
|
|
uint32_t cpt_next_next = _get_cpt(pos + 2);
|
|
if ((cpt_next == 'r' && cpt_next_next == 'e') ||
|
|
(cpt_next == 'v' && cpt_next_next == 'e') ||
|
|
(cpt_next == 'l' && cpt_next_next == 'l')) {
|
|
pos += _add_token(pos + 3);
|
|
continue;
|
|
}
|
|
}
|
|
}
|
|
|
|
auto flags2 = (cpt == ' ' ? _get_flags(pos + 1) : flags);
|
|
// regex: <space>?\p{L}+
|
|
if (flags2.is_letter) {
|
|
pos += (cpt == ' ');
|
|
while (flags2.is_letter) {
|
|
flags2 = _get_flags(++pos);
|
|
}
|
|
_add_token(pos);
|
|
continue;
|
|
}
|
|
// regex: <space>?\p{N}+
|
|
if (flags2.is_number) {
|
|
pos += (cpt == ' ');
|
|
while (flags2.is_number) {
|
|
flags2 = _get_flags(++pos);
|
|
}
|
|
_add_token(pos);
|
|
continue;
|
|
}
|
|
// regex: <space>?[^\s\p{L}\p{N}]+
|
|
if (!(flags2.is_whitespace | flags2.is_letter | flags2.is_number) && flags2.as_uint()) {
|
|
pos += (cpt == ' ');
|
|
while (!(flags2.is_whitespace | flags2.is_letter | flags2.is_number) && flags2.as_uint()) {
|
|
flags2 = _get_flags(++pos);
|
|
}
|
|
_add_token(pos);
|
|
continue;
|
|
}
|
|
|
|
size_t num_whitespaces = 0;
|
|
while (_get_flags(pos + num_whitespaces).is_whitespace) {
|
|
num_whitespaces++;
|
|
}
|
|
|
|
// regex: \s+(?!\S)
|
|
if (num_whitespaces > 1 && _get_cpt(pos + num_whitespaces) != OUT_OF_RANGE) {
|
|
pos += num_whitespaces - 1;
|
|
_add_token(pos);
|
|
continue;
|
|
}
|
|
|
|
// regex: \s+
|
|
if (num_whitespaces > 0) {
|
|
pos += num_whitespaces;
|
|
_add_token(pos);
|
|
continue;
|
|
}
|
|
|
|
// no matches
|
|
_add_token(++pos);
|
|
}
|
|
}
|
|
|
|
return bpe_offsets;
|
|
}
|
|
|
|
// LLAMA3 system regex: "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"
|
|
static std::vector<size_t> unicode_regex_split_custom_llama3(const std::string& text, const std::vector<size_t>& offsets) {
|
|
std::vector<size_t> bpe_offsets; // store the offset of each word
|
|
bpe_offsets.reserve(offsets.size()); // Reserve memory for the approximate size
|
|
|
|
const auto cpts = unicode_cpts_from_utf8(text);
|
|
|
|
size_t start = 0;
|
|
for (auto offset : offsets) {
|
|
const size_t offset_ini = start;
|
|
const size_t offset_end = start + offset;
|
|
assert(offset_end <= cpts.size());
|
|
start = offset_end;
|
|
|
|
static const uint32_t OUT_OF_RANGE = 0xFFFFFFFF;
|
|
auto _get_cpt = [&](const size_t pos) -> uint32_t {
|
|
return (offset_ini <= pos && pos < offset_end) ? cpts[pos] : OUT_OF_RANGE;
|
|
};
|
|
|
|
auto _get_flags = [&](const size_t pos) -> codepoint_flags {
|
|
return (offset_ini <= pos && pos < offset_end) ? unicode_cpt_flags(cpts[pos]) : codepoint_flags{};
|
|
};
|
|
|
|
size_t _prev_end = offset_ini;
|
|
auto _add_token = [&](const size_t end) -> size_t {
|
|
assert(_prev_end <= end && end <= offset_end);
|
|
size_t len = end - _prev_end;
|
|
if (len > 0) {
|
|
bpe_offsets.push_back(len);
|
|
}
|
|
_prev_end = end;
|
|
//if (len > 0) {
|
|
// std::string s = "";
|
|
// for(size_t p = end-len; p < end; p++)
|
|
// s += unicode_cpt_to_utf8(cpts[p]);
|
|
// printf(">>> '%s'\n", s.c_str());
|
|
//}
|
|
return len;
|
|
};
|
|
|
|
for (size_t pos = offset_ini; pos < offset_end; /*pos++*/) {
|
|
const uint32_t cpt = _get_cpt(pos);
|
|
const auto flags = _get_flags(pos);
|
|
|
|
// regex: (?i:'s|'t|'re|'ve|'m|'ll|'d) // case insensitive
|
|
if (cpt == '\'' && pos + 1 < offset_end) {
|
|
uint32_t cpt_next = unicode_tolower(_get_cpt(pos + 1));
|
|
if (cpt_next == 's' || cpt_next == 't' || cpt_next == 'm' || cpt_next == 'd') {
|
|
pos += _add_token(pos + 2);
|
|
continue;
|
|
}
|
|
if (pos + 2 < offset_end) {
|
|
uint32_t cpt_next_next = unicode_tolower(_get_cpt(pos + 2));
|
|
if ((cpt_next == 'r' && cpt_next_next == 'e') ||
|
|
(cpt_next == 'v' && cpt_next_next == 'e') ||
|
|
(cpt_next == 'l' && cpt_next_next == 'l')) {
|
|
pos += _add_token(pos + 3);
|
|
continue;
|
|
}
|
|
}
|
|
}
|
|
|
|
// regex: [^\r\n\p{L}\p{N}]?\p{L}+
|
|
if (!(cpt == '\r' || cpt == '\n' || flags.is_number)) {
|
|
if (flags.is_letter || _get_flags(pos + 1).is_letter) { // one or more letters
|
|
pos++;
|
|
while (_get_flags(pos).is_letter) {
|
|
pos++;
|
|
}
|
|
_add_token(pos);
|
|
continue;
|
|
}
|
|
}
|
|
|
|
// regex: \p{N}{1,3}
|
|
if (flags.is_number) {
|
|
size_t ini = pos;
|
|
while (_get_flags(pos).is_number) {
|
|
if (++pos - ini >= 3) {
|
|
_add_token(pos);
|
|
ini = pos;
|
|
}
|
|
}
|
|
_add_token(pos);
|
|
continue;
|
|
}
|
|
|
|
// regex: <space>?[^\s\p{L}\p{N}]+[\r\n]*
|
|
auto flags2 = (cpt == ' ' ? _get_flags(pos + 1) : flags);
|
|
if (!(flags2.is_whitespace | flags2.is_letter | flags2.is_number) && flags.as_uint()) {
|
|
pos += (cpt == ' ');
|
|
while (!(flags2.is_whitespace | flags2.is_letter | flags2.is_number) && flags2.as_uint()) {
|
|
flags2 = _get_flags(++pos);
|
|
}
|
|
uint32_t cpt2 = _get_cpt(pos);
|
|
while (cpt2 == '\r' || cpt2 == '\n') {
|
|
cpt2 = _get_cpt(++pos);
|
|
}
|
|
_add_token(pos);
|
|
continue;
|
|
}
|
|
|
|
size_t num_whitespaces = 0;
|
|
size_t last_end_r_or_n = 0;
|
|
while (_get_flags(pos + num_whitespaces).is_whitespace) {
|
|
uint32_t cpt2 = _get_cpt(pos + num_whitespaces);
|
|
if (cpt2 == '\r' || cpt2 == '\n') {
|
|
last_end_r_or_n = pos + num_whitespaces + 1;
|
|
}
|
|
num_whitespaces++;
|
|
}
|
|
|
|
// regex: \s*[\r\n]+
|
|
if (last_end_r_or_n > 0) {
|
|
pos = last_end_r_or_n;
|
|
_add_token(pos);
|
|
continue;
|
|
}
|
|
|
|
// regex: \s+(?!\S)
|
|
if (num_whitespaces > 1 && _get_cpt(pos + num_whitespaces) != OUT_OF_RANGE) {
|
|
pos += num_whitespaces - 1;
|
|
_add_token(pos);
|
|
continue;
|
|
}
|
|
|
|
// regex: \s+
|
|
if (num_whitespaces > 0) {
|
|
pos += num_whitespaces;
|
|
_add_token(pos);
|
|
continue;
|
|
}
|
|
|
|
// no matches
|
|
_add_token(++pos);
|
|
}
|
|
}
|
|
|
|
return bpe_offsets;
|
|
}
|
|
|
|
// use std::wregex to split the text
|
|
static std::vector<size_t> unicode_regex_split_stl(const std::wstring& wtext, const std::wstring& regex_expr, const std::vector<size_t>& offsets) {
|
|
std::wregex expr(regex_expr);
|
|
std::vector<size_t> bpe_offsets; // store the offset of each word
|
|
bpe_offsets.reserve(offsets.size()); // Reserve memory for the approximate size
|
|
size_t start = 0;
|
|
for (auto offset : offsets) {
|
|
std::wcregex_iterator it(wtext.data() + start, wtext.data() + start + offset, expr);
|
|
std::wcregex_iterator end;
|
|
|
|
int64_t start_idx = 0;
|
|
while (it != end) {
|
|
std::wcmatch match = *it;
|
|
if (match.position() > start_idx) {
|
|
bpe_offsets.emplace_back(match.position() - start_idx);
|
|
}
|
|
bpe_offsets.emplace_back(match.length());
|
|
start_idx = match.position() + match.length();
|
|
++it;
|
|
}
|
|
|
|
if (start_idx < (int64_t)offset) {
|
|
bpe_offsets.emplace_back(offset - start_idx);
|
|
}
|
|
start += offset;
|
|
}
|
|
|
|
return bpe_offsets;
|
|
}
|
|
|
|
// use std::regex to split the text
|
|
static std::vector<size_t> unicode_regex_split_stl(const std::string& text, const std::string& regex_expr, const std::vector<size_t>& offsets) {
|
|
std::regex expr(regex_expr);
|
|
std::vector<size_t> bpe_offsets; // store the offset of each word
|
|
bpe_offsets.reserve(offsets.size()); // Reserve memory for the approximate size
|
|
size_t start = 0;
|
|
for (auto offset : offsets) {
|
|
std::cregex_iterator it(text.data() + start, text.data() + start + offset, expr);
|
|
std::cregex_iterator end;
|
|
|
|
int64_t start_idx = 0;
|
|
while (it != end) {
|
|
std::cmatch match = *it;
|
|
if (match.position() > start_idx) {
|
|
bpe_offsets.emplace_back(match.position() - start_idx);
|
|
}
|
|
bpe_offsets.emplace_back(match.length());
|
|
start_idx = match.position() + match.length();
|
|
++it;
|
|
}
|
|
|
|
if (start_idx < (int64_t)offset) {
|
|
bpe_offsets.emplace_back(offset - start_idx);
|
|
}
|
|
start += offset;
|
|
}
|
|
|
|
return bpe_offsets;
|
|
}
|
|
|
|
static std::vector<size_t> unicode_regex_split_custom(const std::string& text, const std::string& regex_expr, const std::vector<size_t>& offsets) {
|
|
std::vector<size_t> bpe_offsets;
|
|
|
|
if (regex_expr == "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)") {
|
|
bpe_offsets = unicode_regex_split_custom_gpt2(text, offsets);
|
|
}
|
|
else if (
|
|
regex_expr == "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" ||
|
|
regex_expr == "(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+") {
|
|
|
|
bpe_offsets = unicode_regex_split_custom_llama3(text, offsets);
|
|
}
|
|
|
|
return bpe_offsets;
|
|
}
|
|
|
|
//
|
|
// interface
|
|
//
|
|
|
|
std::string unicode_cpt_to_utf8(uint32_t cp) {
|
|
std::string result;
|
|
|
|
if (/* 0x00 <= cp && */ cp <= 0x7f) {
|
|
result.push_back(cp);
|
|
return result;
|
|
}
|
|
if (0x80 <= cp && cp <= 0x7ff) {
|
|
result.push_back(0xc0 | ((cp >> 6) & 0x1f));
|
|
result.push_back(0x80 | (cp & 0x3f));
|
|
return result;
|
|
}
|
|
if (0x800 <= cp && cp <= 0xffff) {
|
|
result.push_back(0xe0 | ((cp >> 12) & 0x0f));
|
|
result.push_back(0x80 | ((cp >> 6) & 0x3f));
|
|
result.push_back(0x80 | (cp & 0x3f));
|
|
return result;
|
|
}
|
|
if (0x10000 <= cp && cp <= 0x10ffff) {
|
|
result.push_back(0xf0 | ((cp >> 18) & 0x07));
|
|
result.push_back(0x80 | ((cp >> 12) & 0x3f));
|
|
result.push_back(0x80 | ((cp >> 6) & 0x3f));
|
|
result.push_back(0x80 | (cp & 0x3f));
|
|
return result;
|
|
}
|
|
|
|
throw std::invalid_argument("invalid codepoint");
|
|
}
|
|
|
|
std::vector<uint32_t> unicode_cpts_normalize_nfd(const std::vector<uint32_t>& cpts) {
|
|
auto comp = [](const uint32_t cpt, const range_nfd& range) {
|
|
return cpt < range.first;
|
|
};
|
|
std::vector<uint32_t> result(cpts.size());
|
|
for (size_t i = 0; i < cpts.size(); ++i) {
|
|
const uint32_t cpt = cpts[i];
|
|
auto it = std::upper_bound(unicode_ranges_nfd.cbegin(), unicode_ranges_nfd.cend(), cpt, comp) - 1;
|
|
result[i] = (it->first <= cpt && cpt <= it->last) ? it->nfd : cpt;
|
|
}
|
|
return result;
|
|
}
|
|
|
|
std::vector<uint32_t> unicode_cpts_from_utf8(const std::string& utf8) {
|
|
std::vector<uint32_t> result;
|
|
result.reserve(utf8.size());
|
|
size_t offset = 0;
|
|
while (offset < utf8.size()) {
|
|
result.push_back(unicode_cpt_from_utf8(utf8, offset));
|
|
}
|
|
return result;
|
|
}
|
|
|
|
codepoint_flags unicode_cpt_flags(const uint32_t cp) {
|
|
static const codepoint_flags undef(codepoint_flags::UNDEFINED);
|
|
static const auto cpt_flags = unicode_cpt_flags_array();
|
|
return cp < cpt_flags.size() ? cpt_flags[cp] : undef;
|
|
}
|
|
|
|
codepoint_flags unicode_cpt_flags(const std::string& utf8) {
|
|
static const codepoint_flags undef(codepoint_flags::UNDEFINED);
|
|
if (utf8.empty()) {
|
|
return undef; // undefined
|
|
}
|
|
size_t offset = 0;
|
|
return unicode_cpt_flags(unicode_cpt_from_utf8(utf8, offset));
|
|
}
|
|
|
|
std::string unicode_byte_to_utf8(uint8_t byte) {
|
|
static std::unordered_map<uint8_t, std::string> map = unicode_byte_to_utf8_map();
|
|
return map.at(byte);
|
|
}
|
|
|
|
uint8_t unicode_utf8_to_byte(const std::string& utf8) {
|
|
static std::unordered_map<std::string, uint8_t> map = unicode_utf8_to_byte_map();
|
|
return map.at(utf8);
|
|
}
|
|
|
|
uint32_t unicode_tolower(uint32_t cp) {
|
|
auto it = unicode_map_lowercase.find(cp);
|
|
return it == unicode_map_lowercase.end() ? cp : it->second;
|
|
}
|
|
|
|
std::vector<std::string> unicode_regex_split(const std::string& text, const std::vector<std::string>& regex_exprs) {
|
|
// unicode categories
|
|
static const std::map<std::string, int> k_ucat_enum = {
|
|
{ "\\p{N}", codepoint_flags::NUMBER },
|
|
{ "\\p{L}", codepoint_flags::LETTER },
|
|
{ "\\p{P}", codepoint_flags::PUNCTUATION },
|
|
{ "\\p{M}", codepoint_flags::ACCENT_MARK },
|
|
{ "\\p{S}", codepoint_flags::SYMBOL },
|
|
};
|
|
|
|
static const std::map<int, int> k_ucat_cpt = {
|
|
{ codepoint_flags::NUMBER, 0xD1 },
|
|
{ codepoint_flags::LETTER, 0xD2 },
|
|
{ codepoint_flags::PUNCTUATION, 0xD3 },
|
|
{ codepoint_flags::ACCENT_MARK, 0xD4 },
|
|
{ codepoint_flags::SYMBOL, 0xD5 },
|
|
|
|
};
|
|
|
|
static const std::map<int, std::string> k_ucat_map = {
|
|
{ codepoint_flags::NUMBER, "\x30-\x39" }, // 0-9
|
|
{ codepoint_flags::LETTER, "\x41-\x5A\x61-\x7A" }, // A-Za-z
|
|
{ codepoint_flags::PUNCTUATION, "\x21-\x23\x25-\x2A\x2C-\x2F\x3A-\x3B\x3F-\x40\\\x5B-\\\x5D\x5F\\\x7B\\\x7D" }, // !-#%-*,-/:-;?-@\[-\]_\{\}i
|
|
{ codepoint_flags::ACCENT_MARK, "" }, // no sub-128 codepoints
|
|
{ codepoint_flags::SYMBOL, "\\\x24\\\x2B\x3C-\x3E\x5E\x60\\\x7C" }, // $+<=>^`|
|
|
};
|
|
|
|
// compute collapsed codepoints only if needed by at least one regex
|
|
bool need_collapse = false;
|
|
for (auto& regex_expr : regex_exprs) {
|
|
// search for unicode categories
|
|
for (const auto& ucat : k_ucat_enum) {
|
|
if (std::string::npos != regex_expr.find(ucat.first)) {
|
|
need_collapse = true;
|
|
break;
|
|
}
|
|
}
|
|
}
|
|
|
|
const auto cpts = unicode_cpts_from_utf8(text);
|
|
|
|
// generate a "collapsed" representation of the text, where all codepoints are replaced by a single byte
|
|
// ref: https://github.com/ggerganov/llama.cpp/pull/6920#issuecomment-2081479935
|
|
std::string text_collapsed;
|
|
if (need_collapse) {
|
|
// collapse all unicode categories
|
|
text_collapsed.resize(cpts.size());
|
|
|
|
for (size_t i = 0; i < cpts.size(); ++i) {
|
|
// keep single-byte codepoints as is
|
|
if (cpts[i] < 128) {
|
|
text_collapsed[i] = cpts[i];
|
|
continue;
|
|
}
|
|
|
|
const auto flags = unicode_cpt_flags(cpts[i]);
|
|
|
|
if (flags.is_whitespace) {
|
|
//NOTE: C++ std::regex \s does not mach 0x85, Rust and Python regex does.
|
|
//text_collapsed[i] = (char) 0x85; // <Next Line> as whitespace fallback
|
|
text_collapsed[i] = (char)0x0B; // <vertical tab> as whitespace fallback
|
|
}
|
|
else if (k_ucat_cpt.find(flags.category_flag()) != k_ucat_cpt.end()) {
|
|
text_collapsed[i] = k_ucat_cpt.at(flags.category_flag());
|
|
}
|
|
else {
|
|
text_collapsed[i] = (char)0xD0; // fallback
|
|
}
|
|
}
|
|
}
|
|
|
|
std::vector<size_t> bpe_offsets = { cpts.size() };
|
|
|
|
for (auto& regex_expr : regex_exprs) {
|
|
// first, see if we have an efficient custom regex implementation
|
|
auto tmp = unicode_regex_split_custom(text, regex_expr, bpe_offsets);
|
|
|
|
if (!tmp.empty()) {
|
|
bpe_offsets = std::move(tmp);
|
|
continue;
|
|
}
|
|
|
|
// fallback to general-purpose std::regex / std::wregex
|
|
try {
|
|
// if a unicode category is used in the regex, we use the collapsed text and replace the unicode category
|
|
// with the corresponding collapsed representation
|
|
bool use_collapsed = false;
|
|
for (auto& ucat : k_ucat_enum) {
|
|
if (std::string::npos != regex_expr.find(ucat.first)) {
|
|
use_collapsed = true;
|
|
break;
|
|
}
|
|
}
|
|
|
|
if (use_collapsed) {
|
|
// sanity-check that the original regex does not contain any non-ASCII characters
|
|
const auto cpts_regex = unicode_cpts_from_utf8(regex_expr);
|
|
for (size_t i = 0; i < cpts_regex.size(); ++i) {
|
|
if (cpts_regex[i] >= 128) {
|
|
throw std::runtime_error("Regex includes both unicode categories and non-ASCII characters - not supported");
|
|
}
|
|
}
|
|
|
|
// generate a collapsed representation of the regex
|
|
std::string regex_expr_collapsed;
|
|
|
|
// track if we are inside [], because nested [] are not allowed
|
|
bool inside = false;
|
|
for (size_t i = 0; i < regex_expr.size(); ++i) {
|
|
if (regex_expr[i] == '[' && (i == 0 || regex_expr[i - 1] != '\\')) {
|
|
regex_expr_collapsed += '[';
|
|
inside = true;
|
|
continue;
|
|
}
|
|
|
|
if (inside && regex_expr[i] == ']' && regex_expr[i - 1] != '\\') {
|
|
regex_expr_collapsed += ']';
|
|
inside = false;
|
|
continue;
|
|
}
|
|
|
|
if (regex_expr[i + 0] == '\\' && i + 4 < regex_expr.size() &&
|
|
regex_expr[i + 1] == 'p' &&
|
|
regex_expr[i + 2] == '{' &&
|
|
regex_expr[i + 4] == '}') {
|
|
const std::string pat = regex_expr.substr(i, 5);
|
|
if (k_ucat_enum.find(pat) != k_ucat_enum.end()) {
|
|
if (!inside) {
|
|
regex_expr_collapsed += '[';
|
|
}
|
|
regex_expr_collapsed += k_ucat_cpt.at(k_ucat_enum.at(pat));
|
|
regex_expr_collapsed += k_ucat_map.at(k_ucat_enum.at(pat));
|
|
if (!inside) {
|
|
regex_expr_collapsed += ']';
|
|
}
|
|
i += 4;
|
|
continue;
|
|
}
|
|
}
|
|
|
|
regex_expr_collapsed += regex_expr[i];
|
|
}
|
|
|
|
//printf("text_collapsed: %s\n", text_collapsed.c_str());
|
|
//printf("regex_expr_collapsed: %s\n", regex_expr_collapsed.c_str());
|
|
bpe_offsets = unicode_regex_split_stl(text_collapsed, regex_expr_collapsed, bpe_offsets);
|
|
}
|
|
else {
|
|
// no unicode category used, we can use std::wregex directly
|
|
const std::wstring wregex_expr = unicode_wstring_from_utf8(regex_expr);
|
|
|
|
// std::wregex \s does not mach non-ASCII whitespaces, using 0x0B as fallback
|
|
std::wstring wtext(cpts.begin(), cpts.end());
|
|
for (size_t i = 0; i < wtext.size(); ++i) {
|
|
if (wtext[i] > 0x7F && unicode_cpt_flags(wtext[i]).is_whitespace) {
|
|
wtext[i] = 0x0B;
|
|
}
|
|
}
|
|
|
|
//printf("text: %s\n", text.c_str());
|
|
//printf("regex_expr: %s\n", regex_expr.c_str());
|
|
bpe_offsets = unicode_regex_split_stl(wtext, wregex_expr, bpe_offsets);
|
|
}
|
|
}
|
|
catch (std::regex_error& e) {
|
|
fprintf(stderr, "Failed to process regex: '%s'\n", regex_expr.c_str());
|
|
fprintf(stderr, "Regex error: %s\n", e.what());
|
|
throw std::runtime_error("Failed to process regex");
|
|
}
|
|
}
|
|
|
|
std::vector<std::string> bpe_words;
|
|
bpe_words.reserve(bpe_offsets.size()); // reserve memory for the approximate size
|
|
|
|
size_t start = 0;
|
|
for (size_t& offset : bpe_offsets) {
|
|
bpe_words.emplace_back();
|
|
for (size_t i = start; i < start + offset; ++i) {
|
|
bpe_words.back() += unicode_cpt_to_utf8(cpts[i]);
|
|
}
|
|
start += offset;
|
|
}
|
|
|
|
return unicode_byte_encoding_process(bpe_words);
|
|
}
|