Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit 4cd7347

Browse filesBrowse files
stasoidtordex
authored andcommitted
Support standard HTML character encodings
1 parent 289c153 commit 4cd7347
Copy full SHA for 4cd7347

File tree

Expand file treeCollapse file tree

12 files changed

+2095
-53
lines changed
Filter options
Expand file treeCollapse file tree

12 files changed

+2095
-53
lines changed

‎doc/document_createFromString.txt

Copy file name to clipboard
+62Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
static document::ptr document::createFromString(
2+
const estring& str,
3+
document_container* container,
4+
const string& master_styles = litehtml::master_css,
5+
const string& user_styles = "");
6+
7+
---------------------------------------------------------------------------------------------------------
8+
Terminology:
9+
10+
BOM encoding is the encoding suggested by the byte-order-mark (BOM). Can be UTF-8, UTF-16LE, or UTF-16BE.
11+
Cannot be UTF-32 because it is not a valid HTML encoding. See bom_sniff.
12+
13+
meta encoding is an HTML encoding suggested by a valid <meta> charset tag.
14+
15+
valid <meta> charset tag:
16+
* must be inside <head>
17+
* must have one of these forms:
18+
<meta charset="utf-8"> or
19+
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
20+
* encoding name must be one of the encoding labels https://encoding.spec.whatwg.org/#names-and-labels (see get_encoding)
21+
22+
HTTP encoding is the encoding specified in HTTP Content-Type header https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type
23+
24+
user override encoding - when your program allows user to manually choose encoding for particular page or site
25+
---------------------------------------------------------------------------------------------------------
26+
27+
28+
Call without specifying encoding:
29+
createFromString(string, container), where string is std::string or char*
30+
* if BOM is present, BOM encoding will be used
31+
* otherwise, if valid <meta> tag is present, meta encoding will be used
32+
* otherwise, UTF-8 will be used
33+
34+
Call with encoding, confidence is certain:
35+
createFromString({string, encoding::big5}, container)
36+
* if BOM is present, BOM encoding will be used
37+
* otherwise, Big5 will be used
38+
NOTE: encoding from <meta> tag will be ignored
39+
40+
Call with encoding, confidence is tentative (very rare, you probably don't need this):
41+
createFromString({string, encoding::big5, confidence::tentative}, container)
42+
* if BOM is present, BOM encoding will be used
43+
* otherwise, if valid <meta> tag is present, meta encoding will be used
44+
* otherwise, Big5 will be used
45+
46+
---------------------------------------------------------------------------------------------------------
47+
48+
User override encoding and HTTP encoding must be passed with confidence certain, if both are present user
49+
override encoding should take precedence.
50+
51+
If both user override encoding and HTTP encoding are unspecified, your program may guess encoding by using
52+
encoding of the page when it was last visited or by performing frequency analysis or by URL domain or
53+
by current user locale or smth else. Any such encoding should be passed with confidence tentative.
54+
The precedence of these guesses is specified in the encoding sniffing algorithm, see litehtml::encoding_sniffing_algorithm
55+
and https://html.spec.whatwg.org/multipage/parsing.html#encoding-sniffing-algorithm
56+
57+
litehtml implements only the first and the last step of this algorithm:
58+
it sets encoding to BOM encoding if BOM is present and it tentatively sets encoding to UTF-8 if encoding is unknown.
59+
60+
If your program is displaying html files from the web it is recommended to detect HTTP encoding, because
61+
it is not very unusual for web pages to have encoding specified only in HTTP header or meta encoding be different
62+
from HTTP encoding (HTTP encoding takes the precedence in this case).

‎include/litehtml/document.h

Copy file name to clipboardExpand all lines: include/litehtml/document.h
+10-1Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@
44
#include "style.h"
55
#include "types.h"
66
#include "master_css.h"
7+
#include "encodings.h"
8+
typedef struct GumboInternalOutput GumboOutput;
79

810
namespace litehtml
911
{
@@ -70,6 +72,7 @@ namespace litehtml
7072
media_features m_media;
7173
string m_lang;
7274
string m_culture;
75+
string m_text;
7376
public:
7477
document(document_container* objContainer);
7578
virtual ~document();
@@ -105,11 +108,17 @@ namespace litehtml
105108
void append_children_from_string(element& parent, const char* str);
106109
void dump(dumper& cout);
107110

108-
static litehtml::document::ptr createFromString(const char* str, litehtml::document_container* objPainter, const char* master_styles = litehtml::master_css, const char* user_styles = "");
111+
// see doc/document_createFromString.txt
112+
static document::ptr createFromString(
113+
const estring& str,
114+
document_container* container,
115+
const string& master_styles = litehtml::master_css,
116+
const string& user_styles = "");
109117

110118
private:
111119
uint_ptr add_font(const char* name, int size, const char* weight, const char* style, const char* decoration, font_metrics* fm);
112120

121+
GumboOutput* parse_html(estring str);
113122
void create_node(void* gnode, elements_list& elements, bool parseTextNode);
114123
bool update_media_lists(const media_features& features);
115124
void fix_tables_layout();

‎include/litehtml/encodings.h

Copy file name to clipboard
+90Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
#ifndef LH_ENCODINGS_H
2+
#define LH_ENCODINGS_H
3+
4+
namespace litehtml
5+
{
6+
7+
// https://encoding.spec.whatwg.org/#names-and-labels
8+
enum class encoding
9+
{
10+
null, // indicates error or absence of encoding
11+
utf_8,
12+
13+
// Legacy single-byte encodings; must be in sync with single_byte_indexes
14+
ibm866,
15+
iso_8859_2,
16+
iso_8859_3,
17+
iso_8859_4,
18+
iso_8859_5,
19+
iso_8859_6,
20+
iso_8859_7,
21+
iso_8859_8,
22+
iso_8859_8_i,
23+
iso_8859_10,
24+
iso_8859_13,
25+
iso_8859_14,
26+
iso_8859_15,
27+
iso_8859_16,
28+
koi8_r,
29+
koi8_u,
30+
macintosh,
31+
windows_874,
32+
windows_1250,
33+
windows_1251,
34+
windows_1252,
35+
windows_1253,
36+
windows_1254,
37+
windows_1255,
38+
windows_1256,
39+
windows_1257,
40+
windows_1258,
41+
x_mac_cyrillic,
42+
43+
// Legacy multi-byte East Asian encodings
44+
gbk,
45+
gb18030,
46+
big5,
47+
euc_jp,
48+
iso_2022_jp,
49+
shift_jis,
50+
euc_kr,
51+
52+
// Legacy miscellaneous encodings
53+
replacement,
54+
utf_16be,
55+
utf_16le,
56+
x_user_defined
57+
};
58+
59+
// https://html.spec.whatwg.org/multipage/parsing.html#concept-encoding-confidence
60+
enum class confidence // encoding confidence
61+
{
62+
tentative,
63+
certain,
64+
// irrelevant // not used here
65+
};
66+
67+
// Used as argument for document::createFromString, parse_html and encoding_sniffing_algorithm.
68+
struct estring : string // string with encoding
69+
{
70+
litehtml::encoding encoding;
71+
litehtml::confidence confidence;
72+
73+
estring(const string& str, litehtml::encoding encoding = encoding::null, litehtml::confidence confidence = confidence::certain)
74+
: string(str), encoding(encoding), confidence(confidence) {}
75+
76+
estring(const char* str) : string(str), encoding(encoding::null), confidence(confidence::certain) {}
77+
};
78+
79+
80+
encoding bom_sniff(const string& str);
81+
void encoding_sniffing_algorithm(estring& str);
82+
83+
encoding get_encoding(string label);
84+
encoding extract_encoding_from_meta_element(string str);
85+
86+
void decode(string input, encoding coding, string& output);
87+
88+
} // namespace litehtml
89+
90+
#endif // LH_ENCODINGS_H

‎include/litehtml/html.h

Copy file name to clipboardExpand all lines: include/litehtml/html.h
+7-1Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323

2424
namespace litehtml
2525
{
26-
void trim(string &s, const string& chars_to_trim = " \n\r\t");
26+
string& trim(string &s, const string& chars_to_trim = " \n\r\f\t");
2727
void lcase(string &s);
2828
int value_index(const string& val, const string& strings, int defValue = -1, char delim = ';');
2929
string index_value(int index, const string& strings, char delim = ';');
@@ -39,6 +39,12 @@ namespace litehtml
3939

4040
bool is_number(const string& string, const bool allow_dot = 1);
4141

42+
// https://infra.spec.whatwg.org/#ascii-whitespace
43+
inline bool is_whitespace(char c)
44+
{
45+
return c == ' ' || c == '\t' || c == '\n' || c == '\r' || c == '\f';
46+
}
47+
4248
inline int t_isdigit(int c)
4349
{
4450
return (c >= '0' && c <= '9');

‎include/litehtml/os_types.h

Copy file name to clipboardExpand all lines: include/litehtml/os_types.h
+4-1Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,15 @@
22
#define LH_OS_TYPES_H
33

44
#include <string>
5+
#include <memory>
56
#include <cstdint>
67

78
namespace litehtml
89
{
910
using std::string;
10-
typedef std::uintptr_t uint_ptr;
11+
using std::shared_ptr;
12+
using std::make_shared;
13+
using uint_ptr = std::uintptr_t;
1114

1215
#if defined( WIN32 ) || defined( _WIN32 ) || defined( WINCE )
1316

‎include/litehtml/utf8_strings.h

Copy file name to clipboardExpand all lines: include/litehtml/utf8_strings.h
+3Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,9 @@
66

77
namespace litehtml
88
{
9+
// converts UTF-32 ch to UTF-8 and appends it to str
10+
void append_char(string& str, int ch);
11+
912
class utf8_to_wchar
1013
{
1114
const byte* m_utf8;

‎litehtml.vcxproj

Copy file name to clipboardExpand all lines: litehtml.vcxproj
+2Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -191,6 +191,7 @@
191191
<ClCompile Include="src\el_text.cpp" />
192192
<ClCompile Include="src\el_title.cpp" />
193193
<ClCompile Include="src\el_tr.cpp" />
194+
<ClCompile Include="src\encodings.cpp" />
194195
<ClCompile Include="src\flex_item.cpp" />
195196
<ClCompile Include="src\flex_line.cpp" />
196197
<ClCompile Include="src\formatting_context.cpp" />
@@ -261,6 +262,7 @@
261262
<ClInclude Include="include\litehtml\el_text.h" />
262263
<ClInclude Include="include\litehtml\el_title.h" />
263264
<ClInclude Include="include\litehtml\el_tr.h" />
265+
<ClInclude Include="include\litehtml\encodings.h" />
264266
<ClInclude Include="include\litehtml\flex_item.h" />
265267
<ClInclude Include="include\litehtml\flex_line.h" />
266268
<ClInclude Include="include\litehtml\master_css.h" />

‎litehtml.vcxproj.filters

Copy file name to clipboardExpand all lines: litehtml.vcxproj.filters
+6Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -218,6 +218,9 @@
218218
<ClCompile Include="src\flex_line.cpp">
219219
<Filter>Source Files</Filter>
220220
</ClCompile>
221+
<ClCompile Include="src\encodings.cpp">
222+
<Filter>Source Files</Filter>
223+
</ClCompile>
221224
</ItemGroup>
222225
<ItemGroup>
223226
<ClInclude Include="include\litehtml\background.h">
@@ -418,5 +421,8 @@
418421
<ClInclude Include="include\litehtml\flex_line.h">
419422
<Filter>Header Files</Filter>
420423
</ClInclude>
424+
<ClInclude Include="include\litehtml\encodings.h">
425+
<Filter>Header Files</Filter>
426+
</ClInclude>
421427
</ItemGroup>
422428
</Project>

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.