标签 字符集 下的文章

用Base64编码UTF-8字符串(C语言)

//将ansi字符串转为UTF-8之后转为Base64编码供邮件使用
#include <stdio.h>
#include <windows.h>
#include <string>
#include "base64.h"
using namespace std;

int main()
{
string test1="輸入模式";
WCHAR test1_w[255]={0};
char test1_u8[255]={0};
//mb->wc
int len=MultiByteToWideChar(CP_ACP,0,test1.c_str(),-1,NULL,0);
MultiByteToWideChar(CP_ACP,0,test1.c_str(),-1,test1_w,len);
//wc->utf8
DWORD dwNum = WideCharToMultiByte(CP_UTF8,NULL,test1_w,-1,NULL,0,NULL,FALSE);
WideCharToMultiByte (CP_UTF8,NULL,test1_w, -1, test1_u8, dwNum,NULL,FALSE);
//encode
string test2 = base64_encode((const unsigned char *)test1_u8, strlen(test1_u8)+1);
}

有一个字符串需要以UTF-8字符集编码转换为Base64编码。解释一下以上代码。 test1是一个ansi字符串,通过 MultiByteToWideChar 函数将 test1 ansi字符串转换为 test1_w UTF-16 LE宽字符。然后又通过WideCharToMultiByte 将宽字符串转为UTF-8字符串 test1_u8。 然后通过base64_encode函数转为Base64编码。base64_encode函数来自 René Nyffeneggerhttp://www.adp-gmbh.ch/cpp/common/base64.html

"輸入模式"这四个繁体字的ANSI 16进制编码(中文系统也就是GBK编码)是

DD94 C8EB C4A3 CABD

转换成UTF-8的16进制编码是:

E8BCB8 E585A5 E6A8A1 E5BC8F

上文函数两次调用WideCharToMultiByte和两次调用MultiByteToWideChar 的第一次调用返回值是获取转换后编码buffer的字节数。len等于5,需要4个WCHAR+1个'\0'。 dwNum等于13,需要12 Bytes+1个'\0'。

GetPrivateProfileString return 0 last error 2的一种可能性

wstring strSysDir;
WCHAR wszBuf[MAX_PATH] = {0};
WCHAR wszSysDir[MAX_PATH] = {0};
DWORD dwRet = 0;
GetSystemDirectory(wszSysDir, MAX_PATH);
_tcscat_s(wszSysDir, MAX_PATH, _T("\\test.ini"));
SetLastError(0);
dwRet = GetPrivateProfileString(L"aaa", L"bbb", NULL, wszBuf, MAX_PATH, wszSysDir);
dwRet = GetLastError();

以上代码调用GetPrivateProfileString以后GetLastError()返回2,但是已经确定文件确实存在。检查了文件字符集是UTF-8 without BOM。转为ANSI或者UTF-16 Little Endian之后均可以正常读取。

后来又重新检测了一下字符集,发现GetPrivateProfileString是支持所有字符集的。如果以后出现此问题,引以为鉴吧。

How to store UTF-8 encoding data to sqlite3 using Visual C++

I've created a sqlite database with encoding UTF-8(default).

Then I use the following statement to insert data:

strcpy(sql,"insert into blog(title) values('呵呵')");
sqlite3_exec(db,sql,0,0,0);

then I open the sqlite database with tool called SQLite Developer the value of title field shows ºǺ� garbage code under Data encoding:UNICODE. then I changed Data encoding to ANSI, value of title shows right.

As I know the sqlite3_exec prototype is :

int sqlite3_exec(
  sqlite3*,                                  /* An open database */
  const char *sql,                           /* SQL to be evaluated */
  int (*callback)(void*,int,char**,char**),  /* Callback function */
  void *,                                    /* 1st argument to callback */
  char **errmsg                              /* Error msg written here */
);

I still try to pass wchar_t type to sql,but still won't work it out.

My Visual C++ project already defined UNOCODE & _UNICODE, So my question is: how to store UTF-8 encoding data to sqlite3 using Visual C++?


Update(question solved)

I use iconv to convert GBK encoding to UTF-8 inspired by msandiford. Thanks msandiford so much.

char* pOut;
char* pIn;
size_t inLen,outLen=2000;
strcpy(sql,"insert into blog(title) values('呵呵')");
string strSQL = sql;
char* sql2 = (char*)malloc(2000);
memset(sql2,0,2000);
pOut = &sql2[0];
inLen = strlen(strSQL.c_str());
pIn = const_cast<char*>(strSQL.c_str());
iconv_t g2u8 = iconv_open("UTF-8","GBK");
iconv(g2u8,(const char**)&pIn,&inLen,&pOut,&outLen);
sqlite3_exec(db,sql2,0,0,0);

Collecting comments into answer form:

From the question comments, apparently the source files are not encoded in UTF-8. Converting to UTF-8 or using the UTF-8 encoding directly seems to work.

Using UTF-8 encoding directly:

strcpy(sql,"insert into blog (title) values ('\xE5\x91\xB5\xE5\x91\xB5')");


You could avoid having to convert all your source files to UTF-8 by doing something like this:

sprintf(sql, "insert into blog (title) values('%s')", AnsiToUtf8("呵呵"));

Unfortunately the AnsiToUtf8() function is going to be pretty platform specific.


Looking further into this, it appears that Visual Studio saves source files in the default encoding for your Windows locale settings. Based on this, there could potentially be an assortment of encodings if your dev team's computers are set up for different locales.

I think it would be quite difficult, if not impossible, to implement an AnsiToUtf8() function that would cope in all the possible cases, especially given that the locale settings for the computer that the code is developed on may not be the same as the computer that ultimately runs the code.

I think the cleanest way to resolve this would be to use UTF-8 encoding uniformly in source files, assuming you want to use code points in string literals outside the areas where the default encoding and Unicode overlap.

Another way would be to internationalise the code so that the source files did not contain extended characters, and use something like GNU gettext or similar to handle translations.

via http://stackoverflow.com/questions/8753812/how-to-store-utf-8-encoding-data-to-sqlite3-using-visual-c

mysql source命令导入乱码

环境:winxp/mysql 5.5.8

mysql -uroot -pxxxxxx
use testdb;
source d:/www/testdb.sql

导入很慢,每个row,都有6个warnings。

所有中文全部是问号(乱码)

查看了testdb.sql,发现create table语句后面没有设置default charset,

create table `tb1` (
`col1` varchar (45),
)ENGINE=MyISAM DEFAULT CHARSET=utf8; 

成功导入!且速度很快。

link标签和script标签跑到body下面,网页顶部有空白,出现“锘匡豢”乱码,UTF-8 BOM,EF BB BF

最近在做一个简单的记账系统,用php+mysql。在要完工的时候发现了一个问题,研究了2天的时间才有了答案。以下是页首的裁图:

link标签和script标签跑到body下面,网页顶部有空白,出现“锘匡豢”乱码,UTF-8 BOM,EF BB BF

页面的头部有空白区域。有的人可能怀疑是css的margin,padding,border没有重置为0造成的。其实不然,我已经将这几个属性重置为0。 而且在firebug下面查看HTML代码会发现link标签和script表情跑到body下面:

link标签和script标签跑到body下面,网页顶部有空白,出现“锘匡豢”乱码,UTF-8 BOM,EF BB BF

并且body标签下面和link标签之间有空行。 我自诩的查看我的模板文件是没有任何问题的,想了很久最后实在是没辙了,跑到Stackoverflow高手云集的地方去提问

有个人提出了是页面中存在stray text。

You’ve got some stray text content inside the, before thetag. The
browser sees the text and decides this means you’re starting the main
document body but have forgotten to include thetag.

This is actually valid—if inadvisable—in HTML4: theend-tag
andstart-tag are both optional. This is how you can have justxHello!
as a valid HTML document. But it’s not permissible in XHTML, so if you
validate your document you should get a “character data is not allowed
here” error at the point the stray text occurs.

The browser then parses the rest of the document as body content,
putting theinside the body (which is not valid, but which is
nonetheless commonplace). It ignores the realwhen that comes along
because it already has a body.

If you can’t see the stray text, perhaps it’s an invisible character
like U+00A0 No-break space or—most likely for Chinese documents—U+3000
Ideographic space , which you may get when you press space in some
input method modes. These characters won’t be visible, but they’re not
‘ignorable whitespace’ like a normal U+0020 Space or newline, so they
trigger ‘text content’ processing and force the.

就是说页面存在浏览器不能忽略的空白,比如U+0020活U+3000之类的。 我的php脚本全部使用的是utf-8编码,html页面的charset也就是utf-8,我强制让浏览器把页面用其他编码来解析,比如GB2312,然后出现了如下图的情况:

link标签和script标签跑到body下面,网页顶部有空白,出现“锘匡豢”乱码,UTF-8 BOM,EF BB BF

link标签和script标签跑到body下面,网页顶部有空白,出现“锘匡豢”乱码,UTF-8 BOM,EF BB BF

空白的部分出现了乱码,而且内容都是“锘匡豢”,这下子更是把我搞糊涂了。我按照SO上面bobince兄的建议,把模板页传到W3C markup validator上面去检查,得出来如下结论:

link标签和script标签跑到body下面,网页顶部有空白,出现“锘匡豢”乱码,UTF-8 BOM,EF BB BF

大概意思是说,UTF-8中的BOM编码在一些编辑器或者是浏览器中支持不好,可能会出现问题。 然后网上搜索了关于Byte Order Mark的信息:

在UCS 编码中有一个叫做"ZERO WIDTH NO-BREAK SPACE"的字符,它的编码是FEFF。而FFFE在UCS中是不存在的字符,所以不应该出现在实际传输中。UCS规范建议我们在传输字节流前,先传输字符"ZERO WIDTH NO-BREAK SPACE"。这样如果接收者收到FEFF,就表明这个字节流是Big-Endian的;如果收到FFFE,就表明这个字节流是Little- Endian的。因此字符"ZERO WIDTH NO-BREAK SPACE"又被称作BOM。

UTF-8不需要BOM来表明字节顺序,但可以用BOM来表明编码方式。字符"ZERO WIDTH NO-BREAK SPACE"的UTF-8编码是EF BB BF。所以如果接收者收到以EF BB BF开头的字节流,就知道这是UTF-8编码了。

Windows就是使用BOM来标记文本文件的编码方式的。

然后我用UltraEdit的16进制编辑模式查看代码,都是EF BB BF开头的,说明都是带BOM的。我手动的将所有文件转成UTF-8 without BOM。页面终于正常了。link,script标签乖乖的跑到head下面,网页顶部空白消失。oh yeah。这就是搞了2天的答案。 最后我在网上随便下载了知名php程序的utf-8版,发现都是UTF-8 without BOM的。 那么我们继续回头看看出现问题的现象就有答案了。“锘匡豢”在页面头部出现多次的原因是首页处理文件index.php require_once了多个类和库文件,而那些库和类文件都是用的带BOM的UTF-8,所有PHP无法识别,直接将EF BB BF输出,在charset="utf-8"的页面中是空白,在GB2312的页面中的输出的就是一个稀有汉字。不信可以查锘匡豢这几个字的GB2312代码是多少:

UTF-8编码是变长的,1—6个字节。其中汉字编码,是3个或4个字节。而恰好EF BB BF多次出现,两个EF BB BF组成EFBB BFEF BBBF ,而EFBB BFEF BBBF就是“锘匡豢”的UTF-8编码。这也是很多网站页面顶部出现“锘”加一个方框。 看来得找个时间恶补编码知识了。