数据污染

用Google搜索“乱花渐…”,搜索自动给出的search hints都错了,都是“乱花渐…”,正确的应该是“乱花渐…”。

数据污染,搜索给出的错误Search hints

很显然,大量用户的错误输入被喂进了搜索引擎。过多被污染的数据用于算法,输出结果的准确度自然会降低。

除了用户主动错误输入产生的数据污染,在AI时代,Generative AI会产生大量自动生成内容,而这些内容往往不会经过核对就回流到互联网,这是更为严重的数据污染。

这是个问题。解决问题也就有价值。AI时代的数据清洗的意义巨大,与之相关的工作、产品或服务则价值巨大。


Data Pollution

It is evident that a large amount of erroneous user input has been fed into search engines. When too much polluted data is used in algorithms, the accuracy of the output naturally decreases.

In addition to data pollution caused by users’ input errors, in the AI era, Generative AI produces a vast amount of automatically generated content, which often gets recycled back onto the internet without verification. This represents a more severe form of data pollution.

This is a significant issue. Addressing the issue brings value. Consequently, in the AI era, the importance of data cleansing is immense, and related jobs, products, or services have tremendous value.

发表评论

您的邮箱地址不会被公开。 必填项已用 * 标注