描述
开 本: 16开纸 张: 胶版纸包 装: 平装-胶订是否套装: 否国际标准书号ISBN: 9787121335310丛书名: 高级大数据人才培养丛书
总序
短短几年间,大数据就以一日千里的发展速度,快速实现了从概念到落地,直接带
动了相关产业井喷式发展。全球多家研究机构统计数据显示,大数据产业将迎来发展黄
金期:IDC 预计,大数据和分析市场将从2016 年的1300 亿美元增长到2020 年的2030
亿美元以上;中国报告大厅发布的大数据行业报告数据也说明,自2017 年起,我国大数据
产业将迎来发展黄金期,未来2~3 年的市场规模增长率将保持在35%左右。
数据采集、数据存储、数据挖掘、数据分析等大数据技术在越来越多的行业中得到
应用,随之而来的就是大数据人才问题的凸显。麦肯锡预测,每年数据科学专业的应届
毕业生将增加7%,然而仅高质量项目对于专业数据科学家的需求每年就会增加12%,完
全供不应求。根据《人民日报》的报道,未来3~5 年,中国需要180 万数据人才,但目
前只有约30 万人,人才缺口达到150 万之多。
以贵州大学为例,其首届大数据专业研究生就业率就达到100%,可以说“一抢而空”。
急切的人才需求直接催热了大数据专业,国家正式设立“数据科学与大数据技术”
本科新专业。目前已经有两批共计35 所大学获批,包括北京大学、中南大学、对外经
济贸易大学、中国人民大学、北京邮电大学、复旦大学等。估计2018 年会有几百所高
校获批。
不过,就目前而言,在大数据人才培养和大数据课程建设方面,大部分高校仍然处
于起步阶段,需要探索的还有很多。首先,大数据是个新生事物,懂大数据的老师少之
又少,院校缺“人”;其次,尚未形成完善的大数据人才培养和课程体系,院校缺“机制”;
再次,大数据实验需要为每位学生提供集群计算机,院校缺“机器”;*后,院校没有海
量数据,开展大数据教学科研工作缺“原材料”。
其实,早在网格计算和云计算兴起时,我国科技工作者就曾遇到过类似的挑战,我
有幸参与了这些问题的解决过程。为了解决网格计算问题,我在清华大学读博期间,于
2001 年创办了中国网格信息中转站网站,每天花几个小时收集和分享有价值的资料给学
术界,此后我也多次筹办和主持全国性的网格计算学术会议,进行信息传递与知识分享。
2002 年,我与其他专家合作的《网格计算》教材也正式面世。
2008 年,当云计算开始萌芽之时,我创办了中国云计算网站(chinacloud.cn)(在各
大搜索引擎“云计算”关键词中排名*),2010 年出版了《云计算(*版)》、2011
年出版了《云计算(第二版)》、2015 年出版了《云计算(第三版)》,每一版都花费了大
量成本制作并免费分享对应的几十个教学PPT。目前,这些PPT 的下载总量达到了几百
万次之多。同时,《云计算》教材也成为国内高校的*教材,在CNKI 公布的高被引图
书名单中,对于2010 年以来出版的所有图书,《云计算(*版)》在自动化和计算机领域
排名全国*。除了资料分享,在2010 年,我也在南京组织了全国高校云计算师资培训
班,培养了国内*批云计算老师,并通过与华为、中兴、360 等知名企业合作,输出云
计算技术,培养云计算研发人才。这些工作获得了大家的认可与好评,此后我接连担任
了工信部云计算研究中心专家、中国云计算专家委员会云存储组组长等职位。
近几年,面对日益突出的大数据发展难题,我也正在尝试使用此前类似的办法去应
对这些挑战。为了解决大数据技术资料缺乏和交流不够通透的问题,我于2013 年创办了
中国大数据网站(thebigdata.cn),投入大量的人力进行日常维护,该网站目前已经在各
大搜索引擎的“大数据”关键词排名中位居*;为了解决大数据师资匮乏的问题,我
面向全国院校陆续举办多期大数据师资培训班。2016 年末至今,在南京多次举办全国高
校/高职/中职大数据免费培训班,基于《大数据》《大数据实验手册》以及云创大数据提
供的大数据实验平台,帮助到场老师们跑通了Hadoop、Spark 等多个大数据实验,使他
们跨过了“从理论到实践,从知道到用过”的门槛。2017 年5 月,还举办了全国千所高
校大数据师资免费讲习班,盛况空前。
其中,为了解决大数据实验难的问题而开发的大数据实验平台,正在为越来越多高
校的教学科研带去方便:2016 年,我带领云创大数据(www.cstor.cn,股票代码:835305)
的科研人员,应用Docker 容器技术,成功开发了BDRack 大数据实验一体机,它打破虚
拟化技术的性能瓶颈,可以为每一位参加实验的人员虚拟出Hadoop 集群、Spark 集群、
Storm 集群等,自带实验所需数据,并准备了详细的实验手册(包含42 个大数据实验)、
PPT 和实验过程视频,可以开展大数据管理、大数据挖掘等各类实验,并可进行精确营
销、信用分析等多种实战演练。目前,大数据实验平台已经在郑州大学、西京学院、郑
州升达经贸管理学院、镇江高等职业技术学校等多所院校成功应用,并广受校方好评。
该平台也以云服务的方式在线提供(大数据实验平台,https://bd.cstor.cn),帮助师生通过
自学,用一个月左右成为大数据动手的高手。
同时,为了解决缺乏权威大数据教材的问题,我所负责的南京大数据研究院,联合
金陵科技学院、河南大学、云创大数据、中国地震局等多家单位,历时两年,编著出版
了适合本科教学的《大数据》《大数据库》《大数据实验手册》等教材。另外,《数据挖掘》
《虚拟化与容器》《大数据可视化》《深度学习》等本科教材也将于近期出版。在大数据教
学中,本科院校的实践教学应更加系统性,偏向新技术的应用,且对工程实践能力要求
更高。而高职、高专院校则更偏向于技术性和技能训练,理论以够用为主,学生将主要
从事数据清洗和运维方面的工作。基于此,我们还联合多家高职院校专家准备了《云计
算基础》《大数据基础》《数据挖掘基础》《R 语言》《数据清洗》《大数据系统运维》《大
数据实践》系列教材,目前也已经陆续进入定稿出版阶段。
此外,我们也将继续在中国大数据(thebigdata.cn)和中国云计算(chinacloud.cn)
等网站免费提供配套PPT 和其他资料。同时, 持续开放大数据实验平台
(https://bd.cstor.cn)、免费的物联网大数据托管平台万物云(wanwuyun.com)和环境大数
据免费分享平台环境云(envicloud.cn),使资源与数据随手可得,让大数据学习变得更加
轻松。
在此,特别感谢我的硕士导师谢希仁教授和博士导师李三立院士。谢希仁教授所著
的《计算机网络》已经更新到第7 版,与时俱进且日臻完美,时时提醒学生要以这样的
标准来写书。李三立院士是留苏博士,为我国计算机事业做出了杰出贡献,曾任国家攀
登计划项目首席科学家。他的严谨治学带出了一大批杰出的学生。
本丛书是集体智慧的结晶,在此谨向付出辛勤劳动的各位作者致敬!书中难免会有
不当之处,请读者不吝赐教。我的邮箱:[email protected],微信公众号:刘鹏看未来
(lpoutlook)。
刘鹏 教授
于南京大数据研究院
前言
21 世纪初,人类迈入大数据时代,各行各业拥抱大数据,希冀借大数据挖掘与分
析来促进产业升级与变革。因此,大数据人才的需求呈现井喷之势。
中国云计算专家咨询委员会秘书长刘鹏教授顺势而为,周密思考,提出高级大数据
人才培养课程体系,并邀请全国上百家高校中从事一线教学科研任务的教师一起,编撰
高级大数据人才培养丛书。本书即该套丛书之一。
本书的定位是大数据挖掘技术与应用。以“让学习变得轻松”为根本出发点,本书
努力回答:数据挖掘是什么?发展如何?经典的数据挖掘算法有哪些?大数据环境下数
据挖掘有哪些新特点和新延展?如何分析实际问题,如何应用?本书编写的指导思想有
三:一是理论与应用相呼应。从数据挖掘算法理论与方法、工具和应用两方面进行阐述,
既注重理论,同时贴近实战,解行结合,希望学习者既能很快将理论应用于实际领域的
数据分析中,同时也具备厚积薄发的能力;二是基础与发展一脉相承。大数据新常态下
经典数据挖掘的基本原理仍然适用,不同之处在于,根据现有分布式、并行环境,对原
有算法进行优化。本书拟循序渐进地介绍经典数据挖掘算法,以及大数据环境下数据挖
掘算法的新特点和新延展,有助于学习者全面掌握数据挖掘理论;三是局部与全局整体
联动。本书属于高级大数据人才培养丛书系列教材,因此,在本书内容组织上,需要考
虑与丛书其他教材的关系,既紧密联系又自成一体,共同组成高级大数据人才培养课程
体系。
基于上述指导思想,本书内容分为四部分:一是概念与基础,见第1 章绪论和第2
章;二是经典的数据挖掘算法,见第3 章分类、第4 章回归、第5 章聚类和第6 章关联
规则;三是大数据挖掘技术,其中,第7 章重点介绍了大数据环境下经典数据挖掘算法
的优化与改进,第8 章介绍了系统的理论与方法,第9 章则对链接分析与网页排序、
互联网信息抽取、日志挖掘与查询分析等技术进行了介绍;四是常用数据挖掘工具(包),
见附录。
本书成稿过程中得到丛书主编刘鹏教授和丛书副主编金陵科技学院张燕副院长的大
力支持,在书稿提纲和内容组织上提出了诸多建设性意见。同时,两轮审稿评审专家对
本书给予了全面指导和帮助,在此一并致谢。
当前,大数据挖掘技术仍处在高速发展的历史阶段,其概念内涵、技术方法、应用
模式还在不断创新演化之中,由于时间和水平所限,本书还存在缺点和不足,欢迎大家
不吝赐教。
1.1 数据挖掘基本概念 ··································································································1
1.1.1 数据挖掘的概念 ··························································································1
1.1.2 大数据环境下的数据挖掘 ···········································································2
1.1.3 数据挖掘的特性 ··························································································3
1.1.4 数据挖掘的过程 ··························································································3
1.2 数据挖掘起源及发展历史 ······················································································4
1.3 数据挖掘常用工具 ··································································································7
1.3.1 商用工具 ······································································································7
1.3.2 开源工具 ······································································································8
1.4 数据挖掘应用场景 ································································································ 10
习题 ································································································································ 12
参考文献 ························································································································ 13
第2 章 数据预处理与相似性 ····························································································· 14
2.1 数据类型 ··············································································································· 14
2.1.1 属性与度量 ································································································ 14
2.1.2 数据集的类型 ···························································································· 15
2.2 数据预处理 ··········································································································· 16
2.2.1 数据清理 ···································································································· 16
2.2.2 数据集成 ···································································································· 18
2.2.3 数据规范化 ································································································ 19
2.2.4 数据约简 ···································································································· 20
2.2.5 数据离散化 ································································································ 22
2.3 数据的相似性 ······································································································· 23
2.3.1 数值属性的相似性度量 ············································································· 23
2.3.2 标称属性的相似性度量 ············································································· 26
2.3.3 组合异种属性的相似性度量 ····································································· 27
2.3.4 文档相似性度量 ························································································ 28
2.3.5 离散序列相似性度量 ················································································· 30
习题 ································································································································ 31
参考文献 ························································································································ 32
第3 章 分类 ························································································································ 33
3.1 分类的基本概念、分类过程及分类器性能的评估 ············································· 33
3.1.1 分类的基本概念 ························································································ 33
3.1.2 分类的过程 ································································································ 33
3.1.3 分类器性能的评估方法 ············································································· 34
3.2 决策树 ···········································································
评论
还没有评论。