"景先生毕设|www.jxszl.com

基于机器学习算法的泰坦尼克号获救预测【字数:10339】

2024-11-03 10:52编辑: www.jxszl.com景先生毕设

目录
1 时代背景 1
2 技术和理论准备 2
2.1 python语言的科学计算库及编辑环境 2
2.1.1 numpy科学计算库 2
2.1.2 pandas数据处理库 2
2.1.3 matplotlib可视化库 2
2.1.4 seaborn可视化库 2
2.1.5 ScikitLearn机器学习算法库 2
2.1.6 anaconda开源的python包管理器 2
2.2 数据挖掘的流程 3
2.3 机器学习算法理论及模型验证的方法 3
2.3.1机器学习算法理论: 4
2.3.2模型检验法 6
3 代码实现 7
3.1 特征分析 7
3.2 特征工程和数据清洗 26
3.3 机器学习算法建模及模型检测 31
4 结果分析 38
致谢 38
参考文献 39
基于机器学习算法的泰坦尼克号获救预测
摘要
随着大数据时代的到来,日数据量的增长速度已经到了瞠目结舌的程度,常规的存储和管理数据技术此时已经是杯水车薪,所以当时急需寻求新的数据处理技术。在此背景下诞生了数据挖掘技术,目前数据挖掘已经广泛应用到各个领域中,对任何一份简单数据,都可以使用数据挖掘技术发现其内在的巨大价值。本课题选择了一份关于泰坦尼克号的乘客获救数据,运用数据挖掘法分析幸存下来的乘客都具有哪些特征。在数据挖掘的过程中,我们既有对特征的单一分析,又有特征的联合分析;不仅有表格的数据统计,还有大量的可视化展示,使得每一个特征对结果的影响程度一目了然。在分析完特征的影响后,我们又进行了特征工程和数据清洗的操作,即剔除不必要的特征、合并特征,缺失值填充等,为之后的建模检验创造了良好的数据集环境。在建模中,采用了多种机器学习分类算法并调整参数使得模型更优;为了得到更可靠和更具说服力的模型,又进行了交叉验证,最后给出特征的重要性排名,结果令人满意。
关键字:大数据;数据挖掘;泰坦尼克号;机器学习
PREDICTION OF TITANIC RES *51今日免费论文网|www.51jrft.com +Q: ¥351916072¥ 
CUE BASED ON MACHINE LEARNING ALGORITHM
ABSTRACT
With the advent of the era of big data, the growth rate of daily data has reached an unprecedentedly high level. At this time, conventional technology of data of storage and management is in short supply, so it is urgent to find new data processing technology. Data mining technology was born at that time. Data mining technology has been broadly applied to various fields. For any piece of simple data, this technology can be applied to discover its intrinsic value. In this paper, we chose a piece of data with regard to the Titanic passengers whove been rescued and analyze their features in cooperation with data mining technology. There’re not only individual but also joint analyses of the features, not only table statistics but also a large number of visualizations during the mining process, which made the impact of every feature on the results apparently. After the analysis of impacts of features, we applied feature engineering and data cleaning processes to get rid of unnecessary features, merge analogous features, fill in missing values and so on to create a better data set environment for modeling and verification later on. Furthermore, various types of machine learning classification algorithms with tuning parameters have been used to improve the data model. At last, the crossvalidation program has been implemented to achieve a more reliable and persuasive model and the importance ranking of features is given. The outcome is satisfying.

原文链接:http://www.jxszl.com/jsj/xxaq/606983.html