优达棒球赛数据分析项目

棒球运动员的身高、体重的特点

作者获得了一份从1820到1995年出生的棒球运动员的身体数据。这里我对各地运动员的身高、体重情况以及他们随着时间的变化,以及它们和运动员寿命的关系情况感兴趣。接下来,我将对这些进行分析

提出问题:

1.运动员的出生区域分布

2.运动员的身高、体重随出生年份的变化

3.运动员的寿命与身高、体重的关系

这里,运动员的身高、体重是因变量,年份、城市是自变量

#导入数据库

# -*- coding: utf-8 -*-

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from __future__ import division

%matplotlib inline

导入数据

def read_csv(filename):

file=filename

data=pd.read_csv(file)

return(data)

player_df=read_csv('Master.csv')

#stars_df=read_csv('AllstarFull.csv')

让我们先来看一下导入的数据的结构

player_df.head()

playerIDbirthYearbirthMonthbirthDaybirthCountrybirthStatebirthCitydeathYeardeathMonthdeathDay...nameLastnameGivenweightheightbatsthrowsdebutfinalGameretroIDbbrefID0aardsda011981.012.027.0USACODenverNaNNaNNaN...AardsmaDavid Allan220.075.0RR2004/4/62015/8/23aardd001aardsda011aaronha011934.02.05.0USAALMobileNaNNaNNaN...AaronHenry Louis180.072.0RR1954/4/131976/10/3aaroh101aaronha012aaronto011939.08.05.0USAALMobile1984.08.016.0...AaronTommie Lee190.075.0RR1962/4/101971/9/26aarot101aaronto013aasedo011954.09.08.0USACAOrangeNaNNaNNaN...AaseDonald William190.075.0RR1977/7/261990/10/3aased001aasedo014abadan011972.08.025.0USAFLPalm BeachNaNNaNNaN...AbadFausto Andres184.073.0LL2001/9/102006/4/13abada001abadan01

5 rows × 24 columns

下面是数据中表头的含义:

1.playerID A unique code asssigned to each player. The playerID links

the data in this file with records in the other files.

2.birthYear Year player was born

3.birthMonth Month player was born

4.birthDay Day player was born

5.birthCountry Country where player was born

6.birthState State where player was born

7.birthCity City where player was born

8.deathYear Year player died

9.deathMonth Month player died

10.deathDay Day player died

11.deathCountry Country where player died

12.deathState State where player died

13.deathCity City where player died

14.nameFirst Player's first name

15.nameLast Player's last name

16.nameGiven Player's given name (typically first and middle)

17.weight Player's weight in pounds

18.height Player's height in inches

19.bats Player's batting hand (left, right, or both)

20.throws Player's throwing hand (left or right)

21.debut Date that player made first major league appearance

数据项目有很多,但我们只需要选手ID,出生年份、出生国家、城市等数据,这里将提取这些数据

data1_df=player_df[['playerID','birthYear','deathYear','birthCountry','birthState','birthCity','weight','height']]

让我们看一下新数据的结构

data1_df.head()

playerIDbirthYeardeathYearbirthCountrybirthStatebirthCityweightheight0aardsda011981.0NaNUSACODenver220.075.01aaronha011934.0NaNUSAALMobile180.072.02aaronto011939.01984.0USAALMobile190.075.03aasedo011954.0NaNUSACAOrange190.075.04abadan011972.0NaNUSAFLPalm Beach184.073.0

data1_df.head()

playerIDbirthYeardeathYearbirthCountrybirthStatebirthCityweightheight0aardsda011981.0NaNUSACODenver220.075.01aaronha011934.0NaNUSAALMobile180.072.02aaronto011939.01984.0USAALMobile190.075.03aasedo011954.0NaNUSACAOrange190.075.04abadan011972.0NaNUSAFLPalm Beach184.073.0

接下来让我们查看一下数据的摘要信息

data1_df.describe()

birthYeardeathYearweightheightcount18703.0000009336.00000017975.00000018041.000000mean1930.6641181963.850364185.98086272.255640std41.22907931.50636921.2269882.598983min1820.0000001872.00000065.00000043.00000025%1894.0000001942.000000170.00000071.00000050%1936.0000001966.000000185.00000072.00000075%1968.0000001989.000000200.00000074.000000max1995.0000002016.000000320.00000083.000000

从摘要信息中可以看到,棒球运动员的平均身高为72.255英寸,分布在43英寸到83英寸之间;体重的波动范围为65-320磅,平均体重为185.98磅

让我们看一下是否存在数据缺失情况

data1_df.info()

RangeIndex: 18846 entries, 0 to 18845

Data columns (total 8 columns):

playerID 18846 non-null object

birthYear 18703 non-null float64

deathYear 9336 non-null float64

birthCountry 18773 non-null object

birthState 18220 non-null object

birthCity 18647 non-null object

weight 17975 non-null float64

height 18041 non-null float64

dtypes: float64(4), object(4)

memory usage: 1.2+ MB

可以看到,数据中体重、身高、出生年份、死亡年份数据信息不全。

其中,身高、体重数据将用前值补全,出生年份缺失的则需要将其剔除

#定义补全函数

def enfull_ave(letter):

data1_df[letter].fillna(method='ffill')

#补全体重

enfull_ave('weight')

#补全身高

enfull_ave('height')

#剔除缺失数据

data1_df=data1_df.dropna(how='all')

现在,让我们对棒球运动员的国家分布和城市分布进行分析

#下面定义几个常用函数

# 按照name对运动员进行分组后,计算每组的人数

def player_count(data,name):

return data.groupby(name)['playerID'].count()

def player_count_rate(data,name):

b=player_count(data,name)

a=data['playerID'].count()

return b/a

# 输出饼图

def print_pie(group_data,title):

group_data.plot.pie(title=title,figsize=(12, 12),autopct='%3.1f%%',startangle =90,legend=True)

# 输出柱状图

def print_bar(data,title):

bar=data.plot.bar(title=title,width=10)

for p in bar.patches:

bar.annotate('%3.1f%%' % (p.get_height()*100), (p.get_x() * 1.005, p.get_height() * 1.005))

#输出折线图

def print_plot(data,name1,title):

x=data.index

y=data[name1]

plt.figure(figsize=(12,6)) #创建绘图对象

plt.plot(x,y,'ro',color="red",linewidth=1) #在当前绘图对象绘图(X轴,Y轴,蓝色虚线,线宽度)

plt.xlabel("year")

plt.ylabel(name1)

plt.title(title) #图标题

plt.show() #显示图

plt.savefig("line.jpg") #保存图

接下来,让我们查看棒球运动员在各个国家的分布比例

player_count_rate(data1_df,'birthCountry').sort_values(ascending=False)

birthCountry

USA 0.875730

D.R. 0.034119

Venezuela 0.018094

P.R. 0.013425

CAN 0.012947

Cuba 0.010506

Mexico 0.006261

Japan 0.003290

Panama 0.002918

Ireland 0.002653

United Kingdom 0.002600

Germany 0.002441

Australia 0.001486

South Korea 0.000902

Colombia 0.000902

Nicaragua 0.000743

Curacao 0.000743

V.I. 0.000637

Netherlands 0.000637

Taiwan 0.000584

Russia 0.000424

France 0.000424

Italy 0.000371

Bahamas 0.000318

Aruba 0.000265

Poland 0.000265

Austria 0.000212

Sweden 0.000212

Spain 0.000212

Czech Republic 0.000212

Jamaica 0.000212

Brazil 0.000159

Norway 0.000159

Saudi Arabia 0.000106

At Sea 0.000053

American Samoa 0.000053

Belgium 0.000053

Belize 0.000053

China 0.000053

Viet Nam 0.000053

Denmark 0.000053

Finland 0.000053

Greece 0.000053

Guam 0.000053

Honduras 0.000053

Indonesia 0.000053

Lithuania 0.000053

Philippines 0.000053

Singapore 0.000053

Slovakia 0.000053

Switzerland 0.000053

Afghanistan 0.000053

Name: playerID, dtype: float64

可以看到,棒球运动员来自50多个国家和地区。绝大多数棒球运动员的出生国家在美国,占比87.6%;比较高的有D.R.、Venezuela、P.R.、CAN、Cuba ,都达到了1%以上。接下来,让我们看一下美国运动员的州分布

#提取美国运动员数据

data_usa=data1_df[data1_df['birthCountry']=='USA']

#画饼图

print_pie(player_count_rate(data_usa,'birthState'),'The player rate about States')

从这里可以看到,出生在CA的棒球运动员最多,占比为13%,其次为PA,为8.5%。排名前五的州为CA,PA,NY,IL,OH,有超过44%的美国棒球运动员在这些地方出生

让我们看一下各地棒球运动员的身高、体重情况吧

data2=data1_df[['birthCountry','birthState','height','weight']]

#按平均身高排序

data3=data2.groupby('birthCountry').mean().sort_values(by='height',ascending=False)

print '有%d个国家超过了平均水平'%(data3['height'][data3['height']>=data1_df['height'].mean()].count())

data3

有26个国家超过了平均水平

heightweightbirthCountryIndonesia78.000000220.000000Belgium77.000000205.000000Jamaica75.250000201.250000Afghanistan75.000000215.000000Brazil74.333333205.000000Singapore74.000000205.000000Honduras74.000000185.000000Guam74.000000210.000000Australia73.500000200.500000Netherlands73.454545183.333333South Korea73.411765198.294118Curacao73.357143207.857143Spain73.250000189.666667Switzerland73.000000170.000000Lithuania73.000000185.000000Norway73.000000180.000000China73.000000165.000000Philippines73.000000188.000000Aruba73.000000200.000000Panama72.890909186.018182D.R.72.819596192.916019Taiwan72.727273194.454545Sweden72.666667185.000000Nicaragua72.571429189.785714Germany72.375000182.871795USA72.257213185.427646Venezuela72.225806197.222874Japan72.209677192.354839Mexico72.127119189.118644Saudi Arabia72.000000200.000000Greece72.000000185.000000American Samoa72.000000210.000000Bahamas72.000000180.833333Slovakia72.000000196.000000CAN71.979167185.212500P.R.71.881423185.818182France71.833333184.666667Austria71.750000190.250000Cuba71.682051185.451282Colombia71.647059199.125000Poland71.600000179.800000V.I.71.333333186.250000Italy71.142857180.428571Czech Republic71.000000184.000000At Sea71.000000170.000000Viet Nam71.000000200.000000United Kingdom70.377778174.500000Belize70.000000180.000000Russia69.857143167.428571Ireland69.552632170.131579Finland69.000000165.000000Denmark67.000000158.000000

可以看到,平均身高最高的国家是印度尼西亚,为78英寸,接下来为比利时,为77英寸。各国的平均身高都不低于67英寸,超过平均水平的国家有26个。接下来,让我们看一下体重情况

c=data2.groupby('birthCountry').mean().sort_values(by='weight',ascending=False)

#对超过平均水平的国家计数

print '有%d个国家超过了平均水平'%(data3['weight'][data3['weight']>=data1_df['weight'].mean()].count())

c

有27个国家超过了平均水平

heightweightbirthCountryIndonesia78.000000220.000000Afghanistan75.000000215.000000American Samoa72.000000210.000000Guam74.000000210.000000Curacao73.357143207.857143Singapore74.000000205.000000Belgium77.000000205.000000Brazil74.333333205.000000Jamaica75.250000201.250000Australia73.500000200.500000Saudi Arabia72.000000200.000000Viet Nam71.000000200.000000Aruba73.000000200.000000Colombia71.647059199.125000South Korea73.411765198.294118Venezuela72.225806197.222874Slovakia72.000000196.000000Taiwan72.727273194.454545D.R.72.819596192.916019Japan72.209677192.354839Austria71.750000190.250000Nicaragua72.571429189.785714Spain73.250000189.666667Mexico72.127119189.118644Philippines73.000000188.000000V.I.71.333333186.250000Panama72.890909186.018182P.R.71.881423185.818182Cuba71.682051185.451282USA72.257213185.427646CAN71.979167185.212500Lithuania73.000000185.000000Greece72.000000185.000000Honduras74.000000185.000000Sweden72.666667185.000000France71.833333184.666667Czech Republic71.000000184.000000Netherlands73.454545183.333333Germany72.375000182.871795Bahamas72.000000180.833333Italy71.142857180.428571Norway73.000000180.000000Belize70.000000180.000000Poland71.600000179.800000United Kingdom70.377778174.500000Ireland69.552632170.131579At Sea71.000000170.000000Switzerland73.000000170.000000Russia69.857143167.428571Finland69.000000165.000000China73.000000165.000000Denmark67.000000158.000000

这里我们可以看到,运动员的平均体重最高的国家仍然是印度尼西亚,为220磅,接下来是阿富汗,为215磅,有27个国家的运动员超过了平均水平

接下来,让我们看一下全明星运动员的情况吧

接下来,让我们看一下平均身高、平均体重岁随年份的变化

#提取数据

b=data1_df.groupby('birthYear').mean()

d=b.dropna()

#打印体重-时间折线图

print_plot(d,'weight','The weight change about birthyears')

#打印身高-时间折线图

print_plot(d,'height','The height change about birthYear')

从这里可以看到,运动员的身高和体重随着出生年份呈现正相关关系。那么,他们之间有多大的相关性呢?接下来让我们查看一下

#提取数据

e=pd.DataFrame(d,columns=['birthyear','weight','height'])

e['birthyear']=e.index

#计算相关系数

e.corrwith(e['birthyear'])

birthyear 1.000000

weight 0.929546

height 0.947681

dtype: float64

从这里可以看到,运动员的出生年份与运动员的平均身高的的相关系数为0.947,与平均体重的相关系数为0.934。可以看到运动员的平均身高、体重与年份有很大的相关性。但是由于缺乏进一步数据,造成这种现象的原因不得而知

接下来,我们看一下运动员的寿命与身高、体重情况

#剔除在世运动员的数据,并提取数据

data_age=data1_df.dropna(how='all')

data_age=data_age[['playerID','birthYear','deathYear','weight','height']]

#计算运动员寿命

data_age=pd.DataFrame(data_age,columns=['playerID','birthYear','deathYear','Age','weight','height'])

data_age['Age']=data_age['deathYear']-data_age['birthYear']

去掉可能存在的缺失值

#剔除存在缺失的数据

data_age=data_age.dropna()

#计算平均值

f=data_age.groupby('Age').mean()

f

birthYeardeathYearweightheightAge20.01907.5000001927.500000176.50000070.50000021.01867.0000001888.000000181.50000072.50000022.01925.8000001947.800000179.00000071.40000023.01915.0000001938.000000169.60000072.00000024.01916.2000001940.200000177.40000071.30000025.01898.3076921923.307692176.15384672.46153826.01903.4000001929.400000177.53333371.73333327.01887.7692311914.769231172.88461570.88461528.01894.5000001922.500000178.50000071.50000029.01907.4324321936.432432176.29729771.48648630.01888.7096771918.709677172.77419471.06451631.01881.6666671912.666667169.25925970.77777832.01889.3939391921.393939173.33333370.72727333.01894.2580651927.258065167.29032370.51612934.01898.9000001932.900000177.04000071.82000035.01899.1351351934.135135183.40540571.75675736.01891.0512821927.051282176.71794970.12820537.01886.5384621923.538462171.46153870.33333338.01892.0833331930.083333178.25000071.35416739.01897.5897441936.589744179.43589771.64102640.01892.3111111932.311111178.55555671.13333341.01893.5000001934.500000177.70454570.72727342.01893.2250001935.225000179.22500071.27500043.01891.2040821934.204082175.67346970.81632744.01885.3442621929.344262173.01639370.37704945.01898.1212121943.121212178.84848571.13636446.01893.9387761939.938776179.04081671.06122447.01893.4415581940.441558175.01298770.80519548.01894.0000001942.000000174.16455770.94936749.01894.2131151943.213115175.59016470.868852...............75.01900.2850241975.285024174.78260971.16425176.01897.8949771973.894977175.80821971.11872177.01897.6071431974.607143173.99107171.00446478.01897.6066351975.606635176.32701471.03317579.01898.9909911977.990991175.64414471.15765880.01899.3512401979.351240177.00000071.19008381.01899.8796301980.879630176.35185270.92592682.01900.7544641982.754464176.07589371.28125083.01901.4541281984.454128175.66513871.24311984.01898.2578951982.257895175.41578970.91578985.01900.0052631985.005263172.21578970.96842186.01903.9139781989.913978175.81182871.20967787.01897.7986111984.798611175.40277871.09027888.01904.5405411992.540541177.42567671.53378489.01900.2992131989.299213174.86614271.22834690.01901.4867261991.486726173.49557570.85840791.01899.0681821990.068182173.75000070.68181892.01901.6736841993.673684175.83157971.15789593.01901.5131581994.513158173.82894771.00000094.01898.0888891992.088889173.53333371.31111195.01899.4615381994.461538172.57692370.82692396.01902.2222221998.222222176.50000071.11111197.01893.6470591990.647059171.82352970.35294198.01900.8823531998.882353174.70588270.70588299.01897.2222221996.222222163.44444469.666667100.01899.7000001999.700000168.60000070.100000101.01900.4000002001.400000167.00000070.400000102.01900.0000002002.000000165.00000071.000000103.01911.0000002014.000000158.00000065.000000107.01891.0000001998.000000162.00000069.000000

85 rows × 4 columns

#提取年龄

age_df=pd.DataFrame(f,columns=['age','weight','height'])

age_df['age']=f.index

#绘制折线图

print_plot(age_df,'weight','weight-age')

print_plot(age_df,'height','height-age')

#计算相关系数

age_df.corr()

ageweightheightage1.000000-0.430298-0.371683weight-0.4302981.0000000.724237height-0.3716830.7242371.000000

可以看到,运动员寿命与身高、体重存在弱相关关系,且与运动员身高、体重呈负相关关系。其相关性远不如出生年份。但这里也说明运动员的身高、体重在某种程度上有可能影响运动员寿命

友情链接