之前开源的新冠疫情历史数据见新冠疫情历史数据-github仓库，有好多人问我这些数据是怎么获取的，因为最近一个月一直在忙创新实训的事情，所以也一直没有时间把博客写出来，下面我来讲一下我是如何获取到这些数据的。

数据是在《数据可视化》这门课程开课初期使用python爬虫从全球新冠病毒最新实时疫情地图_丁香园爬取到的，在此非常感谢丁香园！

第一阶段

疫情数据发布平台有每日头条、腾讯网、丁香医生、国家卫健委网站等，经过数据准确性、有无历史数据、数据发布时间、数据是否全面等考量，最终选定丁香医生的疫情实时动态（网址：网址）作为数据来源。
采用python爬虫的方式获取丁香园上的数据，下面分析丁香园的数据来源。
在我看来，要爬取网页数据，首先要了解网页结构和网页的数据流。
右键或F12查看网页源代码时会发现如下内容：

会发现这里对应着每一个国家的相关数据，如美国的：

{"id":3746165,"createTime":1591183434000,"modifyTime":1591183434000,"tags":"","countryType":2,"continents":"北美洲","provinceId":"8","provinceName":"美国","provinceShortName":"","cityName":"","currentConfirmedCount":1261772,"confirmedCount":1831821,"confirmedCountRank":1,"suspectedCount":0,"curedCount":463868,"deadCount":106181,"deadCountRank":1,"deadRate":"5.79","deadRateRank":35,"comment":"","sort":0,"operator":"hejiashu","locationId":971002,"countryShortCode":"USA","countryFullName":"United States of America","statisticsData":"https://file1.dxycdn.com/2020/0315/553/3402160512808052518-135.json","incrVo":{"currentConfirmedIncr":90,"confirmedIncr":91,"curedIncr":0,"deadIncr":1},"showRank":true}

其中发现有一个属性statisticsData，指向的是json数据链接，打开后发现：

这里记录的是美国的历史疫情数据，data中包含字段有：

1	confirmedCount\confirmedIncr\curedCount\curedIncr\curentConfirmedCount\currentConfirmedIncr\dateId\deadCount\deadIncr\suspectedCount\suspectedCountIncr

数据非常全面。
同样在网页源码下方有中国各省市数据：

这里对应着中国每一个省、直辖市、自治区和特别行政区的数据，如香港：

{"provinceName":"香港","provinceShortName":"香港","currentConfirmedCount":51,"confirmedCount":1093,"suspectedCount":63,"curedCount":1038,"deadCount":4,"comment":"疑似1例","locationId":810000,"statisticsData":"https://file1.dxycdn.com/2020/0223/331/3398299755968040033-135.json","cities":[]}

其中statisticsData中指向了一个json数据链接，打开后发现：

这里记录了香港的历史疫情数据，data中同样包含字段有：

1	confirmedCount\confirmedIncr\curedCount\curedIncr\curentConfirmedCount\currentConfirmedIncr\dateId\deadCount\deadIncr\suspectedCount\suspectedCountIncr

数据非常全面，由此通过分析网页便找到了中国和世界的疫情详细历史数据。

第二阶段

下一步要通过python爬虫进行获取。
首先根据分析，我们要获取每一个国家地区和中国每一个省市疫情数据对应的列表。
通过python中的requests库获取页面，代码如下：

def getOriHtmlText(url,code='utf-8'):
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
        }
        r=requests.get(url,timeout=30,headers=headers)
        r.raise_for_status()
        r.encoding=code
        return r.text
    except:
        return "There are some errors when get the original html!"

其中国家地区列表在页面中的window.getListByCountryTypeService2true，这就需要通过BeautifulSoup库中的find函数进行查找匹配，然后获取到后面的字符串，因为其符合json数据格式，所以将其转为json格式写入文件，代码如下所示：

html=getOriHtmlText(url)
    soup=BeautifulSoup(html,'html.parser')
    # script=soup.find_all('script',{"id":"getListByCountryTypeService2true"})
    # print(script.find(''))
    htmlBodyText=soup.body.text
    # 获取国家数据
    worldDataText=htmlBodyText[htmlBodyText.find('window.getListByCountryTypeService2true = '):]
    worldDataStr = worldDataText[worldDataText.find('[{'):worldDataText.find('}catch')]
    worldDataJson=json.loads(worldDataStr)
    with open("../data/worldData.json","w") as f:
        json.dump(worldDataJson,f)
        print("写入国家数据文件成功！")

同理，中国各省市列表在页面中的window.getAreaStat，列表获取过程如上所示。
获取结果如下所示：

第二步，要根据列表提取各国家地区、中国各省市的历史疫情数据，即statisticData所对应的json数据链接，从列表中提取链接用requests库爬取即可，代码如下所示（以获取各国家地区数据为例）：

def deal_worlddatalist():
    with open("../data/worldData.json",'r') as f:
        worldDataJson=json.load(f)
    # print(len(worldDataJson))
    # print(worldDataJson)
    for i in range(0,len(worldDataJson)):
        print(worldDataJson[i]['provinceName']+" "+worldDataJson[i]['countryShortCode']+" "+worldDataJson[i]['countryFullName']+" "+worldDataJson[i]['statisticsData'])
    return worldDataJson
def get_the_world_data():
    # 获取每个国家对应的json
    worldDataJson=deal_worlddatalist()
    # 记录错误数量
    errorNum=0
    for i in range(0,len(worldDataJson)):
        provinceName=worldDataJson[i]['provinceName']
        try:
            headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
            }
            r = requests.get(worldDataJson[i]['statisticsData'], timeout=30, headers=headers)
            r.raise_for_status()
            r.encoding = 'utf-8'
            everCountryDataJson = json.loads(r.text)
            toWriteFilePath="../data/worldData/"+provinceName+".json"
            with open(toWriteFilePath,'w') as file:
                json.dump(everCountryDataJson, file)
            print(provinceName + " 数据得到！")
            time.sleep(10)
        except:
            errorNum+=1
            print("在获取 "+provinceName+" 数据时出错！")
    print("各国数据获取完成！")
    print("错误数据量为："+str(errorNum))

获取结果为：

每一个json文件里对应着该国家地区的疫情历史数据，中国各省市同理，结果如下：

至此，疫情历史数据已被获取至各json文件中。

第三阶段

为了便于后续访问数据，将各json文件中的数据存储至MySQL数据库中，数据库要按照json文件中的属性进行设计，世界数据表：

中国各省、直辖市、自治区和特别行政区数据表：

下一步将json文件中的数据存至各对应的数据表中，使用python中的pymysql库，以写入世界数据为例，代码如下所示：

# 将各国json数据写入数据库
def importWorldJsonToDB():
    # 建立数据库连接
    db = pymysql.connect(
        host="127.0.0.1",
        user="root",
        password=*****,
        database="epidemic"
    )
    # 使用cursor()方法创建一个游标对象cursor
    cursor=db.cursor()
    # 不写增量添加了，因为数据量也不是很大并且前面的操作都为考虑增量，所以每一次都直接删了重新导吧
    deleteSql="truncate countrydata"
    try:
        cursor.execute(deleteSql)
        db.commit()
        print("删除国家数据成功！进行重新导入！")
    except:
        print("删除国家数据时出错！")
        db.rollback()
    with open("../data/worldData.json",'r') as f:
        worldDataJson=json.load(f)
    # 批量插入的数据集合
    insertValue=[]
    # 所插入的主键记录
    dataCount=1
    for i in range(0, len(worldDataJson)):
        # 获取每一个国家的名称，并打开其对应的json文件
        countryName=worldDataJson[i]['provinceName']
        countryShortCode=worldDataJson[i]['countryShortCode']
        continent=worldDataJson[i]['continents']
        countryFullName=nameMap[worldDataJson[i]['provinceName']]
        countryJsonPath="../data/worldData/"+countryName+".json"
        with open(countryJsonPath) as f:
            countryJson=json.load(f)
        for j in range(0,len(countryJson['data'])):
            tupleData=()
            tupleData+=(
                dataCount,countryJson['data'][j]['confirmedCount'],countryJson['data'][j]['confirmedIncr'],
                countryJson['data'][j]['curedCount'],countryJson['data'][j]['curedIncr'],countryJson['data'][j]['currentConfirmedCount'],
                countryJson['data'][j]['currentConfirmedIncr'],countryJson['data'][j]['dateId'],countryJson['data'][j]['deadCount'],
                countryJson['data'][j]['deadIncr'],countryJson['data'][j]['suspectedCount'],countryJson['data'][j]['suspectedCountIncr'],
                countryName,countryShortCode,continent,countryFullName
            )
            insertValue.append(tupleData)
            dataCount+=1
    insertSql="INSERT INTO countrydata (id,confirmedCount,confirmedIncr,curedCount,curedIncr,currentConfirmedCount,currentConfirmedIncr,dateId,deadCount,deadIncr,suspectedCount,suspectedCountIncr,countryName,countryShortCode,continent,countryFullName) VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"
    # 执行数据插入
    try:
        cursor.executemany(insertSql,insertValue)
        db.commit()
        print("插入国家数据成功！")
    except:
        print("插入国家数据失败！")
        db.rollback()
    # 关闭连接
    cursor.close()
    db.close()

另外考虑到累计数据，增加一个了世界数据累加表，表结构如下：

至此与系统相关的数据已存储、处理完毕。

完整代码

完整代码已开源，链接见完整代码

最后，致敬所有的抗疫英雄！