本文将会详细的写从零开始制作网易云音乐爬虫的过程,可以下载网易云音乐音乐付费歌曲,使用Python3开发。
声明:本文从前端入手,再获得歌曲数据。如果您不想走那么多弯路,请参考这篇文章:https://zhuanlan.zhihu.com/p/30246788和这篇文章:https://www.shangyexinzhi.com/article/details/id-297404
本文采用环境:
- Pycharm + Python 3.7.5
所需模块:
- fake-useragent
- requests
- re
安装方法:
pip install beautifulsoup4 fake-useragent requests
爬虫思路:
graph TD;
初始化程序-->获取链接;
获取链接-->遍历链接;
遍历链接-->获取真实下载链接;
获取真实下载链接-->下载音乐;
看起来并不是非常困难,是吧?(^o^)/~
经过分析,发现网易云歌单中的音乐格式如下:

好。我们新建一个CloudMusicSpider.py,开始码代码:
#!/usr/bin/env python
#-*- coding:utf8 -*-
import requests #导入模块
from fake_useragent import UserAgent #导入模块
def getArtUrl(artlist_url): #获取音乐真实ID的函数
outerUrl = "http://music.163.com/song/media/outer/url?id={0:s}.mp3" #外链地址
HEADER = { #伪造请求头,防止被封
'Accept': '*/*',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'zh-CN,zh;q=0.8,gl;q=0.6,zh-TW;q=0.4',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded',
'Host': 'music.163.com',
'Referer': 'https://music.163.com/search/',
'User-Agent': UserAgent().random
}
response = requests.get(url=url.replace('#/', ''), headers=HEADER) #网易云音乐使用异步加载,需要去掉"#/"
soup = BeautifulSoup(response.text, "html.parser") #bs4读取
aList = soup.select("a") #获取全部a标签
Songs = {} #格式为{音乐名: "外链地址"}的字典
for music in aList:
if music.has_attr("href"): #判断是否有href属性
if str(music.attrs["href"]).startswith("/song?id="): #是不是音乐ID
id = str(music.attrs["href"]).replace("/song?id=", "") #获取真实ID
try:
Songs[music.text] = outerUrl.format(id) #放入字典
except:
pass
return Songs #返回这个字典
测试:
print(getUrl('https://music.163.com/#/playlist?id=4874770876'))
结果应该是这样的:
{'Undertale': 'http://music.163.com/song/media/outer/url?id=39227624.mp3', 'sans.': 'http://music.163.com/song/media/outer/url?id=39224615.mp3', 'Tem Shop': 'http://music.163.com/song/media/outer/url?id=39224629.mp3', 'Undertale - Snowdin Town (****** Bootleg)': 'http://music.163.com/song/media/outer/url?id=463498197.mp3', "It's Not Like I Like You!!(Instrumental)": 'http://music.163.com/song/media/outer/url?id=1385676402.mp3', '似顔絵広場 (似顔絵チャンネル)': 'http://music.163.com/song/media/outer/url?id=1378269172.mp3', 'ステージ:コミカル (サンドキャニオン)': 'http://music.163.com/song/media/outer/url?id=28546263.mp3', 'I Love You': 'http://music.163.com/song/media/outer/url?id=31460297.mp3', 'Ice Cream': 'http://music.163.com/song/media/outer/url?id=35407675.mp3', 'Dying': 'http://music.163.com/song/media/outer/url?id=506092100.mp3', '${song.name|mark}-${listArtists(song.artists)}': 'http://music.163.com/song/media/outer/url?id=${song.id}.mp3', '${soil(x.name)}': 'http://music.163.com/song/media/outer/url?id=${x.id}.mp3', '{if x.album}{/if}': 'http://music.163.com/song/media/outer/url?id=${x.id}.mp3', '${x.name}': 'http://music.163.com/song/media/outer/url?id=${x.id}.mp3'}
为啥会有'${x.id}'和'${song.id}'??我也不是很清楚,但是只要把它过滤就行了。
def getArtUrl(artlist_url): #获取音乐真实ID的函数
outerUrl = "http://music.163.com/song/media/outer/url?id={0:s}.mp3" #外链地址
HEADER = { #伪造请求头,防止被封
'Accept': '*/*',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'zh-CN,zh;q=0.8,gl;q=0.6,zh-TW;q=0.4',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded',
'Host': 'music.163.com',
'Referer': 'https://music.163.com/search/',
'User-Agent': UserAgent().random
}
response = requests.get(url=url.replace('#/', ''), headers=HEADER) #网易云音乐使用异步iframe加载,需要去掉"#/"
soup = BeautifulSoup(response.text, "html.parser") #bs4读取
aList = soup.select("a") #获取全部a标签
Songs = {} #格式为{音乐名: "外链地址"}的字典
for music in aList:
if music.has_attr("href"): #判断是否有href属性
if str(music.attrs["href"]).startswith("/song?id="): #是不是音乐ID
id = str(music.attrs["href"]).replace("/song?id=", "") #获取真实ID
try:
if not '${x.id}' in id and not '${song.id}' in id: #过滤
Songs[music.text] = outerUrl.format(id) #放入字典
except:
pass
return Songs #返回这个字典
结果:
{'Undertale': 'http://music.163.com/song/media/outer/url?id=39227624.mp3', 'sans.': 'http://music.163.com/song/media/outer/url?id=39224615.mp3', 'Tem Shop': 'http://music.163.com/song/media/outer/url?id=39224629.mp3', 'Undertale - Snowdin Town (****** Bootleg)': 'http://music.163.com/song/media/outer/url?id=463498197.mp3', "It's Not Like I Like You!!(Instrumental)": 'http://music.163.com/song/media/outer/url?id=1385676402.mp3', '似顔絵広場 (似顔絵チャンネル)': 'http://music.163.com/song/media/outer/url?id=1378269172.mp3', 'ステージ:コミカル (サンドキャニオン)': 'http://music.163.com/song/media/outer/url?id=28546263.mp3', 'I Love You': 'http://music.163.com/song/media/outer/url?id=31460297.mp3', 'Ice Cream': 'http://music.163.com/song/media/outer/url?id=35407675.mp3', 'Dying': 'http://music.163.com/song/media/outer/url?id=506092100.mp3'}
这就很好了。不过这种方式总觉得不是很妥。毕竟这个问题我们不知道还会有多少个类似错误。于是我要祭出我的正则表达式大法了!(看来BeautifulSoup也不需要了)
导入模块:
import requests, re #导入模块
from fake_useragent import UserAgent #导入模块
正则表达式:
<a href="/song\?id=(\d+)">(.*?)</a>
\?:问号要转义,不然会被认为是匹配操作符。
(\d+):多个数字形成的子组。
(.*?):匹配任何文字(歌曲名称)。
这样,匹配一个,它就会返回一个格式为(歌曲ID, 歌曲名称)的元组。
于是我们最终的代码如下:
#!/usr/bin/env python
#-*- coding:utf8 -*-
'''
@Author : Ray
@Contact : ray@raycoder.me
@Software: PyCharm
@File : init.py
@Time : 2020/2/23 21:19
'''
import requests, re
from fake_useragent import UserAgent
def getArtUrl(artlist_url): #获取音乐真实ID的函数
outerUrl = "http://music.163.com/song/media/outer/url?id={0:s}.mp3" #外链地址
HEADER = { #伪造请求头,防止被封
'Accept': '*/*',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'zh-CN,zh;q=0.8,gl;q=0.6,zh-TW;q=0.4',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded',
'Host': 'music.163.com',
'Referer': 'https://music.163.com/search/',
'User-Agent': UserAgent().random
}
response = requests.get(url=url.replace('#/', ''), headers=HEADER) #网易云音乐使用异步iframe加载,需要去掉"#/"
songs = {} #含有音乐ID及名称的字典
for i in re.findall(r'<a href="/song\?id=(\d+)">(.*?)</a>', response.text):#遍历匹配的对象
songs[i[1]] = outerUrl.format(i[0]) #存入字典
return songs
测试:
print(getUrl('https://music.163.com/#/playlist?id=4874770876'))
返回如下:
{'Prelude': 'http://music.163.com/song/media/outer/url?id=441612739.mp3', 'Marble Soda [Explicit]': 'http://music.163.com/song/media/outer/url?id=400876011.mp3', 'Snowdin Town': 'http://music.163.com/song/media/outer/url?id=39227596.mp3', 'Monkeys Spinning Monkeys': 'http://music.163.com/song/media/outer/url?id=403711640.mp3', 'Overworld Theme (GFM Trap Remix)': 'http://music.163.com/song/media/outer/url?id=467590509.mp3', 'Pizza': 'http://music.163.com/song/media/outer/url?id=1302090422.mp3', 'King Dedede': 'http://music.163.com/song/media/outer/url?id=27518579.mp3', 'Undertale': 'http://music.163.com/song/media/outer/url?id=39227624.mp3', 'sans.': 'http://music.163.com/song/media/outer/url?id=39224615.mp3', 'Tem Shop': 'http://music.163.com/song/media/outer/url?id=39224629.mp3', 'Undertale - Snowdin Town (****** Bootleg)': 'http://music.163.com/song/media/outer/url?id=463498197.mp3', "It's Not Like I Like You!!(Instrumental)": 'http://music.163.com/song/media/outer/url?id=1385676402.mp3', '似顔絵広場 (似顔絵チャンネル)': 'http://music.163.com/song/media/outer/url?id=1378269172.mp3', 'ステージ:コミカル (サンドキャニオン)': 'http://music.163.com/song/media/outer/url?id=28546263.mp3', 'I Love You': 'http://music.163.com/song/media/outer/url?id=31460297.mp3', 'Ice Cream': 'http://music.163.com/song/media/outer/url?id=35407675.mp3', 'Dying': 'http://music.163.com/song/media/outer/url?id=506092100.mp3'}
最好将其做成一个class:
#!/usr/bin/env python
#-*- coding:utf8 -*-
'''
@Author : Ray
@Contact : ray@raycoder.me
@Software: PyCharm
@File : init.py
@Time : 2020/2/23 21:19
'''
import requests, re
from fake_useragent import UserAgent
class CloudMusicSpider():
def getArtUrl(self, artlist_url): #获取音乐真实ID的函数
outerUrl = "http://music.163.com/song/media/outer/url?id={0:s}.mp3" #外链地址
HEADER = { #伪造请求头,防止被封
'Accept': '*/*',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'zh-CN,zh;q=0.8,gl;q=0.6,zh-TW;q=0.4',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded',
'Host': 'music.163.com',
'Referer': 'https://music.163.com/',
'User-Agent': UserAgent().random
}
response = requests.get(url=url.replace('#/', ''), headers=HEADER) #网易云音乐使用异步iframe加载,需要去掉"#/"
songs = {}
for i in re.findall(r'<a href="/song\?id=(\d+)">(.*?)</a>', response.text):
songs[i[1]] = outerUrl.format(i[0])
return songs
测试:
from CloudMusicSpider import CloudMusicSpider
spider=CloudMusicSpider()
print(spider.getArtUrl('https://music.163.com/#/playlist?id=4874770876'))
好的,我们这样子把它做成一个模块备用。
下一篇文章我会和大家讲爬虫下载文件的方法。我之后会写一篇有关正则表达式的文章。