使用 Python 自动爬取网页小说并生成 TXT 文件-番剧百科

研究这个纯粹是因为网友问《剑来》TXT 精校版啥时候更新而研究的，之前已经尝试使用 TextForever 软件将 HTML 网页小说批量一键转换成 TXT 格式，奈何现在的网页小说在正文内也加入了广告，有些甚至还自动分页。后期排版麻烦并且卡顿，至少缙哥哥的现在的笔记本电脑还是比较卡的，现在不是 Python 流行嘛，正好使用 Python 自动爬取网页小说并生成 TXT 文件。

安装 Python 环境

1、下载 Python3.7.3 源码包

wget -c https://www.python.org/ftp/python/3.7.3/Python-3.7.3.tgz

2、解压 Python3.7.3 源码包

tar -xzvf Python-3.7.3.tgz

3、进入 Python3.7.3 目录

cd Python-3.7.3

4、配置安装信息

./configure --with-ssl

“运气不好”的小伙伴可能会遇到以下错误：

configure: error: in `/root/Python-3.7.3':  configure: error: no acceptable C compiler found in $PATH

该错误解决办法：安装 GCC 软件套件

yum install gcc

然后重新执行配置安装信息即可。

5、安装 openssl-devel 支持

yum install openssl-devel

编译并安装

make && make install

6、这时候，你会幸运的发现，又TM出错了：

File "/root/Python-3.7.3/Lib/ctypes/__init__.py", line 7, in <module>  from _ctypes import Union, Structure, Array  ModuleNotFoundError: No module named '_ctypes'  make: *** [install] 错误 1

原因是缺少依赖包，安装 libffi 依赖即可。

yum install libffi-devel -y

这个解决后重新进行第 6 步编译安装。不出意外，会出现Successfully installed这个安装成功的提示！

执行 Python 脚本爬小说

由于该脚本使用了 Python 扩展库，请先安装BeautifulSoup与requests支持。

pip3 install beautifulsoup4  pip3 install requests

然后任意创建个文件夹（或者直接在根目录）放入 17549.py 脚本。

输入python3 17549.py（如果创建了个文件夹，记得先 cd 进去）就开始爬了……

Python 爬网页小说脚本

为了方便大家使用，该脚本已经打包放网盘，可以直接下载使用。

爬小说生成 TXT 示例 Python 脚本下载: http://ct.dujin.org/f/5210373-485800393-60f362

源码如下：

# -*- coding:UTF-8 -*-  from bs4 import BeautifulSoup  import requests  import sys  import time    class downloader(object):    def __init__(self,url):  self.target = url # 章节页  self.names = [] # 存放章节名  self.urls = [] # 存放章节链接  self.nums = 0 # 章节数  self.title=""#小说名    def get_one_text(self, url_i):    text = ' '  url_i="https://www.nitianxieshen.com"+url_i  r = requests.get(url=url_i)  r.encoding = r.apparent_encoding    html = r.text  html_bf = BeautifulSoup(html, features='html.parser')  #div = html_bf.find_all('div', attrs={"id":"content"})  #print(div.find('div',attrs={"class":"m-tpage"}))  texts=html_bf.find_all('p')  texts[0].decompose()  texts[len(texts)-1].decompose()  for t in texts:  text += str(t)  text = text.replace('<None>', '')  text = text.replace('</None>', '')  text = text.replace('</div>', 'n')  text = text.replace('<br/>', 'n')  text = text.replace('<p>', 'n')  text = text.replace('</p>', 'n')  text = text.replace('<p>', 'n')    return text    def get_name_address_list(self):  list_a_bf = []  list_a = []  r = requests.get(self.target)  r.encoding = r.apparent_encoding  html = r.text  div_bf = BeautifulSoup(html, features='html.parser')  self.title=div_bf.find('h1').text  div = div_bf.find_all('div',attrs={"id":"play_0"})[0]  li=div.find_all('li')  self.nums=len(li)  for i in range(len(li)):  self.names.append(li[i].find('a').string) # string方法返回章节名  self.urls.append(li[i].find('a').get('href')) # get（‘href’）返回子地址串  #print(self.names)  #print(self.urls)  print("共："+str(self.nums)+"章")    def writer(self, name, path, text):  write_flag = True  with open(path, 'a', encoding='utf-8') as f: # 打开目标路径文件  f.write(name + 'n')  f.writelines(text)  f.write('nn')         if __name__ == "__main__":    dl = downloader("https://www.nitianxieshen.com/zhuxian/")  dl.get_name_address_list()    print('《'+dl.title+'》开始下载：')  for i in range(dl.nums):  time.sleep(0.2)  try:  dl.writer(dl.names[i], r''+dl.title+'.txt', dl.get_one_text(dl.urls[i]))  except IndexError as e:  print(repr(e))  sys.stdout.write(" 已下载:%.3f%%" % float((i/dl.nums)*100) + 'r'+'当前第：'+str(i)+' 章')  sys.stdout.flush()  print(dl.title+'下载完成')

想把其他网页小说保存为单个 TXT 文件，只需要修改倒数第二段的小说目录地址即可。

使用 Python 自动爬取网页小说并生成 TXT 文件

安装 Python 环境

执行 Python 脚本爬小说

Python 爬网页小说脚本

相关推荐

评论抢沙发

评论前必须登录！

安装 Python 环境

执行 Python 脚本爬小说

Python 爬网页小说脚本

相关推荐

评论 抢沙发

评论前必须登录！

评论抢沙发