博客园爬虫模拟

时间：2017-06-07 12:40:09 阅读：135 评论：0 收藏：0 [点我收藏+]

标签：http 列表 get ora empty res 分析 div line

  /*
             原理分析: 
             1.通过抓包工具 分析请求地址:http://www.cnblogs.com/liuxiaoji/p/4689119.html
             2.可以看出这个请求是GET请求
             3.通过http请求把数据抓取回来
             4.HttpHelper帮助类请联系作者购买
            */
            HttpHelper http = new HttpHelper();
            string htmlText = http.HttpGet("http://www.cnblogs.com/liuxiaoji/p/4689119.html",string.Empty, Encoding.UTF8, false, false, 5000);


            // 正则css路径分析 
            Regex linkCss = new Regex(@"<link\b[^<>]*?\bhref[\s\t\r\n]*=[\s\t\r\n]*[""‘]?[\s\t\r\n]*(?<url>[^\s\t\r\n""‘<>]*)[^<>]*?/?[\s\t\r\n]*>", RegexOptions.IgnoreCase);

            // 搜索匹配的字符串 
            MatchCollection matches = linkCss.Matches(htmlText);

            // 取得匹配项列表 
            foreach (Match match in matches)
            {
                var item = match.Groups["url"].Value;
                if (!item.Contains("http://www.cnblogs.com"))
                {
                    htmlText = htmlText.Replace(item, item.Contains("/skins") ? $"http://www.cnblogs.com{item}" : $"http://www.cnblogs.com/skins{item}");
                }
            }

            // 最终结果
            var result = htmlText;
            // 文件保存
            using (FileStream fs = new FileStream("E:\\liuxiaoji.html", FileMode.Create))
            {
                var data = Encoding.UTF8.GetBytes(result);
                fs.Write(data, 0, data.Length);
            }

博客园爬虫模拟

标签：http 列表 get ora empty res 分析 div line

原文地址：http://www.cnblogs.com/liuxiaoji/p/6956015.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行