Description

Given a URL startUrl and an interface HtmlParser, implement a Multi-threaded web crawler to crawl all links that are under the same hostname as startUrl.

Return all URLs obtained by your web crawler in any order.

Your crawler should:

  • Start from the page: startUrl
  • Call HtmlParser.getUrls(url) to get all URLs from a webpage of a given URL.
  • Do not crawl the same link twice.
  • Explore only the links that are under the same hostname as startUrl.

As shown in the example URL above, the hostname is example.org. For simplicity’s sake, you may assume all URLs use HTTP protocol without any port specified. For example, the URLs http://leetcode.com/problems and http://leetcode.com/contest are under the same hostname, while URLs http://example.org/test and http://example.com/abc are not under the same hostname.

The HtmlParser interface is defined as such:

interface HtmlParser {  
  // Return a list of all urls from a webpage of given url.  
  // This is a blocking call, that means it will do HTTP request and return when this request is finished.  
  public List<String> getUrls(String url);  
}

Note that getUrls(String url) simulates performing an HTTP request. You can treat it as a blocking function call that waits for an HTTP request to finish. It is guaranteed that getUrls(String url) will return the URLs within 15ms. Single-threaded solutions will exceed the time limit so, can your multi-threaded web crawler do better?

Idea

Below are two examples explaining the functionality of the problem. For custom testing purposes, you’ll have three variables urls, edges and startUrl. Notice that you will only have access to startUrl in your code, while urls and edges are not directly accessible to you in code.

Input:  
urls = [  
  "http://news.yahoo.com",  
  "http://news.yahoo.com/news",  
  "http://news.yahoo.com/news/topics/",  
  "http://news.google.com",  
  "http://news.yahoo.com/us"  
]  
edges = [[2,0],[2,1],[3,2],[3,1],[0,4]]  
startUrl = "http://news.yahoo.com/news/topics/"  
Output: [  
  "http://news.yahoo.com",  
  "http://news.yahoo.com/news",  
  "http://news.yahoo.com/news/topics/",  
  "http://news.yahoo.com/us"  
]

Be multi threading(or multi processing), Python recommend use ThreadPoolExecutor (or ProcessPoolExecutor)to protect the threads (or processes) in a safe state when it executing. And this question maybe execute under a virtual environment on leetcode platform, so I guess take the ThreadPoolExecutor is a better choice.

So, I write 2 methods of the class Solution , one for extract hostname from url name get_hostname(), another one for filter url which is not visited name visit_url().

Then, using the ThreadPoolExecutor to submit task visit_url for each url which is in the queue, and call future.result() to execute each visit_url with url.

Finally, shutdown the ThreadPoolExecutor to release resources and return a list for visit url result.

Solution

# """  
# This is HtmlParser's API interface.  
# You should not implement it, or speculate about its implementation  
# """  
#class HtmlParser(object):  
#    def getUrls(self, url):  
#        """  
#        :type url: str  
#        :rtype List[str]  
#        """  
  
from concurrent.futures import ThreadPoolExecutor  
from threading import Condition  
  
class Solution:  
    def __init__(self) -> None:  
        self._queue = list()  
        self._lock = Condition()  
        self._visited = set()  
  
    def get_hostname(self, url: str):  
        hostname = '.'.join(url.split('/')[2].split('.')[1:])   
        return hostname  
  
    def visit_url(self, url: str):  
        next_urls: List[str] = self._parser.getUrls(url)  
  
        with self._lock:  
            for next_url in next_urls:  
                if next_url not in self._visited and self.current_hostname == self.get_hostname(next_url) :  
                    self._visited.add(next_url)  
                    self._queue.insert(0,next_url)  
  
  
  
    def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:  
        self._queue.insert(0,startUrl)  
        self._visited.add(startUrl)  
        self.current_hostname = self.get_hostname(startUrl)  
        self._parser = htmlParser  
          
        executor = ThreadPoolExecutor()  
  
        while self._queue:  
            urls = [self._queue.pop(), ]  
  
            while self._queue:  
                urls.append(self._queue.pop())  
  
            excecutor_list = [executor.submit(self.visit_url, (url)) for url in urls]  
            for future in excecutor_list:  
                future.result()  
          
        executor.shutdown()  
  
        return list(self._visited)