{"id":275,"date":"2013-10-18T19:56:53","date_gmt":"2013-10-18T19:56:53","guid":{"rendered":"http:\/\/www.webtipblog.com\/?p=275"},"modified":"2013-12-03T20:16:07","modified_gmt":"2013-12-03T20:16:07","slug":"creating-a-crawler-in-symfony-2-using-the-domcrawler-and-client","status":"publish","type":"post","link":"https:\/\/www.webtipblog.com\/creating-a-crawler-in-symfony-2-using-the-domcrawler-and-client\/","title":{"rendered":"Creating a Crawler in Symfony 2 Using the DomCrawler and Client"},"content":{"rendered":"
There have been a few times where I have needed to crawl a Symfony 2 site to index pages and execute code so I built a crawler console command designed using the Symfony 2 DomCrawler and Client. This is a fun alternative to using curl, and the client offers plenty of browser-like features that come in handy, such as a saved history of visited pages or testing the forward and back button functionality on your pages. The authentication cookie below can be used for a curl request to protected pages as well if desired.<\/p>\n
The DomCrawler class allows you to manipulate the DOM while the Client class functions like a browser to make requests and receive responses, as well as follow links and submit forms. Symfony has documented how this works in the Testing chapter of The Book<\/a>, but I needed something that would work outside of unit and functional tests in the form of a console command that could be scheduled to run.<\/p>\n <\/p>\n The crawler command is designed to take a few required arguments: the starting link to crawl and the username to authenticate with so restricted pages can be crawled. It also takes a few optional arguments: the number of pages to crawl at most to prevent the command from infinite crawling, keywords to search for in a route where a matching route should only be indexed once to prevent infinite crawling of dynamic links, and the name of a security firewall to authenticate with. To start, create the command class and set up the arguments and options.<\/p>\n <\/p>\n The configure and interact methods set up the command to run and take arguments, more information on how that works can be found in the Symfony console documentation<\/a>. The execute method starts by setting some class properties based on user input. At this point you should be able to open your terminal and in your project directory run the command with php app\/console crawler:crawl<\/em>.<\/p>\n <\/p>\n The next step is to create and boot the kernel, simply add this method to the SiteCrawlerCommand.<\/p>\n <\/p>\n Then call _createKernel() by adding the following to the execute() method:<\/p>\n <\/p>\n Next, get the Symfony Client which is used to make the requests and retrieve page content.<\/p>\n <\/p>\n<?php\r\nnamespace Acme\\Bundle\\Command;\r\n\r\nuse Symfony\\Bundle\\FrameworkBundle\\Command\\ContainerAwareCommand;\r\nuse Symfony\\Component\\Console\\Input\\InputArgument;\r\nuse Symfony\\Component\\Console\\Input\\InputOption;\r\nuse Symfony\\Component\\Console\\Input\\InputInterface;\r\nuse Symfony\\Component\\Console\\Output\\OutputInterface;\r\n\r\nuse Symfony\\Component\\HttpFoundation\\RedirectResponse;\r\nuse Symfony\\Component\\DomCrawler\\Crawler;\r\nuse Symfony\\Component\\HttpKernel\\Client;\r\nuse Symfony\\Component\\BrowserKit\\Cookie;\r\n\r\nuse Symfony\\Component\\Security\\Core\\Authentication\\Token\\UsernamePasswordToken;\r\n\r\n\/**\r\n * This class crawls the Acme site\r\n *\r\n * @author Joe Sexton <joe@webtipblog.com\r\n *\/\r\nclass SiteCrawlerCommand extends ContainerAwareCommand\r\n{\r\n \/**\r\n * @var OutputInterface\r\n *\/\r\n protected $output;\r\n\r\n \/**\r\n * @var Router\r\n *\/\r\n protected $router;\r\n\r\n \/**\r\n * @var EntityManager\r\n *\/\r\n protected $entityManager;\r\n\r\n \/**\r\n * @var string\r\n *\/\r\n protected $username = null;\r\n\r\n \/**\r\n * @var string\r\n *\/\r\n protected $securityFirewall = null;\r\n\r\n \/**\r\n * @var integer\r\n *\/\r\n protected $searchLimit;\r\n\r\n \/**\r\n * index routes containing these keywords only once\r\n * @var array\r\n *\/\r\n protected $ignoredRouteKeywords;\r\n\r\n \/**\r\n * @var string\r\n *\/\r\n protected $domain = null;\r\n\r\n \/**\r\n * Configure\r\n *\r\n * @author Joe Sexton <joe@webtipblog.com\r\n *\/\r\n protected function configure()\r\n {\r\n $this\r\n ->setName( 'crawler:crawl' )\r\n ->setDescription( 'Crawls the Acme website.' )\r\n ->setDefinition(array(\r\n new InputArgument( 'startingLink', InputArgument::REQUIRED, 'Link to start crawling' ),\r\n new InputArgument( 'username', InputArgument::REQUIRED, 'Username' ),\r\n new InputOption( 'limit', null, InputOption::VALUE_REQUIRED, 'Limit the number of links to process, prevents infinite crawling', 20 ),\r\n new InputOption( 'security-firewall', null, InputOption::VALUE_REQUIRED, 'Firewall name', 'default_firewall' ),\r\n new InputOption( 'ignore-duplicate-keyword', null, InputOption::VALUE_IS_ARRAY|InputOption::VALUE_REQUIRED, 'Index routes containing this keyword only one time (prevents infinite crawling of routes containng query parameters)', array() ),\r\n ))\r\n ->setHelp(<<<EOT\r\nThe <info>crawler:crawl<\/info> command crawls the Acme website:\r\n\r\n<info>php app\/console crawler:crawl <startingLink> <username><\/info>\r\nEOT\r\n );\r\n }\r\n\r\n \/**\r\n * Execute\r\n *\r\n * @author Joe Sexton <joe@webtipblog.com\r\n * @param InputInterface $input\r\n * @param OutputInterface $output\r\n * @todo use product sitemap to crawl product pages\r\n *\/\r\n protected function execute( InputInterface $input, OutputInterface $output )\r\n {\r\n \/\/ user input\r\n $startingLink = $input->getArgument( 'startingLink' );\r\n $this->domain = parse_url( $startingLink, PHP_URL_HOST );\r\n $this->username = $input->getArgument( 'username' );\r\n $this->searchLimit = $input->getOption( 'limit' );\r\n $this->securityFirewall = $input->getOption( 'security-firewall' );\r\n $this->ignoredRouteKeywords = $input->getOption( 'ignore-duplicate-keyword' );\r\n $this->output = $output;\r\n $this->router = $this->getContainer()->get( 'router' );\r\n $this->entityManager = $this->getContainer()->get( 'doctrine.orm.entity_manager' );\r\n\r\n \/\/ start\r\n $output->writeln('\r\n<info>A super-duper web crawler written by:\r\n\r\n ___ _____ _\r\n |_ | \/ ___| | |\r\n | | ___ ___ \\ `--. _____ _| |_ ___ _ __\r\n | |\/ _ \\ \/ _ \\ `--. \\\/ _ \\ \\\/ \/ __\/ _ \\| |_ \\\r\n\/\\__\/ \/ (_) | __\/ \/\\__\/ \/ __\/> <| || (_) | | | |\r\n\\____\/ \\___\/ \\___| \\____\/ \\___\/_\/\\_\\\\__\\___\/|_| |_|\r\n\r\n<\/info>');\r\n\r\n }\r\n\r\n \/**\r\n * Interact\r\n *\r\n * @author Joe Sexton <joe@webtipblog.com\r\n * @param InputInterface $input\r\n * @param OutputInterface $output\r\n *\/\r\n protected function interact( InputInterface $input, OutputInterface $output )\r\n {\r\n if ( ! $input->getArgument( 'startingLink' ) ) {\r\n $startingLink = $this->getHelper( 'dialog' )->askAndValidate(\r\n $output,\r\n 'Please enter the link to start crawling:',\r\n function( $startingLink ) {\r\n if ( empty( $startingLink ) ) {\r\n throw new \\Exception('starting link can not be empty');\r\n }\r\n\r\n return $startingLink;\r\n }\r\n );\r\n $input->setArgument( 'startingLink', $startingLink );\r\n }\r\n\r\n if ( ! $input->getArgument( 'username' ) ) {\r\n $username = $this->getHelper( 'dialog' )->askAndValidate(\r\n $output,\r\n 'Please choose a username:',\r\n function( $username ) {\r\n if ( empty( $username ) ) {\r\n throw new \\Exception( 'Username can not be empty' );\r\n }\r\n\r\n return $username;\r\n }\r\n );\r\n $input->setArgument( 'username', $username );\r\n }\r\n }\r\n\r\n}<\/pre>\n
\/**\r\n * createKernel\r\n *\r\n * @author Joe Sexton <joe@webtipblog.com\r\n * @return \\AppKernel\r\n *\/\r\nprotected function _createKernel() {\r\n\r\n $rootDir = $this->getContainer()->get( 'kernel' )->getRootDir();\r\n require_once( $rootDir . '\/AppKernel.php' );\r\n $kernel = new \\AppKernel( 'test', true );\r\n $kernel->boot();\r\n\r\n return $kernel;\r\n}<\/pre>\n
$kernel = $this->_createKernel();<\/pre>\n
$client = $kernel->getContainer()->get( 'test.client' );<\/pre>\n