php - Scraping data using simple html dom and simpleXML -

March 15, 2015

i'm trying scrape data several links retrieve xml file. keep getting error seem appear on of news. below can see output get

http://www.hltv.org/news/14971-rgn-pro-series-groups-drawnrgn pro series groups drawn  http://www.hltv.org/news/14969-k1ck-reveal-new-teamk1ck reveal new team  http://www.hltv.org/news/14968-world-championships-captains-unveiled fatal error: call member function find() on non-object in  /app/scrape.php on line 266

where line 266

$hltv_full_text = $hltv_deep_link->find("//div[@class='rnewscontent']", 0);

full code

scrape function

function scrape_hltv() {     $hltv = "http://www.hltv.org/news.rss.php";     $sxml = simplexml_load_file($hltv);     global $con;     foreach($sxml->channel->item $item)     {         $hltv_title = (string)$item->title;         $hltv_link = (string)$item->link;         $hltv_date = date('y-m-d h:i:s', strtotime((string)$item->pubdate));         echo $hltv_link;          //if (date('y-m-d', strtotime((string)$item->pubdate)) ==  date('y-m-d')){             if (strpos($hltv_title,'video:') === false) {                 $hltv_deep_link = file_get_html($hltv_link);                 $hltv_full_text = $hltv_deep_link->find("//div[@class='rnewscontent']", 0);                   echo $hltv_title . '<br><br>';              }         //}       }  }  scrape_hltv();

there several occasions when file_get_html() returns false.

see source code here: http://sourceforge.net/p/simplehtmldom/code/head/tree/trunk/simple_html_dom.php#l79

if (empty($contents) || strlen($contents) > max_file_size) {     return false; }

for link

http://www.hltv.org/news/14968-world-championships-captains-unveiled

i think because content of page larger max_file_size(600 000 bytes). page size around 3 mbs.

if want process larger files can try modified version of function:

define('default_target_charset', 'utf-8'); define('default_br_text', "\r\n"); define('default_span_text', " ");  function file_get_html_modified($url, $use_include_path = false, $context=null, $offset = -1, $maxlen=-1, $lowercase = true, $forcetagsclosed=true, $target_charset = default_target_charset, $striprn=true, $defaultbrtext=default_br_text, $defaultspantext=default_span_text) {     $dom = new simple_html_dom(null, $lowercase, $forcetagsclosed, $target_charset, $striprn, $defaultbrtext, $defaultspantext);     $contents = file_get_contents($url, $use_include_path, $context, $offset);     if (empty($contents))     {         return false;     }     $dom->load($contents, $lowercase, $striprn);     return $dom; }

... || strlen($contents) > max_file_size removed.

Search This Blog

Lix

php - Scraping data using simple html dom and simpleXML -

Comments

Post a Comment

Popular posts from this blog

Email notification in google apps script -

c++ - Difference between pre and post decrement in recursive function argument -

javascript - IE11 incompatibility with jQuery's 'readonly'? -