Description: KrawlSite is a web crawler/spider/ offline browser/download manager application. It is a KPart component with its own shell, so it can be run independently in its shell as well as it can be embedded into KPart aware applications like Konqueror. To integrate with Konqueror, open the file associations page in the configuration dialog, select text/html mime type and in the embedded viewers list choose KrawlSite_Part. Now when you right click on a web-page in Konqueror, in the preview in menu, you'll see KrawlSite. Selecting it embeds the component into Konqueror as in the second screen shot. The first screen shot shows the shell in which the component runs. The third component is the configuration dialog.
If you like it please rate it as good
Feel free to send in your bug reports and comments. I'll look into them when I have some spare time.
Also, I am lousy at creating icons, so if someone out there likes this applications(a lot), please make an icon for this app. I'll include your name in the credits.
TIP To use this app to download tutorials, set offline mode on, start crawling from the start of the tutorial. If the start page of the tutorial is the TOC, set crawl depth to 1 or if the start page has the TOC along with the first chapter, set crawl depth to 0. If only next & previous links are present per chapter page, set crawl depth to number of chapters.
I'd like to put in all this information in the handbook, but due to lack of time, not been able to do so. If someone understands the functionality and is willing to write the handbook, pls contact me.
If someone develops an rpm for this, pls contact me, so that I can link your rpm from this page. Many thanks!Last changelog:
ver 0.7 Finally! *crash free(afaik!), esp after kde 3.4 came around. *support for html frames *better UI
patch to v 0.6 * removes a bug that crashes app. * removes bug in multiple job mode
ver 0.6 This one took a long time to come out, but it removes almost all of the bugs that caused the app to crash intermittently, apparently without any reason! There's one KNOWN BUG: * If icon thumbnail previews are generated real time as files are created/deleted the app crashes. This has something to do with the internal implementation of the file browser(a KDE component), so to remove this bug, I'll have to write my own component( lot of work ), or i am doing something wrong with it ( will look into it). Thumbnail previews is disabled by default(but can be enabled by the context menu) changes: *) almost crash proof (see above) *) new file browser, much cleaner to use. *) more work on the leech mode, so its easier to use as a download manager. If you use this app, with some regularity, i strongly suggest that you upgrade from 0.5.1, not because of any major new features but a much easier and crash-less experience. Last of all, thanks for bearing with the crashes. I know it must have been exasperating. ~
ver 0.5.1 * corrected a bug in leech mode
ver 0.5 Some more features: * leech mode finally functional. In Leech mode, the app simply parses through the html file and presents the links and images as checkable items. Select the files to download and save it to disk. handy when you need to download 20-30 links(files) from a list of 50-60-100 (rather than right-click and save link 30 times). * Multiple job support with drop target window. click on drop target window, and drop urls on it. then you can configure each url to have different crawl settings, that is you can crawl the first url to depth 1 in offline mode, while 2nd url to depth 2 in simple mode, and so on. By default each url takes the current main settings. * notification window. notifies when all job(s) have completed. * user can jump to next link(in case current link is unresponsive), to next dropped url, pause and restart crawling. * UI improvements(hopefully!) :-)
ver 0.4.1 * corrected a bug in downloading external links.
ver 0.4 0.4 is a huge jump from 0.3. Almost everything has been spruced up, and some new features added, though Leech mode is still unimplemented. changes: * total rework on offline mode browsing. now links are correctly cross-linked. * handles dynamic content correctly. * tar file support fully functional. turned out tougher to implement than i thought initially, thanks to the tar protocol. the archive tool in konqueror is really simplistic and doesnt do the job right. My version does. :-) * regular expression parsing to correctly parse html pages.can parse through almost 12000 links(in one page) in no time. :-) * a proper file manager with drag-support. * spruced up URL list view. * quick set options available on the page * UI improvements.
ver 0.3 * offline browser mode added. crawl through a site with this setting on, and the app modifies the links in the parsed files to point to local files if they exist on local disk. * improved error reporting. errors encountered are reported in a separate window in real time. * file types can be excluded(dont dowload these file types) or exclusive(only download these file types besides text/html) * UI improvements in main window & config dialog. * web archive support - not working completely. more complicated than i thought initially. right now, only creates a compressed tarball. * leech mode - not implemented as yet. * more code cleanup.
ver 0.2 * major code cleanup. * ugly qt event loop hack replaced with elegant threaded model * ugly crashes due to ugly qt event loop hack removed. * minor UI improvements
my webhost domain is temporary inaccessible, because some idiots use it for phising activity.
they provided me with a temporary domain at http://donnie.911mb.com.
if you have trouble downloading the rpm, just replace the 110mb.com with 911mb.com
hi,
at the first sight krawlsite was what i was looking for ... but ... when i try to copy a site with e.g. picture index pages where each site contains a link to each other site krawlsite does not recongnize that this results in a nearly endless loop ...
cheers
frantek
You could try leech mode. That would show the links on the page. Then, select the picture links and select save. Hope this works for you.
I should add a "visited" url list though. That should speed up things.
I've have always been looking for a website checker for KDE....something that can check copies of a webpage for updates. The only thing that I could find was KWebWatch...but that hasn't been updated in ages.
Is there possibility of such a feature being included in KrawlSite....or better yet: a completely new independant program? :)
This version crash everytime I download anything. Older versions worked fine -althought it crashed frome time to time. The output (it crashes at the first xlib error):
flex@gardenia:~> krawlsite
krawlsite: splitter width: 80
krawlsite: KrawlSitePart...checking to see if there's an active thread
krawlsite: KrawlSitePart...mode from part:1
krawlsite: Krawler... m_url:http://www.gulic.org/static/diveintopython-5.4-es/toc/index.html
krawlsite: Krawler... start krawlinghttp://www.gulic.org/static/diveintopython-5.4-es/toc/index.html
krawlsite: Krawler... mode: 1
krawlsite: ERROR: : couldn't create slave : Unable to create io-slave:
klauncher said: Error en loading 'kmailservice %u'.
krawlsite:
krawlsite: ERROR: : couldn't create slave : Unable to create io-slave:
klauncher said: Error loading 'kmailservice %u'.
krawlsite:
krawlsite: ERROR: : couldn't create slave : Unable to create io-slave:
klauncher said: Error loading 'kmailservice %u'.
krawlsite:
krawlsite: ERROR: : couldn't create slave : Unable to create io-slave:
klauncher said: Error loading 'kmailservice %u'.
krawlsite:
Xlib: unexpected async reply (sequence 0x795b9)!
Xlib: sequence lost (0xc547a > 0xbe03e) in reply type 0xdf!
Xlib: sequence lost (0xc967f > 0xbe03e) in reply type 0x9f!
Xlib: sequence lost (0xc547a > 0xbe03e) in reply type 0xdf!
Xlib: sequence lost (0xc967f > 0xbe03e) in reply type 0x9f!
KCrash: Application 'krawlsite' crashing...
(Is this useful or you need the full debug output like amarok needs?)
Hi, whatever URL I put into the URL bar, it starts crawling and immediately stops saying "Malformed URL".
How am I supposed to insert URLs?
I tried: http://www.unibz.it, www.unibz.it... but nothing.
As soon as I'll get your app working I will use it everyday, so thanks in advance!
Bye
hi,
http://www.unibz.it/ worked for me. no malformed url error. did you try opening it up from a browser or ping the site?
perhaps you havent applied the patch?
hope this helps, happy crawling :-)
I dl,d the .06 rpm it gives me an undefined link error on startup,
I tried compiling from source and make errors when it gets to libart_lpgl.la even when configured with extra libs and dirs because it looks for it in the wrong lib dir /usr/lib as opposed to usr/lib64 . my system is suse 9.2 x86_64 what do I need to do to get it to compile correctly?
i am not maintaining the rpm. as far as the source is concerned, i have not tried it on a x_64 arch. mebbe you can google to see if some other application has had same error, and the resolution for it..?
thanks ,I think its the x86_64 instalation thats the problem I,ve run into this on a couple of other things and can usually get around it by compiling --with-extra-libs=xxx but only if I need only 1 extra lib ,I cant find how to specify multiple extra libs as are needed here. I just mentioned the rpm to note that I tried that route also.
Ratings & Comments
41 Comments
it's a pitty this project is somehow forgotten :/
v0.7 RPM for SLED 10: http://donnie.110mb.com/downloads.php?cat_id=2 GPG key in the front page of my website.
my webhost domain is temporary inaccessible, because some idiots use it for phising activity. they provided me with a temporary domain at http://donnie.911mb.com. if you have trouble downloading the rpm, just replace the 110mb.com with 911mb.com
Hi all, How can i do, when i want to download e.g. only jpeg images between size of 100kb and 500kb? Thanks.
hi, at the first sight krawlsite was what i was looking for ... but ... when i try to copy a site with e.g. picture index pages where each site contains a link to each other site krawlsite does not recongnize that this results in a nearly endless loop ... cheers frantek
You could try leech mode. That would show the links on the page. Then, select the picture links and select save. Hope this works for you. I should add a "visited" url list though. That should speed up things.
made a package for VECTOR LINUX SoHo 5.01
A SlackWare 10.2 Package with SlackBuild script is ready to download!! http://www.slacky.it/index.php?option=com_remository&Itemid=1&func=fileinfo&filecatid=382&parent=category
krawlsite-0.7-S10K35.i586.rpm at http://home.tiscali.be/raoul.linux/downloadSuSE10.0.htm ENJOY !!!
...really great!!! Well done.
I've have always been looking for a website checker for KDE....something that can check copies of a webpage for updates. The only thing that I could find was KWebWatch...but that hasn't been updated in ages. Is there possibility of such a feature being included in KrawlSite....or better yet: a completely new independant program? :)
A content checker could be included. Nice idea.
Krawlsite available in Debian Sid at http://pacotesdeb.codigolivre.org.br Requeriment: KDE 3.4
Due to a very unstable version , I remove the Suse Krawlsite from my server...Sorry waiting for for new http://home.tiscali.be/raoul.linux/download.htm
This version crash everytime I download anything. Older versions worked fine -althought it crashed frome time to time. The output (it crashes at the first xlib error): flex@gardenia:~> krawlsite krawlsite: splitter width: 80 krawlsite: KrawlSitePart...checking to see if there's an active thread krawlsite: KrawlSitePart...mode from part:1 krawlsite: Krawler... m_url:http://www.gulic.org/static/diveintopython-5.4-es/toc/index.html krawlsite: Krawler... start krawlinghttp://www.gulic.org/static/diveintopython-5.4-es/toc/index.html krawlsite: Krawler... mode: 1 krawlsite: ERROR: : couldn't create slave : Unable to create io-slave: klauncher said: Error en loading 'kmailservice %u'. krawlsite: krawlsite: ERROR: : couldn't create slave : Unable to create io-slave: klauncher said: Error loading 'kmailservice %u'. krawlsite: krawlsite: ERROR: : couldn't create slave : Unable to create io-slave: klauncher said: Error loading 'kmailservice %u'. krawlsite: krawlsite: ERROR: : couldn't create slave : Unable to create io-slave: klauncher said: Error loading 'kmailservice %u'. krawlsite: Xlib: unexpected async reply (sequence 0x795b9)! Xlib: sequence lost (0xc547a > 0xbe03e) in reply type 0xdf! Xlib: sequence lost (0xc967f > 0xbe03e) in reply type 0x9f! Xlib: sequence lost (0xc547a > 0xbe03e) in reply type 0xdf! Xlib: sequence lost (0xc967f > 0xbe03e) in reply type 0x9f! KCrash: Application 'krawlsite' crashing... (Is this useful or you need the full debug output like amarok needs?)
did you apply the patch?
Hi, whatever URL I put into the URL bar, it starts crawling and immediately stops saying "Malformed URL". How am I supposed to insert URLs? I tried: http://www.unibz.it, www.unibz.it... but nothing. As soon as I'll get your app working I will use it everyday, so thanks in advance! Bye
hi, http://www.unibz.it/ worked for me. no malformed url error. did you try opening it up from a browser or ping the site? perhaps you havent applied the patch? hope this helps, happy crawling :-)
I dl,d the .06 rpm it gives me an undefined link error on startup, I tried compiling from source and make errors when it gets to libart_lpgl.la even when configured with extra libs and dirs because it looks for it in the wrong lib dir /usr/lib as opposed to usr/lib64 . my system is suse 9.2 x86_64 what do I need to do to get it to compile correctly?
i am not maintaining the rpm. as far as the source is concerned, i have not tried it on a x_64 arch. mebbe you can google to see if some other application has had same error, and the resolution for it..?
thanks ,I think its the x86_64 instalation thats the problem I,ve run into this on a couple of other things and can usually get around it by compiling --with-extra-libs=xxx but only if I need only 1 extra lib ,I cant find how to specify multiple extra libs as are needed here. I just mentioned the rpm to note that I tried that route also.
krawlsite-0.6-s92k33rc.i586.rpm at http://home.tiscali.be/raoul.linux/download.htm
Why is the link broken on the next site? I can't download.... :-(
Sorry Because my private server stays on from 10:00 to 22:00 belgian hours
krawlsite-0.5.1-suse92.i586.rpm at http://home.tiscali.be/raoul.linux/download.htm my participation