Windows Develop Bookmark and Share   
 index > Windows Forms General > How do I parse the HTML from a System.Net.HttpWebRequest - w/o using WebBrowser control?
 

How do I parse the HTML from a System.Net.HttpWebRequest - w/o using WebBrowser control?

Hi,

Guess let me first explain what I'm trying to do...

Let's say you have an embedded script on your page that creates dynamic html, for example:
<script
 type
="text/javascript" 
src
="
http://someplace.com/1234556.js"
></script
>
Just using the HTTPWebRequest only gets back the HTML that's pre-rendered.
In other words, I actually see that script tag in the response stream.

I'd like to see the source HTML after it's rendered, be able to see the HTML it produced .

Is there any code samples out there, or even 3rd Party controls that can accomplish this?

BTW, this is a Windows Application I'm building - so no ASP.NET is involved;
and also I'm not going to use the web browser control, because I have to create an instance of the web request
in parallel - rendering multiple sites of mine at the same time. With the webbrowser control - it must be visible in order to
render. Besides, I'm not making a browser - more of a scraper.

And it would take too long to do stuff in serial like this.

Much Thanks,
Denvas
Denvas  Monday, September 28, 2009 12:31 AM
This popular solution is free. However, I kinda doubt it support Javascript. There's little reason not to use a WebBrowser, it doesn't have to be visible:

private void button1_Click(object sender, EventArgs e) {
WebBrowser wb = new WebBrowser();
wb.DocumentCompleted += delegate(object s, WebBrowserDocumentCompletedEventArgs we) {
Console.WriteLine(wb.Document.Body.OuterHtml);
wb.Dispose();
};
wb.Url = new Uri("http://google.com");
}


Hans Passant.
nobugz  Tuesday, September 29, 2009 2:33 AM
Hi,

I don't think this is going to be easy. What you effectively need to do is write your own, or at least host an existing, jscript engine and an HTML DOM that it can find and work with, retrieve and execute the jscript embedded in the page and then read the HTML from the DOM that is the result. Your problem is that HTML that is rendered after the script is executed is created and/or modified by the script itself... depending on the jscript there isn't even any guarantee the HTML would look the same on every render.

I don't have any samples, but if you really want to do this I'd start Googling for open source jscript engines or ways of plugging existing jscript engines into .Net programs.
Yort  Monday, September 28, 2009 4:03 AM
Thanks for your help Yort.

Jeez, that does seem complicated. I can't believe some company hasn't already created a component/control to do this. (Hmmm... money idea?).

I'll do some research the next few days on this....

Thanks again,
- Denvas
Denvas  Monday, September 28, 2009 6:14 PM
This popular solution is free. However, I kinda doubt it support Javascript. There's little reason not to use a WebBrowser, it doesn't have to be visible:

private void button1_Click(object sender, EventArgs e) {
WebBrowser wb = new WebBrowser();
wb.DocumentCompleted += delegate(object s, WebBrowserDocumentCompletedEventArgs we) {
Console.WriteLine(wb.Document.Body.OuterHtml);
wb.Dispose();
};
wb.Url = new Uri("http://google.com");
}


Hans Passant.
nobugz  Tuesday, September 29, 2009 2:33 AM

You can use google to search for other answers

Custom Search

More Threads

• Docking Problems
• Accessing PDF file properties in code
• Starting out
• How to implement VBControlExtender's ObjectEvent in vb.net
• How to compile unrunnable classes?
• Accessing DataGridViewButtonCell attribute ButtonElement and/or ButtonState
• Printing Forms with scrollbars
• ImageList with html <img> tag
• Call the click event from another form
• Handle Button Click Event In code