another programer guru question - data extraction

Canon EF 24-105mm f/4.0L IS

07/27/2006 04:45:33 AM · #1

I am looking to build in a functin on my website where it continually updates information that is received from another website.

For example (although this is not what i want to do) have a page, perhaps in php, which continually updates my score from dpchallenge.

Is this possible, or how hard is it to do?

Art Roflmao

Canon EOS-6D Mark II

07/27/2006 04:59:23 AM · #2

Depends on how you are "receiving" information from another website. If the website provides RSS feeds, you could get it that way. If you have direct access to the other website's database, you could use php to pull it from that. Unless you have control over the other website(s), it would be fairly difficult. And even if you do, the level of difficulty depends on your skill / experience level.

07/27/2006 05:16:27 AM · #3

hmm....

well i have a little 'hash my way through' experience with php

What i want to do is create a page which automatically updates with for example the number of downloads of This Image

all i want is one little number :)

is that so tough??

Canon EF 24-105mm f/4.0L IS

07/27/2006 05:19:24 AM · #4

or, that is to say,

if i could do it in php or java I could probably figure it out... or if i knew what to search for on google.

Art Roflmao

Canon EOS-6D Mark II

07/27/2006 05:20:55 AM · #5

Originally posted by leaf:

or, that is to say,

if i could do it in php or java I could probably figure it out... or if i knew what to search for on google.

Search for "screen scraping" or something like that. That may get you all the stuff on a page and then you have to figure out how to get the "one little number" from that.

07/27/2006 05:57:55 AM · #6

ok thanks

Nikon AF Nikkor 50mm f/1.8D

07/27/2006 10:12:01 AM · #7

What platform are you doing this on? I run freebsd a home, a flavor of unix. To do what you want to do I would do it the quick and dirty way.

wget //www.dreamstime.com/woodtexture-image13352 | grep 'downloads:' > download_count.txt

And then in php I would echo the contents of download_count.txt into my own page. Wget is a handy little utility, and if you are running windows there are equivalents out there.

_eug

Nikon D50

07/27/2006 11:32:29 AM · #8

Like Ken said it's called Screen Scraping and it can be done with many scripting languages including Perl and PHP.

Nikon AF Nikkor 50mm f/1.8D

07/27/2006 12:51:36 PM · #9

Just a warning, screen scraping is unreliable, because generally the slightest change to the interface of the site you're trying to scrape renders your script useless.

_eug

Nikon D50

07/27/2006 01:00:25 PM · #10

Originally posted by Louis:

Just a warning, screen scraping is unreliable, because generally the slightest change to the interface of the site you're trying to scrape renders your script useless.

True, if done wrong. That's why you do it right the first time. ;)

07/27/2006 01:20:32 PM · #11

Originally posted by _eug:

Originally posted by Louis:
Just a warning, screen scraping is unreliable, because generally the slightest change to the interface of the site you're trying to scrape renders your script useless.

True, if done wrong. That's why you do it right the first time. ;)

I don't think there's a right way, to be perfectly honest... it's way too hacky for such a disciplined guy like myself. ;)

07/27/2006 01:24:54 PM · #12

Only change to the interface that would break what I suggested is removing the 'Downloads: #' line. If they remove that, then they've removed the piece of info he's interested in anyway so... They chang it to Dwnlds or something ignorant it requires a two second change on his part. What's the issue. This is like arguing that boating is only reliable so long as the ocean continues to be filled with water.

07/27/2006 01:39:42 PM · #13

Originally posted by routerguy666:

Yeah, wget is an awesome little tool. The lynx web browser can also be used to dump out the contents of a page to standard output. BTW, though it's nitpicky, the script used to retrieve the number of downloads should probably also account for the fact that the site may change capitalization of the downloads line. Using lynx for dumping the page contents, you can use the following command to grab JUST the number of downloads and stuff it into download_count.txt:

lynx -dump //www.dreamstime.com/woodtexture-image13352 | grep -i 'downloads:' | awk '{print $2}' > download_count.txt

07/27/2006 01:41:34 PM · #14

I knew someone would come by and throw some awk at the thread ;)

07/27/2006 01:55:56 PM · #15

Originally posted by routerguy666:

I knew someone would come by and throw some awk at the thread ;)

Of course! :D Us damn Gentoo geeks are so detail-oriented sometimes... ;)

07/27/2006 01:59:37 PM · #16

Originally posted by routerguy666:

Like I said... it's a hack.

07/27/2006 02:32:24 PM · #17

Originally posted by Louis:

Originally posted by routerguy666:
Only change to the interface that would break what I suggested is removing the 'Downloads: #' line. If they remove that, then they've removed the piece of info he's interested in anyway so... They chang it to Dwnlds or something ignorant it requires a two second change on his part. What's the issue. This is like arguing that boating is only reliable so long as the ocean continues to be filled with water.

Like I said... it's a hack.

It's certainly not the cleanest way to do it, but he may not have any other choice. It's not like many sites provide RSS feeds for that type of information or access to an API. Sometimes one has to make due with cheap hacks. Hey, it works for Microsoft! ;)

07/27/2006 02:37:48 PM · #18

Originally posted by cutlassdude70:

Originally posted by Louis:

Originally posted by routerguy666:
Only change to the interface that would break what I suggested is removing the 'Downloads: #' line. If they remove that, then they've removed the piece of info he's interested in anyway so... They chang it to Dwnlds or something ignorant it requires a two second change on his part. What's the issue. This is like arguing that boating is only reliable so long as the ocean continues to be filled with water.

Like I said... it's a hack.

It's certainly not the cleanest way to do it, but he may not have any other choice. It's not like many sites provide RSS feeds for that type of information or access to an API. Sometimes one has to make due with cheap hacks. Hey, it works for Microsoft! ;)

Heh... :) I was then going to suggest that the admins (Langdon?) expose the scores, stats, and everything else via SOAP and let us tinker, but that may be a tad too much work. ;)

07/27/2006 02:45:46 PM · #19

Among the other hacks in my bag are graphing my challenge score via MRTG. Ahh, geek life.

07/27/2006 05:12:15 PM · #20

Originally posted by Louis:

Originally posted by cutlassdude70:

Originally posted by Louis:

Originally posted by routerguy666:
Only change to the interface that would break what I suggested is removing the 'Downloads: #' line. If they remove that, then they've removed the piece of info he's interested in anyway so... They chang it to Dwnlds or something ignorant it requires a two second change on his part. What's the issue. This is like arguing that boating is only reliable so long as the ocean continues to be filled with water.

Like I said... it's a hack.

It's certainly not the cleanest way to do it, but he may not have any other choice. It's not like many sites provide RSS feeds for that type of information or access to an API. Sometimes one has to make due with cheap hacks. Hey, it works for Microsoft! ;)

Heh... :) I was then going to suggest that the admins (Langdon?) expose the scores, stats, and everything else via SOAP and let us tinker, but that may be a tad too much work. ;)

Damn that would be cool though! Can you imagine how much easier that would make Southern Gentleman's life with the WPL stuff?!

07/27/2006 05:16:31 PM · #21

Originally posted by routerguy666:

Among the other hacks in my bag are graphing my challenge score via MRTG. Ahh, geek life.

Oooo, good idea! Now if they ever got around to implementing Louis' SOAP idea, one could have quite a bit of fun with rrdtool without even breaking a sweat! Life as a nerd is sweet... :D

Canon EF 24-105mm f/4.0L IS

07/27/2006 05:28:01 PM · #22

Originally posted by cutlassdude70:

Originally posted by Louis:

Originally posted by cutlassdude70:

Originally posted by Louis:

Originally posted by routerguy666:
Only change to the interface that would break what I suggested is removing the 'Downloads: #' line. If they remove that, then they've removed the piece of info he's interested in anyway so... They chang it to Dwnlds or something ignorant it requires a two second change on his part. What's the issue. This is like arguing that boating is only reliable so long as the ocean continues to be filled with water.

Like I said... it's a hack.

It's certainly not the cleanest way to do it, but he may not have any other choice. It's not like many sites provide RSS feeds for that type of information or access to an API. Sometimes one has to make due with cheap hacks. Hey, it works for Microsoft! ;)

Heh... :) I was then going to suggest that the admins (Langdon?) expose the scores, stats, and everything else via SOAP and let us tinker, but that may be a tad too much work. ;)

Damn that would be cool though! Can you imagine how much easier that would make Southern Gentleman's life with the WPL stuff?!

yeah i thought he had some sort of a automatic thing going.. but perhaps he is doing it all manually

Art Roflmao

Canon EOS-6D Mark II

07/27/2006 05:32:11 PM · #23

Originally posted by cutlassdude70:

Oooo, good idea! Now if they ever got around to implementing Louis' SOAP idea, one could have quite a bit of fun with rrdtool without even breaking a sweat! Life as a nerd is sweet... :D

I heard a rumor that Langdon doesn't use SOAP.