Scraping data from web pages

balanga · Jul 19, 2018

I'm trying to retrieve some transaction data from various web pages where there is no means of just downloading the data I require. I have saved around 20 web pages, but when I try loading them in Chrome they just show up as text. I don't see anything like an index.html which would provide a way of holding all the 33 files together. The file types are css and js.download files, and can't figure out where the data I'm interested in is located.

How would I display a .js file? Presumably the .download suffix was added by Chrome when I save the web page.

drhowarddrfine · Jul 19, 2018

Chrome doesn't add anything to any file you download.

Chances are, the data is retrieved dynamically from the server and then displayed. So it's retrieved on the fly and not in the original index file of which one must exist.

obsigna · Jul 19, 2018

You want to start to use a proper tool for recursive web data retrieval, such as ftp/wget. This can be used to fetch a whole web site including all dependencies, like image, JavaScript and CSS files into the informed local directory. This is usually suitable for offline reading and any further processing. The below command would retrieve my website https://obsigna.com/ into the local directory obsigna.com.

pkg install wget

 wget -r -l inf -N -e robots=off --ca-certificate=/usr/local/share/certs/ca-root-nss.crt -P "obsigna.com" -nH -np "https://obsigna.com/"

For the different options see wget(1).

You would then point your browser to the file index.html in the retrieved directory. For the sake of security, browsers do not open local files into <iframes>, and for this reason sites may be displayed incomplete when viewed locally.

ralphbsz · Jul 19, 2018

The .css file is very likely a Cascading Style Sheet. It tells the browser how to render things. For example: "When the HTML says emphasis, use red color", or "put a 3-pixel wide line around all tables", or "use a pink screen background". Very unlikely to contain any useful data.

The .js files are JavaScript. That is a general-purpose programming language, which happens to be used heavily in web pages. Most likely, the web page that you viewed was a JavaScript program. I think there should be at least one .html file (I didn't know a web server could return just a .js file and no HTML at all), but I may be wrong about that. As drhowarddrfine said, it is unfortunately likely that the JavaScript program does not contain any data at all, and simply goes back to the server and reads the data. You should simply use a text editor on the JavaScript, and start reading and understanding it.

drhowarddrfine · Jul 20, 2018

ralphbsz To be clear, there must be at least one original HTML file to start things off. Otherwise the browser's parser won't know how to fetch the other files. However, once that is done, a whole web site can continue without fetching any HTML files at all; only javascript which can insert HTML on its own and download further info on the side as you said.

Chris_H · Jul 20, 2018

Hey, balanga ! It's not the 20th century anymore.

Seems most nowdays are using CCSS (compressed cascading style sheet(s)), and (compressed/executable) JavaScript. Unwinding it all, and all the remote references to even more additional data. Will be a challenge for you, to say the least. There are utilities (even in the FreeBSD ports tree) that can assist with such tasks. But what's the value of the data in those pages worth to you? Unless this is an ongoing task you want/need to perform. You'll probably be better off simply cutting, and copying that data from the pages themselves, while within your web browser. Otherwise, it will be no small undertaking.
Don't believe me? Open any of those pages, and try to decipher those l-o-n-g strings of text/base64, and whatnot inside the pages.

BTW those base64(1) stings might contain anything from images, to text, to ??? You'll need to unpack them, to even have any idea.

HTH

--Chris

Scraping data from web pages

balanga

drhowarddrfine

obsigna

Profile disabled

ralphbsz

drhowarddrfine

Chris_H