wget - Troubles scraping on site demanding login

Hello community!

I have been venturing in online dating and I'd like to optimise the process of finding a match since the datesite is not providing the function of searching by keyword.
To put the time of eyeballing profile to better use I thought I could get ftp/wget to do the work for me.

So the plan is to have ftp/wget to use session cookie and be able to scrape the profiles being logged in with my ID and have a script going through the files deleting the scraped pages with non-matching keyword.

Steps taken:

1. Get the cookies for the login session of my ID

At first I tried logging in and saving cookies with wget but the result was unsuccessful (the passed parameters of --user=foo and --password=bar / as well as --post-data 'user=foo&password=bar' may not work with this website).

I decided to have Firefox store the cookies for me and checking their viability with the addon "Cookie Manager". To be extra sure I used another addon to export the cookies in the wanted Netscape format, hence:
wget(1) - "Load cookies from file before the first HTTP retrieval. file is a textual file in the format originally used by Netscape's cookies.txt file."

2. Use the cookie file to scrape my profile as an example, two ways

First attempt I decide to use the method of passing parameters --no-cookies --header (because other methods have failed on me before):

Code:
wget -vk -e robots=off --user-agent="Mozilla/5.0 (X11; Linux
x86_64; rv:60.0) Gecko/20100101 Firefox/60.0" --no-cookies --header "Cookie: <cookienamehere>=<cookievaluehere>" https://www.happypancake.com/min-sida/

--2018-07-09 17:31:04--  https://www.happypancake.com/min-sida/
Resolving www.happypancake.com (www.happypancake.com)... 104.24.14.10, 104.24.1
5.10, 2400:cb00:2048:1::6818:f0a, ...
Connecting to www.happypancake.com (www.happypancake.com)|104.24.14.10|:443...
connected.
HTTP request sent, awaiting response... 302 Found
Location: /Error404 [following]
--2018-07-09 17:31:05--  https://www.happypancake.com/Error404
Reusing existing connection to www.happypancake.com:443.
HTTP request sent, awaiting response... 302 Found
Location: /Error404 [following]
--2018-07-09 17:31:05--  https://www.happypancake.com/Error404
Reusing existing connection to www.happypancake.com:443.
HTTP request sent, awaiting response... 302 Found
Location: /Error404 [following]
repeating...

Second attempt I use this approach:

Code:
wget -vk -e robots=off --load-cookies allcookies_netscape.txt
--user-agent "Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/
60.0" https://www.happypancake.com/min-sida/

--2018-07-09 17:24:03--  https://www.happypancake.com/min-sida/
Resolving www.happypancake.com (www.happypancake.com)... 104.24.15.10, 104.24.1
4.10, 2400:cb00:2048:1::6818:e0a, ...
Connecting to www.happypancake.com (www.happypancake.com)|104.24.15.10|:443...
connected.
HTTP request sent, awaiting response... 302 Found
Location: /login/ [following]
--2018-07-09 17:24:03--  https://www.happypancake.com/login/
Reusing existing connection to www.happypancake.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘index.html.1’

index.html.1            [ <=>               ]  38.31K  --.-KB/s    in 0.08s

2018-07-09 17:24:03 (506 KB/s) - ‘index.html.1’ saved [39229]

Converting links in index.html.1... 5-50
Converted links in 1 files in 0.002 seconds.

The index.html.1 is a page stating: "You must be logged in to access this page" so this method didn't work either.

Conclusion and new steps to be made

I notice I get the server response 302 from happypancake.com initiating a redirection process. I should get redirected to my profile if the cookie is adding up correctly. Not fully sure what attempt has the most reasonable outcome and to go further I have to pin point where the problem is.

Questions

1. Why isn't it working using the name and value of the session cookie in the two attempts?
2. Can I in a better way make ftp/wget adjust to the redirection and how can I with Firefox check the sent requests and responses of redirecting to my profile?
3. Are the wget parameters correct in this case?


Thanks!
- - - michael_hackson
 
Cheater! You're not suppose to automate the process. They don't like it when you do that.

FTR: trying to reverse engineer someone's website for robot info gathering is a moving target. There are all kinds of things they can be doing to foul your attempts and the cookies are probably "session cookies" so wouldn't work well outside of the browser session they were created in.

My 2 cents worth: join a club and meet folks in real life. online dating is generally a game for masochists.
 
Cheater! You're not suppose to automate the process. They don't like it when you do that.

FTR: trying to reverse engineer someone's website for robot info gathering is a moving target. There are all kinds of things they can be doing to foul your attempts and the cookies are probably "session cookies" so wouldn't work well outside of the browser session they were created in.

My 2 cents worth: join a club and meet folks in real life. online dating is generally a game for masochists.

Hi! I am not too fond of time dumping so when it comes to things like this I'd rather cheat and learn rather than sit and watch. ;)

Yao, generally it's against the whole concept of running an online dating service, I am supposed to sit and invest deadtime in this, however, this website in particular is the forgiving kind and they also have no policy against running bots on their site.
Their concept is not gaining money from people like other sites.

I share your 2 cents and that is why online dating to me is secondary after computer studies so this is more like go fishing. :D
 
Btw:
FTR: trying to reverse engineer someone's website for robot info gathering is a moving target. There are all kinds of things they can be doing to foul your attempts and the cookies are probably "session cookies" so wouldn't work well outside of the browser session they were created in.
.

Isn't it the session cookie that is to be used with wget?
Because what I want to do is to have a session with wget.

https://stackoverflow.com/questions/1324421/how-to-get-past-the-login-page-with-wget

Currently I have 4 cookies in the allcookies_netscape.txt.
__cfduid
foo_SessionId
HPC_B
HPC_SE
 
I'm not up on all the techniques, but what if they are using something like like "canvas fingerprinting" techniques in combination with expected cookies? That means the session information stored in the cookies is not matching up with the expected state of the browser canvas, as wget has no browser canvas per-se. As I say, I'm not up on all the techniques, but it is a moving target if they want to be sneaky.

see How websites track you
 
I'm not up on all the techniques, but what if they are using something like like "canvas fingerprinting" techniques in combination with expected cookies? That means the session information stored in the cookies is not matching up with the expected state of the browser canvas, as wget has no browser canvas per-se. As I say, I'm not up on all the techniques, but it is a moving target if they want to be sneaky.

see How websites track you

Hmm, if that is the case the backup plan is to search for a Firefox addon that essentially can do the same thing, think I read about it in another thread.
Not too fun to relate on addons though if this can be done with wget.
Also open to suggestions if Curl has this function.

I should be able to verify easily if they use "canvas fingerprinting", I think they don't because I tend to get an alarm when a website uses it.

Don't we have a wget master here on the forum?

I know ronaldlees has some expertise in this area.
 
I'll see if it will matter if the cookies are saved with wget or Firefox. The best thing would to have a session cookie linked to wget from the start.

Haven't fully learnt how to use --post-data 'user=foo&password=bar' in the right way since I doubt you put it that way for all websites.

A curl on happypancake.com gives this:
Code:
<input name="ctl00$ctl00$ContentPlaceHolderDefault$mainContent$logi
nForStart$txtUserName" type="text" id="ContentPlaceHolderDefault_mainContent_lo
ginForStart_txtUserName" class="text" placeholder="Smeknamn" />^M
            ^M
            <input name="ctl00$ctl00$ContentPlaceHolderDefault$mainContent$logi
nForStart$txtPassword" type="password" id="ContentPlaceHolderDefault_mainConten
t_loginForStart_txtPassword" class="text" placeholder="Lösenord" />^M
            ^M
So I'll go with --post-data 'text=username&password=password' next.

Currently I am working the way with:
--use-askpass=command and trying to figure out what program to handle the input with.

"--use-askpass=COMMAND will request the username and password for a given
URL by executing the external program COMMAND."


Tried running the above in Firefox and got the string:
/path/to/working/directory/Username for 'https://www.happypancake.com':

But wget exit with error because FF couldn't give input.
 
Maybe this might help. But I also recommend going out into the real life and meet people in sport clubs, museums, concerts, ...
 
Having done some extensive works with web browsers, such as integrating a new protocol (SIP), migrating web sessions and in another case, grabbing documents from sites requiring authentication, you will be better off extending an existing addon to do the scraping for you owing to today's complexities surrounding web (data) security & privacy.

And I feel you; how much leisure time do addicts like us :):) have to socialize? I have read someone do exactly what you are trying to do few months back. All the best!
 
"text" and "password" are the HTML input types, not their names.

https://www.w3schools.com/htmL/html_forms.asp

Always nice with a correctional post. The question I do have, and couldn't find answer for in the wget(1) is what to pass together with the parameter of --post-data. Following different threads some use 'user=foo&password=bar' and other use 'j_username=foo&j_password=bar'. That tells me that ftp/wget in that sense is "dumb" and you have to tell it strictly what data to post since it can't always see the request "username + password".

If I am right, how can I find out what names to use when posting data to a login?
I tried with UserName and Password since what was connected to the placeholder ID, but still not able to log in. :7 Is it case sensitive?

Addition:
Compare these two:

https://unix.stackexchange.com/ques...-through-username-and-password-with-wget?rq=1

https://stackoverflow.com/questions/1324421/how-to-get-past-the-login-page-with-wget
 
Having done some extensive works with web browsers, such as integrating a new protocol (SIP), migrating web sessions and in another case, grabbing documents from sites requiring authentication, you will be better off extending an existing addon to do the scraping for you owing to today's complexities surrounding web (data) security & privacy.

And I feel you; how much leisure time do addicts like us :):) have to socialize? I have read someone do exactly what you are trying to do few months back. All the best!

Thank you for your reply and suggestion of going with addon. If I could I'd rather not rely on addon because I like to have a deeper insight of the process. I found wget quite recently and it has helped me to halve my amount of stored bookmarks. Going for thousand bookmarks or more it's nice to find another way to store articles etc.

Haha and yes, I prioritize studies over online dating, it's more profitable. :D

Meeting people in the wild is better than eyeballing profiles so now you all understand why I'd rather speed up the process. ;)
 
Look at the name property of the input fields of the form. And looking at those horrid names I suspect the site was built with ASP.NET. I'd try with txtUserName and txtPassword.

Ding ding! You are correct. It's ASP.NET. :p Didn't see that '$' before the names. Thank you!
 
Back
Top