How to detect changes on a webpage using python

The test page

First off, we need a page to test on. I personally just used a small simple flask app with flask_simplelogin as a page.

from flask import Flask
from flask_simplelogin import SimpleLogin, login_required
from datetime import datetime as dt

app = Flask(__name__)
app.config['SECRET_KEY'] = 'whatever'
app.config['SIMPLELOGIN_USERNAME'] = 'foo'
app.config['SIMPLELOGIN_PASSWORD'] = 'bar'
SimpleLogin(app)

@app.route('/')
@login_required
def index():
    if dt.now().minute % 2 == 0:
        return "yes"
    else:
        return "no"

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=4002)

This page requires a login and then shows whether the minute currently is even or odd. This is content that changes and works as a example to detect changes.

This page will be running at simplelogin-example.yadamiel.com

Gathering information

Now that's out of the way we can get started. You will need the following tools:

  • A web browser with developer tools (Firefox is used in screenshots)
  • A editor for python code
  • A recent Python 3 version
  • A basic understanding of python and http requests
  • The requests and beautifulsoup4 libraries

Now we will visit our webpage at https://simplelogin-example.yadamiel.com/ and open developer tools in the network tab.

Here we can see the requests done by the webpage. The marked settings should be activated to see all requests and persist requests after reload/redirect. In the top left corner there is a trash can symbol used to clear requests.

After you made sure that you have all requests visible to clear them, reload the page, and log in. This will ensure that you have one concise session shown.

After clicking on a request you get the details of that request. This is where you're going to figure out how the page handles login. The most common way is a session cookie, which this page uses too. This can be seen on the Headers tab for the post request to /login/. A POST request is used for sending data back to the webserver, like for example when someone is logging in it is used to send the login data. Now look for the data sent by the POST request, you can find it in the Params tab. Here we can see that this page sends back 4 values.

  • username
  • password
  • next
  • csrf_token

3 of these are pretty self-explanatory: username and password contain the login credentials, and next contains the path visited after login, in this case / for https://simplelogin-example.yadamiel.com + /

The csfr_token is the difficult part. These tokens are generated uniquely for each person accessing the webpage, to prevent others from faking requests made by you. More information can be found here

Now we need to go back to our first request and look for the CSRF token. This is done in the Response tab of the first request.

On a page this simple it is easy to find this token, but for larger pages I would encourage you to copy the HTML code you see into a text editor with search capabilities.

Writing Code

Now we have all the parts needed to write our code. Make sure you have the requests and beautifulsoup4 libraries installed.

First we need to import these libraries

import requests
from bs4 import BeautifulSoup

import time

(bs4 is the short name for beautifulsoup4)

Requests can easily store cookies (like the session cookie), but it needs to store them somewhere.

sess = requests.Session()

This defines a persistent session for our requests, in which cookies are stored.

Then we need to get the csrf_token which is required for login.

page = sess.get("https://simplelogin-example.yadamiel.com/login/")

This simply requests the page with a GET request and stores the response in the page variable

Then we need to get the csrf_token somehow. This is where beautifulsoup4 comes in handy, because it can easily get different elements from HTML pages.

soup = BeautifulSoup(page.content, features="html.parser")

This creates a "soup" object from the pages content which we used to search for elements.

csrf_token_field = soup.find(id="csrf_token")

csrf_token = csrf_token_field["value"]

Here we search for the first element which matches the HTML id csrf_token. This is the whole <input> tag, so we save that in a variable and extract the value a line later.

Now we have all the parts needed to construct our POST requests data

login_params = {
    "csrf_token":csrf_token,
    "username":"foo",
    "password":"bar",
    "next":"/"
}

Then we just need to send the login POST request and were good to go, since the session will store that we are logged in.

sess.post("https://simplelogin-example.yadamiel.com/login/", data=login_params)

Now we can request the page that is hidden behind the login:

is_minutes_even = sess.get("https://simplelogin-example.yadamiel.com/").content

(the page shows yes when the minute is even and no when it isn't.)

All that's left to do now is repeatedly request the page and see if anything changes. To make things easier we are going to use the hash of the returned page content to check for changes.

previous_hash = hash(is_minutes_even)

Then were going to run an endless loop in which we request the page repeatedly to see if anything changes.

while True:
    is_minutes_even = sess.get("https://simplelogin-example.yadamiel.com/").content
    current_hash = hash(is_minutes_even)
    if previous_hash != current_hash:
        print("It changed!")
    previous_hash = current_hash
    time.sleep(1)

Here we have an infinite loop that requests the page, gets the hash of the page (a hash is always unique to one input, making it perfect for this case) and check if it's different to the previous hash. Then we overwrite the previous hash with the current one and wait one second before requesting again.

What you will notice when running this is that it prints "It changed!" roughly every minute.

Full code

import requests
from bs4 import BeautifulSoup

import time

sess = requests.Session()
page = sess.get("https://simplelogin-example.yadamiel.com/login/")
soup = BeautifulSoup(page.content, features="html.parser")

csrf_token_field = soup.find(id="csrf_token")

csrf_token = csrf_token_field["value"]

login_params = {
    "csrf_token":csrf_token,
    "username":"foo",
    "password":"bar",
    "next":"/"
}

sess.post("https://simplelogin-example.yadamiel.com/login/",data=login_params)

is_minutes_even = sess.get("https://simplelogin-example.yadamiel.com/").content

previous_hash = hash(is_minutes_even)

while True:
    is_minutes_even = sess.get("https://simplelogin-example.yadamiel.com/").content
    current_hash = hash(is_minutes_even)
    if previous_hash != current_hash:
        print("It changed!")
    previous_hash = current_hash
    time.sleep(1)

Notes on other authentication methods

A multitude of other methods to log in users are used across different websites. Some of these may include:

  • Token based: in this case you can add the token in the appropriate request header like this: requests.get(url, headers={"Token":mytoken})
  • HTTP basic auth: in this case the username and password can be passed in the URL or a header: https://username:password@example.com

In any case looking for patterns in the requests is the best approach. Some services may even provide a proper API, in which case this method should be used.

Notes on blocked requests when using python

Some websites don't like being requested like this. They will try to block you. You can try to bypass this by sending a User-Agent header that looks like a normal browser. You can get your current user agent here

Another reason might be that the page requires some JavaScript code running to work. This is something that is way harder to bypass, which I can't cover here. If this is the case I suggest rethinking the plausibility of this and taking a look at this stackoverflow

Useful Links