Rails Web Parser

GitHub logo
| Ruby on Rails |

Preface:

This was a technical test for a job I was top 4% of applicants in, though ultimately unsuccessful.

Part of the test was to submit a PR explaining our thought process, to save rephrasing it is pasted below.

Submission PR

This pull request implements functionality permitting the on-demand generation of a websites:

The implementation attempts to follow Rails conventions as far as my understanding of them goes, with further consideration being given to the writings of individuals affiliated with 37Signals on the basis that those opinions are likely reflected throughout 37Signals codebases.

Implementation

URL Parsing

On submission of a URL deemed valid by the <input type="url> validation, open-uri attempts to access and read the associated webpage.

Parsing into Nokogiri

Once the webpage is persisted in the database, it is parsed once when required via Nokogiri::HTML() and then made available to the relevant methods. Nokogiri supports handling malformed markup automatically, though to not blindly implement the library, the limits of parsing were checked in console based on the inputs in this gist.

On open-uri error:

Application redirects to root_path, passing the error message as a notice for user visibility.

On open-uri success but existing Page with url:

Application returns the relevant show page for the submitted URL, done to avoid polluting the database with identical entries.

On success and no existing entry:

Page.new(page_params) is called with the provided URL and the content read by open-uri.

The decision was made to create Page (@page) after these checks to avoid creating a new object without a guarantee it would be possible to populate @page.content as it would become a redundant operation and provide no value outside of being another thing to dispose of.

page_analyzer.rb

Initially all of the processing logic was done within page.rb but once completed I began to feel this was in breach of the single-responsibility principle. Rails defines a model as “the layer of the system responsible for representing data and business logic” so having a Page responsible for representing its data and also interpreting its stored content for output felt in breach of this.

As such, the information was abstracted into page_analyzer.rb implemented as a PORO as it lacked any form of dependency on the existing Rails libraries so including ActiveModel was unnecessary. page.rb then creates a PageAnalyzer object for use throughout so @page.word_count will still return the relevant figure but will query generation methods held outside of page.rb.

The reasoning here being that it remains possible to essentially talk to the models, without the code itself being unclear on the intention of the model. Page is responsible for displaying information related to it, PageAnalyzer is responsible for generating the analysis to be displayed, clearly separating the concerns.

I did consider implementing this as a concern based on Good Concerns, which would likely have had the code be similar to:

# app/models/page.rb
class Page < ApplicationRecord
  include Analyzable 
end

# app/models/page/analyzable.rb
module Page::Analyzable
  extend ActiveSupport::Concern
   # method implementations
end

This was opted against primarily due to “They need to feature a genuine “has trait” or “acts as” semantics to work” which I don’t believe Analyzable would fit the definition for as while a Page is Analyzable, it still wouldn’t be inherently responsible for said analysis.

Furthermore, the memoization of the PageAnalyzer in Page.rb allows @page_analyzer to behave similarly to a static class, meaning the html content is only parsed on initialisation, offering a slight performance bonus over parsing on every method call.

Page Title

The title of the parsed document is extracted using Nokogiri’s title method, querying the contents of <title> or returning nil. If nil is returned it is displayed as “Title Unavailable”.

The decision was made to not look elsewhere for a title in absence of <title> as the HTML specification states “it is a required element in most situations”, continuing to say it is only superseded by higher-level protocols which likely places those sites outside of the remit of the application.

Word Count

Based on the provided readme, word count was interpreted to be that of meaningful words, following the same logic as how not every aspect of an academic essay contributes to the word count (e.g., appendices, references).

The implementation attempts to be functional across websites by attempting to access progressively less specific elements by their semantic naming, element ID, or class name in an attempt to isolate important content from the rest of the webpage, only falling to the content held within <body> when nothing else is available.

The implementation functions as follows, with steps 1-6 existing within a method get_words_from_page:

  1. Attempt to identify <main> based on it being defined as “the dominant contents of the document”. If unavailable it falls through elements considered to be on similar intent until defaulting to <body>, stored as content
  2. Strip obvious irrelevant elements from selected primary element such as <script> & <style> in attempt to retain a relevant word-count regardless of how specific content was based on the site provided
  3. Strip identifiable sections from content that make reference to containing references/citations/footnotes. This is also done in the attempt to keep the word-count as being based on relevant words*
  4. Strip special characters from remaining text using regex
  5. Strip standalone numbers using regex
  6. Split the remaining text into an array based on spaces, rejecting empty indexes
  7. Return the array length as word count.

The method was implemented in a subtractive way, meaning known undesired elements were removed as opposed to explicitly extracting elements likely containing needed information to assist in making it functional across websites.

Extracting the desired elements would require maintenance of a list of desired elements, likely longer than that of undesired elements based on the MDN elements reference. It would also increase the risk of inaccuracy if a website does not conform to the conventional structures, whereas the undesired text likely has a requirement of living in places such as <script> to be functional.

This count is rendered in the view using number_with_delimiter to insert the relevant commas based on total.

Top Ten Words

Calculating the top ten words in the text also uses get_words_from_page since the array needs to be identical to that of the word count.

Once provided with the array generate_frequent_words uses reject to remove the words in the array including in a list of stop-words based on stopwords-en in an attempt to return meaningful words in the context of the website.

The list of stop-words is ~1000 words long, implemented as a set and loaded via an initialiser. This was implemented in such a way to ensure the list is only loaded once and permits O(1) lookup compared to iterating an array to find a match every time. As the list is entirely lowercase, each word passed to the set for comparison is .downcase, rejecting them regardless of case without introducing a need to track whether a word began with a capital letter at the time of check.

Once purged of stop-words the remaining data is tallied and the top ten by size extracted via .tally.max_by(10) { |_, v| v }. Originally the top ten were selected using sort_by, reverse, and limit but exploration of Enumerable docs led me to realise sorting the entire array to pull the first ten elements is inefficient compared to other available methods.

Once passed to the view the frequent words hash is iterated by key, value with each populating a column per row.

Table of Contents

The generation of table of contents adopts similar logic to that of calculating the word count, in the sense it uses a method of falling back through specific elements in an attempt to identify a pre-existing table of contents and the anchors within, generating one built on the available page headings if an existing table cannot be identified.

The logical implementation is:

  1. Attempt to identify HTML element with reference to providing a table of contents or page outline, fallback to <h1> - <h6> if required
  2. Extract its level of nesting via heading strength or sum of list ancestors depending on identified ToC element as an element with multiple list declaration ancestors will always be a nested list.
  3. For each element extracted, strip numbered prefixes via regex
  4. Create a stack tracking nest level, assigning children to nested hashes where appropriate

Once this hash is returned it is rendered via a partial containing a conditional if children.present?. If true, the partial recursively renders itself until .present? evaluates to false, allowing the nested elements to display appropriately.

Consideration was given to other methods of rendering, such as implementing javascript via stimulus to create the list elements dynamically instead of iterating a simple partial or rethinking the data structure used to hold the table so it was not nested to begin with but through use it was noticed that the show page for ruby-lang documentation for String loaded in its entirety in 230ms, significantly less time than required for users to lose attention for a task despite the sites side bar being longer than one would expect from most sites.

Testing

This submission aims to have an effective yet concise test-suite by testing the public implementation of the analysis interface using html fixtures, removing a dependency on a fetching library to be functional for the suite to run.

The tests make an effort to not evaluate the functions belonging to third-party libraries, meaning, while there is fall back logic with element selection the tests ensure the relevant fall back happens when it should. It does not evaluate whether Nokogiri is detecting the appropriate elements via its methods (mainly .css())to begin with, that responsibility would fall within the test suite of Nokogiri.

There is a system test present that was written to ensure url submission returns correctly depending on result, though this is not entirely comprehensive as the system is straight-forward enough to be mainly tested through use. There’s no JavaScript that would necessitate a full system test, even then, the argument could be made that system tests have failed.