Fork me on GitHub
#specter
<
2018-01-08
>
aaelony08:01:17

New to Specter. I'm scraping http://docs.h2o.ai/h2o/latest-stable/h2o-docs/rest-api-reference.html to build a vector of maps where each map will have a key for :http-verb, :rest-path :inputs and outputs. Another challenge is that the html appears to be in 4 conceptual sections, 1) a section of a href links with rest endpoints, 2) a section of h2 headings with the http-verb and rest endpoint followed by a table with Input and Output, 3) a section of a href links with schema nouns, and 4) a final section of h2 headings with schema noun name followed by a table of keys and their descriptions. How might I keep the four sections separate, before combining them? I'm also unclear if I should use select, collect, codewalker, or continue-then-stay to collect and surface nested pieces of information. Thanks in advance.

nathanmarz14:01:15

@aaelony you're going to have to be more specific

nathanmarz14:01:53

you want to use specter to extract information out of html?

nathanmarz14:01:23

can you paste a sample of the html you're scraping, and what you want as output?

aaelony16:01:34

ok, let me take some time to formulate a better question.

aaelony20:01:43

hi @nathanmarz, here is the code in clojure that I'm wondering how to produce in Specter.

aaelony20:01:45

(ns testing 
      (:require [net.cgrand.enlive-html :as html]                                                                                                                                                                                                                                  
            [org.httpkit.client :as http]                                                                                                                                                                                                                                      
            [clojure.string :as str] ))

(->> (html/html-snippet
(:body @(http/get ""
{:insecure false})))
(filterv #(= (:tag %) :html))
first
:content
(filterv #(= (:tag %) :body))
first
:content
(filterv #(= (:tag %) :div))
first
:content
(filterv #(= (:tag %) :h2))
(mapv #(let [[verb endpoint] (-> %
:content
first
(str/split #" ")
)
inputs (if endpoint
(re-seq #"\{(.*?)\}" endpoint))
]
{:verb verb :endpoint endpoint :inputs inputs}
))
(filterv #(or (= (:verb %) "GET")
(= (:verb %) "POST")
(= (:verb %) "DELETE")
(= (:verb %) "HEAD")))
)