In Part 1 of this series we created a simple Go function to extract the content of a web page. In this posting, we are going to take that function and use the nicely developed http package of golang to expose that function as a microservice.
Making a Go Program a Webservice
Our first steps to creating a microservice from our simple go function is to reorginize our code and track it in GitHub. We want to create an installable library and binary such that a container can pull down the latest version of the source and start the application with no problems. To do this, we create two files.
In a lib subdirectory we include the contents of the go package. We will include this in the web service go file as well. Take note that there is a strange, very subtle way, of exposing symbols of a package externally. You CAPITALIZE the first letter of a symbol, in this case a function name, to expose it for use externally. I’m not sure that its an intuitive thing to someone new to go, or even if its a sane syntactical decision on the go folks part, but it is pretty nice eventually.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
package scrape import ( "github.com/PuerkitoBio/goquery" "log" "regexp" "bytes" "net/http" ) func fetchContent(url string) string { response, err := http.Get(url) if err != nil { log.Fatal(err) } else { doc, err := goquery.NewDocumentFromResponse(response) if err != nil { log.Fatal(err) }else{ var buffer bytes.Buffer tags := "h1, h2, h3, p, ul, ol, u, b, a, title, meta" doc.Find(tags).Each(func(i int, s *goquery.Selection) { tagName := s.Get(0).Data if tagName == "meta"{ a, exists := s.Attr("content") if exists{ buffer.WriteString(a + "\n") } }else{ buffer.WriteString(s.Text() + "\n") } }) return buffer.String() } } return "" } func removeTags(content string) string { script, _ := regexp.Compile("<script([\\s\\S]*?)</script>") clensed := script.ReplaceAllString(content, "") reg, _ := regexp.Compile("<[^>]+>") reclensed := reg.ReplaceAllString(clensed, "") rege, _ := regexp.Compile("\\s+") reclensede := rege.ReplaceAllString(reclensed, " ") return reclensede } func Scrape(url string) string { return removeTags(fetchContent(url)) } |
As you can see we are only exposing a single method, the Scrape function, to the outside world.
In the parent directory we write the go code for the binary we will include. The binary will be responsible for standing up an http server. This http server will basically transfer the URL query parameter to our backend library and return the extracted content. For this we include the go package “net/http” which is included and simple to use.
The app.go file is as follows
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
package main import ( "fmt" "net/http" "github.com/allengeer/scrape/lib" ) func status(w http.ResponseWriter, r *http.Request) { fmt.Fprintf(w,"RUNNING") } func scraper(w http.ResponseWriter, r *http.Request) { url := r.URL.Query().Get("url") if url == "" { r.ParseForm() url = r.FormValue("url") } fmt.Fprint(w, scrape.Scrape(url)) } func main() { http.HandleFunc("/status", status) http.HandleFunc("/scrape", scraper) http.ListenAndServe(":5000", nil) } |
Now you can see we define two routes, a /status and a /scrape route. Each route has a handler function which we define. Each handler function has a specific contract
1 |
func functionname(w http.ResponseWriter, r *http.Request) |
During the scrape call, we extract the url parameter from the request (r) pointer, and then call our library’s expose Scrape function with that value.
1 |
scrape.Scrape(url) |
That function is imported in our import call at the top
1 |
import(..."github.com/allengeer/scrape/lib"...) |
Finally in the main function we define the routes and start the server. Also make sure we are in package main to tell go this is a binary that we want installed that runs the main method.
We commit this to subversion and we are ready to install it to our Go workspace.
Installing a Go Webservice from Git
Assuming you have that all set up, simply run
1 |
go get github.com/allengeer/scrape |
Go will fetch the latest source from github – in this case the source I have checked into my repo, but you can replace the allengeer part with your github user name. It compiles the library places it in your workspace, and it compiles the binary and places it on the bin path of your workspace.
Now to start the webservice (assuming $GOPATH/bin is on your $PATH)
1 |
scrape |
Now visit http://localhost:5000/status
You should see RUNNING. Now give this a try http://localhost:5000/scrape?url=http://allengeer.com/part-6-creating-a-go-web-content-extraction-microservice
You should see the text content of this page. How cool is that?!
Containerize a Go Microservice with Docker
The final step in creating our Microservice is to package it in a container using docker. To create a Go app that is tracked in GitHub – such as we have done here – the Dockerfile is super simple.
1 2 3 4 5 |
FROM golang:latest RUN go get github.com/PuerkitoBio/goquery RUN go get github.com/allengeer/scrape ENTRYPOINT /go/bin/scrape EXPOSE 5000 |
As you can see, we start from the the golang image, we fetch the two required go packages, and then we run the scraper. We expose port 5000. We can build with the standard
1 |
docker build -t scraper:latest . |
and then run our container with
1 |
docker run -d -p 5000:5000 scraper:latest |
and then scale it up
1 |
docker run -d -p 5001:5000 scraper:latest |
In the next post we are going to look at how to start our microservice ecosystem with the microservice we created in part 5 and our web extractor microservice to start creating a system for extracting web page content and analyzing it with IBM BlueMix.
Source Code
The source code for this post is available at https://github.com/allengeer/scrape