Insights operator is OpenShift Cloud Native application based on the [Operators Framework](https://github.com/operator-framework).
Operators Framework is a toolkit to manage other cloud native applications.

Tip:
Try to install operators-sdk and generate new operator using https://sdk.operatorframework.io/docs/building-operators/golang/quickstart/
You will see how much code is generated by operators-sdk by default and what is provided by default in an operator.

Main goal of Insights Operators is to periodically gather anonymized data from applications in cluster and periodically upload them
to `cloud.redhat.com` for analysis.

Insights Operator itself is not managing any applications, it rather only runs using Operators Framework infrastructure.
Like is the convention of Operator applications it has most of the code structured in pkg package and `pkg/controller/operator.go`
is the hosting the Operator controller. Typically Operator controllers are reading configuration and starts some periodical tasks.

## How Insights Operator reads configuration
In case of Insights Operator, configuration is a combination of file [config/pod.yaml](config/pod.yaml) and configuration stored in
Namespace openshift-config in secret support. In the secret support is the endpoint and interval. The secret doesn't exist by default,
but when exists it overrides default settings which IO reads from config/pod.yaml.
The support secret has
- endpoint (where to upload to),
- interval (baseline for how often to gather and upload)
- httpProxy, httpsProxy, noProxy eventually to set custom proxy, which overrides cluster proxy just for Insights Operator uploads
- username, password - if set, the insights client upload will be authenticated by basic authorization and this username/password. By default it uses Token from pull-secret secret.

The pull-secret has .dockerconfigjson with list of Tokens to various docker repositories + authentication token for insights operator upload:
- under auths object in property cloud.redhat.com is property auth with Bearer token for cloud.redhat.com Authentication.



Content of openshift-config secret support:
```
$ oc get secret support -n openshift-config -o=yaml
apiVersion: v1
data:
  endpoint: aHR0cHM6Ly9jbG91ZC5yZWRoYXQuY29tL2FwaS9pbmdyZXNzL3YxL3VwbG9hZA==
  interval: Mmg=
kind: Secret
metadata:
  creationTimestamp: "2020-10-05T05:37:34Z"
  name: support
  namespace: openshift-config
  resourceVersion: "823414"
  selfLink: /api/v1/namespaces/openshift-config/secrets/support
  uid: 0e522987-4c02-479d-8d10-e4f551e60b65
type: Opaque

$ oc get secret support -n openshift-config -o=json | jq -r .data.endpoint | base64 -d
https://cloud.redhat.com/api/ingress/v1/upload
$ oc get secret support -n openshift-config -o=json | jq -r .data.interval | base64 -d
2h
```
The support secret can be also configured for a Insights Operator specific Http Proxy using keys (httpProxy, httpsProxy and noProxy).

To configure authentication to cloud.redhat.com Insights Operator is reading preconfigured token from namespace
openshift-config and secret pull-secret (where are cluster-wide tokens stored). The token to cloud.redhat.com is stored in .dockerjsonconfig, inside auth section.
```
oc get secret/pull-secret -n openshift-config -o json | jq -r ".data | .[]" | base64 --decode | jq
{
  "auths": {
    ...
    "cloud.openshift.com": {
      "email": "cee-ops-admins@redhat.com",
      "auth": "BASE64-ENCODED-JWT-TOKEN-REMOVED"
    },
    ...
  }
}
```
The configuration secrets are periodically refreshed by [configobserver](pkg/config/configobserver/configobserver.go). Any code can register to
receive signal through channel by using config.ConfigChanged(), like for example in `insightsuploader.go`. It will then get notified if config changes.
```
configCh, cancelFn := c.configurator.ConfigChanged()
```
Internally the configObserver has an array of subscribers, so all of them will get the signal.


## How is Insights Operator scheduling gathering
Commonly used pattern in Insights Operator is that a task is started as go routine and runs its own cycle of periodic actions.
These actions are mostly started from `operator.go`.
They are usually using wait.Until - runs function periodically after short delay until end is signalled.
There are these main tasks scheduled:
- Gatherer
- Uploader
- Config Observer
- Disk Pruner

### Scheduling of Gatherer
Gatherer is using this logic to start information gathering from the cluster, and it is handled in [periodic.go](pkg/controller/periodic/periodic.go).

So far we have only 1 Gatherer(called `clusterconfig`), it has several gather-functions each collecting different data from the cluster.
The workflow of the gather-functions is managed by the Gatherer.
Only one Gatherer runs at one time, this is because we only have 1 Gatherer at the moment. (ie.: we can add concurrency here when its needed)
When IO starts there is an initial delay before the first `Gather` happens, after that a `Gather` is initiated every interval, this is done by `periodicTrigger`.
`periodic.Run` handles the initial delay and starts the `periodicTrigger` like `go wait.Until(func() { c.periodicTrigger(stopCh) }, time.Second, stopCh)`.

`Gather` uses `ExponentialBackoff` to retry (amount specified in: `status.GatherFailuresCountThreshold`) if a Gatherer returns any errors, these errors are mostly caused when a collected resource is not yet ready therefore it can't be right now collected so we should retry later.
It's important that all retries finish before the next gather period starts, so that we don't have potential conflicts, the Backoff is calibrated to take this into account.
Errors that occurred during a gather-function are logged in the metadata part of the IO archive. (`insigths-operator/gathers.json`)

### Scheduling and running of Uploader
The `operator.go` is starting background task defined in `pkg/insights/insightsuploader/insightsuploader.go`. The insights uploader is periodically checking if there are any data to upload by calling summarizer.
If no data to upload are found the uploader continues with next cycle.
The uploader cycle is running `wait.Poll` function which is waiting until config changes or until there is a time to upload. The time to upload is set by initialDelay.
If this is the first upload (the lastReportedTime from status is not set) the uploader uses `interval/8+random(interval/8*2)` as next upload time. This could be reset though to 0, if it is Safe to upload immediately. If any upload was already reported, the next upload interval is going to be `now - lastReported + interval + 1.2 Jitter`.
Code: This line sets next interval in regular polling:
```
wait.Jitter(now.Sub(next), 1.2)
```
After calculation of initialDelay, `wait.Until` runs regular function, which starts waiting on either config change or until time till initialDelay and then continues. Every event is retriggered by `wait.Until` again every 15 seconds. For example if ClusterVersion is not populated (because Gatherer haven't finished initial Gathering), it will retry in 15 seconds.
Eventually uploader will use `insightsclient.Send` to run the upload itself. It then reports any errors to its Status reporter.

# How is Uploader authenticating to cloud.redhat.com
The insightsclient.Send is creating http client with Get method and url, which can be configured in config/pod.yaml, or eventually from support secret endpoint value. The transport is encrypted with TLS, which is set in clientTransport() method. This method is using `pkg/authorizer/clusterauthorizer.go` to
add the Bearer token, which is read from secret pull-secret, the section .auths.cloud.redhat.com.auth. The clientTransport is also setting Proxy, which
can be either from Proxy settings or from support secret, or from Env variables (cluster-wide).

## Summarising the content before upload
Summarizer is defined by `pkg/record/diskrecorder/diskrecorder.go` and is merging all existing archives. That is, it merges together all archives with name matching pattern `insights-*.tar.gz`, which weren't removed and which are newer than the last check time. Then mergeReader is taking one file after another and adding all of them to archive under their path.
If the file names are unstable (for example reading from Api with Limit and reaching the Limit), it could merge together more files than specified in Api limit.

## Scheduling the ConfigObserver
Another background task started from Observer is from `pkg/config/configobserver/configobserver.go`. The observer creates configObserver by calling `configObserver.New`, which sets default observing interval to 5 minutes.
The Run method runs again wait.Poll every 5 minutes and reads both support and pull-secret secrets.

## Scheduling diskpruner and what it does
By default Insights Operator Gather is calling diskrecorder to save newly collected data in a new file, but doesn't remove old. This is the task of diskpruner. Observer calls `recorder.PeriodicallyPrune()` function. It is again using wait.Until pattern and runs approximately after every second interval.
Internally it calls `diskrecorder.Prune` with `maxAge = interval*6*24` (with 2h it is 12 days) everything older is going to be removed from io archive path (by default `/tmp/insights-operator`).



## How is Insights operator setting operator Status
The operator status is based on K8s [Pod conditions](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions).
Code: How Insights Operator status conditions looks like:
```
$ oc get co insights -o=json | jq '.status.conditions'
[
  {
    "lastTransitionTime": "2020-10-03T04:13:50Z",
    "status": "False",
    "type": "Degraded"
  },
  {
    "lastTransitionTime": "2020-10-03T04:13:50Z",
    "status": "True",
    "type": "Available"
  },
  {
    "lastTransitionTime": "2020-10-03T04:14:05Z",
    "message": "Monitoring the cluster",
    "status": "False",
    "type": "Progressing"
  },
  {
    "lastTransitionTime": "2020-10-03T04:14:05Z",
    "status": "False",
    "type": "Disabled"
  }
]
```
The status is being updated by `pkg/controller/status/status.go`. Status has a background task, which is periodically updating
the Operator status from its internal list of Sources. Any component which wants to participate on Operator's status adds a
SimpleReporter, which is returning its actual Status. The Simple reporter is defined in controllerstatus.

Code: In `operator.go` components are adding their reporters to Status Sources:
```
statusReporter.AddSources(uploader)
```
This periodic status updater calls updateStatus which sets the Operator status after calling merge to all the provided Sources.
The uploader updateStatus determines if it is Safe to upload, if Cluster Operator status and Pod last Exit Code are both healthy.
It relies on fact that updateStatus is called on Start of status cycle.



## How is Insights Operator using various Api Clients
Internally Insights operator is talking to Kubernetes Api server over Http Rest queries. Each query is authenticated by a Bearer token,
To simulate see an actual Rest query being used, you can try:
```
$ oc get pods -A -v=9
I1006 12:26:33.972634   66541 loader.go:375] Config loaded from file:  /home/mkunc/.kube/config
I1006 12:26:33.977546   66541 round_trippers.go:423] curl -k -v -XGET  -H "Accept: application/json;as=Table;v=v1;g=meta.k8s.io,application/json;as=Table;v=v1beta1;g=meta.k8s.io,application/json" -H "User-Agent: oc/4.5.0 (linux/amd64) kubernetes/9933eb9" -H "Authorization: Bearer Xy9HoVzNdsRifGr3oCIl7pfxwkeqE2u058avw6o969w" 'https://api.sharedocp4upi43.lab.upshift.rdu2.redhat.com:6443/api/v1/pods?limit=500'
I1006 12:26:36.075230   66541 round_trippers.go:443] GET https://api.sharedocp4upi43.lab.upshift.rdu2.redhat.com:6443/api/v1/pods?limit=500 200 OK in 2097 milliseconds
I1006 12:26:36.075284   66541 round_trippers.go:449] Response Headers:
I1006 12:26:36.075300   66541 round_trippers.go:452]     Audit-Id: 53ad17b9-c3fe-4166-9693-2bacf60f7dcc
I1006 12:26:36.075313   66541 round_trippers.go:452]     Cache-Control: no-cache, private
I1006 12:26:36.075326   66541 round_trippers.go:452]     Content-Type: application/json
I1006 12:26:36.075347   66541 round_trippers.go:452]     Vary: Accept-Encoding
I1006 12:26:36.075370   66541 round_trippers.go:452]     Date: Tue, 06 Oct 2020 10:26:36 GMT
I1006 12:26:36.467245   66541 request.go:1068] Response Body: {"kind":"Table","apiVersion":"meta.k8s.io/v1","metadata":{"selfLink":"/api/v1/pods"
... CUT HERE
```

But adding Bearer token and creating Rest query is all handled automatically for us by using Clients, which are generated, type safe golang libraries,
like [github.com/openshift/client-go](github.com/openshift/client-go) or [github.com/kubernetes/client-go](github.com/kubernetes/client-go).
Both these libraries are generated by automation, which specifies from which Api repo and which Api Group it generates it.

All clients are created near/at where they are going to be used, we pass around the configs that were created from the KUBECONFIG envvar defined in cluster.
Reason for doing this is that there are many clients every one of which is cheap to create and passing around the config is simple while also not changing much over time.
On the other hand its quite cumbersome to pass around a bunch of clients, the number of which is changing by the day, with no benefit.

## How are the credentials used in clients
In IO deployment [manifest](manifests/06-deployment.yaml) is specified service account operator (serviceAccountName: operator). This is the account under which insights operator runs or reads its configuration or also reads the metrics.
Because Insights Operator needs quite powerful credentials to access cluster-wide resources, it has one more service account called gather. It is created
in [manifest](manifests/03-clusterrole.yaml).
Code: To verify if gather account has right permissions to call verb list from apigroup machinesets I can use:
```
kubectl auth can-i list machinesets --as=system:serviceaccount:openshift-insights:gather
yes
```
This account is used to impersonate any clients which are being used in Gather Api calls. The impersonated account is set in operator go:
Code: In Operator.go specific Api client is using impersonated account
```
	gatherKubeConfig := rest.CopyConfig(controller.KubeConfig)
	if len(s.Impersonate) > 0 {
		gatherKubeConfig.Impersonate.UserName = s.Impersonate
	}
  // .. and later on this impersonated client is used to create another clients
  gatherConfigClient, err := configv1client.NewForConfig(gatherKubeConfig)
```

Code: The impersonated account is specified in config/pod.yaml (or config/local.yaml) using:
```
impersonate: system:serviceaccount:openshift-insights:gather
```
To test where the client has right permissions, the command mentioned above with verb, api and service account can be used.

Note: I was only able to test missing permissions on OCP 4.3, the versions above seems like always passing fine. Maybe higher versions
don't have RBAC enabled.

Code: Example error returned from Api, in this case downloading Get config from imageregistry.
```
configs.imageregistry.operator.openshift.io "cluster" is forbidden: User "system:serviceaccount:openshift-insights:gather" cannot get resource "configs" in API group "imageregistry.operator.openshift.io" at the cluster scope
```

## How Api extensions works
If any cloud native application wants to add some Kubernetes Api endpoint, it needs to define it using [K8s Api extensions](https://kubernetes.io/docs/concepts/extend-kubernetes/) and it would need to define Custom Resource Definition. Openshift itself defines them for [github.com/openshift/api](github.com/openshift/api) (ClusterOperators, Proxy, Image, ..). Thus for using api of Openshift, we need to use Openshift's client-go generated client.
If we would need to use Api of some other Operators, we would need to find if Operator is defining Api.

Typically when operator defines a new CRD type, this type would be defined inside of its repo (for example [Machine Config Operator's MachineConfig](https://github.com/openshift/machine-config-operator/tree/master/pkg/apis/machineconfiguration.openshift.io)).

To talk to specific Api, we need to have generated clientset and generated lister types from the CRD type. There might be three possibilities:
- Operator doesn't generate clientset nor lister types
- Operator generate only lister types
- Operator generates both, clientset and lister types

Machine Config Operator defines:
- its Lister types [here](https://github.com/openshift/machine-config-operator/tree/master/pkg/generated/listers/machineconfiguration.openshift.io/v1)
- its ClientSet [here](https://github.com/openshift/machine-config-operator/blob/master/pkg/generated/clientset/versioned/clientset.go)

Normally such a generation is not intended for other consumers, unless it is prepared in a separate api library. For example
[Operators Lifecycle Manager](https://github.com/operator-framework/operator-lifecycle-manager) defines its CRD types [here](https://github.com/operator-framework/api/tree/master/pkg/operators/v1alpha1). Operators framework is exposing in Api only CRD and lister types, not ClientSet.

One problem with adding new operator to go.mod is that usually other operator will have its own reference to k8s/api (and related k8s/library-go), which might be different then what Insights Operator is using, which could cause issues during compilation (when referenced Operator is using Api from new k8s api).

If it is impossible to reference, or operator doesn't expose generated Lister or ClientSet types in all these cases when we don't have type safe
Api, we can still use non type safe custom build types called [dynamic client](k8s.io/client-go/dynamic). There are two cases, when Lister types exists, but no ClientSet, or when no Lister types exists at all both have examples [here](https://pkg.go.dev/sigs.k8s.io/controller-runtime@v0.6.2/pkg/client#example-Client-List).
Such a client is used in [GatherMachineSet](pkg/gather/clusterconfig/clusterconfig.go).


## Gathering the data

### clusterconfig
When the `periodic.go` calls method Gather of the `clusterconfig` Gatherer, it's handled [here](https://github.com/openshift/insights-operator/blob/master/pkg/gather/clusterconfig/0_gatherer.go#L99).

The clusterconfig Gatherer starts each gather-function in its own separate goroutine with a dedicated channel to send back their results.
Each gather-function is its own separate entity, each creates their own clients using the configs present in the `Gatherer` object that was passed down as parameter.
We further divided the gather-functions into 2 main parts:
1. the 'adapter-part' that is called by the `Gatherer.Gather`, named `Gather<Something>`, it handles the creation of the clients and handling the communication with the `Gatherer`.
2. the 'core-part' that holds the actual logic of what to gather, named `gather<Something>`, the clients required for this are passed in as arguments by the 'adapter-part'.

Gather-functions are IO bound and they don't use much of the CPU, so giving each of them a goroutine doesn't stress the CPU but gives us an 'async' way of making REST calls, which improves the performance greatly.

After starting the goroutines, the Gatherer will start monitoring the channels, when it receives a result it will:
- Store the received `record`s using the provided `record.Interface`'s `Record` method.
- Store some metadata about the gather-function.
- Collect the errors accordingly. Errors are accumulated over all the gather-functions and returned as 1 summed up error.

Each result is being stored into record.Item as Marshalable item. It is using either golang Json marshaller, or K8s Api serializers. Those has to be explicitly registered in init func. The record is created to archive under its Name specifying full relative path including folders. The extension for particular record file is defined by GetExtension() func, but most of them are today of "json", except metrics or id.

The `gatherFunctions` map is where we reference all the gather-functions we have within the `clusterconfig` package.
Each has an id (the key in the map) these can be used to only execute a selection of the gather-functions. (according to the default config all gather-function will be executed)
Furthermore each gather-function is categorized into either:
- `important` meaning if that gather-function has an error we will notify `periodic.go` about it, which will handle it accordingly.
- `failable` meaning if that gather-function has an error we will just log it and add it to our metadata.
This is necessary as we are expanding into gathering data about resources that are not guaranteed to be present on the cluster. By default if a resource is not present we shouldn't see an error, but it's better to be safe.

## Downloading and exposing Archive Analysis
After the successful upload of archive, the progress monitoring task starts. By default it waits for 1m until it checks if results of analysis of the archive (done by external pipeline in cloud.redhat.com) are available. The report contains LastUpdatedAt timestamp, and verifies if report has changed its state (for this cluster) since the last time. If there was no
update (yet), it retries its download of analysis, because we have uploaded the data, but no analysis was provided back yet.
The successfully downloaded report is then being reported as IO metric health_statuses_insights.
Code: Example of reported metrics:
```
# HELP health_statuses_insights [ALPHA] Information about the cluster health status as detected by Insights tooling.
# TYPE health_statuses_insights gauge
health_statuses_insights{metric="critical"} 0
health_statuses_insights{metric="important"} 0
health_statuses_insights{metric="low"} 1
health_statuses_insights{metric="moderate"} 1
health_statuses_insights{metric="total"} 2
```

## Configuring what to gather
In the yaml config there is a field named `gather` it expects a list of strings, each string is an id that is connected to a gather function.
Adding such an id to the list means that that certain gather function needs to be run.
If nothing is set in the `gather` list then no gathering will take place and an error will be raised.
There is a special id named `ALL` which if in the list then every gather function will be run.
The id of each gather function can be found in the `docs/gathered-data.md` beside the `Id in config:` text for each section.

#### Example for using special id `ALL`
```
gather:
  - ALL
```

#### Example for using individual ids
```
gather:
 - pdbs
 - metrics
 - operators
 - container_images
 - nodes
 - config_maps
 - version
 - id
 - infrastructures
 - networks
 - authentication
 - image_registries
 - image_pruners
 - feature_gates
 - oauths
 - ingress
 - proxies
 - certificate_signing_requests
 - crds
 - host_subnets
 - machine_sets
 - install_plans
 - service_accounts
 - machine_config_pools
 - container_runtime_configs
 - stateful_sets
```