Let's take a look at the r ss_etl.groovy script. Note that the script uses an external library ( commons-io-2.5.jar) that is placed in a directory referenced by the configuration. Here the XML response from each feed is transformed by a groovy script, in this case referenced as a locally stored file. This is done in sequence for each target.url it receives from upstream. The InvokeHTTP processor performs all the low-level work to send an HTTP request to each target.url property passed to it, and pull in its XML response (in the RSS case) which it passes to the next processor. Here is the where the magic of NiFi processors can be seen. Note that in our case we are matching the original text, to get each URL as originally configured. ![]() The matched regex for target.url property for each split is sent in sequence to the next processor. Note that the text is extracted by regex. This processor takes each split and extracts its text, which I assign to a new property named target.url. This processor takes the content from GetFile and splits each line into separate strings creating a list of URLs that are passed to the next processor as splits. Important configurations for this processor are: It is scheduled to read the file each day at 23:00. Each line of the configuration file nf is the URL of an RSS feed. ![]() This processor reads the content of a configuration file stored locally. We can use this same flow identically for 100 URLs or for 1. Using a configuration file listing URLs prevents us from hard-coding one InvokeHTTP processor for each URL and allows us to change the list on the fly. From here, Spark, Hive, Pig, HBase and so on can take over. The tsv result for all configured feed URLs will be merged and put to HDFS as a single file named as. ![]() Additionally, HTML formatting is stripped from those values. Each record is one news story (item element) with title, description etc extracted from the XML as tab-separated values. A groovy script will transform each XML-formatted feed response to produce a set or records. RSS feeds return XML content and one feed URL returns multiple XML item elements, each representing one news story and structured by child elements like title, description, etc. At a very high-level, an InvokeHTTP processor will retrieve in sequence the RSS feeds from a list of URLs that are configured locally. The article may be particularly useful for newcomers to NiFi. groovy scripting that imports an external jar library.iterating the flow from an external configuration file.This article shows a simple NiFi data flow from the web to HDFS that demonstrates several fundamental capabilities of NiFi, including:
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |