OMIT( ) function

Returns a string with one or more specified substrings removed.

Syntax

OMIT(string1, string2 <,case_sensitive>)

Parameters

string1

Character. The string to remove the substring or substrings from.

string2

Character. The substring or substrings to remove. This parameter must be specified using a quoted string. Use commas to separate multiple substrings. Use a space after a comma only if it is part of the substring you want to remove. If the double quotation mark character occurs in any of the substrings, enclose the entire parameter in single quotes (i.e., ‘ ’). To omit a comma, place a single comma last in the list of substrings, followed immediately by the closing quotation mark (see the final example below).

case_sensitive

Optional. Logical. Specify T to make substrings specified for removal case-sensitive, or F to ignore case. If the case_sensitive parameter is omitted the default value of T is used.

Output

Character.

Remarks

The OMIT( ) function removes one or more substrings from a string. It differs from functions such as CLEAN( ), EXCLUDE( ), INCLUDE( ), and REMOVE( ) because it matches and removes characters on a substring basis rather than on a character-by-character basis, allowing you to remove specific words, abbreviations, or repeated sequences of characters from a string without affecting the remainder of the string.

OMIT( ) can improve the effectiveness of the LEVDIST( ) or ISFUZZYDUP( ) functions, or the FUZZYDUP command, by removing generic elements such as “Corporation” or “Inc.” from field values. Removal of generic elements focuses the string comparisons performed by LEVDIST( ), ISFUZZYDUP( ), and FUZZYDUP on just the portion of the strings where a meaningful difference may occur.

If you specify multiple substrings for removal, the order in which you list them in the string2 parameter can affect the output results. When the OMIT( ) function is processed, the first substring is removed from all values that contain it, the second substring is then removed from all values that contain it, and so on. If one substring forms part of another substring – for example, “Corp” and “Corporation” – removing the shorter substring first also alters values containing the longer substring (“Corporation” becomes “oration”) and prevents the longer substring from being found. To avoid this situation, specify longer substrings before any shorter substrings they contain. For example, OMIT(vendor," Corporation, Corp., Corp").

You can specify single-character substrings, such as punctuation marks, special characters, and a blank space, which further reduces the generic material in strings. It may be more efficient to remove a single character first, such as a period or a blank, which reduces the number of substring variations you subsequently need to specify. Compare the third and fourth examples below. They both return the same results, but the fourth example is more efficient.

Blanks or spaces in substrings are treated like any other character. You must explicitly specify each blank you want removed as part of a substring – for example, if you specify an ampersand without any blanks (“&”), “Ricoh Sales & Service” becomes “Ricoh Sales  Service”. If you include blanks (“ & ”), “Ricoh Sales & Service” becomes “Ricoh SalesService”. If you specify a blank that is not part of the substring, the substring will not be found – for example, if you specify an ampersand with blanks (“ & ”), “Ricoh Sales&Service” remains unchanged. When using commas to separate multiple substrings, follow the comma with a space only if it corresponds with the actual substring you want to remove. One approach to dealing with blanks is to remove all blanks from a test field first, by specifying a blank as a single-character substring prior to specifying any other substrings.

After using OMIT( ) to created a computed field, review the contents of the field to confirm you have not inadvertently omitted meaningful portions of strings. For example, omitting “Co” gets rid of a common abbreviation for “Company”, but it also removes the letters “Co” from two locations in “Coca-Cola”.

Examples

Example Return value

OMIT("Intercity Couriers Corporation", " Corporation, Corp.")

Intercity Couriers

OMIT("Inter-city Couriers Corp.", " Corporation, Corp.")

Inter-city Couriers

The Levenshtein Distance between this returned value and the returned value in the example above is 1. If the generic elements are not removed, the distance between the two examples is 8, which could allow the values to escape detection as fuzzy duplicates.

OMIT(vendor," Corporation, Corp., Corp, Inc., Inc, Ltd., Ltd")

The vendor field values with all generic elements such as “Corporation” and “Inc.” removed.

OMIT(vendor," ,.,Corporation,Corp,Inc,Ltd")

The vendor field values with all generic elements such as “Corporation” and “Inc.” removed.

This example and the one preceding it both return the same results but the syntax for this example is more efficient.

OMIT(vendor," Corporation, Corp,")

The vendor field values with “Corporation” and “Corp” removed, and all commas removed.

Related reference
FUZZYDUP command
ISFUZZYDUP( ) function
LEVDIST( ) function


(C) 2013 ACL Services Ltd. All Rights Reserved. | Send feedback