Test 9 - Phantom Employees

A phantom (ghost) employee is someone on the payroll who doesn't work for the company. By falsifying personnel or payroll records, a fraudster can issue paychecks to this phantom employee. Then, the fraudster, or even an accomplice, can cash or deposit these paychecks. The phantom may be completely fictitious, or they might be a real person who doesn't work for the company. If the phantom is a real person, they're often a friend or relative of the perpetrator. In order for a phantom employee scheme to work, four things must happen:

Phantom added to the payroll
Timekeeping and wage information collected and entered
Paycheck issued to the phantom
Check delivered to the perpetrator or an accomplice.

Because they're already in the payroll system, one of the easiest ways to create a phantom is to extend the termination date of an employee who is leaving, as we saw in the “Paid After Termination” test.

Because phantoms lurk in the guise of genuine employees, they're extremely difficult to detect, especially in companies with large numbers of employees. To uncover phantom employees, there are several tests that we can run in Analytics, such as searching for PO box addresses, duplicate addresses, duplicate social security numbers, or duplicate bank account numbers. If these tests don’t produce any results, it doesn’t necessarily mean that there aren't any phantoms. Sometimes, the only way to detect them is through a basic manual review, such as a head count.

In our test today, we'll create a conditional computed field to harmonize common address endings and convert all of our characters to upper case. Harmonizing this data will make it much easier to look for duplicates (for example, consider the problems that might arise when trying to compare "Boulevard" to "Blvd"). Then, we'll run Analytics' Fuzzy Duplicates command on our cleaned up address field to identify matches that may be indicative of phantom employees.

PO Boxes

Another common red flag to look out for in address fields is PO Boxes. Uncovering a specific string can be difficult, particularly when there are several fields that it might be hiding in. For cases where the string might be embedded and/or within one of several fields, the FIND( ) function comes in handy!

Risk

Phantom employees on the payroll may be used to channel funds to an unauthorized party, or as a vehicle for fraud.

Objective

Identify employees with addresses similar enough that they may be the same.

Analyzing Data

In the Empmast table, create a conditional computed field, c_AddressCleanup, to harmonize the spelling of common address endings and convert all characters to upper case. The expression should find “STREET”, “ROAD”, AVENUE”, and “BOULEVARD” and rename them to “ST”, “RD”, AVE”, and “BLVD”:

Show me

Here's the logic behind the c_AddressCleanup field:

Condition	Value
FIND( " street " , address )	REPLACE( UPPER(address), " STREET ", " ST "
FIND( " road " , address )	REPLACE( UPPER(address), “ ROAD ", " RD ")
FIND( " avenue " , address )	REPLACE( UPPER(address), " AVENUE ", " AVE ")
FIND( " boulevard " , address )	REPLACE( UPPER(address), " BOULEVARD ", " BLVD ")
Default value	UPPER(address)

OPEN EmpMast
DEFINE FIELD c_AddressCleanup COMPUTED

REPLACE(UPPER(address), " STREET ", " ST ") IF FIND( " street" , address )
REPLACE( UPPER(address), " ROAD ", " RD ") IF FIND( " road " , address )
REPLACE( UPPER(address), " AVENUE ", " AVE ") IF FIND( " avenue " , address )
REPLACE( UPPER(address), " BOULEVARD ", " BLVD ") IF FIND( " boulevard " , address )
UPPER(address)

Create a computed field, c_FuzzyKey, that concatenates (combines) the c_AddressCleanup, city, and state_province fields into one and will remove any blank spaces, using the syntax:
- EXCLUDE(c_AddressCleanup + city + state_province," ")
Show me
OPEN EmpMast

DEFINE FIELD c_AddressCleanup COMPUTED
REPLACE(UPPER(address), " STREET ", " ST ") IF FIND( " street" , address )
REPLACE( UPPER(address), " ROAD ", " RD ") IF FIND( " road " , address )
REPLACE( UPPER(address), " AVENUE ", " AVE ") IF FIND( " avenue " , address )
REPLACE( UPPER(address), " BOULEVARD ", " BLVD ") IF FIND( " boulevard " , address )
UPPER(address)
DEFINE FIELD c_FuzzyKey COMPUTED EXCLUDE(c_AddressCleanup + city + state_province," ")
Run the Fuzzy Duplicates command on the c_FuzzyKey field. Enter a Difference Threshold of 2 and deselect the checkboxes for Difference Threshold and Difference Percentage. From List Fields, select all fields. Name the new table r_SimilarAddress.
Difference threshold is a measurement of the smallest amount, in bytes, that compared values can differ in order to be considered "different." Using a difference threshold of two means values that differ by two bytes or less are included in the result. For example, "MAIN STRET" and "MAIN STREET" have a difference of one byte, so they'd be included in the result.
Show me
OPEN EmpMast
DEFINE FIELD c_AddressCleanup COMPUTED
REPLACE(UPPER(address), " STREET ", " ST ") IF FIND( " street" , address )
REPLACE( UPPER(address), " ROAD ", " RD ") IF FIND( " road " , address )
REPLACE( UPPER(address), " AVENUE ", " AVE ") IF FIND( " avenue " , address )
REPLACE( UPPER(address), " BOULEVARD ", " BLVD ") IF FIND( " boulevard " , address )
UPPER(address)
DEFINE FIELD c_FuzzyKey COMPUTED EXCLUDE(c_AddressCleanup + city + state_province," ")
FUZZYDUP ON c_FuzzyKey OTHER ALL LEVDISTANCE 2 TO "r_SimilarAddress"
OPEN r_SimilarAddress

Conclusion

While manually inspecting records may sometimes be the only way to uncover phantom employees, it's nice to perform some analysis, such as fuzzy duplicate matching, before resorting to such tedious work. As we've just seen, Analytics' Fuzzy Duplicates command can be a more powerful tool than the Duplicates command when looking for values that are similar but not necessarily identical.

Fuzzy Duplicates matching is particularly useful when comparing free-text fields that may be more prone to typos and spelling mistakes. Depending on your data and results, you can adjust the Difference Threshold accordingly. Alternatively, you may wish to set a Difference Percentage, which will measure the number of bytes that differ as a percentage from the total number of bytes.