All posts
8 min read

Your Books Have Patterns You Cannot See. AI Agents Can.

anomaly detectionforensicsAIcompliance
Artifi

Your Books Have Patterns You Cannot See. AI Agents Can.

In October 2024, TechSupply GmbH sends your accounts payable department invoice #INV-2024-0847 for 14,200 euros. It is a legitimate invoice for server hardware. The AP clerk reviews the purchase order, confirms the delivery receipt, codes it to the correct expense account, and processes the payment. Everything is clean.

Eight days later, TechSupply sends invoice #INV-2024-0892 for 14,200 euros. Same vendor. Same amount. Different invoice number. The formatting is slightly different -- the line items are described with different wording, the VAT is broken out on a separate line instead of included inline. The AP clerk processing this batch is the same person who handled the first invoice, but she processes 40 to 60 invoices per week. She does not remember the specifics of a single invoice from eight days ago. She checks the PO, confirms the amount, and pays it.

Four months later, during the quarterly audit review, an auditor pulls a vendor spend report sorted by amount. Two identical payments to TechSupply GmbH catch her eye. She flags them. The investigation confirms the duplicate. The refund request goes to TechSupply, who takes three weeks to process it. The accounting entries to reverse and re-record the transaction take another cycle to clean up. Total elapsed time from the duplicate payment to full resolution: nearly five months.

This story is not unusual. The Association of Certified Fraud Examiners estimates that the typical organization loses 5 percent of revenue to fraud and error annually. Duplicate payments alone account for 0.1 to 0.5 percent of total disbursements at most companies -- numbers that sound small until you calculate them against a $50 million spend base. That is $50,000 to $250,000 per year walking out the door because a human being cannot remember the details of an invoice they processed last Tuesday.

Why Human Review Misses Patterns

The instinct is to blame the AP clerk, or to add another layer of review. But the problem is not negligence. It is architecture. Human beings are reviewing financial data under conditions that are structurally hostile to pattern detection.

The first condition is volume. A mid-market company processing 500 to 1,500 vendor invoices per month is asking its AP team to hold the details of every transaction in working memory, or at minimum, to recognize when a new transaction resembles a previous one. Cognitive science is clear on this: human working memory holds roughly four to seven discrete items. Asking someone to cross-reference invoice #892 against the 60 invoices they processed in the prior two weeks is asking them to do something their brain is not designed for.

The second condition is cognitive fatigue. The 30th invoice reviewed in a batch gets less scrutiny than the 3rd. This is not laziness. It is a well-documented phenomenon called vigilance decrement -- the measurable decline in detection accuracy that occurs during sustained attention tasks. Air traffic controllers rotate every two hours for exactly this reason. AP clerks review invoices for eight hours straight.

The third condition is confirmation bias. When an invoice has a valid PO, a matching amount, and a recognized vendor name, the brain's pattern-matching machinery says "this looks right." The clerk is not looking for reasons to reject the invoice. They are looking for confirmation that it is valid. And the duplicate provides exactly that confirmation: valid vendor, valid amount, valid PO reference. Everything checks out, because everything that the clerk is checking genuinely is correct. The problem is the thing they are not checking -- whether this exact payment was already made.

The fourth condition is temporal distance. The original invoice was processed days or weeks ago. Human memory for specific numerical details degrades rapidly. After 48 hours, the probability that an AP clerk remembers the exact amount and vendor of a specific invoice approaches zero, unless that invoice was unusual in some way. Duplicates, by definition, are not unusual. They look exactly like the legitimate transactions around them.

These four conditions are not fixable with training or process improvements. They are inherent to human cognition operating at the volume and velocity of modern financial operations.

Five Types of Financial Anomalies

Duplicate invoices are the most tangible example, but financial data contains at least five distinct categories of anomalies that are invisible at human review speed and visible at computational speed.

Duplicate invoice detection

The detection logic is straightforward in principle: same vendor, same amount (within a configurable tolerance), within a defined date window. The default parameters might be exact amount match within 30 days, but the system should also catch near-duplicates -- invoices where the amount differs by less than 1 percent or $5, which can indicate the same service invoiced with slightly different tax calculations or rounding.

The critical subtlety is that true duplicates often have different invoice numbers. A vendor's billing system generates a new invoice number for each issuance, even if the underlying charge is identical. This means that naive duplicate detection based on invoice numbers alone misses the most common case. The detection must operate on the combination of vendor identity, amount proximity, and temporal proximity -- precisely the dimensions that human memory handles poorly.

Benford's Law analysis

In 1938, physicist Frank Benford observed that in naturally occurring numerical datasets, the leading digit is not uniformly distributed. The digit 1 appears as the first digit approximately 30.1 percent of the time. The digit 2 appears 17.6 percent of the time. The digit 9 appears only 4.6 percent of the time. This distribution, now called Benford's Law, holds across an remarkably wide range of datasets: population figures, river lengths, stock prices, and -- critically for our purposes -- financial transaction amounts.

When a set of financial transactions deviates significantly from the Benford distribution, it suggests one of two things: either the data was generated by a non-natural process (such as manual fabrication, where humans tend to over-represent certain digits), or there is a systematic pattern in the data that warrants investigation (such as a large number of transactions clustering at a specific amount).

The chi-squared goodness-of-fit test provides a rigorous way to quantify the deviation. A chi-squared statistic above the critical value for the chosen significance level (typically 0.05) indicates that the observed distribution is unlikely to have occurred by chance. This does not prove fraud. It proves that the data is worth investigating -- which is precisely what an anomaly detector should do.

A human auditor performing Benford's Law analysis needs to extract the data, calculate the digit frequencies, compute the chi-squared statistic, and interpret the results. This takes hours. An automated agent does it in seconds, every night, across every expense category.

Segregation of duties violations

One of the oldest principles in internal controls is that the person who creates a transaction should not be the same person who approves it. This prevents a single individual from both initiating and authorizing a payment -- a control that, when absent, is present in over 30 percent of occupational fraud cases according to the ACFE.

Detecting segregation of duties violations requires comparing the creator and approver fields across all transactions. In a system with proper audit logging, this is a simple query. But "simple" does not mean "done." Most companies check segregation of duties during annual audits, not continuously. An employee who both creates and approves a transaction in March might not be flagged until December -- nine months of exposure.

Unusual payment patterns

Financial data contains timing and distribution patterns that are invisible in day-to-day processing but become obvious in aggregate. Round-number clustering is a classic indicator: if 15 percent of your expense reimbursements are for exactly $4,999 and your approval threshold is $5,000, that is not a coincidence. It is a pattern of threshold avoidance that warrants investigation.

Weekend and holiday transactions can also be informative. If your business operates Monday through Friday but your accounting data shows journal entries posted at 2 AM on a Sunday, that entry deserves scrutiny. Time-of-day patterns can reveal automation issues (batch jobs running at unexpected times) or, in more concerning cases, unauthorized activity conducted outside business hours.

Payment frequency anomalies round out this category. A vendor that historically receives one payment per month and suddenly receives three in a single week may be legitimate (a large project ramping up) or may indicate duplicate processing or unauthorized payments.

Vendor risk indicators

New vendors are inherently higher risk than established ones. A vendor created last week with a first transaction of $75,000 is not necessarily fraudulent, but it is statistically more likely to be problematic than a vendor with a two-year transaction history and consistent payment patterns. The anomaly is the combination of newness and magnitude.

Other vendor risk indicators include PO box addresses (which can be legitimate but are over-represented in fraudulent vendor schemes), multiple vendors sharing the same bank account (a strong indicator of fictitious vendor fraud), and vendors with names that are suspiciously similar to employees (a classic indicator of employee-created shell companies).

None of these indicators are conclusive on their own. Each one has legitimate explanations. But taken together, and monitored continuously rather than annually, they form a detection mesh that catches problems months earlier than traditional audit cycles.

How It Works

An Anomaly Detector agent is not a traditional software feature. It is an autonomous process that runs on a schedule, applies analytical techniques to financial data, and produces findings for human review. Understanding what it does -- and what it does not do -- is important.

The agent runs on two schedules. A daily quick scan covers the most time-sensitive checks: duplicate invoice detection for the day's transactions, segregation of duties verification for the day's approvals, and a scan of any new vendor activity. This daily scan completes in minutes and produces immediate findings if something warrants attention.

A monthly deep scan runs more computationally intensive analyses: Benford's Law testing across all expense categories for the prior month, payment pattern analysis across all vendors, vendor risk scoring for all active vendors, and trend analysis comparing current-month patterns against established baselines. This deeper analysis produces a monthly forensic report.

Critically, the agent is read-only. It observes transactions. It analyzes patterns. It creates findings. But it cannot modify, delete, or approve any transaction. This is a fundamental design principle, not just a permission setting. An agent that can both detect anomalies and modify transactions has a conflict of interest. The detection agent observes and reports. A human being -- informed by the agent's findings -- makes decisions.

Each finding carries a confidence score. A duplicate invoice match where the amount is identical, the vendor is the same, and the date gap is three days might receive a confidence score of 0.95 -- high enough to warrant immediate attention. A Benford's Law deviation that is statistically significant but mild might receive a score of 0.65 -- informative but not urgent. This confidence scoring serves a critical purpose: it prevents alarm fatigue. A system that flags everything with equal urgency quickly gets ignored. A system that distinguishes between "you should look at this today" and "this is worth knowing about" maintains its credibility over time.

The agent also maintains memory. After each scan, it stores baselines: average monthly spend by vendor, typical transaction volumes by day of week, digit distribution profiles by expense category. These baselines are not static. They update with each month of data, which means the agent's understanding of "normal" for your specific organization becomes more precise over time.

The Audit Trail Advantage

Every finding the Anomaly Detector produces is logged as a structured audit event. The log includes the scan timestamp, the detection rule that triggered, the specific transactions involved, the confidence score, the baseline data used for comparison, and the eventual resolution (when a human reviews and dispositions the finding).

This audit trail has value beyond the immediate detection. When external auditors arrive for the annual review and ask "what controls do you have in place for duplicate payment detection?", the answer is not "our AP team tries to catch them." The answer is "an automated system scans every transaction daily, applies six detection heuristics, and logs all findings with confidence scores and resolutions. Here is the complete log for the audit period."

This transforms the conversation with auditors. Instead of defending the absence of controls, you are demonstrating the presence of systematic, documented, continuous monitoring. The audit trail does not just record what the agent found. It records that the agent looked -- every day, without exception, across every transaction.

For companies subject to SOX compliance, this continuous monitoring provides evidence of control effectiveness that is difficult to achieve through periodic manual testing. The control is not "we review invoices" (subjective, unverifiable). The control is "an automated agent applies defined detection rules to 100 percent of transactions daily, with logged results" (objective, verifiable, continuous).

Duplicate Detection in Practice

Let us walk through the duplicate detection process in detail, because it illustrates how computational analysis differs from human review.

The agent begins by loading all accounts payable invoices for the scan period. For a daily scan, this is the current day's transactions plus a lookback window (typically 30 days, configurable). For a monthly scan, this is the full month plus the prior month for overlap detection.

The first step is grouping by vendor. The agent creates clusters of transactions for each vendor, which immediately reduces the comparison space. A company with 200 vendors and 1,000 monthly transactions averages 5 transactions per vendor -- far more manageable than comparing all 1,000 transactions against each other.

Within each vendor cluster, the agent compares amounts. An exact match is the strongest signal, but the agent also checks for near-matches: amounts within a configurable tolerance (default: 1 percent or $5, whichever is greater). This catches duplicates where tax calculations differ slightly, where currency conversion produces rounding differences, or where a credit note was applied to one instance but not the other.

For each amount match, the agent checks the date proximity. Two invoices from the same vendor for the same amount twelve months apart are almost certainly separate legitimate transactions. Two invoices within five days of each other are much more likely to be duplicates. The date window is configurable because different businesses have different transaction patterns -- a company that pays vendors weekly has a different normal pattern than one that pays monthly.

When a potential duplicate is identified, the agent examines the invoice numbers. If the numbers are identical, it is a true duplicate -- the same invoice submitted and processed twice. If the numbers differ, it is a near-duplicate -- potentially the same charge invoiced separately, which is actually the harder case to catch and the more common one in practice.

The confidence score integrates all of these signals. An exact amount match, within three days, with different invoice numbers, scores around 0.92. A near-match within 1 percent, within seven days, scores around 0.75. An exact match within 30 days scores around 0.80. The scoring function is not arbitrary -- it reflects the empirical probability that a given combination of signals represents a genuine duplicate.

The finding is then created with all supporting data: both transaction IDs, both invoice numbers, both dates, the amount comparison, the confidence score, and a human-readable explanation of why this was flagged. The human reviewer sees everything they need to make a decision without performing any additional research.

The Compounding Baseline

The most significant advantage of automated anomaly detection is not any single scan. It is the compounding effect of baseline accumulation over time.

In the first month of operation, the agent has no history. Its baselines are generic: Benford's Law distributions from statistical theory, industry-standard thresholds for duplicate detection, default parameters for timing analysis. The initial findings will include false positives -- transactions flagged as anomalous that are actually normal for your specific business.

By month three, the agent has accumulated enough data to begin building organization-specific baselines. It knows that your company typically processes 450 to 550 AP invoices per month. It knows that Vendor A typically bills between $8,000 and $12,000 per month. It knows that your expense digit distribution skews slightly toward the digit 2 (because your most common expense category falls in the $200 to $299 range), and that this skew is normal for your business, not an anomaly.

By month six, the baselines are robust. The false positive rate has dropped significantly because the agent's definition of "anomalous" is calibrated to your organization's specific patterns, not to generic thresholds. A transaction that would have been flagged in month one -- because it deviated from the textbook distribution -- is now correctly identified as normal for your business.

By month twelve, the agent has a full annual cycle of data. It understands seasonality: December expenses are higher because of annual renewals, March has a spike from Q1 bonuses, August is quieter because half the team is on vacation. Seasonal patterns that would confuse a month-over-month analysis are accounted for in the baseline.

The practical effect is that the signal-to-noise ratio improves continuously. In month one, the agent might produce 20 findings, of which 5 are genuinely worth investigating. By month twelve, it might produce 5 findings, of which 4 are genuinely worth investigating. The total number of findings decreases not because there are fewer anomalies, but because the definition of "anomalous" has been refined by twelve months of learning what is normal.

This compounding baseline is impossible to replicate with periodic manual analysis. An auditor who performs duplicate payment testing once a year starts from scratch each time -- or at best, from the prior year's workpapers. The institutional knowledge of "what is normal" lives in individual auditors' heads, not in a structured, continuously updated data model.

What This Costs Versus External Audit

A forensic audit engagement from a Big Four accounting firm typically runs between $50,000 and $200,000, depending on the scope and complexity. For a mid-market company, this might cover a focused analysis of one or two risk areas -- say, vendor payments and expense reports -- for a single fiscal year.

That engagement happens once. It produces a report. The report identifies findings as of a specific point in time. By the time the report is delivered, the data is three to six months old. Any anomalies identified during the audit period have been accumulating consequences for months.

The alternative is not to replace external audit -- there are regulatory and governance reasons to maintain independent review. The alternative is to close the gap between what happens and when it is detected. An anomaly that surfaces on day two costs a refund request and a correcting journal entry. An anomaly that surfaces on month eight costs a refund request, a correcting entry, a restatement analysis, a control deficiency report, and a remediation plan.

The economics are straightforward. Continuous automated monitoring costs a fraction of periodic manual testing. But the cost comparison understates the real advantage, which is temporal. The value of detecting a duplicate payment on day two versus month eight is not proportional to the cost difference. It is proportional to the compound consequences of the delay: interest on overpayments, management time spent on investigation, auditor concerns about control effectiveness, and the erosion of trust with vendors who may wonder why you keep paying them twice.

Early detection is not just cheaper. It is categorically different from late detection. The corrective action is simpler, the consequences are contained, and the control environment is demonstrably stronger.

The Shift from Reactive to Continuous

The traditional approach to financial anomaly detection is fundamentally reactive. Errors are found when someone happens to notice them, or when an auditor specifically looks for them, or when a vendor calls to report an overpayment. The detection is opportunistic, not systematic. It covers whatever the auditor's sample happens to include, not the full population of transactions.

This reactive posture was acceptable when the alternative -- continuous monitoring -- was prohibitively expensive. Hiring a forensic accountant to review every transaction every day would cost more than the anomalies they would find. The economics did not support it.

Autonomous agents change that equation entirely. The marginal cost of scanning an additional transaction is effectively zero. The cost of running the analysis nightly instead of annually is negligible. The question is no longer "can we afford to look at everything?" It is "can we afford not to?"

At Artifi, the Anomaly Detector is one of several autonomous agents that operate within the finance system -- each one responsible for a specific domain, each one running continuously, each one producing structured findings with confidence scores and full audit trails. The human finance team reviews findings, makes decisions, and applies judgment where it is needed. The computational work of scanning, comparing, and scoring happens in the background, every night, across every transaction.

Your books have patterns. Some of those patterns are exactly what they should be. Some of them are problems waiting to surface at the worst possible time. The difference between the two is not visible at human speed. But it is visible at machine speed, if you build the system to look.

Subscribe to new posts

Get notified when we publish new insights on AI-native finance.