About Deduplication in Interfolio Web Profiles
Records should appear only once in search results on the university website, whether at the university level, department level, or elsewhere. However, duplicate records may still occur for a variety of reasons, including but not limited to:
- A user enters the record more than once in FAR.
- A user collaborates with coauthors from the same institution.
Our technology performs several jobs to prevent duplicate records from showing on the web profile.
This article explains how those rules work so you can better understand why a record may have been deduplicated (and therefore does not appear), or why a record was not deduplicated (and may appear to be a duplicate on the site).
Deduplication Rules: Identifying and Removing Same Content
- Deduplication checks occur within scholarship type (ex: we compare conference proceedings to conference proceedings, not conference proceedings to journal articles). Given that scholars often present their work before publication, when those two activities have the same titles, this will appear to be duplicated, but are actually separate activities.
- DOI Check: Records with the same Digital Object Identifier (DOI) are deduplicated if there is also an 80% match on Title.
- When records do not match on DOI, we compare several fields. All the following must be true for a record to be identified as a duplicate. Publication B must match publication A on:
- Title and subtitle (80% similarity)
- Number of pages
- Persons
- Publication date (within a 1-year difference)
- Pages
- ISBN
- Host publication title (e.g., journal title)
Merge Different Variants of the Same Scholarship
In the case of institutional co-authorship (#2 at the top of this article), when scholars use FAR to denote their coauthors who are faculty members at their current institution, every coauthor/FAR user on the record has their own version of that scholarship activity in FAR. This could result in at least 3 problems:
- Inconsistency in representation of the same scholarship activity across coauthor profile pages.
- Unit and institutional activity counts being artificially inflated by counting all the versions as unique scholarship activities.
- Search results being less useful given they will also show duplicates.
We aim to prevent these problems by merging the records. Our merging process includes the following steps:
- Of identified duplicate candidates, the most complete record received by the web profile from FAR is identified as the basis for what is going to show on the web profile. "Most complete' is defined as the record with the largest number of non-blank fields.
- In the event that the most complete record (the ‘target’ record) is missing a field, the field is chosen from an identified duplicate (by majority, or in the event of a tie, at random).
The merging process is conducted daily to allow all user changes within that timeframe to be considered in the next day's duplicate identification and record merging. For data types not covered by Scopus (see below, “Book, Chapter, Journal Article, and Conference Proceedings”), faculty members are encouraged to align with internal co-authors to ensure that the record is displayed as desired.
Book, Book Chapter, Journal Article, and Conference Proceedings
If a record in FAR is identified as existing in Scopus, the Web Profile will display the Scopus record regardless of what is in FAR. If a discrepancy exists between what is displayed and what is expected, faculty are encouraged to contact the publisher to make the necessary changes.
Visibility: “Publicly Display” and Scholarship Data
If any of the identified internal authors marks a record to “Publicly Display” = No, it will be hidden from the Web Profile.
Enrichment: Fetch from outside of FAR to add metadata to the record
After deduplication and merge processes, we attempt to match all records sent to the web profile to records in Scopus to enrich them with key metadata. This process can take up to a week.