Semantic Expert Finding
HP Innovation Research Program 2011. Grant CW267313
for Service Portfolio Creation
[September 5, 2012] The paper Linked Open Data to support Content-based Recommender Systems has been awarded with the Best Research Paper Award at ISemantics 2012 Conference
[June 1, 2012] The paper Exploiting the Web of Data in Model-based Recommender Systems has been accepted for presentation at the 6th ACM Conference on Recommender Systems Conference to be held in Dublin (9-13 September, 2012)
[May 25, 2012] The paper Linked Open Data to support Content-based Recommender Systems has been accepted for presentation at ISemantics 2012 Conference to be held in Graz (5-7 September, 2012)
[May 15, 2012] The workshop Semantic Technologies meet Recommender Systems & Big Data (SeRSy) will be co-located with the 11th International Semantic Web Conference (ISWC 2012)
[May 3, 2012] HP Labs Innovation Research Program (IRP) project "Semantic Expert Finding for Service Portfolio Creation" has been selected to receive continued funding in 2012
For a service organization to be effective, it is very important that knowledge is transferred effectively between its members. In particular, when much of the information resides "informally" within the organization - as is the case in agile service portfolio creation (topic 22, HP Labs IRP CFP) - it is of fundamental importance that the organization is proficient at finding an expert with specific competencies, creating a team of workers with particular abilities, identifying the area of expertise of a department or a research group, and other similar tasks.
Usually, in order to effectively fulfill such tasks, the organization needs to (i) have a deep knowledge of the environment and of the people operating into it, (ii) have a detailed knowledge and (ideally) experience of the requested skills and (iii) carefully analyze each profile and then (try to) match it with task requirements.
As a way of example, if one wanted to identify the core competencies in a research group or to find a person with specific competences, it would not be enough to look at the resumes, or job descriptions of each employees, since competences of a person are not necessarily in her job descriptions and, often, they are not kept up to date. On the other hand, a lot of information could be inferred from the actual "work life" of employees: their participation in specific projects, the data they expose on company-owned social networks, the chat sessions between colleagues, their publications.
This precious information should be collected and leveraged by the company. However, the crucial point is that such information is in some way "hidden" within the data extracted from these heterogeneous information sources. First, data need to be cleaned, contextualized and expanded. Finally, such information should be stored in a way that facilitates the exploration, reuse and exploitation to accomplish the tasks previously mentioned.
Description of Proposed Research
The first challenge of our research is to collect data about people (employees, candidates, scientists) from miscellaneous sources (social networks, resumes, emails, chat sessions, research projects, publications) and to bind such data to shared structured datasets.
The data extraction process is critical since each source is different and has to be suitably handled. The same is for accessing the information contained within datasets. For example, we may have datasets (e.g. DBLP for publications, O*Net for skills and competencies, DBpedia for general-purpose information) that can be queried through structured queries, that is SPARQL, SQL queries or specific APIs.
Concerning resumes, it is possible to launch CV-analyzers to obtain (semi-)structured data. Graph analysis can be performed on social networks. Finally, text analysis will be conducted with text documents or text fragments. The outcome of the data extraction step will be a set of "raw data". Basically a bag of tags/resources (i.e., competences, skills, etc.) will be associated with each profile (user). The extracted data will be noisy after the data extraction step. Hence, the data extraction process will be followed by a data cleansing one (disambiguate tags, contextualize them, discard non informative data, etc.). The final step in data processing will be then to attach an explicit semantics to them: what is the meaning of the tag Java? Does it refer to the Indonesian island or to the computer programming language? The association of an explicit meaning to a tag brings to two main benefits. On the one hand, we are able to uniquely identify a resource (datum/tag). On the other hand, we may relate meanings via an underlying knowledge base moving from a pure set of tags to a semantic network of tags. Semantic data associated to a profile, may also go through a process of tag expansion. Each tag will be enriched through a process of semantic expansion. In fact, thanks to semantic relations among tags, their meaning is implicitly enriched.
After the data cleansing step, tags will no more be simple keywords. They will be uniquely identified, have their own meaning, i.e. their own semantics.
This last step will be possible thanks to the exploitation of multiple knowledge bases (KB), e.g. DBpedia, O*NET, SOC just to cite a few. This leads to the second challenge of the project, which is the process of knowledge construction to build an ad-hoc context-aware semantic network starting from heterogeneous KBs. Here the challenge lays in identifying links among the different KBs and extract, from each of them, a sub-set related only to the knowledge of interest (e.g. IT-related knowledge). For this reason a context-bounding module is needed to select and extract from each KB only those resources belonging to a specific context of interest. For example, one could be interested only in IT domain or, more specifically, in database domain. Starting from heterogeneous knowledge bases freely available in the (Semantic) Web, the context bounding module will filter out non-relevant and out-of-context resources.
As we said before, the output of the knowledge construction process will be an ad-hoc context-aware semantic networks: ad-hoc as each KB, being different from others, has to be handled in a specific way and context-aware as from each KB we will extract only information related to a specific domain of interest, leaving out not relevant information.
The obtained semantic network will be a weighted graph where nodes are competences, skills and works while weights represent the similarities between them.
Once the data have been cleansed, through a process of information extraction we are able to build a graph where people are associated with a similarity value to their skills, competences, etc..
The weighted graph will be stored in a Graph DB. Compared with relational databases, graph databases are faster for associative data-sets. Moreover they scale to large data sets, since they do not require expensive join operations. Moreover, since graph DB do not depend rigidly on a fixed schema, they are able to handle more easily changes in the data and evolving schemas.
We propose a multi-layered graph, where each layer refers to a specific dimension (e.g. people, skills and expertise, projects, publications). The resources in each layer are linked with resources in the same tier and with resources in other tiers via a similarity value. The similarity value is computed exploiting similarity metrics that take into account the semantics of two resources of interest.
The information gathered in the graph will be then used for a double purpose (i) to exploit graph mining techniques in order to discover laws and patterns e.g. important relations among people, projects, etc. and (ii) for exploratory search tasks.
As concerns exploratory search, we would like to give the user the ability:
The visual exploration of the graph is suitable when the user wants to explore the semantic network without having a clear idea of what she is looking for, starting from an initial vague idea or from an overview and then refining it through exploration. This type of exploration is suitable, for example, to visually identify clusters of nodes as research groups within a department. Here we will propose different forms of graph visualization, using force-based algorithms (e.g. to identify clusters of competences, or people having similar expertise) or hyperbolic geometry (e.g. to have an overview of the graph to explore).
In a semantic network, an atomic query is a query involving only one node or one edge of the network. In our domain, expressing non-atomic queries is needed, for instance, when the user is looking for someone having specific skills and, there is the need to find an expert to be allocated on a given project. With respect to the underlying semantic network, the query itself can be represented as a labelled graph. In order to formulate such queries, we will propose a "smart" user interface exploiting simple input fields (text-fields, sliders, etc.). The smartness of the user interface will be in its ability to automatically infer the graph representing the actual query by exploiting: (1) the semantics encoded within the multi-layered graph built after the information extraction task, (2) the semantics of the semantic networks built after the knowledge construction task and (3) the semantics of the values coming from the input fields filled by the user.
Once the query has been expressed, the system will query the multi-layered graph and, based on the same similarity metrics adopted in the information extraction task, it will return a ranked list of resources matching the semantics of the query.