Circuit Mind is developing smart systems that augment electronic engineers' workflow and enable them to innovate orders of magnitude faster than previously. A key piece of this vision is building a database of digital twins of electronic components. Digital twins are exact counterparts of physical assets and objects that are already unlocking tremendous value in the construction and infrastructure management industries. Similarly, with a library of detailed electronic component digital twins, new applications beyond our wildest imagination can be unlocked: AI-assisted circuit design, automated extraction and categorisation of reusable building blocks, and rapidly finding replacement components for out of stock parts, among many others.
Electronic engineers consult a variety of engineering resources to acquire the information required to complete a design. One of these resources are datasheets, which describe engineering parameters and functional details of electronic components. Information found in a datasheet, such as pinouts, functional attributes, and interface descriptions, provides arguably the most important parameters for a component during design. It is therefore imperative that electronic component digital twins incorporate a thorough, structured representation of datasheets. Inferring this information is not straightforward though, and manual creation of digital twins is prohibitively time-consuming. Hence, we are developing a range of algorithms that facilitate the creation of digital twins by extracting various pieces of information from datasheets automatically.
Need for a Reusable Language Model
A reusable language model is crucial to address three main obstacles in extracting information from datasheets: data formats are inconsistent, design guidelines are given in natural language, and some data points are buried in the document body.
First, most electronic component manufacturers delegate the writing of a datasheet to the engineering team that designed the component in question. Coupled with often missing guidelines on naming and structuring, this practice results in highly inconsistent naming and data structuring across datasheets from different manufacturers or even different design teams in the same organisation. To be able to accurately identify and discern data entities in datasheets our system needs to have an abstract understanding of the meanings of each term. Semantic language models do just that by learning to represent text as an abstract embedding of its meaning. These representations can be employed to normalise data that uses different naming conventions.
Moreover, information is often only disclosed in paragraphs in free-form text such as descriptions of PCB layout constraints or required external components. Sophisticated language models can recognise and interpret such information in a way that enables the extraction of engineering constraints and functional details in a structured, machine-interpretable way.
Lastly, important functional information is often hidden in lists and paragraphs, often causing engineers to gloss over it unless they are specifically looking for that piece of data. It is important to be able to automatically recognise and extract these data fragments to enable our system to compile the required data about components. Entity recognition systems built on natural language models are applied with increasing success to carry out this exact task in other domains such as healthcare and finance.
Importance of reusability
Training modern language models, even from pre-trained generalist models, requires significant resources that can considerably drive up development costs. For this reason, it is important that the model we train is reusable and is applicable for many downstream tasks that might come up in understanding datasheets.
Transformers for Electronics Literature
Transformers are a class of natural language models that aim to build representations of phrases, sentences, and paragraphs and the relationships between them. These representations in turn can be used for other tasks such as entity recognition, question answering and semantic search. Transformers rely on a mechanism called attention that enables them to filter important information and understand connections between entities very efficiently, making them the most successful semantic language model yet. The intuition behind how transformers work has been explained by dozens of other sources such as this video explanation of the original paper by Google.
A wide range of transformer models are available from open-source repositories such as Google's popular BERT or the more recent transformer-XL with weights from pre-training on large corpora. After experimenting with these pre-trained models it became apparent quickly that they are not quite fit for the task at hand. These models are trained on sources such as Wikipedia articles and collections of novels resulting in models that capture semantic relationships well in everyday language. The issue is that datasheets are written in highly technical language often employing very specific terms and abbreviations.
To bring pre-trained general-purpose transformers to the electronics domain two strategies were employed: domain-specific preprocessing of text and fine-tuning pre-trained general language models on electronics literature such as textbooks, datasheets, and technical articles.
Domain-specific preprocessing involves a variety of steps, most importantly expanding common abbreviations and handling special entities such as units and quantities. Abbreviations are expanded by referring to a large curated index of common abbreviations used in electronics engineering. Special entities are replaced by special markers and are handled separately using multi-modal machine learning in downstream tasks.
Fine-tuning Transformers on Electronics Literature
As a starting point for fine-tuning on electronics literature, we are using pre-trained transformer models. These models are freely and publicly available enabling sophisticated language systems to be developed without the time and financial burdens of running long training processes. As automating datasheet understanding requires domain-literate models, we fine-tuned models following the general training strategy outlined below.
The goal of this fine-tuning is to allow the model to better understand the context of domain-specific text. For example, we would ideally like the word “resistor” to be more closely aligned with the word “ohm” than the word “amp”. A typical approach to training domain-specific language models may involve using publicly available pre-trained weights and fine-tuning on supervised examples, i.e. with human-annotated data. This supervised fine-tuning of a public model would normally be adequate for most tasks, however, manually collecting supervised training samples can be expensive and time-consuming. To address the lack of supervised samples we trained our language models in an unsupervised manner using electronics literature we have collected and curated to be a clean and representative sample of the type of text our tools might encounter in datasheets.
Specifically, we experimented with two approaches: the first using Masked Language Modelling (MLM) as popularised by BERT, the second using Transformer-based Denoising AutoEncoder (TSDAE) as proposed by Wang et al. Both of these techniques involve taking a noisy text sequence as input to the model with the objective of reproducing the original non-noisy sequence correctly. MLM is the method used to train popular models such as BERT, it involves masking out random words in sentences and optimising the model to correctly predict the unmasked sentence, thus teaching the model language structure and contextual dependencies.
TSDAE is similar to MLM but trains models to generate descriptive whole sentence embeddings by introducing pooling before the decoding stage. Pooling transforms all input sequences into a single embedding regardless of the original length of the text. This forces the encoder to generate an accurate and descriptive singular embedding of the whole sequence which the decoder can use to reproduce the non-noisy text from. For supervised training, we only use the encoder and pooling layers to generate high-quality domain-specific sentence embeddings.
To compare the performance of MLM and TSDAE pretraining we present results from a recent project in which we used a language model to read a non-standardised textual description of an electrical parameter from an external source and identify the most appropriate standardised parameter in our internal database.
To compare performance we use a DistilBert model (a lightweight version of BERT) as a baseline. Among others, we used accuracy, precision and area under the ROC (Receiver Operating Characteristic) curve to evaluate the various language models on this task.
Our DistilBert baseline which has no electronics literature training scores 90.80% accuracy on this particular task. We found that both MLM and TSDAE increased performance, however, TSDAE clearly outperforms both the baseline and MLM on total accuracy, consistently with the results reported in the TSDAE paper.
Precision shows a similar tendency; they are slightly higher for the MLM trained model than the baseline while TSDAE significantly outperforms both models. This further supports the case of using a model that is fine-tuned on electronics literature.
The area under the ROC curve is virtually unchanged indicating no significant decrease in the model's ability to discriminate between phrases.
Applying the model to a downstream task
This pre-trained strategy has allowed us to robustly train electrical engineering literate models which can be fine-tuned for downstream language tasks. We have found that a robustly pre-trained model can achieve very high performance on a range of tasks without the need for large supervised data, thus saving valuable time and resources.
One example of such a task is normalising component attributes. Different data sources describe component attributes in different ways, with slight variations in terminology or even completely different acronyms. Hence we need a model that takes as input an unstandardized natural language description of an attribute and is able to match it to one of our internal standardized attributes.
We have been able to successfully apply our pre-trained TSDAE model to this task by fine-tuning on a small (<500 samples) supervised dataset. The resulting model had a matching accuracy of 95%. We have also been able to produce well-performing models on other relevant tasks, again without the need for large supervised datasets.
Accelerating the digital twin future
Building digital twins of physical assets has far-reaching implications, including for digital twins of electronic components. Having built a robust reusable language model is an important step in enabling our platform to accelerate the creation of a large component digital twin database. Using this model in conjunction with our other data extraction algorithms unlocks high-efficiency data entry across all areas of the digital twin creation process.