Drupal is one of the most consumed and powerful content management systems in the world. However, importing content with complex structures was a task I found to be extremely frustrating as a new developer. Finding the most reliable way to import data into a Drupal system was challenging and unnecessarily difficult. One of the befitting examples would certainly be the Migrate module tool that Drupal 8 provides.
Migration plugins are responsible for mapping the data extracted from a source to the Drupal definition of that data. When I started a methodical approach to perform the migration, I first learned that there is a need to create a custom migration plugin to import data. This is done using a YAML file where we have to mention the source, destination and process plugins. The procedure of migration is straightforward for content types such as 'Articles,' but is far more complex for structures like Custom Entities. After learning how to write this script, I had to test it by loading the configuration and running the migration. I realized that for every change I made in the script, I have to load the configuration once again and execute it. After many trials and errors, I managed to get the migration correct for a simple Content Type with three fields. But, when I had to perform migration for a Custom Entity with 30 fields, this task seemed impossible to complete within a realistic time frame. This challenge encouraged me to walk on a trail with no footprints and led me to devise two new approaches for performing the complete data import.
Approach 1: Talend Data Integration
The first approach is using the Talend Data Integration tool. Talend is a versatile ETL tool which provides the ability to make direct connections with the Drupal MySQL Database.
I used this platform instead of a Drupal module to import data into the system. In order to do this, it is very important to understand the purpose of all the tables in the Drupal database and their relationships with each other. For instance, in Drupal 8, if we are importing data of content type "Node", Drupal inserts data into tables such as node, node_revision, node__body, node_field_data, node__access and many more. The official Drupal website has an Entity Reference diagram of the Drupal database and is a good resource to learn the structure of it.
Every content type that we create gets stored in the node and other tables in the database.
The node table has the following columns (nid, vid, type, uuid, langcode):
- nid – Node Identifier.
- vid - Every node has multiple versions of itself every time you save it. Each version has a Version ID.
- type – This is the name of the bundle such as article/customer/page etc.
- uuid - Universal Unique Identifier
- langcode - For every language, there is a code such as "en" for English.
node_revision has columns like nid, vid, timestamp, revision log, etc. to keep track of every version of a node that has been saved from the Front End. The revision log stores a message to know changes for a vid. This can be disabled from a checkbox option called “Create new revision.”
node_revision_body, node_revision comments are just revision tables for their respective tables. For every node(nid), version(vid) and type(bundle), it stores the default field 'title' and other details like timestamp, etc. nid, vid, type, langcode, status, uid, title, created, changed, promote, sticky, default_langcode, revision_translation_affected
node_field_image, node_field_tags store data for image and tags field for every node
node__body stores body information for each node whose columns are (bundle, deleted, entity_id, revision_id, langcode, delta, body_value, body_summary, body_format).
node__access stores data about access for all nodes.
The following diagram will give a rough idea as to how a node gets stored in the Drupal database. Consider a “Customer” content type with 4 fields — Title, Age, Contract Date and Discount. Every time a Customer content type is added changes are reflected in the following tables -
node
node__body
node_field_data
node__field_age
node__field_contract_date
node__field_discount
And all corresponding revision tables.
These are the steps for the first approach.
- Prepare a CSV file with the data that should be imported into the Drupal system. This CSV file will have all the field values required to be imported into the system.
- Create a Talend project and a Job, and read all the data records from the CSV input file to the Talend job using a component called tFileInputDelimited.
- Then, make a connection to the system’s database and retrieve the schema of all the tables that will be needed for the Insert operation.
- The component called tMap is used to map different data input points to different data output points. Use it to simply map the input received from the datasheet to the data columns from tables in the Drupal database.
- In the same component, generate all the other variables like Timestamps, UUIDs, etc. and map them to the columns in the tables.
- In the above figure, we can see the variables which have been assigned a constant value or a data expression that is used repeatedly for every record that is imported from the CSV file.
- The variable "type" has been assigned the constant String value of "liferafts" which is the machine name in Drupal for a content type called “Liferafts” made in this system.
- The variable “nid” is used to assign the Node ID value that will be used for the insert operations for every record in the complete import process. This is done by retrieving the last Node ID present in the system before doing the import and adding it to the current count of the number of records imported at that time. This we get after reading the “node” table.
- The variable “vid” is used to assign the Version ID value for every Node present in the system. This is done in the same way as for nid. This we get after reading the “node_field_data” table.
- The variable “langcode” is assigned the constant String value of “en” which stands for English.
- The function TalendDate.getCurrentDate().getTime() captures the Timestamp for when the variable var1 is used to import a record into the system.
- The variable "delta" has been assigned a constant value of 0
- A universally unique identifier (UUID) is a value used to identify information in computer systems. In this Job, I have generated it by using the MySQL function. You can also use other methods like a Java function to do so.
- Then make an individual table connection using a component named tDBOutput where the action on Table is selected as “Insert”. This has to be done for every table we need to add our data into.
- For every field that is added in a Content-Type, there is a table that gets created in the Drupal database for that field. Hence, as the number of fields increases, the number of table connections will also increase.
- Read all the columns from our datasheet, one at a time, generate other variables using data expression, map the data to output columns and load into the tables.
- This is what the final Job will look like. All the tLogRow components are not mandatory, as they are used only to print the output records. Similarly, the components tFlowToIterate and tFixedRowInput are used for extra preprocessing like declaring global variables for columns coming from the CSV File.
Approach 2: Restful Web Service
Some people may argue that Approach 1 unnecessarily uses another platform when Drupal already has multiple ways to support data importation. My counterargument is that we leave Drupal out of the process and directly load data into the MySQL tables, resulting in utmost certainty that the data is now in our system. However, if your data import has too many fields, it would require Insert action into many more tables.
A sophisticated way of handling such a situation is by using the REST API configured in Drupal and using an external platform like Talend to make the request calls. This is the second approach I followed, where I used a universal way to talk to the system — the Restful Web Service.
This approach is easily scalable and the same steps need to be followed for a high number of fields for content type.
- Prepare the CSV file. Consider that we are importing the Title and Body of a sample content type.
- Configure a REST API for the content type for which the records have to be added.
- Change its configuration to enable POST request and Basic HTTP Authentication. The Basic Auth module takes a username and password out of the request and authenticates them against Drupal. We need to enable this module as it implements the HTTP Basic protocol in which the login credentials are encoded and added to the Authorization header within the request. Go to Configuration → REST → rest_api/(The name of your REST API) → Check the Authentication Provider for basic_auth.
- Construct a routine in Java using the okhttp library to make the API request. This can also be done by using other REST libraries and programming languages.
- HTTP verbs are used in requests to interact with resources in a REST system: We use POST to create a new resource.
- In the header of the request, the client sends the type of content that it is able to receive from the server. This is called the Accept field, and it ensures that the server does not send data that cannot be understood or processed by the client. We will use Content-Type as "application/hal+json". We are using the JSON format, XML can also be used.
- Requests must contain a path to a resource that the operation should be performed on which is the URL. You will see that the “http://drupal-8-new.dd:8083/entity/node" is the URL of the local site. The “/entity” because we want to add Entity, “/node” because the type of Entity is Node.
The Routine Libraries can be added by going to Code > Routines > Create a New Folder > Create a New Routine > Right Click > Edit Routine
Libraries > Browse > Add the Library File > Finish.
5. This is a sample Java routine if we were to add only the title and body of a content type.
First, import the okhttp library and create the object for the client. This client acts the same as one in Postman to send the Request object and captures the response in the Response object.
6. Before creating the routine in Java, you can also construct the call in Postman client to study and test the POST request. This step assured me that the construction of my REST call was correct and hence, I could confidently proceed with importing thousands of records into the system. This is how it would look like in Postman to import just Title and Body of a content type.
7. If you don’t know how to get the encoded value for Authorization, you can simply go to Postman > Authorization > Select Type as Basic Auth > Enter Username and Password.
This automatically generates the encoded value in the Header and can be copy-pasted into the Java routine we use.
8. After making the Java routine in Talend, read data from the CSV file using the component, tFIleInputDelimited similar to Approach 1, and join it to a component called tJavaFlex which allows you to call a Java function. In tJavaFlex, every field value is taken and dynamically passed to the REST body.
The link row2 has all the field names that we read in the tFileInputDelimited component which are used to pass to the Java function.
9. After running this job, for every record in the datasheet, a REST call is made and data is sent to the system.
This method eliminates the hassle of maintaining complex table relationships in the database. These approaches have the same procedure to import Node, User, Taxonomy and Custom Entity. They eliminate the tasks of keeping up with the latest module versions or fixing other compatibility issues and are reliable and easy ways to import data into a Drupal system.
Finally, let’s compare each data import approach side by side.
Database Connection Approach |
REST Approach |
Very fast - Thousands of records will take a couple of seconds to be imported in multiple tables. |
Slower than the first approach. May take 10-15 minutes in all to send each record one at a time using REST for a few thousand records. |
For a Content Type, with 4 fields, making the job will take around 1-2 hrs and execution a couple of seconds for a few hundred records. |
For a Content Type, with 4 fields, making the job will take around 30-45 mins and execution a couple of minutes for a few hundred records. |
Not very scalable. As data becomes bigger, the number of table connections increases. |
Very Scalable – We simply add more fields in the REST body. |
More scope of error.As there is manual mapping involved, this method will fail if not done carefully. |
Very less scope of error – As there is only a Java routine involved, this method will work successfully if POST request construction is done right. |
Not very flexible. Most of the approach follows a fixed way of doing the import. |
Flexible. We can use any format of data, REST library, programming language, platform other than Talend. |
Can be used to import Node, User, Taxonomy and Custom Entity. |
It may not be able to Configure REST API for Taxonomy, Users, etc. |
The Talend Job in this approach isn’t easily reusable. If Content-Type is changed, all the field table connections need to be changed which may take time. |
The REST approach is easily reusable. The URL, Headers, Method will be the same for the same website, only the body of the request will need changes for a different content type. |
One of the greatest strengths of an open source CMS like Drupal is that there is often no single “right” way to solve a problem. These are just a few of the methods developers can use to import content and data in Drupal.
For more on how Fantail Consulting and Technologies uses open source software solutions, like Drupal, visit our website or check us out our profile on Drupal.org.