File Import API

File Import API

The File Import API provides endpoints for uploading, analyzing, and previewing files that can be imported into various SQL engines. This API simplifies the process of creating database tables from files like CSV, TSV, and Excel spreadsheets.

Note: All API endpoints require authentication. For endpoints that work with remote files (import_type=remote), the API uses the file system permissions associated with the authenticated user's session.

Overview

The File Import API allows you to:

  • Upload files from your local system
  • Analyze file metadata to determine format and characteristics
  • Check whether files have headers
  • Preview file content with data types
  • Get SQL type mappings for different SQL dialects

Typical Import Workflow

A typical workflow for importing a file into a database table involves these steps:

  1. Upload the file using the /api/importer/upload/file/ endpoint
  2. Detect file metadata using the /api/importer/file/guess_metadata/ endpoint
  3. Determine if the file has a header using the /api/importer/file/guess_header/ endpoint
  4. Preview the file with column type detection using the /api/importer/file/preview/ endpoint
  5. Use the preview data to create a table in your SQL engine of choice

Upload a Local File

Upload a file from your local system to the Hue server.

Endpoint: /api/importer/upload/file/

Method: POST

Content Type: multipart/form-data

Request Parameters:

Name Type Required Description
file File Yes The file to upload (csv, tsv, excel)

Example using cURL:

curl -X POST \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>" \
  -F "file=@/path/to/sales_data.csv" \
  https://demo.gethue.com/api/importer/upload/file/

Example using JavaScript:

// Using fetch API
const formData = new FormData();
formData.append('file', fileInputElement.files[0]);

fetch('https://demo.gethue.com/api/importer/upload/file/', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer <YOUR_JWT_TOKEN>'
  },
  body: formData
})
.then(response => response.json())
.then(data => console.log(data))
.catch(error => console.error('Error:', error));

Response:

{
  "file_path": "/tmp/username_abc123_sales_data.csv"
}

Status Codes:

  • 201 Created - File was uploaded successfully
  • 400 Bad Request - Invalid file format or size
  • 500 Internal Server Error - Server-side error

Restrictions:

  • Maximum file size is determined by the configuration IMPORTER.MAX_LOCAL_FILE_SIZE_UPLOAD_LIMIT
  • Certain file extensions may be restricted based on IMPORTER.RESTRICT_LOCAL_FILE_EXTENSIONS

Guess File Metadata

Analyze a file to determine its type and metadata properties such as delimiters for CSV files or sheet names for Excel files.

Endpoint: /api/importer/file/guess_metadata/

Method: GET

Request Parameters:

Name Type Required Description
file_path String Yes Full path to the file to analyze
import_type String Yes Type of import, either local or remote

Example using cURL:

# For a local file uploaded previously
curl -X GET \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>" \
  "https://demo.gethue.com/api/importer/file/guess_metadata/?file_path=/tmp/username_abc123_sales_data.csv&import_type=local"

# For a remote file on HDFS
curl -X GET \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>" \
  "https://demo.gethue.com/api/importer/file/guess_metadata/?file_path=/user/hue/data/sales_data.csv&import_type=remote"

Example using JavaScript:

// Using fetch API for a local file
fetch('https://demo.gethue.com/api/importer/file/guess_metadata/?file_path=/tmp/username_abc123_sales_data.csv&import_type=local', {
  method: 'GET',
  headers: {
    'Authorization': 'Bearer <YOUR_JWT_TOKEN>'
  }
})
.then(response => response.json())
.then(metadata => console.log(metadata))
.catch(error => console.error('Error:', error));

Response Examples:

For CSV files:

{
  "type": "csv",
  "field_separator": ",",
  "quote_char": "\"",
  "record_separator": "\n"
}

For TSV files:

{
  "type": "tsv",
  "field_separator": "\t",
  "quote_char": "\"",
  "record_separator": "\n"
}

For Excel files:

{
  "type": "excel",
  "sheet_names": ["Sales 2024", "Sales 2025", "Analytics"]
}

Status Codes:

  • 200 OK - Metadata detected successfully
  • 400 Bad Request - File not found or invalid parameters
  • 500 Internal Server Error - Server-side error during detection

Guess File Header

Analyze a file to determine if it has a header row. This API uses heuristics to detect if the first row appears to contain column names rather than data.

Endpoint: /api/importer/file/guess_header/

Method: GET

Request Parameters:

Name Type Required Description
file_path String Yes Full path to the file to analyze
file_type String Yes Type of file (csv, tsv, excel, delimiter_format)
import_type String Yes Type of import, either local or remote
sheet_name String No Sheet name (required for Excel files)

Example using cURL:

# For a CSV file
curl -X GET \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>" \
  "https://demo.gethue.com/api/importer/file/guess_header/?file_path=/tmp/username_abc123_sales_data.csv&file_type=csv&import_type=local"

# For an Excel file
curl -X GET \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>" \
  "https://demo.gethue.com/api/importer/file/guess_header/?file_path=/tmp/username_abc123_financial_report.xlsx&file_type=excel&import_type=local&sheet_name=Q1_Results"

Example using JavaScript:

const params = new URLSearchParams({
  file_path: '/tmp/username_abc123_sales_data.csv',
  file_type: 'csv',
  import_type: 'local'
});

fetch(`https://demo.gethue.com/api/importer/file/guess_header/?${params.toString()}`, {
  method: 'GET',
  headers: {
    'Authorization': 'Bearer <YOUR_JWT_TOKEN>'
  }
})
.then(response => response.json())
.then(result => {
  console.log("Has header:", result.has_header);
  // Use this information to configure preview or import
})
.catch(error => console.error('Error:', error));

Response:

{
  "has_header": true
}

How Header Detection Works:

The header detection algorithm uses multiple factors to make an educated guess:

  1. Checks if the first row contains text while subsequent rows contain numeric values
  2. Examines if the first row's data types differ from the majority of other rows
  3. Looks for common header naming patterns (e.g., id, name, date, etc.)
  4. Checks if the first row has no duplicate values while data rows might have duplicates

Status Codes:

  • 200 OK - Header detection successful
  • 400 Bad Request - Invalid parameters or file not found
  • 500 Internal Server Error - Server-side error

Preview File

Generate a preview of a file's content with column type mapping for creating SQL tables.

Endpoint: /api/importer/file/preview/

Method: GET

Request Parameters:

Name Type Required Description
file_path String Yes Full path to the file to preview
file_type String Yes Type of file (csv, tsv, excel, delimiter_format)
import_type String Yes Type of import (local or remote)
sql_dialect String Yes SQL dialect for type mapping (hive, impala, trino, phoenix, sparksql)
has_header Boolean Yes Whether the file has a header row
sheet_name String No Sheet name (required for Excel files)
field_separator String No Field separator character (defaults to , for CSV, \t for TSV, required for delimiter_format)
quote_char String No Quote character (defaults to ")
record_separator String No Record separator character (defaults to \n)

Example using cURL:

# For a CSV file with header
curl -X GET \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>" \
  "https://demo.gethue.com/api/importer/file/preview/?file_path=/tmp/username_abc123_sales_data.csv&file_type=csv&import_type=local&sql_dialect=hive&has_header=true"

# For an Excel file with header
curl -X GET \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>" \
  "https://demo.gethue.com/api/importer/file/preview/?file_path=/tmp/username_abc123_financial_report.xlsx&file_type=excel&import_type=local&sql_dialect=impala&has_header=true&sheet_name=Q1_Results"

# For a custom pipe-delimited file using delimiter_format
curl -X GET \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>" \
  "https://demo.gethue.com/api/importer/file/preview/?file_path=/tmp/username_abc123_pipe_data.txt&file_type=delimiter_format&import_type=local&sql_dialect=hive&has_header=true&field_separator=|&quote_char=\"&record_separator=\n"

Example using JavaScript:

// Parameters for previewing a CSV file
const params = new URLSearchParams({
  file_path: '/tmp/username_abc123_sales_data.csv',
  file_type: 'csv',
  import_type: 'local',
  sql_dialect: 'hive',
  has_header: 'true'
});

// Make the fetch request
fetch(`https://demo.gethue.com/api/importer/file/preview/?${params.toString()}`, {
  method: 'GET',
  headers: {
    'Authorization': 'Bearer <YOUR_JWT_TOKEN>'
  }
})
.then(response => response.json())
.then(previewData => {
  // Use preview data to display to user or generate CREATE TABLE statement
  console.log(previewData);
})
.catch(error => console.error('Error:', error));

Response:

For a sales data CSV file:

{
  "type": "csv",
  "columns": [
    {
      "name": "transaction_id",
      "type": "INT"
    },
    {
      "name": "product_name",
      "type": "STRING"
    },
    {
      "name": "category",
      "type": "STRING"
    },
    {
      "name": "price",
      "type": "DOUBLE"
    },
    {
      "name": "quantity",
      "type": "INT"
    },
    {
      "name": "purchase_date",
      "type": "DATE"
    }
  ],
  "preview_data": [
    ["1001", "Laptop XPS 13", "Electronics", "1299.99", "1", "2025-01-15"],
    ["1002", "Wireless Headphones", "Electronics", "149.99", "2", "2025-01-15"],
    ["1003", "Office Chair", "Furniture", "249.50", "1", "2025-01-16"],
    ["1004", "Notebook Set", "Office Supplies", "24.95", "5", "2025-01-17"]
  ]
}

Using Preview Data to Generate a CREATE TABLE Statement:

You can use the preview data to generate SQL statements for table creation. Here's an example for Hive:

CREATE TABLE sales_data (
  transaction_id INT,
  product_name STRING,
  category STRING,
  price DOUBLE,
  quantity INT, 
  purchase_date DATE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

About delimiter_format File Type:

The delimiter_format file type allows you to process custom delimited files that don't follow standard CSV or TSV formats. When using this file type:

  • field_separator is required and must be explicitly specified
  • quote_char and record_separator should be provided for proper parsing
  • Common examples include pipe-delimited files (|), semi-colon delimited files (;), or files with custom record separators

Parameter Validation and Best Practices:

  • For Excel files, sheet_name is required and the API will return a 400 error if not provided
  • For standard delimited files (csv, tsv), default values will be applied if optional parameters are not specified
  • For delimiter_format files, you must explicitly provide at least the field_separator
  • For all file types, it's recommended to pass the values returned by the guess_metadata endpoint:
    • Use field_separator from guess_metadata response
    • Use quote_char from guess_metadata response
    • Use record_separator from guess_metadata response
  • When import_type is set to remote, the API uses the file system associated with the user's request

Status Codes:

  • 200 OK - Preview generated successfully
  • 400 Bad Request - Invalid parameters or file not found
  • 500 Internal Server Error - Server-side error

Get SQL Type Mapping

Get mapping from Polars data types to SQL types for a specific SQL dialect. This helps in translating detected column types to appropriate SQL data types when creating tables.

Endpoint: /api/importer/sql_type_mapping/

Method: GET

Request Parameters:

Name Type Required Description
sql_dialect String Yes SQL dialect for type mapping (hive, impala, trino, phoenix, sparksql)

Example using cURL:

curl -X GET \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>" \
  "https://demo.gethue.com/api/importer/sql_type_mapping/?sql_dialect=hive"

Example using JavaScript:

fetch('https://demo.gethue.com/api/importer/sql_type_mapping/?sql_dialect=hive', {
  method: 'GET',
  headers: {
    'Authorization': 'Bearer <YOUR_JWT_TOKEN>'
  }
})
.then(response => response.json())
.then(typeMappings => {
  console.log(typeMappings);
  // Use type mappings to generate SQL schema
})
.catch(error => console.error('Error:', error));

Response:

{
  "Int8": "TINYINT",
  "Int16": "SMALLINT",
  "Int32": "INT",
  "Int64": "BIGINT",
  "UInt8": "TINYINT",
  "UInt16": "SMALLINT",
  "UInt32": "INT",
  "UInt64": "BIGINT",
  "Float32": "FLOAT",
  "Float64": "DOUBLE",
  "Decimal": "DECIMAL",
  "Boolean": "BOOLEAN",
  "Utf8": "STRING",
  "String": "STRING",
  "Binary": "BINARY",
  "Date": "DATE",
  "Time": "TIMESTAMP",
  "Datetime": "TIMESTAMP",
  "Duration": "INTERVAL DAY TO SECOND",
  "Array": "ARRAY",
  "List": "ARRAY",
  "Struct": "STRUCT"
}

Dialect-Specific Type Mappings

Different SQL dialects have variations in type names. Here are some key differences:

Trino Types (compared to Hive):

{
  "Int32": "INTEGER",
  "UInt32": "INTEGER",
  "Utf8": "VARCHAR",
  "String": "VARCHAR",
  "Binary": "VARBINARY",
  "Float32": "REAL",
  "Struct": "ROW",
  "Object": "JSON"
}

Phoenix Types (compared to Hive):

{
  "UInt8": "UNSIGNED_TINYINT",
  "UInt16": "UNSIGNED_SMALLINT",
  "UInt32": "UNSIGNED_INT",
  "UInt64": "UNSIGNED_LONG",
  "Utf8": "VARCHAR",
  "String": "VARCHAR",
  "Binary": "VARBINARY",
  "Time": "TIME"
}

Status Codes:

  • 200 OK - Mapping retrieved successfully
  • 400 Bad Request - Invalid SQL dialect
  • 500 Internal Server Error - Server-side error

Error Handling

All endpoints follow a consistent error response format:

{
  "error": "Error message describing what went wrong"
}

Example Error Responses:

File not found:

{
  "error": "File /tmp/nonexistent_file.csv not found"
}

Invalid parameter:

{
  "error": "Sheet name is required for Excel files"
}

File size limit exceeded:

{
  "error": "File too large. Maximum file size is 100 MiB."
}

Authentication errors will return a standard HTTP 401 Unauthorized response.

Note: Authentication tokens can be obtained through Hue's authentication API. See the Authentication API documentation for details.

Complete Workflow Example

Here's an example workflow that combines all the APIs to import a CSV file into a Hive table:

Step 1: Upload the file

// Upload the CSV file
const formData = new FormData();
formData.append('file', fileInputElement.files[0]);

const uploadResponse = await fetch('https://demo.gethue.com/api/importer/upload/file/', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer <YOUR_JWT_TOKEN>'
  },
  body: formData
});

const uploadResult = await uploadResponse.json();
const filePath = uploadResult.file_path;

Step 2: Detect file metadata

// Get file metadata
const metadataParams = new URLSearchParams({
  file_path: filePath,
  import_type: 'local'
});

const metadataResponse = await fetch(`https://demo.gethue.com/api/importer/file/guess_metadata/?${metadataParams.toString()}`, {
  method: 'GET',
  headers: {
    'Authorization': 'Bearer <YOUR_JWT_TOKEN>'
  }
});

const metadata = await metadataResponse.json();
const fileType = metadata.type;
const fieldSeparator = metadata.field_separator;
const quoteChar = metadata.quote_char;
const recordSeparator = metadata.record_separator;

Step 3: Check for header row

// Detect if file has a header
const headerParams = new URLSearchParams({
  file_path: filePath,
  file_type: fileType,
  import_type: 'local'
});

const headerResponse = await fetch(`https://demo.gethue.com/api/importer/file/guess_header/?${headerParams.toString()}`, {
  method: 'GET',
  headers: {
    'Authorization': 'Bearer <YOUR_JWT_TOKEN>'
  }
});

const headerResult = await headerResponse.json();
const hasHeader = headerResult.has_header;

Step 4: Preview the file with column type detection

// Generate file preview with SQL types
const previewParams = new URLSearchParams({
  file_path: filePath,
  file_type: fileType,
  import_type: 'local',
  sql_dialect: 'hive',
  has_header: hasHeader,
  field_separator: fieldSeparator,
  quote_char: quoteChar,
  record_separator: recordSeparator
});

const previewResponse = await fetch(`https://demo.gethue.com/api/importer/file/preview/?${previewParams.toString()}`, {
  method: 'GET',
  headers: {
    'Authorization': 'Bearer <YOUR_JWT_TOKEN>'
  }
});

const preview = await previewResponse.json();

Step 5: Generate SQL CREATE TABLE statement

// Generate SQL CREATE TABLE statement
const tableName = 'imported_sales_data';
const columns = preview.columns.map(col => `  ${col.name} ${col.type}`).join(',\n');

const createTableSQL = `CREATE TABLE ${tableName} (\n${columns}\n)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '${fieldSeparator}'
STORED AS TEXTFILE;`;

console.log(createTableSQL);

// You can then execute this SQL using the SQL execution API

Best Practices

  1. Error Handling: Always implement proper error handling for API responses, especially for file uploads which may fail due to size limitations or network issues.

  2. Progress Feedback: For large file uploads, consider implementing progress indicators using XHR instead of fetch to track upload progress.

  3. Type Checking: When parsing and validating file data, always check that the inferred types match what you expect for your business data.

  4. Sequential Processing: Process API calls in the proper sequence as shown in the workflow example to ensure each step has the required data.

  5. User Confirmation: Allow users to review and modify the automatic type detection before creating tables, especially for large datasets.

  6. Access Control: Ensure file permissions are properly set for any uploaded files, especially in multi-user environments.

  7. Cleanup: Consider implementing cleanup mechanisms for temporary uploaded files that are no longer needed. Files uploaded via the /api/importer/upload/file/ endpoint are stored in temporary locations (typically /tmp/) and should be cleaned up after processing.