

Problem Statement
Why multi-modal genre classification?
Movie Genre Classification is an important part of recommendation systems, and correct classification is key to ensure movies reach the correct audience. However, due to the tendency of movies to overlap into several genres, and the general lack of hard boundaries between the definitions of these genres, relying on a single data feature of a movie can limit our ability to classify movies correctly. Our team will design our models using a multi-modal approach to combine textual data (movie plot overviews) and visual data (movie posters). At the current stage we are treating the problem as multiclass, with at most 3 genres per label. We will evaluate the performance of an image only model, a text only model, and a fusion model.
Approach & Goals:
The following sections will outline our methods for handling data, image modeling, text modeling, and fusion modeling. Our goals in this approach are to learn and experiment with data preprocessing, model selection and design, and model fusion, as well as to simulate the workflow of projects in the professional world.
Data Preprocessing
Data Lead: Max Pierson
My overall strategy for dataset creation was to collect as much data as possible. Once completed, the highest occuring multigenre combinations were to be used in our datasets. We also utilized undersampling among these chosen classes to create a balanced final dataset. Several different datasets were created and tested throughout this project, however, collection of the data for all sets followed the same basic plan:
- Movie Data Collection
Using the TMDB API with the tmdbsimple wrapper, I searched movies by id and collected the following into a csv file:
- Id
- Title
- Overview
- Poster Path
- Genres
- Poster Collection
Using the poster path, all images were collected as 500 x 750 jpgs from TMDB. In the final datasets, each poster is saved as a 256x256 jpg inside the folder ‘posters.zip’, and is labelled by its unique TMDB movie id. - Data Cleaning
After data was collected, any rows with missing values or bad poster paths were removed. This ’bad’ data was saved for future use, if needed. - Undersampling
After data was cleaned, heavily occuring classes (generally at least 1,000 instances) were selected to be sampled from. These classes were then undersampled to create a normal distribution. - Train-Test Split
Remaining data was then split into training, test, and validation sets at a ratio of
80% training, 10% test, 10% validation.


ID Selection
My strategy for ID data was to collect as many data points as possible to generate more instances of the rarer genres, allowing us to undersample the more strongly represented genres and still have a large final dataset. To this end, TMDB IDs were collected using the following three methods:

IDs were randomly selected from all available ids in TMDB until 100,000 movies were collected. This resulted in quite messy data, with only 47,324 movies remaining after cleaning. Additionally, it revealed a significant disparity in genre representation, So a second method was developed:

IDs collected using Genre request inside tmdbsimple, which provides a list of up to 500 pages worth of movies belonging to each genre requested. This yielded 117,370 movies after cleaning, and a much better representation of the rarer genres. However, to ensure the robustness of our dataset, we added a third set:

IDs were collected from the Kaggle database 'TMDB_movie_dataset_v11.csv' which offers data on over 1 million movies. After, cleaning, I was left with 361,940 movies.
Podge
I then combined these three datasets, removing any duplicate IDs. During this process, it was also noted that some movies in the database were duplicates with different IDs. Though some movies shared the same title but were not duplicates, it was decided the safest measure to take would be to remove any duplicate titles in the new set. This process resulted in the podge dataset, named for its origin as a hodgepodge of different datasets, which contains data on 346,934 unique movies.

As seen above, there still remains a significant imbalance in genre representation. I further reduced this dataset to movies containing 1 to 3 labels and, in order to address the imbalance, I grouped related genres together. The result is podge_combi_3g, a set which contains 333,783 movies belonging to 366 different genre combinations. Of these genre combinations, 39 have over 2,000 instances, and each single-genre label has over 3,000 instances. By undersampling all of these labels down to these limits, I developed two multiclass datasets: one multigenre and one single-genre.


Podge_3g_combi_2000
- 39 multigenre labels
- 78,000 movies, 2,000 per label

Podge_1g_combi_3000
- 13 single-genre labels
- 38,857 movies
- Note: Crime_Mystery has only 2,857 movies, all others have 3,000
Image Model
Image Model Lead: Max Pierson
For this project I experimented with transfer learning on several architectures, using the pretrained weights from ImageNet1k. The only changes made to these architectures was the output shape to fit our dataset, as well as the altering of dropout layers to a value of p=0.5 at most. After some experimentation, it was determined that best performance is given when training all layers without any freezing. Using this strategy, I experimented on Podge_1g_combi_3000 with the following model architectures:
VGG
VGG was designed as an experiment to increase depth of a CNN. A CNN uses convolutional layers to apply filters to an input image in order to extract important features, similar to edge and shape detection. It utilizes a standard CNN architecture with 3x3 convolutional filters, and increases depth to up to 19 layers. Though it is no longer considered state of the art, early experimentation with it can provide a baseline expectation on loss and accuracy to build from.
EfficientNet
EfficientNet uses a set of fixed scaling coefficients for the width, depth, and resolution in order to scale up model size efficiently, allowing for improved accuracy while maintaining a lower computational cost and resource expenditure on parameter searching. Experimentation with this model will be done with versions B0 and B2, as we do not currently have the computational resources to handle the larger versions.
ResNet
ResNet addresses the issue of vanishing gradients in deep CNNs, or the tendency for gradients to approach zero during backpropogation to early layers of a deep neural network. This is achieved with a residual block: a CNN block in which a copy of the input to the block is added directly to the output. This allows the model to go much deeper without losing training stability. Currently, we expect This architecture to provide the best results for our dataset.
Visual Transformer
A Visual Transformer is a self-attention based architecture that eliminates the need for convolutional layers by converting the input image into a linear embedding of patches, allowing the image to be treated the same as a tokenized sequence of words. This technique has been shown to be successful with very large datasets (well over 10 million images) but is not expected to outperform CNN architectures for a dataset of our size.
Image Transforms
In order to reduce chances of overfitting, I employed several transforms to alter the incoming training data. To alter alignment I use RandomResizedCrop followed by RandomHorizontalFlip. For coloring I use ColorJitter and RandomGreyscale. After some experimentation, I noted some benefit from the addition of TrivialAugmentWide. Finally, the image is converted to a tensor and normalized, performing an inline calculation of the mean and standard deviation based on the desired image and batch sizes.
Experimentation & Results
Baseline: VGG19
- SGD with momentum = 0, weight decay = 1e-5
- Learning Rate Scheduler: Cosine Annealing with a 5 epoch linear warmup
- Max LR: 0.05
- Training Length: 30 epochs
Final Loss and Accuracy Values:
- Train loss: 1.7032 | Train Micro Accuracy: 0.4371
- Val Loss: 1.7619 | Val Micro Accuracy: 0.4439
- Test Loss: 1.7031 | Test Micro Accuracy: 0.4478


As seen above, preventing overfitting on VGG required regularization in the form of weight decay, which resulted in an unappealing loss curve. However, these results still show an ability to generalize with up to around 44% accuracy. Tests with my subsequent model choices initially utilized the same training settings as VGG19's, and were then iteratively altered to find the best results for each model.
EfficientNet_B0
- SGD with momentum = 0, weight decay = 0
- Learning Rate Scheduler: Cosine Annealing with a 5 epoch linear warmup
- Max LR: 0.05
- Training Length: 30 epochs
Final Loss and Accuracy Values:
- Train loss: 1.6621 | Train Micro Accuracy: 0.4513
- Val Loss: 1.6618 | Val Micro Accuracy: 0.4521
- Test Loss: 1.9171 | Test Micro Accuracy: 0.4604



EfficientNet_b0's training yielded a much smoother loss curve and slightly better results with a final accuracy of 46%. Looking at some of the mislabelled posters shows several occurrences where the predicted label is understandable, for instance a Sherlock Holmes movie should have been classified a Mystery, but the poster depicts a man in a historical setting.
ResNet50
- SGD with momentum = 0, weight decay = 0
- Learning Rate Scheduler: Cosine Annealing with a 5 epoch linear warmup
- Max LR: 0.05
- Training Length: 30 epochs
Final Loss and Accuracy Values:
- Train loss: 1.6008 | Train Micro Accuracy: 0.4718
- Val Loss: 1.6314 | Val Micro Accuracy: 0.4696
- Test Loss: 1.6091 | Test Micro Accuracy: 0.4776



Resnet50 shows improvement from efficiennet, but once again we see mislabelling from ambiguous, and potentially mislabelled, data. This will be discussed further at the end of this section.
ViT_B_32
- AdamW with weight decay = 0.2
- Learning Rate Scheduler: Cosine Annealing with a 5 epoch linear warmup
- Max LR: 1e-3
- Training Length: 40 epochs
Final Loss and Accuracy Values:
- Train loss: 1.6324 | Train Micro Accuracy: 0.4601
- Val Loss: 1.6873 | Val Micro Accuracy: 0.4547
- Test Loss: 1.6962 | Test Micro Accuracy: 0.4478



My testing with ViT_B_32 required significant alterations to the optimizer settings to prevent overfitting, and was unable to improve upon resnet's loss and accuracy. This aligns with my previous hypothesis that a visual transformer would be limited by our relatively small dataset compared to its normal use cases.
Further Analysis
At this stage I have determined that ResNet is the best model to fit our current dataset. Though it is able to achieve the best accuracy and loss of our tested models, our results are still far from ideal. Further analysis of the model's ability to handle our data can be done by looking at the precision and recall of each individual label in the validation set, as well as the confusion matrix:
ResNet50 Label Metrics:
Genre | Precision | Recall |
---|---|---|
Action_Adventure | 0.4770 | 0.4806 |
Animation | 0.6338 | 0.7575 |
Comedy | 0.5452 | 0.5779 |
Crime_Mystery | 0.3600 | 0.2595 |
Documentary | 0.3826 | 0.3932 |
Drama | 0.3829 | 0.3983 |
Drama, Family_TV Movie | 0.4056 | 0.3385 |
Fantasy_Science Fiction | 0.3449 | 0.3024 |
History_War_Western | 0.5872 | 0.6210 |
Horror | 0.5122 | 0.6172 |
Music | 0.5242 | 0.5418 |
Romance | 0.5040 | 0.5497 |
Thriller | 0.3359 | 0.2792 |

Aside from the standout of Animation, the model seems to be underfitting in all categories. A quick look at some of the mislabelled movies shows that some posters can be quite ambiguous:


While others can be attributed to the limitations of the dataset. On further analysis of this film, I found that it should have been labelled a combination of Comedy and Horror, but TMDB shows only Comedy. Overall, this shows the importance of the use of multi-modal architecture for cases where label boundaries are not distinct, as the additional context provided from a plot overview could dramatically improve our predictive cababilities. Also, it shows the importance of a robust dataset with well-vetted labels.
Image Model Discussion
At this stage, training Resnet50 with pretrained ImageNet1k weights on our dataset yields a 47.76% test accuracy with 1.6091 loss. Working through this project, I gained experience in the use of APIs, web scraping, dataset maintenance and analysis, and learned that the simplest and safest strategy to prevent overfitting is to collect more data from underrrepresented labels. Additionally, I experimented with model training, regularization, learning rate scheduling, and transfer learning, and found that the newest and biggest model is not always the best for any given dataset. Our next steps would be to cross-reference our TMDB genre data with genre data from another site, such as IMDB, to ensure accuracy of our labels. Additionally, future attempts would explore treating this problem as multilabel, using binary cross entropy loss, rather than as multiclass. Initial experimentation showed a validation accuracy of 80% after just 3 epochs, but further analysis showed that this measurement was not reflected in the predictions:

Further research into prediction thresholds, suitable multilabel metrics, and techniques for sampling well-balanced train and test sets from heavily imbalanced and overlapping multilabel data will be required before this method could prove viable.
Text Model
This implementation develops a multi-label movie genre classification model using a Bidirectional Long Short-Term Memory (BiLSTM) neural network to predict multiple genres from movie plots, leveraging a dataset of 62,957 movies from the Movie dataset. Below is a detailed documentation of every step I performed, organized into paragraphs to clearly outline the process, its purpose, and outcomes.
Environment Setup: I began by configuring the Python environment to support the diverse requirements of the project. I imported essential libraries: pandas and NumPy for data manipulation, NLTK for text preprocessing, scikit-learn for data splitting and evaluation metrics, PyTorch for building and training the BiLSTM model, Gensim for Word2Vec embeddings, and Matplotlib with Seaborn for visualization. I downloaded NLTK’s (for word tokenization) and stopwords (for removing common English words) resources, caching them to avoid redundant downloads. To ensure reproducibility, I set random seeds to 42 for both PyTorch and NumPy, guaranteeing consistent data splits and model initialization. I configured the system to use a GPU cuda if available.
Data Loading and Initial Exploration: I loaded the dataset into a pandas DataFrame using panda, revealing 62,957 movie records with seven columns: Unnamed: 0 (an index artifact), id (unique movie identifier), title (movie name),genre (single primary genre), plot (textual plot description), poster_path (poster URL), and genres (multi-label genres as string lists, e.g., ['Animation', 'Crime', 'Mystery', 'Action']
). To explore plot variability, I added a length column by calculating the character count of each plot, finding, for example, 291 characters for Detective Conan: The Lost Ship in the Sky. This step helped assess the diversity of plot descriptions, which ranged from 98 to 452 characters in the sample data, informing preprocessing decisions.
Text Preprocessing: I defined a text cleaning function to standardize plot descriptions and remove noise. The function converts text to lowercase, URLs starting with pic, and all characters except letters and apostrophes (replacing others with spaces). It eliminates standalone single letters (e.g., “ a ”), strips punctuation, and uses NLTK’s word_tokenize to split text into words. Common English stopwords (e.g., “the,” “is”) and words with two or fewer characters are removed to focus on meaningful terms. Multiple spaces are normalized, and leading/trailing spaces are trimmed. I applied this function to all plots, creating cleaned versions (e.g., “A terrorist group invades a laboratory” becomes “terrorist group invades laboratory”).
Genre Encoding and Data Splitting: I parsed the genres column, which contained string representations of lists, into actual Python lists . I used scikit-learn’s MultiLabelBinarizer to encode these lists into a binary matrix, where each movie is represented by a vector of 19 elements (corresponding to unique genres like Action, Drama, etc.), with 1 indicating a genre’s presence and 0 its absence. For example,['Action', 'Drama']might yield [1, 0, 0, 1, ...]
. I split the padded sequences and genre matrix into training (80%, ~50,366 samples) and testing (20%, ~12,591 samples) sets using train_test_split, ensuring randomized but reproducible splits due to the fixed seed.
Word Embeddings: To capture semantic relationships, I trained a Word2Vec model using Gensim on the tokenized plots, generating 100-dimensional word embeddings. The model was configured with a minimum word frequency and context window to optimize embedding quality. I constructed an embedding matrix of shape 10,000 × 100, mapping each word in the tokenizer’s vocabulary to its Word2Vec embedding. Words absent from the Word2Vec model were assigned zero vectors, ensuring compatibility with the model’s embedding layer.
BiLSTM Model Architecture: I designed a BiLSTM classifier in PyTorch with the following components: an embedding layer , a dropout layer (0.3 rate to prevent overfitting), a two-layer bidirectional LSTM s with dropout between layers, and a fully connected layer mapping the concatenated final hidden states to 19 genre scores. The bidirectional LSTM processes sequences forward and backward, capturing contextual dependencies, while the output layer produces raw scores for multi-label classification.
Model Training: I created a custom PyTorch Dataset to pair padded sequences with genre vectors, enabling batch processing via DataLoader objects (batch size ~32, shuffling for training). I used Binary Cross-Entropy with Logits Loss, combining sigmoid activation and loss computation, and the Adam optimizer (learning rate ~0.001). I trained the model for 10 epochs, computing training and validation losses per epoch. Training involved forward passes, loss calculation, backpropagation, and weight updates, while validation assessed performance without updates. Losses were plotted, showing a downward trend, indicating learning. I saved the model weights to bilstm_model.pth, metadata (e.g., vocabulary size, genre names) to metadata.json, and the tokenizer and MultiLabelBinarizer totokenizer.pkl and mlb.pkl.
Model Evaluation: I evaluated the model on the test set in evaluation mode, generating predictions by applying a sigmoid function to output scores . I produced a classification report detailing precision, recall, and F1-score per genre, noting stronger performance for common genres (e.g., Drama, Action) and weaker results for rare ones (e.g., Western) due to class imbalance. For each genre, I generated confusion matrices (e.g., Action: TN=8338, FP=1833, FN=682, TP=1738), visualizing them as heatmaps with Seaborn and saving them as PNGs (e.g., confusion_matrix_Action.png), providing insights into prediction errors like false positives.
Inference on Sample Plot: I tested the model on a sample plot: “A group of elite hackers is recruited by a secret government agency to stop a rogue AI that threatens to take over the world's financial systems. As they race against time, they uncover a conspiracy that could change the fate of humanity.” I loaded the saved model weights, metadata, tokenizer, and MultiLabelBinarizer. The plot was cleaned, tokenized, padded to 100 tokens. The model predicted probabilities, yielding genres: Action, Adventure, Science Fiction, and Thriller, aligning with the plot’s themes. This demonstrated the model’s practical utility.
Conclusion: The implementation successfully delivers a robust BiLSTM model for multi-label genre classification, with saved components enabling reuse. Challenges like class imbalance and genre overlap suggest future enhancements, such as transformer models or data augmentation, to boost performance.
Fusion Model Overview
This project tackles the problem of multi-label movie genre classification using a fusion model that integrates both visual and textual modalities. Unlike single-label classifiers, our model predicts multiple applicable genres (e.g., Action + Sci-Fi + Thriller) for a single movie. The key innovation lies in leveraging two distinct but complementary data sources—movie posters and plot summaries—to provide a richer understanding of the content. This approach is particularly relevant for recommendation systems, content-based filtering, and media analytics platforms.
The fusion architecture addresses challenges such as:
- High inter-class overlap (e.g., Action vs. Thriller)
- Class imbalance (frequent genres like Drama vs. rare ones like Western)
- Subtle genre cues that may exist in only one modality (e.g., Family themes in text, Horror cues in posters)
Model Selection and Rationale
- ResNet18: Initially used for image classification due to its computational efficiency and fast convergence. It served as a strong baseline but lacked the depth to capture high-level abstractions in movie posters.
- ResNet50: Introduced to overcome limitations of ResNet18. It is a deeper convolutional neural network capable of capturing intricate patterns, visual themes, and compositional structure—crucial for visually rich genres like Fantasy, Horror, and Sci-Fi.
- BERT: Used to encode plot summaries. As a transformer-based model, BERT captures contextual relationships between words and understands long-range dependencies, enabling genre inference even from abstract or indirect plot descriptions. The final `[CLS]` token output was used as the plot representation.
The fusion model combines these features (2048-dim image vector + 768-dim BERT embedding) and passes them through a fully connected classifier. This multi-modal setup significantly improves performance by allowing the model to attend to both literal and visual genre indicators.
Training and Optimization
To train the fusion model effectively:
- Used the AdamW optimizer with weight decay for regularization.
- Employed Binary Cross-Entropy Loss (BCEWithLogitsLoss) with per-genre weights to handle label imbalance.
- Performed dynamic threshold tuning per genre to convert predicted probabilities to binary labels that maximize F1 score.
- Implemented learning rate scheduling and dropout regularization to improve generalization and prevent overfitting.
Performance Summary
- ResNet18 (Image-Only): Final F1 Score = 0.1896; struggled with recall due to limited feature depth.
- ResNet50 (Image-Only, Optimized): Final F1 Score = 0.3582; deeper visual features led to noticeable gains in genre differentiation.
- Fusion (ResNet50 + BERT): Final F1 Score = 0.6552; significantly better than unimodal models, confirming the value of multi-modal learning.
Visualizations and Insights
The following plots were generated using the fusion model to gain deeper insights into genre prediction behavior:
- F1 Score per Genre: Shows genre-wise prediction quality. Genres like Crime, Documentary, and Romance scored high, while low-frequency genres like Family and Western had weaker performance.

This bar chart presents the F1 score for each movie genre, offering a clear picture of how well the fusion model performs on individual labels. Genres like Romance, Crime, and Documentary exhibit the highest F1 scores—above 0.80—indicating that the model excels at identifying these genres accurately and consistently. This strong performance is likely due to the distinctive combination of visual and textual patterns present in both the poster and plot for these genres. Mid-range performance is observed in genres like Action, Comedy, and Science Fiction, where the model maintains a good balance between precision and recall. However, it’s important to note that performance dips significantly for underrepresented or ambiguous genres such as Family, Western, and War—a direct consequence of class imbalance in the training dataset. This plot is valuable in highlighting where the model is confident and reliable, and where further improvement is needed through techniques such as data augmentation, oversampling, or fine-grained label restructuring.


This stacked bar chart breaks down the model’s predictions into three key components—True Positives (TP), False Positives (FP), and False Negatives (FN)—for each genre. It offers a granular view into how well the model handles each class beyond just a single metric like accuracy or F1 score. Genres like Drama, Romance, and Documentary show a dominant blue section, indicating a high number of true positives. This confirms that the model reliably identifies these genres when they are present. However, genres such as Comedy and Crime display a significant proportion of false positives (orange), suggesting that while the model is actively predicting these genres, it occasionally does so incorrectly—possibly due to overlapping features with other genres like Action or Thriller. Notably, genres like Family, TV Movie, and Thriller exhibit a large share of false negatives (green), indicating the model often misses these genres when they should be predicted. This reveals an important limitation—these genres may lack strong or unique cues in the input data, or they may be underrepresented in the training set. This chart is especially useful for debugging genre-specific weaknesses, as it visually highlights where the model is overconfident (high FP) or under-sensitive (high FN). It reinforces the need for balancing the dataset, fine-tuning genre-specific thresholds, or employing targeted augmentation strategies.

The ROC (Receiver Operating Characteristic) curve provides a genre-by-genre evaluation of the model’s ability to distinguish between positive and negative classes. The curve plots the true positive rate (sensitivity) against the false positive rate for different decision thresholds. The Area Under the Curve (AUC) score, shown in parentheses for each genre, quantifies this performance: a score close to 1.0 indicates excellent separability. Genres such as Documentary (AUC = 0.95), Crime (0.92), Romance (0.90), and Animation (0.94) demonstrate outstanding AUC values, suggesting the model is highly effective at correctly identifying these genres with minimal false alarms. Even visually and semantically nuanced genres like Fantasy and Horror achieve AUC scores above 0.80, highlighting the strength of the multi-modal fusion approach in complex prediction scenarios. On the other hand, genres like Drama (AUC = 0.68) show relatively lower AUC scores. This may reflect their high intra-class variability or overlapping features with other genres (e.g., Romance, History). Furthermore, several genres—including War, Western, and TV Movie—returned AUC as nan, indicating either insufficient positive samples in the test set or absence of predictions. This reinforces the importance of addressing class imbalance during training. Overall, this ROC curve visualization validates the fusion model’s robust discriminative capability, especially for well-represented genres, and also helps identify where further dataset enrichment is needed.

This histogram visualizes the distribution of the model’s predicted confidence scores across all movie-genre pairs. Each bar represents the number of predictions that fall within a specific confidence interval, giving insight into how confident the model is when assigning genres. The skewed distribution toward the lower end—particularly around 0.1 to 0.3—indicates that the majority of predictions are made with relatively low confidence. This is typical in multi-label classification, where the model often outputs probabilities for multiple genres, many of which may not be relevant. A long tail extending toward the right suggests that the model is confidently predicting a smaller subset of genre associations—likely for more distinguishable and well-represented genres. This plot is particularly useful for evaluating calibration: whether the predicted probabilities reflect actual correctness. In this case, the distribution supports the decision to implement threshold tuning per genre, allowing the model to make sharper distinctions for high-confidence predictions while suppressing uncertain ones. Overall, this analysis helps ensure that genre predictions are not only accurate but also trustworthy—especially critical for downstream applications like recommendation systems.

This bar chart compares the F1 scores of three different model configurations: Text Only, Image Only, and the full Fusion Model. The goal of this ablation study is to evaluate how each input modality contributes to the overall performance and to validate whether combining them truly adds value. The Fusion Model, which integrates both ResNet50 (image features) and BERT (text features), achieves the highest F1 score—demonstrating that the two modalities provide complementary information that significantly boosts genre classification performance. On the other hand, while the Image Only and Text Only models perform reasonably well in isolation, they lag behind the fusion approach, each missing important context that the other provides. This result strongly justifies the decision to pursue a multi-modal architecture. It also highlights that neither the visual nor textual features alone are sufficient to capture the full complexity of multi-genre movie classification. The fusion strategy delivers a balanced understanding of both what a movie is about and how it is visually presented—leading to more accurate and reliable predictions.
Label Distribution
The dataset contains over 29,000 labeled movie records with multi-label annotations. Genres like Drama, Comedy, and Action are highly represented, while genres such as Music, Western, and War are rare. This imbalance influenced model performance and was addressed with weighted loss functions and threshold optimization.
Conclusion and Future Work
This project demonstrates a successful implementation of multi-modal learning for complex classification tasks. By combining state-of-the-art visual and textual models—ResNet50 and BERT—we achieved high predictive performance and deep genre understanding. The approach is scalable to other domains like book classification, news categorization, or content moderation.
Future work includes:
- Using attention-based fusion layers instead of simple concatenation.
- Exploring deeper architectures like EfficientNet-V2 and RoBERTa.
- Balancing the dataset with synthetic samples or generative augmentation for rare genres.
- Deploying the model via a web interface for interactive genre prediction.