2023-05-25 更新
Deep Image Compression Using Scene Text Quality Assessment
Authors:Shohei Uchigasaki, Tomo Miyazaki, Shinichiro Omachi
Image compression is a fundamental technology for Internet communication engineering. However, a high compression rate with general methods may degrade images, resulting in unreadable texts. In this paper, we propose an image compression method for maintaining text quality. We developed a scene text image quality assessment model to assess text quality in compressed images. The assessment model iteratively searches for the best-compressed image holding high-quality text. Objective and subjective results showed that the proposed method was superior to existing methods. Furthermore, the proposed assessment model outperformed other deep-learning regression models.
PDF Accepted by Pattern Recognition, 2023
点此查看论文截图
Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields
Authors:Jingbo Zhang, Xiaoyu Li, Ziyu Wan, Can Wang, Jing Liao
Text-driven 3D scene generation is widely applicable to video gaming, film industry, and metaverse applications that have a large demand for 3D scenes. However, existing text-to-3D generation methods are limited to producing 3D objects with simple geometries and dreamlike styles that lack realism. In this work, we present Text2NeRF, which is able to generate a wide range of 3D scenes with complicated geometric structures and high-fidelity textures purely from a text prompt. To this end, we adopt NeRF as the 3D representation and leverage a pre-trained text-to-image diffusion model to constrain the 3D reconstruction of the NeRF to reflect the scene description. Specifically, we employ the diffusion model to infer the text-related image as the content prior and use a monocular depth estimation method to offer the geometric prior. Both content and geometric priors are utilized to update the NeRF model. To guarantee textured and geometric consistency between different views, we introduce a progressive scene inpainting and updating strategy for novel view synthesis of the scene. Our method requires no additional training data but only a natural language description of the scene as the input. Extensive experiments demonstrate that our Text2NeRF outperforms existing methods in producing photo-realistic, multi-view consistent, and diverse 3D scenes from a variety of natural language prompts.
PDF Homepage: https://eckertzhang.github.io/Text2NeRF.github.io/
点此查看论文截图
CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model
Authors:Shuai Zhao, Xiaohan Wang, Linchao Zhu, Yi Yang
Pre-trained vision-language models are the de-facto foundation models for various downstream tasks. However, this trend has not extended to the field of scene text recognition (STR), despite the potential of CLIP to serve as a powerful scene text reader. CLIP can robustly identify regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in natural images. With such merits, we introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP. It has two encoder-decoder branches: a visual branch and a cross-modal branch. The visual branch provides an initial prediction based on the visual feature, and the cross-modal branch refines this prediction by addressing the discrepancy between the visual feature and text semantics. To fully leverage the capabilities of both branches, we design a dual predict-and-refine decoding scheme for inference. CLIP4STR achieves new state-of-the-art performance on 11 STR benchmarks. Additionally, a comprehensive empirical study is provided to enhance the understanding of the adaptation of CLIP to STR. We believe our method establishes a simple but strong baseline for future STR research with VL models.
PDF Preprint, work in progress