Skip to content

New Retail Scenes (Convenience Store)

1. Introduction

Usage Scenario

The New Retail Scenario Multimodal Perception Memory Reasoning Model can be mainly applied to the new retail scenario and can be used by developers of convenience stores and unmanned supermarkets, researchers, and educators.

Target Users

  • Convenience Store & Unmanned Supermarket Application Developers: The New Retail Scenario Multimodal Perception Memory Reasoning Model (Convenience Store) currently includes 221 categories, which can be applied to scenarios such as convenience store product detection, product classification, and intelligent ordering. It can also quickly build a personal product library for classification detection, shortening the development cycle.
  • Research Users: Supports researchers in optimizing algorithms, scene perception, and large model testing in the field of robot vision, promoting research on intelligent applications.
  • Educational Users: Used for robot perception and vision detection experiments in teaching, allowing students to practice theoretical knowledge and design and complete related experimental projects.

2. Quick Start

Basic Environment Preparation

Supporting Version Requirements

ProjectVersion
Operating Systemubuntu20.04
Architecturex86
Graphics Drivernvidia-driver-535
Python3.8
pip24.9.1

Install conda and Python Environment

Install the conda package management tool and the corresponding Python environment. For details, please refer to Install Conda and Python Environment

Build Python Environment

  1. Create a conda virtual environment.

    bash
    conda create -n faiss python=3.10 -y
  2. Activate the virtual environment.

    bash
    conda activate faiss
  3. Check the Python version.

    bash
    python -V
  4. Check the pip version.

    bash
    pip -V
  5. Update pip to the latest version.

    bash
    pip install -U pip

Install Python Environment Third-Party Package Dependencies

  1. Install the basic environment + faiss.

    bash
    conda install -n faiss pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-
    cuda=12.1 -c pytorch -c nvidia
    conda install -n faiss faiss-gpu -c pytorch
  2. Install dependent libraries.

    bash
    python -m pip install -U huggingface_hub keyboard psutil datasets pillow IPython transformers==4.38.2
  3. Install yolo dependencies.

    bash
    cd ../ObjectLocator_Classifer
    python -m pip install -r requirements.txt

Get the Code

Please visit Code Download to obtain the model code.

Get the Model (Provided in the zip package)

Please visit Model Download to obtain the model files. Below is the corresponding model file description.

No.File NameDescription
1shelve_platform_augment_truncation_1206.ptUsed to receive video stream data, the model can get detection boxes based on RGB images and depth images.
2dino_vectors_huojia_0401Used to extract features from detected box images, the obtained feature vectors are used for the next step of vector similarity search.
3dinov2_modelVector comparison library, by comparing the input feature vectors with this vector library, the specific category of the detected object is obtained.
4CDNet.pthPerforms depth restoration on RGB images to obtain the corresponding depth images.

Run the Code

python
# Set the path
rgbd_model_path = "Set to the local path of the yolo-rgbd directory"
rgbd_init_weight_path = "Set to the local path of the CDNet.pth directory"
clip_model_path = "Set to the local path of the dinov2_model directory"
faiss_path = "Set to the local path of the dino_vectors_huojia_0401 directory"
#Initialize parameters
camera_width=1280
camera_height=720
target_imgWidth = 640
target_imgHeight = 360
conf=0.55

# Instantiate the object
solver = Solver()
rgbd_dinov2_inference = memory_Inference()

# Initialize the model
rgbd_model, solver, clip_model, clip_processor, category_name_faiss, index_faiss=rgbd_dinov2_inference.gen_model(rgbd_model_path, solver, rgbd_init_weight_path,clip_model_path,faiss_path)

# Instantiate the camera class
camera = realsense.RealSenseCamera()
camera.set_resolution(1280,720)

# Start the camera stream and get intrinsic information
camera.start_camera()
while True:
# Camera reads RGB+D stream
    color_frame, _, _, point_cloud, _ = camera.read_align_frame()

3. API Invocation

Model Invocation Start

python
# Perform depth learning reconstruction
    resize_color_frame = cv2.resize(color_frame, (target_imgWidth, target_imgHeight))
    depth_img_restore = cv2.cvtColor(solver.test(img=resize_color_frame), cv2.COLOR_GRAY2RGB)
    if len(color_frame):
    
# Perform RGBD positioning
        rgbd_result,width_scale,height_scale=rgbd_dinov2_inference.detect_inerence(rgbd_model, color_frame, depth_img_restore, conf=conf,target_imgWidth=target_imgWidth,target_imgHeight=target_imgHeight)
        
# Display results on the frame
        annotated_frame = rgbd_result[0].plot()
        
# Extract image features using the dinov2 classifier and perform vector similarity search with faiss
        rgbd_result, results, total_cls_results =rgbd_dinov2_inference.classify_memory_inference(rgbd_result,color_frame,width_scale,height_scale, clip_model, clip_processor,category_name_faiss, index_faiss,k=4)
        
# Visualization
        rgbd_dinov2_inference.plot(rgbd_result,total_cls_results)

Key Parameter Explanation:

  • Invocation Input: color_frame: The camera image frame.
  • Result Output: rgbd_result: Detection box coordinates and categories. results: Short-term memory, long-term memory, knowledge memory, and network memory obtained from the perception reasoning process. total_cls_results: Perception category results.

4. Feature Introduction

The New Retail Scenario Multimodal Perception Memory Reasoning Model covers a total of 221 product categories in the new retail scenario, with the ability to perceive and classify them.

Usage Conditions

The detected new retail items must be within the camera detection range.

Feature Details

  • Capture video using the yolo-rgbd recognition model to obtain the specific positioning and segmentation results of the object, which are used to construct a feature library and provide fine-grained classification to the classification model.
  • Based on the constructed feature library, use the clip classifier model to extract features from the image library, save the extracted feature vectors as files, and use them as a feature vector library for real-time feature comparison.
  • Real-time recognition, according to the position results given by the detector, crop on the original image, change the image size to the corresponding input size as required by the model, and input the processed image into the clip classifier model.
  • The dinov2 classifier extracts features from the image, and the faiss library compares the obtained features with the features in the feature vector library to obtain the category results.

5. Update Log

Update DateUpdate Content
2025.04.16New content added

6. Support and Feedback

  1. To learn more about Realman's product series, please visit: Realman Official Website
  2. To learn more about usage issues of Realman products, please visit: Realman Academy
  3. 咨For more questions about Realman products, please contact via email. Official email: sales@realman-robot.com
  4. To stay up-to-date with the latest news about Realman products, please follow: Official WeChat public account alt text

This project follows the MIT License.