Application Scenario:
New Retail Scenes (Convenience Store)

1. Introduction

Usage Scenario

The New Retail Scenario Multimodal Perception Memory Reasoning Model can be mainly applied to the new retail scenario and can be used by developers of convenience stores and unmanned supermarkets, researchers, and educators.

Target Users

Convenience Store & Unmanned Supermarket Application Developers: The New Retail Scenario Multimodal Perception Memory Reasoning Model (Convenience Store) currently includes 221 categories, which can be applied to scenarios such as convenience store product detection, product classification, and intelligent ordering. It can also quickly build a personal product library for classification detection, shortening the development cycle.
Research Users: Supports researchers in optimizing algorithms, scene perception, and large model testing in the field of robot vision, promoting research on intelligent applications.
Educational Users: Used for robot perception and vision detection experiments in teaching, allowing students to practice theoretical knowledge and design and complete related experimental projects.

2. Quick Start

Basic Environment Preparation

Supporting Version Requirements

Project	Version
Operating System	ubuntu20.04
Architecture	x86
Graphics Driver	nvidia-driver-535
Python	3.8
pip	24.9.1

Install conda and Python Environment

Install the conda package management tool and the corresponding Python environment. For details, please refer to Install Conda and Python Environment。

Build Python Environment

Create a conda virtual environment.
bash
```
conda create -n faiss python=3.10 -y
```
Activate the virtual environment.
bash
```
conda activate faiss
```
Check the Python version.
bash
```
python -V
```
Check the pip version.
bash
```
pip -V
```
Update pip to the latest version.
bash
```
pip install -U pip
```

Install Python Environment Third-Party Package Dependencies

Install the basic environment + faiss.

bash

conda install -n faiss pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-
cuda=12.1 -c pytorch -c nvidia
conda install -n faiss faiss-gpu -c pytorch

Install dependent libraries.

bash

python -m pip install -U huggingface_hub keyboard psutil datasets pillow IPython transformers==4.38.2

Install yolo dependencies.

bash

cd ../ObjectLocator_Classifer
python -m pip install -r requirements.txt

Get the Code

Please visit Code Download to obtain the model code.

Get the Model (Provided in the zip package)

Please visit Model Download to obtain the model files. Below is the corresponding model file description.

No.	File Name	Description
1	shelve_platform_augment_truncation_1206.pt	Used to receive video stream data, the model can get detection boxes based on RGB images and depth images.
2	dino_vectors_huojia_0401	Used to extract features from detected box images, the obtained feature vectors are used for the next step of vector similarity search.
3	dinov2_model	Vector comparison library, by comparing the input feature vectors with this vector library, the specific category of the detected object is obtained.
4	CDNet.pth	Performs depth restoration on RGB images to obtain the corresponding depth images.

Run the Code

python

# Set the path
rgbd_model_path = "Set to the local path of the yolo-rgbd directory"
rgbd_init_weight_path = "Set to the local path of the CDNet.pth directory"
clip_model_path = "Set to the local path of the dinov2_model directory"
faiss_path = "Set to the local path of the dino_vectors_huojia_0401 directory"
#Initialize parameters
camera_width=1280
camera_height=720
target_imgWidth = 640
target_imgHeight = 360
conf=0.55

# Instantiate the object
solver = Solver()
rgbd_dinov2_inference = memory_Inference()

# Initialize the model
rgbd_model, solver, clip_model, clip_processor, category_name_faiss, index_faiss=rgbd_dinov2_inference.gen_model(rgbd_model_path, solver, rgbd_init_weight_path,clip_model_path,faiss_path)

# Instantiate the camera class
camera = realsense.RealSenseCamera()
camera.set_resolution(1280,720)

# Start the camera stream and get intrinsic information
camera.start_camera()
while True:
# Camera reads RGB+D stream
    color_frame, _, _, point_cloud, _ = camera.read_align_frame()

3. API Invocation

Model Invocation Start

python

# Perform depth learning reconstruction
    resize_color_frame = cv2.resize(color_frame, (target_imgWidth, target_imgHeight))
    depth_img_restore = cv2.cvtColor(solver.test(img=resize_color_frame), cv2.COLOR_GRAY2RGB)
    if len(color_frame):
    
# Perform RGBD positioning
        rgbd_result,width_scale,height_scale=rgbd_dinov2_inference.detect_inerence(rgbd_model, color_frame, depth_img_restore, conf=conf,target_imgWidth=target_imgWidth,target_imgHeight=target_imgHeight)
        
# Display results on the frame
        annotated_frame = rgbd_result[0].plot()
        
# Extract image features using the dinov2 classifier and perform vector similarity search with faiss
        rgbd_result, results, total_cls_results =rgbd_dinov2_inference.classify_memory_inference(rgbd_result,color_frame,width_scale,height_scale, clip_model, clip_processor,category_name_faiss, index_faiss,k=4)
        
# Visualization
        rgbd_dinov2_inference.plot(rgbd_result,total_cls_results)

Key Parameter Explanation:

Invocation Input: color_frame: The camera image frame.
Result Output: rgbd_result: Detection box coordinates and categories. results: Short-term memory, long-term memory, knowledge memory, and network memory obtained from the perception reasoning process. total_cls_results: Perception category results.

4. Feature Introduction

The New Retail Scenario Multimodal Perception Memory Reasoning Model covers a total of 221 product categories in the new retail scenario, with the ability to perceive and classify them.

Usage Conditions

The detected new retail items must be within the camera detection range.

Feature Details

Capture video using the yolo-rgbd recognition model to obtain the specific positioning and segmentation results of the object, which are used to construct a feature library and provide fine-grained classification to the classification model.
Based on the constructed feature library, use the clip classifier model to extract features from the image library, save the extracted feature vectors as files, and use them as a feature vector library for real-time feature comparison.
Real-time recognition, according to the position results given by the detector, crop on the original image, change the image size to the corresponding input size as required by the model, and input the processed image into the clip classifier model.
The dinov2 classifier extracts features from the image, and the faiss library compares the obtained features with the features in the feature vector library to obtain the category results.

5. Update Log

Update Date	Update Content
2025.04.16	New content added

6. Support and Feedback

To learn more about Realman's product series, please visit: Realman Official Website
To learn more about usage issues of Realman products, please visit: Realman Academy
咨For more questions about Realman products, please contact via email. Official email: sales@realman-robot.com
To stay up-to-date with the latest news about Realman products, please follow: Official WeChat public account

7. Copyright and License Agreement

This project follows the MIT License.

Application Scenario: New Retail Scenes (Convenience Store) ​

1. Introduction ​

2. Quick Start ​

Basic Environment Preparation ​

Supporting Version Requirements ​

Install conda and Python Environment ​

Build Python Environment ​

Install Python Environment Third-Party Package Dependencies ​

Get the Code ​

Get the Model (Provided in the zip package) ​

Run the Code ​

3. API Invocation ​

Model Invocation Start ​

4. Feature Introduction ​

Usage Conditions ​

Feature Details ​

5. Update Log ​

6. Support and Feedback ​

7. Copyright and License Agreement ​

Application Scenario:
New Retail Scenes (Convenience Store)

1. Introduction

2. Quick Start

Basic Environment Preparation

Supporting Version Requirements

Install conda and Python Environment

Build Python Environment

Install Python Environment Third-Party Package Dependencies

Get the Code

Get the Model (Provided in the zip package)

Run the Code

3. API Invocation

Model Invocation Start

4. Feature Introduction

Usage Conditions

Feature Details

5. Update Log

6. Support and Feedback

7. Copyright and License Agreement