Application Scenario:
New Retail Scenes (Convenience Store) 1. Introduction
Usage Scenario
The New Retail Scenario Multimodal Perception Memory Reasoning Model can be mainly applied to the new retail scenario and can be used by developers of convenience stores and unmanned supermarkets, researchers, and educators.
Target Users
- Convenience Store & Unmanned Supermarket Application Developers: The New Retail Scenario Multimodal Perception Memory Reasoning Model (Convenience Store) currently includes 221 categories, which can be applied to scenarios such as convenience store product detection, product classification, and intelligent ordering. It can also quickly build a personal product library for classification detection, shortening the development cycle.
- Research Users: Supports researchers in optimizing algorithms, scene perception, and large model testing in the field of robot vision, promoting research on intelligent applications.
- Educational Users: Used for robot perception and vision detection experiments in teaching, allowing students to practice theoretical knowledge and design and complete related experimental projects.
2. Quick Start
Basic Environment Preparation
Supporting Version Requirements
Project | Version |
---|---|
Operating System | ubuntu20.04 |
Architecture | x86 |
Graphics Driver | nvidia-driver-535 |
Python | 3.8 |
pip | 24.9.1 |
Install conda and Python Environment
Install the conda package management tool and the corresponding Python environment. For details, please refer to Install Conda and Python Environment。
Build Python Environment
Create a conda virtual environment.
bashconda create -n faiss python=3.10 -y
Activate the virtual environment.
bashconda activate faiss
Check the Python version.
bashpython -V
Check the pip version.
bashpip -V
Update pip to the latest version.
bashpip install -U pip
Install Python Environment Third-Party Package Dependencies
Install the basic environment + faiss.
bashconda install -n faiss pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch- cuda=12.1 -c pytorch -c nvidia conda install -n faiss faiss-gpu -c pytorch
Install dependent libraries.
bashpython -m pip install -U huggingface_hub keyboard psutil datasets pillow IPython transformers==4.38.2
Install yolo dependencies.
bashcd ../ObjectLocator_Classifer python -m pip install -r requirements.txt
Get the Code
Please visit Code Download to obtain the model code.
Get the Model (Provided in the zip package)
Please visit Model Download to obtain the model files. Below is the corresponding model file description.
No. | File Name | Description |
---|---|---|
1 | shelve_platform_augment_truncation_1206.pt | Used to receive video stream data, the model can get detection boxes based on RGB images and depth images. |
2 | dino_vectors_huojia_0401 | Used to extract features from detected box images, the obtained feature vectors are used for the next step of vector similarity search. |
3 | dinov2_model | Vector comparison library, by comparing the input feature vectors with this vector library, the specific category of the detected object is obtained. |
4 | CDNet.pth | Performs depth restoration on RGB images to obtain the corresponding depth images. |
Run the Code
# Set the path
rgbd_model_path = "Set to the local path of the yolo-rgbd directory"
rgbd_init_weight_path = "Set to the local path of the CDNet.pth directory"
clip_model_path = "Set to the local path of the dinov2_model directory"
faiss_path = "Set to the local path of the dino_vectors_huojia_0401 directory"
#Initialize parameters
camera_width=1280
camera_height=720
target_imgWidth = 640
target_imgHeight = 360
conf=0.55
# Instantiate the object
solver = Solver()
rgbd_dinov2_inference = memory_Inference()
# Initialize the model
rgbd_model, solver, clip_model, clip_processor, category_name_faiss, index_faiss=rgbd_dinov2_inference.gen_model(rgbd_model_path, solver, rgbd_init_weight_path,clip_model_path,faiss_path)
# Instantiate the camera class
camera = realsense.RealSenseCamera()
camera.set_resolution(1280,720)
# Start the camera stream and get intrinsic information
camera.start_camera()
while True:
# Camera reads RGB+D stream
color_frame, _, _, point_cloud, _ = camera.read_align_frame()
3. API Invocation
Model Invocation Start
# Perform depth learning reconstruction
resize_color_frame = cv2.resize(color_frame, (target_imgWidth, target_imgHeight))
depth_img_restore = cv2.cvtColor(solver.test(img=resize_color_frame), cv2.COLOR_GRAY2RGB)
if len(color_frame):
# Perform RGBD positioning
rgbd_result,width_scale,height_scale=rgbd_dinov2_inference.detect_inerence(rgbd_model, color_frame, depth_img_restore, conf=conf,target_imgWidth=target_imgWidth,target_imgHeight=target_imgHeight)
# Display results on the frame
annotated_frame = rgbd_result[0].plot()
# Extract image features using the dinov2 classifier and perform vector similarity search with faiss
rgbd_result, results, total_cls_results =rgbd_dinov2_inference.classify_memory_inference(rgbd_result,color_frame,width_scale,height_scale, clip_model, clip_processor,category_name_faiss, index_faiss,k=4)
# Visualization
rgbd_dinov2_inference.plot(rgbd_result,total_cls_results)
Key Parameter Explanation:
- Invocation Input: color_frame: The camera image frame.
- Result Output: rgbd_result: Detection box coordinates and categories. results: Short-term memory, long-term memory, knowledge memory, and network memory obtained from the perception reasoning process. total_cls_results: Perception category results.
4. Feature Introduction
The New Retail Scenario Multimodal Perception Memory Reasoning Model covers a total of 221 product categories in the new retail scenario, with the ability to perceive and classify them.
Usage Conditions
The detected new retail items must be within the camera detection range.
Feature Details
- Capture video using the yolo-rgbd recognition model to obtain the specific positioning and segmentation results of the object, which are used to construct a feature library and provide fine-grained classification to the classification model.
- Based on the constructed feature library, use the clip classifier model to extract features from the image library, save the extracted feature vectors as files, and use them as a feature vector library for real-time feature comparison.
- Real-time recognition, according to the position results given by the detector, crop on the original image, change the image size to the corresponding input size as required by the model, and input the processed image into the clip classifier model.
- The dinov2 classifier extracts features from the image, and the faiss library compares the obtained features with the features in the feature vector library to obtain the category results.
5. Update Log
Update Date | Update Content |
---|---|
2025.04.16 | New content added |
6. Support and Feedback
- To learn more about Realman's product series, please visit: Realman Official Website
- To learn more about usage issues of Realman products, please visit: Realman Academy
- 咨For more questions about Realman products, please contact via email. Official email: sales@realman-robot.com
- To stay up-to-date with the latest news about Realman products, please follow: Official WeChat public account
7. Copyright and License Agreement
This project follows the MIT License.