licanglong 3585aeb20b init 3 月之前
..
1_Pooling 3585aeb20b init 3 月之前
2_Dense 3585aeb20b init 3 月之前
img 3585aeb20b init 3 月之前
result 3585aeb20b init 3 月之前
.gitattributes 3585aeb20b init 3 月之前
README.md 3585aeb20b init 3 月之前
config.json 3585aeb20b init 3 月之前
model.safetensors 3585aeb20b init 3 月之前
modules.json 3585aeb20b init 3 月之前
pytorch_model.bin 3585aeb20b init 3 月之前
sentence_bert_config.json 3585aeb20b init 3 月之前
special_tokens_map.json 3585aeb20b init 3 月之前
tokenizer.json 3585aeb20b init 3 月之前
tokenizer_config.json 3585aeb20b init 3 月之前
vocab.txt 3585aeb20b init 3 月之前

README.md


pipeline_tag: sentence-similarity tags:

  • mteb
  • sentence-transformers
  • feature-extraction
  • sentence-similarity model-index:
  • name: acge_text_embedding results:
    • task: type: STS dataset: type: C-MTEB/AFQMC name: MTEB AFQMC config: default split: validation revision: b44c3b0110 metrics:
    • type: cos_sim_pearson value: 54.03434872650919
    • type: cos_sim_spearman value: 58.80730796688325
    • type: euclidean_pearson value: 57.47231387497989
    • type: euclidean_spearman value: 58.80775026351807
    • type: manhattan_pearson value: 57.46332720141574
    • type: manhattan_spearman value: 58.80196022940078
    • task: type: STS dataset: type: C-MTEB/ATEC name: MTEB ATEC config: default split: test revision: 0f319b1142 metrics:
    • type: cos_sim_pearson value: 53.52621290548175
    • type: cos_sim_spearman value: 57.945227768312144
    • type: euclidean_pearson value: 61.17041394151802
    • type: euclidean_spearman value: 57.94553287835657
    • type: manhattan_pearson value: 61.168327500057885
    • type: manhattan_spearman value: 57.94477516925043
    • task: type: Classification dataset: type: mteb/amazon_reviews_multi name: MTEB AmazonReviewsClassification (zh) config: zh split: test revision: 1399c76144 metrics:
    • type: accuracy value: 48.538000000000004
    • type: f1 value: 46.59920995594044
    • task: type: STS dataset: type: C-MTEB/BQ name: MTEB BQ config: default split: test revision: e3dda5e115 metrics:
    • type: cos_sim_pearson value: 68.27529991817154
    • type: cos_sim_spearman value: 70.37095914176643
    • type: euclidean_pearson value: 69.42690712802727
    • type: euclidean_spearman value: 70.37017971889912
    • type: manhattan_pearson value: 69.40264877917839
    • type: manhattan_spearman value: 70.34786744049524
    • task: type: Clustering dataset: type: C-MTEB/CLSClusteringP2P name: MTEB CLSClusteringP2P config: default split: test revision: 4b6227591c metrics:
    • type: v_measure value: 47.08027536192709
    • task: type: Clustering dataset: type: C-MTEB/CLSClusteringS2S name: MTEB CLSClusteringS2S config: default split: test revision: e458b3f541 metrics:
    • type: v_measure value: 44.0526024940363
    • task: type: Reranking dataset: type: C-MTEB/CMedQAv1-reranking name: MTEB CMedQAv1 config: default split: test revision: 8d7f1e9425 metrics:
    • type: map value: 88.65974993133156
    • type: mrr value: 90.64761904761905
    • task: type: Reranking dataset: type: C-MTEB/CMedQAv2-reranking name: MTEB CMedQAv2 config: default split: test revision: 23d1867505 metrics:
    • type: map value: 88.90396838907245
    • type: mrr value: 90.90932539682541
    • task: type: Retrieval dataset: type: C-MTEB/CmedqaRetrieval name: MTEB CmedqaRetrieval config: default split: dev revision: cd540c506d metrics:
    • type: map_at_1 value: 26.875
    • type: map_at_10 value: 39.995999999999995
    • type: map_at_100 value: 41.899
    • type: map_at_1000 value: 42.0
    • type: map_at_3 value: 35.414
    • type: map_at_5 value: 38.019
    • type: mrr_at_1 value: 40.635
    • type: mrr_at_10 value: 48.827
    • type: mrr_at_100 value: 49.805
    • type: mrr_at_1000 value: 49.845
    • type: mrr_at_3 value: 46.145
    • type: mrr_at_5 value: 47.693999999999996
    • type: ndcg_at_1 value: 40.635
    • type: ndcg_at_10 value: 46.78
    • type: ndcg_at_100 value: 53.986999999999995
    • type: ndcg_at_1000 value: 55.684
    • type: ndcg_at_3 value: 41.018
    • type: ndcg_at_5 value: 43.559
    • type: precision_at_1 value: 40.635
    • type: precision_at_10 value: 10.427999999999999
    • type: precision_at_100 value: 1.625
    • type: precision_at_1000 value: 0.184
    • type: precision_at_3 value: 23.139000000000003
    • type: precision_at_5 value: 17.004
    • type: recall_at_1 value: 26.875
    • type: recall_at_10 value: 57.887
    • type: recall_at_100 value: 87.408
    • type: recall_at_1000 value: 98.721
    • type: recall_at_3 value: 40.812
    • type: recall_at_5 value: 48.397
    • task: type: PairClassification dataset: type: C-MTEB/CMNLI name: MTEB Cmnli config: default split: validation revision: 41bc36f332 metrics:
    • type: cos_sim_accuracy value: 83.43956704750451
    • type: cos_sim_ap value: 90.49172854352659
    • type: cos_sim_f1 value: 84.28475486903963
    • type: cos_sim_precision value: 80.84603822203135
    • type: cos_sim_recall value: 88.02899228431144
    • type: dot_accuracy value: 83.43956704750451
    • type: dot_ap value: 90.46317132695233
    • type: dot_f1 value: 84.28794294628929
    • type: dot_precision value: 80.51948051948052
    • type: dot_recall value: 88.4264671498714
    • type: euclidean_accuracy value: 83.43956704750451
    • type: euclidean_ap value: 90.49171785256486
    • type: euclidean_f1 value: 84.28235820561584
    • type: euclidean_precision value: 80.8022308022308
    • type: euclidean_recall value: 88.07575403320084
    • type: manhattan_accuracy value: 83.55983162958509
    • type: manhattan_ap value: 90.48046779812815
    • type: manhattan_f1 value: 84.45354259069714
    • type: manhattan_precision value: 82.21877767936226
    • type: manhattan_recall value: 86.81318681318682
    • type: max_accuracy value: 83.55983162958509
    • type: max_ap value: 90.49172854352659
    • type: max_f1 value: 84.45354259069714
    • task: type: Retrieval dataset: type: C-MTEB/CovidRetrieval name: MTEB CovidRetrieval config: default split: dev revision: 1271c78090 metrics:
    • type: map_at_1 value: 68.54599999999999
    • type: map_at_10 value: 77.62400000000001
    • type: map_at_100 value: 77.886
    • type: map_at_1000 value: 77.89
    • type: map_at_3 value: 75.966
    • type: map_at_5 value: 76.995
    • type: mrr_at_1 value: 68.915
    • type: mrr_at_10 value: 77.703
    • type: mrr_at_100 value: 77.958
    • type: mrr_at_1000 value: 77.962
    • type: mrr_at_3 value: 76.08
    • type: mrr_at_5 value: 77.118
    • type: ndcg_at_1 value: 68.809
    • type: ndcg_at_10 value: 81.563
    • type: ndcg_at_100 value: 82.758
    • type: ndcg_at_1000 value: 82.864
    • type: ndcg_at_3 value: 78.29
    • type: ndcg_at_5 value: 80.113
    • type: precision_at_1 value: 68.809
    • type: precision_at_10 value: 9.463000000000001
    • type: precision_at_100 value: 1.001
    • type: precision_at_1000 value: 0.101
    • type: precision_at_3 value: 28.486
    • type: precision_at_5 value: 18.019
    • type: recall_at_1 value: 68.54599999999999
    • type: recall_at_10 value: 93.625
    • type: recall_at_100 value: 99.05199999999999
    • type: recall_at_1000 value: 99.895
    • type: recall_at_3 value: 84.879
    • type: recall_at_5 value: 89.252
    • task: type: Retrieval dataset: type: C-MTEB/DuRetrieval name: MTEB DuRetrieval config: default split: dev revision: a1a333e290 metrics:
    • type: map_at_1 value: 25.653
    • type: map_at_10 value: 79.105
    • type: map_at_100 value: 81.902
    • type: map_at_1000 value: 81.947
    • type: map_at_3 value: 54.54599999999999
    • type: map_at_5 value: 69.226
    • type: mrr_at_1 value: 89.35
    • type: mrr_at_10 value: 92.69
    • type: mrr_at_100 value: 92.77
    • type: mrr_at_1000 value: 92.774
    • type: mrr_at_3 value: 92.425
    • type: mrr_at_5 value: 92.575
    • type: ndcg_at_1 value: 89.35
    • type: ndcg_at_10 value: 86.55199999999999
    • type: ndcg_at_100 value: 89.35300000000001
    • type: ndcg_at_1000 value: 89.782
    • type: ndcg_at_3 value: 85.392
    • type: ndcg_at_5 value: 84.5
    • type: precision_at_1 value: 89.35
    • type: precision_at_10 value: 41.589999999999996
    • type: precision_at_100 value: 4.781
    • type: precision_at_1000 value: 0.488
    • type: precision_at_3 value: 76.683
    • type: precision_at_5 value: 65.06
    • type: recall_at_1 value: 25.653
    • type: recall_at_10 value: 87.64999999999999
    • type: recall_at_100 value: 96.858
    • type: recall_at_1000 value: 99.13300000000001
    • type: recall_at_3 value: 56.869
    • type: recall_at_5 value: 74.024
    • task: type: Retrieval dataset: type: C-MTEB/EcomRetrieval name: MTEB EcomRetrieval config: default split: dev revision: 687de13dc7 metrics:
    • type: map_at_1 value: 52.1
    • type: map_at_10 value: 62.629999999999995
    • type: map_at_100 value: 63.117000000000004
    • type: map_at_1000 value: 63.134
    • type: map_at_3 value: 60.267
    • type: map_at_5 value: 61.777
    • type: mrr_at_1 value: 52.1
    • type: mrr_at_10 value: 62.629999999999995
    • type: mrr_at_100 value: 63.117000000000004
    • type: mrr_at_1000 value: 63.134
    • type: mrr_at_3 value: 60.267
    • type: mrr_at_5 value: 61.777
    • type: ndcg_at_1 value: 52.1
    • type: ndcg_at_10 value: 67.596
    • type: ndcg_at_100 value: 69.95
    • type: ndcg_at_1000 value: 70.33500000000001
    • type: ndcg_at_3 value: 62.82600000000001
    • type: ndcg_at_5 value: 65.546
    • type: precision_at_1 value: 52.1
    • type: precision_at_10 value: 8.309999999999999
    • type: precision_at_100 value: 0.941
    • type: precision_at_1000 value: 0.097
    • type: precision_at_3 value: 23.400000000000002
    • type: precision_at_5 value: 15.36
    • type: recall_at_1 value: 52.1
    • type: recall_at_10 value: 83.1
    • type: recall_at_100 value: 94.1
    • type: recall_at_1000 value: 97.0
    • type: recall_at_3 value: 70.19999999999999
    • type: recall_at_5 value: 76.8
    • task: type: Classification dataset: type: C-MTEB/IFlyTek-classification name: MTEB IFlyTek config: default split: validation revision: 421605374b metrics:
    • type: accuracy value: 51.773759138130046
    • type: f1 value: 40.341407912920054
    • task: type: Classification dataset: type: C-MTEB/JDReview-classification name: MTEB JDReview config: default split: test revision: b7c64bd89e metrics:
    • type: accuracy value: 86.69793621013133
    • type: ap value: 55.46718958939327
    • type: f1 value: 81.48228915952436
    • task: type: STS dataset: type: C-MTEB/LCQMC name: MTEB LCQMC config: default split: test revision: 17f9b096f8 metrics:
    • type: cos_sim_pearson value: 71.1397780205448
    • type: cos_sim_spearman value: 78.17368193033309
    • type: euclidean_pearson value: 77.4849177602368
    • type: euclidean_spearman value: 78.17369079663212
    • type: manhattan_pearson value: 77.47344305182406
    • type: manhattan_spearman value: 78.16454335155387
    • task: type: Reranking dataset: type: C-MTEB/Mmarco-reranking name: MTEB MMarcoReranking config: default split: dev revision: 8e0c766dbe metrics:
    • type: map value: 27.76160559006673
    • type: mrr value: 28.02420634920635
    • task: type: Retrieval dataset: type: C-MTEB/MMarcoRetrieval name: MTEB MMarcoRetrieval config: default split: dev revision: 539bbde593 metrics:
    • type: map_at_1 value: 65.661
    • type: map_at_10 value: 74.752
    • type: map_at_100 value: 75.091
    • type: map_at_1000 value: 75.104
    • type: map_at_3 value: 72.997
    • type: map_at_5 value: 74.119
    • type: mrr_at_1 value: 67.923
    • type: mrr_at_10 value: 75.376
    • type: mrr_at_100 value: 75.673
    • type: mrr_at_1000 value: 75.685
    • type: mrr_at_3 value: 73.856
    • type: mrr_at_5 value: 74.82799999999999
    • type: ndcg_at_1 value: 67.923
    • type: ndcg_at_10 value: 78.424
    • type: ndcg_at_100 value: 79.95100000000001
    • type: ndcg_at_1000 value: 80.265
    • type: ndcg_at_3 value: 75.101
    • type: ndcg_at_5 value: 76.992
    • type: precision_at_1 value: 67.923
    • type: precision_at_10 value: 9.474
    • type: precision_at_100 value: 1.023
    • type: precision_at_1000 value: 0.105
    • type: precision_at_3 value: 28.319
    • type: precision_at_5 value: 17.986
    • type: recall_at_1 value: 65.661
    • type: recall_at_10 value: 89.09899999999999
    • type: recall_at_100 value: 96.023
    • type: recall_at_1000 value: 98.455
    • type: recall_at_3 value: 80.314
    • type: recall_at_5 value: 84.81
    • task: type: Classification dataset: type: mteb/amazon_massive_intent name: MTEB MassiveIntentClassification (zh-CN) config: zh-CN split: test revision: 31efe3c427 metrics:
    • type: accuracy value: 75.86751849361131
    • type: f1 value: 73.04918450508
    • task: type: Classification dataset: type: mteb/amazon_massive_scenario name: MTEB MassiveScenarioClassification (zh-CN) config: zh-CN split: test revision: 7d571f9278 metrics:
    • type: accuracy value: 78.4364492266308
    • type: f1 value: 78.120686034844
    • task: type: Retrieval dataset: type: C-MTEB/MedicalRetrieval name: MTEB MedicalRetrieval config: default split: dev revision: 2039188fb5 metrics:
    • type: map_at_1 value: 55.00000000000001
    • type: map_at_10 value: 61.06399999999999
    • type: map_at_100 value: 61.622
    • type: map_at_1000 value: 61.663000000000004
    • type: map_at_3 value: 59.583
    • type: map_at_5 value: 60.373
    • type: mrr_at_1 value: 55.2
    • type: mrr_at_10 value: 61.168
    • type: mrr_at_100 value: 61.726000000000006
    • type: mrr_at_1000 value: 61.767
    • type: mrr_at_3 value: 59.683
    • type: mrr_at_5 value: 60.492999999999995
    • type: ndcg_at_1 value: 55.00000000000001
    • type: ndcg_at_10 value: 64.098
    • type: ndcg_at_100 value: 67.05
    • type: ndcg_at_1000 value: 68.262
    • type: ndcg_at_3 value: 61.00600000000001
    • type: ndcg_at_5 value: 62.439
    • type: precision_at_1 value: 55.00000000000001
    • type: precision_at_10 value: 7.37
    • type: precision_at_100 value: 0.881
    • type: precision_at_1000 value: 0.098
    • type: precision_at_3 value: 21.7
    • type: precision_at_5 value: 13.719999999999999
    • type: recall_at_1 value: 55.00000000000001
    • type: recall_at_10 value: 73.7
    • type: recall_at_100 value: 88.1
    • type: recall_at_1000 value: 97.8
    • type: recall_at_3 value: 65.10000000000001
    • type: recall_at_5 value: 68.60000000000001
    • task: type: Classification dataset: type: C-MTEB/MultilingualSentiment-classification name: MTEB MultilingualSentiment config: default split: validation revision: 46958b007a metrics:
    • type: accuracy value: 77.52666666666667
    • type: f1 value: 77.49784731367215
    • task: type: PairClassification dataset: type: C-MTEB/OCNLI name: MTEB Ocnli config: default split: validation revision: 66e76a618a metrics:
    • type: cos_sim_accuracy value: 81.10449377368705
    • type: cos_sim_ap value: 85.17742765935606
    • type: cos_sim_f1 value: 83.00094966761633
    • type: cos_sim_precision value: 75.40983606557377
    • type: cos_sim_recall value: 92.29144667370645
    • type: dot_accuracy value: 81.10449377368705
    • type: dot_ap value: 85.17143850809614
    • type: dot_f1 value: 83.01707779886148
    • type: dot_precision value: 75.36606373815677
    • type: dot_recall value: 92.39704329461456
    • type: euclidean_accuracy value: 81.10449377368705
    • type: euclidean_ap value: 85.17856775343333
    • type: euclidean_f1 value: 83.00094966761633
    • type: euclidean_precision value: 75.40983606557377
    • type: euclidean_recall value: 92.29144667370645
    • type: manhattan_accuracy value: 81.05035192203573
    • type: manhattan_ap value: 85.14464459395809
    • type: manhattan_f1 value: 82.96155671570953
    • type: manhattan_precision value: 75.3448275862069
    • type: manhattan_recall value: 92.29144667370645
    • type: max_accuracy value: 81.10449377368705
    • type: max_ap value: 85.17856775343333
    • type: max_f1 value: 83.01707779886148
    • task: type: Classification dataset: type: C-MTEB/OnlineShopping-classification name: MTEB OnlineShopping config: default split: test revision: e610f2ebd1 metrics:
    • type: accuracy value: 93.71000000000001
    • type: ap value: 91.83202232349356
    • type: f1 value: 93.69900560334331
    • task: type: STS dataset: type: C-MTEB/PAWSX name: MTEB PAWSX config: default split: test revision: 9c6a90e430 metrics:
    • type: cos_sim_pearson value: 39.175047651512415
    • type: cos_sim_spearman value: 45.51434675777896
    • type: euclidean_pearson value: 44.864110004132286
    • type: euclidean_spearman value: 45.516433048896076
    • type: manhattan_pearson value: 44.87153627706517
    • type: manhattan_spearman value: 45.52862617925012
    • task: type: STS dataset: type: C-MTEB/QBQTC name: MTEB QBQTC config: default split: test revision: 790b0510dc metrics:
    • type: cos_sim_pearson value: 34.249579701429084
    • type: cos_sim_spearman value: 37.30903127368978
    • type: euclidean_pearson value: 35.129438425253355
    • type: euclidean_spearman value: 37.308544018709085
    • type: manhattan_pearson value: 35.08936153503652
    • type: manhattan_spearman value: 37.25582901077839
    • task: type: STS dataset: type: mteb/sts22-crosslingual-sts name: MTEB STS22 (zh) config: zh split: test revision: eea2b4fe26 metrics:
    • type: cos_sim_pearson value: 61.29309637460004
    • type: cos_sim_spearman value: 65.85136090376717
    • type: euclidean_pearson value: 64.04783990953557
    • type: euclidean_spearman value: 65.85036859610366
    • type: manhattan_pearson value: 63.995852552712186
    • type: manhattan_spearman value: 65.86508416749417
    • task: type: STS dataset: type: C-MTEB/STSB name: MTEB STSB config: default split: test revision: 0cde68302b metrics:
    • type: cos_sim_pearson value: 81.5595940455587
    • type: cos_sim_spearman value: 82.72654634579749
    • type: euclidean_pearson value: 82.4892721061365
    • type: euclidean_spearman value: 82.72678504228253
    • type: manhattan_pearson value: 82.4770861422454
    • type: manhattan_spearman value: 82.71137469783162
    • task: type: Reranking dataset: type: C-MTEB/T2Reranking name: MTEB T2Reranking config: default split: dev revision: 76631901a1 metrics:
    • type: map value: 66.6159547610527
    • type: mrr value: 76.35739406347057
    • task: type: Retrieval dataset: type: C-MTEB/T2Retrieval name: MTEB T2Retrieval config: default split: dev revision: 8731a845f1 metrics:
    • type: map_at_1 value: 27.878999999999998
    • type: map_at_10 value: 77.517
    • type: map_at_100 value: 81.139
    • type: map_at_1000 value: 81.204
    • type: map_at_3 value: 54.728
    • type: map_at_5 value: 67.128
    • type: mrr_at_1 value: 90.509
    • type: mrr_at_10 value: 92.964
    • type: mrr_at_100 value: 93.045
    • type: mrr_at_1000 value: 93.048
    • type: mrr_at_3 value: 92.551
    • type: mrr_at_5 value: 92.81099999999999
    • type: ndcg_at_1 value: 90.509
    • type: ndcg_at_10 value: 85.075
    • type: ndcg_at_100 value: 88.656
    • type: ndcg_at_1000 value: 89.25699999999999
    • type: ndcg_at_3 value: 86.58200000000001
    • type: ndcg_at_5 value: 85.138
    • type: precision_at_1 value: 90.509
    • type: precision_at_10 value: 42.05
    • type: precision_at_100 value: 5.013999999999999
    • type: precision_at_1000 value: 0.516
    • type: precision_at_3 value: 75.551
    • type: precision_at_5 value: 63.239999999999995
    • type: recall_at_1 value: 27.878999999999998
    • type: recall_at_10 value: 83.941
    • type: recall_at_100 value: 95.568
    • type: recall_at_1000 value: 98.55000000000001
    • type: recall_at_3 value: 56.374
    • type: recall_at_5 value: 70.435
    • task: type: Classification dataset: type: C-MTEB/TNews-classification name: MTEB TNews config: default split: validation revision: 317f262bf1 metrics:
    • type: accuracy value: 53.687
    • type: f1 value: 51.86911933364655
    • task: type: Clustering dataset: type: C-MTEB/ThuNewsClusteringP2P name: MTEB ThuNewsClusteringP2P config: default split: test revision: 5798586b10 metrics:
    • type: v_measure value: 74.65887489872564
    • task: type: Clustering dataset: type: C-MTEB/ThuNewsClusteringS2S name: MTEB ThuNewsClusteringS2S config: default split: test revision: 8a8b2caeda metrics:
    • type: v_measure value: 69.00410995984436
    • task: type: Retrieval dataset: type: C-MTEB/VideoRetrieval name: MTEB VideoRetrieval config: default split: dev revision: 58c2597a59 metrics:
    • type: map_at_1 value: 59.4
    • type: map_at_10 value: 69.214
    • type: map_at_100 value: 69.72699999999999
    • type: map_at_1000 value: 69.743
    • type: map_at_3 value: 67.717
    • type: map_at_5 value: 68.782
    • type: mrr_at_1 value: 59.4
    • type: mrr_at_10 value: 69.214
    • type: mrr_at_100 value: 69.72699999999999
    • type: mrr_at_1000 value: 69.743
    • type: mrr_at_3 value: 67.717
    • type: mrr_at_5 value: 68.782
    • type: ndcg_at_1 value: 59.4
    • type: ndcg_at_10 value: 73.32300000000001
    • type: ndcg_at_100 value: 75.591
    • type: ndcg_at_1000 value: 75.98700000000001
    • type: ndcg_at_3 value: 70.339
    • type: ndcg_at_5 value: 72.246
    • type: precision_at_1 value: 59.4
    • type: precision_at_10 value: 8.59
    • type: precision_at_100 value: 0.96
    • type: precision_at_1000 value: 0.099
    • type: precision_at_3 value: 25.967000000000002
    • type: precision_at_5 value: 16.5
    • type: recall_at_1 value: 59.4
    • type: recall_at_10 value: 85.9
    • type: recall_at_100 value: 96.0
    • type: recall_at_1000 value: 99.1
    • type: recall_at_3 value: 77.9
    • type: recall_at_5 value: 82.5
    • task: type: Classification dataset: type: C-MTEB/waimai-classification name: MTEB Waimai config: default split: test revision: 339287def2 metrics:
    • type: accuracy value: 88.53
    • type: ap value: 73.56216166534062
    • type: f1 value: 87.06093694294485 ---
icon

acge model

acge模型来自于合合信息技术团队,对外技术试用平台TextIn, github开源链接为github。合合信息是行业领先的人工智能及大数据科技企业,致力于通过智能文字识别及商业大数据领域的核心技术、C端和B端产品以及行业解决方案为全球企业和个人用户提供创新的数字化、智能化服务。

技术交流请联系,商务合作请联系,可以点击图片,扫面二维码来加入我们的微信社群。想加入合合信息,做“文档解析”、“文档检索”、“文档预研”的同学可以投简历给min_du@intsig.net,也可直接添加HR微信详聊岗位内容。

acge是一个通用的文本编码模型,是一个可变长度的向量化模型,使用了Matryoshka Representation Learning,如图所示:

matryoshka-small

建议使用的维度为1024或者1792

Model Name Model Size (GB) Dimension Sequence Length Language Need instruction for retrieval?
acge-text-embedding 0.65 [1024, 1792] 1024 Chinese NO

Metric

C-MTEB leaderboard (Chinese)

测试的时候因为数据的随机性、显卡、推理的数据类型导致每次推理的结果不一致,我总共测试了4次,不同的显卡(A10 A100),不同的数据类型,测试结果放在了result文件夹中,选取了一个精度最低的测试作为最终的精度测试。 根据infgrad的建议,选取不用的输入的长度作为测试,Sequence Length为512时测试最佳。

Model Name GPU tensor-type Model Size (GB) Dimension Sequence Length Average (35) Classification (9) Clustering (4) Pair Classification (2) Reranking (4) Retrieval (8) STS (8)
acge_text_embedding NVIDIA TESLA A10 bfloat16 0.65 1792 1024 68.91 72.76 58.22 87.82 67.67 72.48 62.24
acge_text_embedding NVIDIA TESLA A100 bfloat16 0.65 1792 1024 68.91 72.77 58.35 87.82 67.53 72.48 62.24
acge_text_embedding NVIDIA TESLA A100 float16 0.65 1792 1024 68.99 72.76 58.68 87.84 67.89 72.49 62.24
acge_text_embedding NVIDIA TESLA A100 float32 0.65 1792 1024 68.98 72.76 58.58 87.83 67.91 72.49 62.24
acge_text_embedding NVIDIA TESLA A100 float16 0.65 1792 768 68.95 72.76 58.68 87.84 67.86 72.48 62.07
acge_text_embedding NVIDIA TESLA A100 float16 0.65 1792 512 69.07 72.75 58.7 87.84 67.99 72.93 62.09

Reproduce our results

C-MTEB:

import torch
import argparse
import functools
from C_MTEB.tasks import *
from typing import List, Dict
from sentence_transformers import SentenceTransformer
from mteb import MTEB, DRESModel


class RetrievalModel(DRESModel):
    def __init__(self, encoder, **kwargs):
        self.encoder = encoder

    def encode_queries(self, queries: List[str], **kwargs) -> np.ndarray:
        input_texts = ['{}'.format(q) for q in queries]
        return self._do_encode(input_texts)

    def encode_corpus(self, corpus: List[Dict[str, str]], **kwargs) -> np.ndarray:
        input_texts = ['{} {}'.format(doc.get('title', ''), doc['text']).strip() for doc in corpus]
        input_texts = ['{}'.format(t) for t in input_texts]
        return self._do_encode(input_texts)

    @torch.no_grad()
    def _do_encode(self, input_texts: List[str]) -> np.ndarray:
        return self.encoder.encode(
            sentences=input_texts,
            batch_size=512,
            normalize_embeddings=True,
            convert_to_numpy=True
        )


def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('--model_name_or_path', default="acge_text_embedding", type=str)
    parser.add_argument('--task_type', default=None, type=str)
    parser.add_argument('--pooling_method', default='cls', type=str)
    parser.add_argument('--output_dir', default='zh_results',
                        type=str, help='output directory')
    parser.add_argument('--max_len', default=1024, type=int, help='max length')
    return parser.parse_args()


if __name__ == '__main__':
    args = get_args()
    encoder = SentenceTransformer(args.model_name_or_path).half()
    encoder.encode = functools.partial(encoder.encode, normalize_embeddings=True)
    encoder.max_seq_length = int(args.max_len)

    task_names = [t.description["name"] for t in MTEB(task_types=args.task_type,
                                                      task_langs=['zh', 'zh-CN']).tasks]
    TASKS_WITH_PROMPTS = ["T2Retrieval", "MMarcoRetrieval", "DuRetrieval", "CovidRetrieval", "CmedqaRetrieval",
                          "EcomRetrieval", "MedicalRetrieval", "VideoRetrieval"]
    for task in task_names:
        evaluation = MTEB(tasks=[task], task_langs=['zh', 'zh-CN'])
        if task in TASKS_WITH_PROMPTS:
            evaluation.run(RetrievalModel(encoder), output_folder=args.output_dir, overwrite_results=False)
        else:
            evaluation.run(encoder, output_folder=args.output_dir, overwrite_results=False)


Usage

acge 中文系列模型

在sentence-transformer库中的使用方法:

from sentence_transformers import SentenceTransformer

sentences = ["数据1", "数据2"]
model = SentenceTransformer('acge_text_embedding')
print(model.max_seq_length)
embeddings_1 = model.encode(sentences, normalize_embeddings=True)
embeddings_2 = model.encode(sentences, normalize_embeddings=True)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)

在sentence-transformer库中的使用方法,选取不同的维度:

from sklearn.preprocessing import normalize
from sentence_transformers import SentenceTransformer

sentences = ["数据1", "数据2"]
model = SentenceTransformer('acge_text_embedding')
embeddings = model.encode(sentences, normalize_embeddings=False)
matryoshka_dim = 1024
embeddings = embeddings[..., :matryoshka_dim]  # Shrink the embedding dimensions
embeddings = normalize(embeddings, norm="l2", axis=1)
print(embeddings.shape)
# => (2, 1024)