In Positron Emission Tomography (PET) reconstruction, utilizing Time of Flight (TOF) information can significantly enhance the signal-to-noise (SNR) ratio, posing a greater challenge for the precision of TOF. To address this, we employed two distinct waveform datasets for training our developed network. One dataset comprises simulated waveform data obtained through a comprehensive simulation process established using Geant4 and GosSip. The other dataset consists of real waveform data collected from lutetium yttrium orthosilicate (LYSO) scintillators and silicon photomultiplier (SiPM) detectors placed at various positions. Our network, a combination of Transformer and Convolutional Neural Network (CNN), was developed for predicting the TOF of coincidence events based on waveform data from PET detectors. Our network achieved average full width at half maximum (FWHM) of 189 ps, with reductions of 82 ps and 13 ps compared to constant fraction discriminator (CFD) and CNN, across multiple positions. Additionally, there was an average bias reduction of 10.3 ps compared to CNN. We visualized the attention map, revealing the remarkable enhancement of Transformer on the rising edge of waveforms. We also demonstrated the robustness of our proposed network by including waveforms with scattered events in the real training dataset. Data augmentation through translation and flip was investigated and resulted in an improvement of 5 ps. Furthermore, we analyzed the characteristic differences between real and simulated waveform data, providing valuable insights for generating more realistic simulated data in the future. Our network improved the average FWHM and bias, leading to enhanced SNR and clearer imaging. Data augmentation effectively expanded the dataset and facilitated the data collection process.