The recent trend to deploy programmable packet processors in cloud environments enhances the packet processing capability without losing the flexibility to adapt the functions at runtime. In particular, distributed edge clouds can have a heterogeneous programmable processing substrate made up of different classes of devices: CPUs, NPUs, FPGAs, etc. However, managing the allocation of workloads in such a heterogeneous programmable processing substrate, in particular deciding where to instantiate a certain function, is a non-trivial task with many decisive functional and QoS-related factors. In this paper, we propose a mathematical model for optimizing the embedding of Service Function Chains implemented in P4, while considering the functional and QoS requirements associated with embedding requests, and the various types of processing devices that have different properties in terms of processing delay and supported features. To satisfy delay requirements, the problem formulation utilizes performance models to predict the forwarding latency associated with different candidate embedding options. Furthermore, a greedy solution is proposed to solve the problem in an efficient manner. Finally, a detailed numerical evaluation is conducted to evaluate the formulated model when different workload and infrastructure characteristics are varied and to evaluate the effectiveness of the proposed greedy solution.